CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

Monorepo for Setu docs containing documentation content (MDX), API reference specs (OpenAPI), and two TypeScript pipelines that convert them into vector embeddings for a RAG-powered documentation copilot.

Build & Test Commands

docs-ingestion (primary pipeline)

cd docs-ingestion
npm ci                          # Install dependencies
npm run build                   # TypeScript compilation
npm test                        # Jest tests (ESM mode, requires --experimental-vm-modules)
npm run ingest                  # Build + run full ingestion pipeline
npm run normalize-api-specs     # Normalize OpenAPI specs → .api-reference-normalized/
npm run normalize-mdx           # Normalize MDX → .docs-normalized/
npm run check-token-limits      # Validate token limits on normalized output
npm run smoke-test-ingestion    # End-to-end smoke test
npm run dev                     # Run via tsx without build step
npm run upload-content          # Upload chunks to S3

Running a single test file:

cd docs-ingestion
node --experimental-vm-modules node_modules/jest/bin/jest.js src/normalize-mdx.test.ts

docs-embeddings

cd docs-embeddings
npm ci && npm run build
npm run sync                    # Build + run incremental embedding sync
npm run dry-run                 # Validate without external API calls
npm run embed-all               # Full re-embedding
npm run verify-embed            # Verification script

Architecture

Data Flow

content/*.mdx + api-references/*.json
        │                    │
        ▼                    ▼
  normalize-mdx      normalize-api-specs
        │                    │
        ▼                    ▼
  .docs-normalized/   .api-reference-normalized/
        └────────┬───────────┘
                 ▼
     docs-ingestion pipeline
     (scan → parse → chunk → metadata → deduplicate)
                 │
                 ▼
         output/chunks.json
                 │
                 ▼
     docs-embeddings pipeline
     (filter → embed via Bedrock → upsert Pinecone + upload S3)

Three-Layer RAG Architecture

Pinecone — embeddings + metadata references (NO content stored)
S3 — actual chunk content keyed by content_hash
Claude — retrieval + answer generation

docs-ingestion/src/ Key Modules

index.ts — Pipeline orchestrator, handles both MDX and API spec ingestion
scanner.ts — Recursive .md/.mdx file discovery
parser.ts — MDX frontmatter extraction + markdown AST parsing
chunker.ts — Token-based splitting (target 700 tokens, min 500) respecting semantic boundaries; never splits code blocks
metadata.ts — Enriches chunks with SHA256 hashes, URLs, git info, product/category
deduplication.ts — Incremental updates via content hash comparison
text-cleaner.ts — Deduplicates paragraphs/sentences, normalizes whitespace
embedding-helpers.ts — Filters chunks by embeddability (50-1600 tokens, with force-embed overrides for critical content)
normalize-mdx.ts — Strips JSX/HTML from MDX, produces clean markdown
normalize-api-specs.ts — Converts OpenAPI 3.x/Swagger 2.0 specs to markdown with RAG metadata
types.ts — Core interfaces: DocumentChunk, PipelineConfig, ChunkingConfig

docs-embeddings/src/ Key Modules

sync.ts — Incremental sync: only embeds NEW content_hash values, fetches existing embeddings for metadata-only updates
embedder.ts — AWS Bedrock client (Amazon Titan Text Embeddings v2, 1024-dim)
vector-db.ts — Pinecone client (upsert, fetch, delete, list)
content-uploader.ts — S3 upload by content_hash

Content Structure

Content lives in content/{category}/{product}/ — categories: payments, data, dev-tools
API specs in api-references/{category}/ (JSON/YAML OpenAPI files)
endpoints.json — Product catalog (categories, products, versions, visibility)
menuItems.json — Auto-generated sidebar structure (do not edit manually)
redirects.json — URL redirect mappings

Versioning

Default version content lives in product root folder. Older versions go in versioned subfolders (e.g., account-aggregator/v1/). Versions configured in endpoints.json via versions and default_version fields.

MDX Frontmatter

Every MDX file requires:

---
sidebar_title: Page Title
page_title: Full Page Title — Setu Docs
order: 0
visible_in_sidebar: true
---

Assets

Stored in S3 bucket docs-mdx-assets. URL format: https://docs-assets.setu.co/latest/{path}. Path mirrors content folder structure.

Key Constants

Constant	Value	Location
Target chunk size	700 tokens	chunker.ts
Min/Max chunk	500/700 tokens	chunker.ts
Hard max chunk	900 tokens	chunker.ts
Min embeddable	50 tokens	embedding-helpers.ts
Max embeddable	1600 tokens	embedding-helpers.ts (Titan v2 supports 8192)
Force-embed patterns	`api-reference/payments/umap`	embedding-helpers.ts (bypasses max)
Sentence-split trigger	1500 tokens	chunker.ts (splits chunks above this)
Overlap	60-100 tokens	chunker.ts
Embedding dimensions	1024	embedder.ts (Titan v2)

CI Pipeline

.github/workflows/docs-ingestion-ci.yml runs on PRs to main/staging:

Build & test (TypeScript compile + Jest)
API spec normalization (determinism check — runs twice and diffs output)
Token limit compliance check
Ingestion smoke test
Embedding dry-run (validates without external API calls)

Design Constraints

Deterministic: Ingestion pipeline produces identical output for identical input (no randomness, no LLM calls)
Incremental: Both pipelines use content hashing to skip unchanged content
Code blocks are never split by the chunker
Content separation: Pinecone stores only embeddings + metadata; actual content lives in S3
Both pipelines are ES Modules ("type": "module" in package.json)
Generated directories .docs-normalized/ and .api-reference-normalized/ are gitignored build artifacts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Repository Overview

Build & Test Commands

docs-ingestion (primary pipeline)

docs-embeddings

Architecture

Data Flow

Three-Layer RAG Architecture

docs-ingestion/src/ Key Modules

docs-embeddings/src/ Key Modules

Content Structure

Versioning

MDX Frontmatter

Assets

Key Constants

CI Pipeline

Design Constraints

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Repository Overview

Build & Test Commands

docs-ingestion (primary pipeline)

docs-embeddings

Architecture

Data Flow

Three-Layer RAG Architecture

docs-ingestion/src/ Key Modules

docs-embeddings/src/ Key Modules

Content Structure

Versioning

MDX Frontmatter

Assets

Key Constants

CI Pipeline

Design Constraints