This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Monorepo for Setu docs containing documentation content (MDX), API reference specs (OpenAPI), and two TypeScript pipelines that convert them into vector embeddings for a RAG-powered documentation copilot.
cd docs-ingestion
npm ci # Install dependencies
npm run build # TypeScript compilation
npm test # Jest tests (ESM mode, requires --experimental-vm-modules)
npm run ingest # Build + run full ingestion pipeline
npm run normalize-api-specs # Normalize OpenAPI specs → .api-reference-normalized/
npm run normalize-mdx # Normalize MDX → .docs-normalized/
npm run check-token-limits # Validate token limits on normalized output
npm run smoke-test-ingestion # End-to-end smoke test
npm run dev # Run via tsx without build step
npm run upload-content # Upload chunks to S3Running a single test file:
cd docs-ingestion
node --experimental-vm-modules node_modules/jest/bin/jest.js src/normalize-mdx.test.tscd docs-embeddings
npm ci && npm run build
npm run sync # Build + run incremental embedding sync
npm run dry-run # Validate without external API calls
npm run embed-all # Full re-embedding
npm run verify-embed # Verification scriptcontent/*.mdx + api-references/*.json
│ │
▼ ▼
normalize-mdx normalize-api-specs
│ │
▼ ▼
.docs-normalized/ .api-reference-normalized/
└────────┬───────────┘
▼
docs-ingestion pipeline
(scan → parse → chunk → metadata → deduplicate)
│
▼
output/chunks.json
│
▼
docs-embeddings pipeline
(filter → embed via Bedrock → upsert Pinecone + upload S3)
- Pinecone — embeddings + metadata references (NO content stored)
- S3 — actual chunk content keyed by
content_hash - Claude — retrieval + answer generation
index.ts— Pipeline orchestrator, handles both MDX and API spec ingestionscanner.ts— Recursive.md/.mdxfile discoveryparser.ts— MDX frontmatter extraction + markdown AST parsingchunker.ts— Token-based splitting (target 700 tokens, min 500) respecting semantic boundaries; never splits code blocksmetadata.ts— Enriches chunks with SHA256 hashes, URLs, git info, product/categorydeduplication.ts— Incremental updates via content hash comparisontext-cleaner.ts— Deduplicates paragraphs/sentences, normalizes whitespaceembedding-helpers.ts— Filters chunks by embeddability (50-1600 tokens, with force-embed overrides for critical content)normalize-mdx.ts— Strips JSX/HTML from MDX, produces clean markdownnormalize-api-specs.ts— Converts OpenAPI 3.x/Swagger 2.0 specs to markdown with RAG metadatatypes.ts— Core interfaces:DocumentChunk,PipelineConfig,ChunkingConfig
sync.ts— Incremental sync: only embeds NEW content_hash values, fetches existing embeddings for metadata-only updatesembedder.ts— AWS Bedrock client (Amazon Titan Text Embeddings v2, 1024-dim)vector-db.ts— Pinecone client (upsert, fetch, delete, list)content-uploader.ts— S3 upload by content_hash
- Content lives in
content/{category}/{product}/— categories:payments,data,dev-tools - API specs in
api-references/{category}/(JSON/YAML OpenAPI files) endpoints.json— Product catalog (categories, products, versions, visibility)menuItems.json— Auto-generated sidebar structure (do not edit manually)redirects.json— URL redirect mappings
Default version content lives in product root folder. Older versions go in versioned subfolders (e.g., account-aggregator/v1/). Versions configured in endpoints.json via versions and default_version fields.
Every MDX file requires:
---
sidebar_title: Page Title
page_title: Full Page Title — Setu Docs
order: 0
visible_in_sidebar: true
---Stored in S3 bucket docs-mdx-assets. URL format: https://docs-assets.setu.co/latest/{path}. Path mirrors content folder structure.
| Constant | Value | Location |
|---|---|---|
| Target chunk size | 700 tokens | chunker.ts |
| Min/Max chunk | 500/700 tokens | chunker.ts |
| Hard max chunk | 900 tokens | chunker.ts |
| Min embeddable | 50 tokens | embedding-helpers.ts |
| Max embeddable | 1600 tokens | embedding-helpers.ts (Titan v2 supports 8192) |
| Force-embed patterns | api-reference/payments/umap |
embedding-helpers.ts (bypasses max) |
| Sentence-split trigger | 1500 tokens | chunker.ts (splits chunks above this) |
| Overlap | 60-100 tokens | chunker.ts |
| Embedding dimensions | 1024 | embedder.ts (Titan v2) |
.github/workflows/docs-ingestion-ci.yml runs on PRs to main/staging:
- Build & test (TypeScript compile + Jest)
- API spec normalization (determinism check — runs twice and diffs output)
- Token limit compliance check
- Ingestion smoke test
- Embedding dry-run (validates without external API calls)
- Deterministic: Ingestion pipeline produces identical output for identical input (no randomness, no LLM calls)
- Incremental: Both pipelines use content hashing to skip unchanged content
- Code blocks are never split by the chunker
- Content separation: Pinecone stores only embeddings + metadata; actual content lives in S3
- Both pipelines are ES Modules (
"type": "module"in package.json) - Generated directories
.docs-normalized/and.api-reference-normalized/are gitignored build artifacts