Skip to content

Deduplicate USC identifiers within import batches#174

Merged
mustyoshi merged 1 commit intomasterfrom
claude/fix-chromadb-importer-chjJs
Feb 27, 2026
Merged

Deduplicate USC identifiers within import batches#174
mustyoshi merged 1 commit intomasterfrom
claude/fix-chromadb-importer-chjJs

Conversation

@mustyoshi
Copy link
Collaborator

Summary

Added deduplication logic to prevent processing duplicate USC identifiers within a single import batch in the Chroma US Code importer.

Key Changes

  • Introduced a seen_in_batch set to track USC identifiers already processed in the current batch
  • Added a check to skip rows with duplicate usc_ident values and increment the skip counter
  • Duplicates are now detected and skipped before document building, improving efficiency

Implementation Details

The deduplication occurs early in the row processing loop, before the build_document() call. This ensures that if the same USC identifier appears multiple times within a batch, only the first occurrence is processed and subsequent duplicates are skipped without unnecessary document construction.

https://claude.ai/code/session_01PaWRuuLei9GtmCUSBUGMfP

…n each batch

The JOIN to usc_section/usc_chapter in fetch_sections_batch can produce
multiple rows with the same usc_ident (e.g. /us/usc/t10/s20251). ChromaDB's
upsert rejects batches that contain duplicate IDs. Added a seen_in_batch set
to skip any repeated ident within a single batch before calling upsert.

https://claude.ai/code/session_01PaWRuuLei9GtmCUSBUGMfP
@mustyoshi mustyoshi merged commit 883d26f into master Feb 27, 2026
1 check passed
@mustyoshi mustyoshi deleted the claude/fix-chromadb-importer-chjJs branch February 27, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants