Skip to content

full rebuild of knowledge base code with hybrid search and ampere docs#67

Merged
JoeStech merged 3 commits intomainfrom
add_ampere_docs
Mar 20, 2026
Merged

full rebuild of knowledge base code with hybrid search and ampere docs#67
JoeStech merged 3 commits intomainfrom
add_ampere_docs

Conversation

@JoeStech
Copy link
Member

No description provided.

Copilot AI review requested due to automatic review settings March 20, 2026 21:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Rebuilds the knowledge base ingestion and retrieval pipeline by adding structured chunking (markdown/html/pdf), expanding the source corpus with Ampere documentation, and switching runtime retrieval to a dense+sparse hybrid search with heuristic reranking.

Changes:

  • Add structured document parsing/chunking (incl. PDF) and enrich chunk metadata (heading paths, doc_type, product, version, resolved_url, content_type).
  • Introduce hybrid retrieval: USearch dense + BM25 sparse fused via RRF and reranked with intent/entity heuristics.
  • Expand sources/evaluation artifacts (Ampere docs + retrieval eval script and question set).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
mcp-local/utils/search_utils.py Adds BM25, RRF fusion, heuristic reranking, and updated deduplication helpers.
mcp-local/server.py Uses hybrid search, adds embedding-model cache handling, and returns richer result fields.
mcp-local/requirements.txt Adds rank-bm25 dependency.
mcp-local/Dockerfile Adds HF/SentenceTransformers cache env and pre-downloads embedding model during build.
embedding-generation/vector-db-sources.csv Adds Ampere reference architectures, tuning guides, tutorials, briefs, etc.
embedding-generation/tests/test_generate_chunks.py Updates tests for enriched chunk dict fields + structured chunking behaviors.
embedding-generation/requirements.txt Adds pypdf and rank-bm25.
embedding-generation/local_vectorstore_creation.py Builds search_text and new metadata fields; removes embedding vectors from metadata JSON; adds cache-folder usage.
embedding-generation/generate-chunks.py Refactors ingestion to structured parsing + per-source chunking; adds URL normalization; introduces fetch helper and discovery toggles.
embedding-generation/document_chunking.py New structured parsers (md/html/pdf) and chunking utilities producing retrieval-friendly chunk content.
embedding-generation/evaluate_retrieval.py New CLI to evaluate retrieval hit@k/MRR over local metadata/index.
embedding-generation/eval_questions.json New eval dataset for retrieval checks.
embedding-generation/Dockerfile Adds model pre-download, new args, and includes structured chunking module in the image.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +155 to +157
ranking = np.argsort(scores)[::-1]
results: List[Dict[str, Any]] = []
for rank, idx in enumerate(ranking[:k], start=1):
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bm25_search sorts the full score array with np.argsort(scores) for every query, which is O(N log N) over the entire corpus. Since only top-k results are needed, use a top-k selection approach (e.g., np.argpartition / partial sort) to reduce query-time overhead for large metadata sets.

Suggested change
ranking = np.argsort(scores)[::-1]
results: List[Dict[str, Any]] = []
for rank, idx in enumerate(ranking[:k], start=1):
# Use a top-k selection instead of fully sorting all scores to improve performance.
scores = np.asarray(scores)
if scores.size == 0 or k <= 0:
return []
k = min(k, scores.size)
# Get indices of the top-k scores (in arbitrary order)
topk_indices = np.argpartition(scores, -k)[-k:]
# Sort the top-k indices by score in descending order
ranking = topk_indices[np.argsort(scores[topk_indices])[::-1]]
results: List[Dict[str, Any]] = []
for rank, idx in enumerate(ranking, start=1):

Copilot uses AI. Check for mistakes.
Comment on lines +85 to 87
search_results = hybrid_search(query, USEARCH_INDEX, METADATA, EMBEDDING_MODEL, BM25_INDEX)
deduped = deduplicate_urls(search_results)
# Only return the relevant fields
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knowledge_base_search now formats and returns whatever hybrid_search yields after URL deduping, but it never caps the response size. This can return 10s/100s of results (especially given hybrid_search candidate depth), increasing payload size and response time; slice to the configured top-k after deduping (or pass an explicit max to deduplicate_urls).

Copilot uses AI. Check for mistakes.
Comment on lines +769 to 775
except Exception as err:
print(f"Other error occurred: {err}")
with open('info/errors.csv', 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow([url, str(err)])
return None
except Exception as err:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_with_logging has a duplicated except Exception block; the second one is unreachable and also changes the return type (returns False vs None). Remove the duplicate handler and keep a single consistent error path/return type.

Copilot uses AI. Check for mistakes.
Comment on lines +763 to +767
except requests.exceptions.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
with open('info/errors.csv', 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow([url, str(http_err)])
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_with_logging writes to a hard-coded info/errors.csv, but this PR also makes CHUNK_DETAILS_FILE (and its directory) configurable. If details_file is pointed elsewhere, info/ may not exist and error logging can raise FileNotFoundError. Consider writing errors next to details_file (or ensuring the info directory exists before opening).

Copilot uses AI. Check for mistakes.
JoeStech and others added 2 commits March 20, 2026 16:24
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@JoeStech JoeStech merged commit 360a640 into main Mar 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants