full rebuild of knowledge base code with hybrid search and ampere docs by JoeStech · Pull Request #67 · arm/mcp

JoeStech · 2026-03-20T21:54:11Z

No description provided.

Copilot

Pull request overview

Rebuilds the knowledge base ingestion and retrieval pipeline by adding structured chunking (markdown/html/pdf), expanding the source corpus with Ampere documentation, and switching runtime retrieval to a dense+sparse hybrid search with heuristic reranking.

Changes:

Add structured document parsing/chunking (incl. PDF) and enrich chunk metadata (heading paths, doc_type, product, version, resolved_url, content_type).
Introduce hybrid retrieval: USearch dense + BM25 sparse fused via RRF and reranked with intent/entity heuristics.
Expand sources/evaluation artifacts (Ampere docs + retrieval eval script and question set).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
mcp-local/utils/search_utils.py	Adds BM25, RRF fusion, heuristic reranking, and updated deduplication helpers.
mcp-local/server.py	Uses hybrid search, adds embedding-model cache handling, and returns richer result fields.
mcp-local/requirements.txt	Adds `rank-bm25` dependency.
mcp-local/Dockerfile	Adds HF/SentenceTransformers cache env and pre-downloads embedding model during build.
embedding-generation/vector-db-sources.csv	Adds Ampere reference architectures, tuning guides, tutorials, briefs, etc.
embedding-generation/tests/test_generate_chunks.py	Updates tests for enriched chunk dict fields + structured chunking behaviors.
embedding-generation/requirements.txt	Adds `pypdf` and `rank-bm25`.
embedding-generation/local_vectorstore_creation.py	Builds `search_text` and new metadata fields; removes embedding vectors from metadata JSON; adds cache-folder usage.
embedding-generation/generate-chunks.py	Refactors ingestion to structured parsing + per-source chunking; adds URL normalization; introduces fetch helper and discovery toggles.
embedding-generation/document_chunking.py	New structured parsers (md/html/pdf) and chunking utilities producing retrieval-friendly chunk content.
embedding-generation/evaluate_retrieval.py	New CLI to evaluate retrieval hit@k/MRR over local metadata/index.
embedding-generation/eval_questions.json	New eval dataset for retrieval checks.
embedding-generation/Dockerfile	Adds model pre-download, new args, and includes structured chunking module in the image.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mcp-local/utils/search_utils.py

Copilot · 2026-03-20T22:00:24Z

mcp-local/utils/search_utils.py

+    ranking = np.argsort(scores)[::-1]
+    results: List[Dict[str, Any]] = []
+    for rank, idx in enumerate(ranking[:k], start=1):


bm25_search sorts the full score array with np.argsort(scores) for every query, which is O(N log N) over the entire corpus. Since only top-k results are needed, use a top-k selection approach (e.g., np.argpartition / partial sort) to reduce query-time overhead for large metadata sets.

Suggested change

ranking = np.argsort(scores)[::-1]

results: List[Dict[str, Any]] = []

for rank, idx in enumerate(ranking[:k], start=1):

# Use a top-k selection instead of fully sorting all scores to improve performance.

scores = np.asarray(scores)

if scores.size == 0 or k <= 0:

return []

k = min(k, scores.size)

# Get indices of the top-k scores (in arbitrary order)

topk_indices = np.argpartition(scores, -k)[-k:]

# Sort the top-k indices by score in descending order

ranking = topk_indices[np.argsort(scores[topk_indices])[::-1]]

results: List[Dict[str, Any]] = []

for rank, idx in enumerate(ranking, start=1):

Copilot · 2026-03-20T22:00:24Z

mcp-local/server.py

+        search_results = hybrid_search(query, USEARCH_INDEX, METADATA, EMBEDDING_MODEL, BM25_INDEX)
+        deduped = deduplicate_urls(search_results)
        # Only return the relevant fields


knowledge_base_search now formats and returns whatever hybrid_search yields after URL deduping, but it never caps the response size. This can return 10s/100s of results (especially given hybrid_search candidate depth), increasing payload size and response time; slice to the configured top-k after deduping (or pass an explicit max to deduplicate_urls).

Copilot · 2026-03-20T22:00:24Z

embedding-generation/generate-chunks.py

+    except Exception as err:
+        print(f"Other error occurred: {err}")
+        with open('info/errors.csv', 'a', newline='') as csvfile:
+            csv_writer = csv.writer(csvfile)
+            csv_writer.writerow([url, str(err)])
+        return None
    except Exception as err:


fetch_with_logging has a duplicated except Exception block; the second one is unreachable and also changes the return type (returns False vs None). Remove the duplicate handler and keep a single consistent error path/return type.

Copilot · 2026-03-20T22:00:25Z

embedding-generation/generate-chunks.py

+    except requests.exceptions.HTTPError as http_err:
+        print(f"HTTP error occurred: {http_err}")
+        with open('info/errors.csv', 'a', newline='') as csvfile:
+            csv_writer = csv.writer(csvfile)
+            csv_writer.writerow([url, str(http_err)])


fetch_with_logging writes to a hard-coded info/errors.csv, but this PR also makes CHUNK_DETAILS_FILE (and its directory) configurable. If details_file is pointed elsewhere, info/ may not exist and error logging can raise FileNotFoundError. Consider writing errors next to details_file (or ensuring the info directory exists before opening).

mcp-local/utils/search_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

full rebuild of knowledge base code with hybrid search and ampere docs

b4f2a5c

Copilot AI review requested due to automatic review settings March 20, 2026 21:54

Copilot started reviewing on behalf of JoeStech March 20, 2026 21:54 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

JoeStech and others added 2 commits March 20, 2026 16:24

Apply suggestion from @Copilot

8fbcd9d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

be402db

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

JoeStech merged commit 360a640 into main Mar 20, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

full rebuild of knowledge base code with hybrid search and ampere docs#67

full rebuild of knowledge base code with hybrid search and ampere docs#67
JoeStech merged 3 commits intomainfrom
add_ampere_docs

JoeStech commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    ranking = np.argsort(scores)[::-1]
-    results: List[Dict[str, Any]] = []
-    for rank, idx in enumerate(ranking[:k], start=1):
+    # Use a top-k selection instead of fully sorting all scores to improve performance.
+    scores = np.asarray(scores)
+    if scores.size == 0 or k <= 0:
+        return []
+    k = min(k, scores.size)
+    # Get indices of the top-k scores (in arbitrary order)
+    topk_indices = np.argpartition(scores, -k)[-k:]
+    # Sort the top-k indices by score in descending order
+    ranking = topk_indices[np.argsort(scores[topk_indices])[::-1]]
+    results: List[Dict[str, Any]] = []
+    for rank, idx in enumerate(ranking, start=1):

Conversation

JoeStech commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants