Skip to content

feat: code vectorization pipeline and search-code MCP tool#136

Merged
github-actions[bot] merged 2 commits intomasterfrom
feature/code-vectorization-mcp
Mar 6, 2026
Merged

feat: code vectorization pipeline and search-code MCP tool#136
github-actions[bot] merged 2 commits intomasterfrom
feature/code-vectorization-mcp

Conversation

@jordanpartridge
Copy link
Copy Markdown
Contributor

@jordanpartridge jordanpartridge commented Mar 6, 2026

Summary

  • Add vectorize-code command to embed tree-sitter symbols into Qdrant's code collection for semantic search
  • Add search-code-tool MCP tool exposing semantic code search to Claude Desktop and Claude Code
  • Update SymbolIndexService for macOS Python 3.12 with 10k file limit and 600s timeout
  • CodeIndexerService gains indexSymbol() and vectorizeFromIndex() with --kind and --language filters

Test plan

  • ./know vectorize-code local/pstrax-laravel --kind=class --language=php embeds PHP classes into Qdrant
  • MCP search-code-tool returns semantic results from the code collection
  • ./know index-code /path/to/repo works with jcodemunch on Python 3.12
  • MCP server lists all 6 tools including search-code-tool

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a new command-line tool to vectorize code repositories for semantic indexing.
    • Introduced semantic code search functionality with support for repository and language filtering.
    • Enhanced knowledge server with integrated code search capabilities.

Tree-sitter symbols can now be embedded into Qdrant for semantic code
search. Adds vectorize-code command with --kind and --language filters,
a search-code-tool exposed via MCP, and updates SymbolIndexService for
macOS Python 3.12 with a 10k file limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

The PR introduces a complete code vectorization and semantic search workflow: a new console command vectorizes tree-sitter symbols into Qdrant, a new MCP tool enables semantic code search, CodeIndexerService gains indexing and batch-vectorization methods, and SymbolIndexService's Python environment is optimized.

Changes

Cohort / File(s) Summary
New Vectorization Command
app/Commands/VectorizeCodeCommand.php
New Laravel Zero console command that orchestrates vectorization of indexed symbols into Qdrant. Validates repo argument, loads tree-sitter index file, applies optional kind/language filters, reports progress every 100 items, and returns success/failure exit codes.
MCP Code Search Tool
app/Mcp/Tools/SearchCodeTool.php, app/Mcp/Servers/KnowledgeServer.php
New SearchCodeTool for semantic code search with query validation (min 2 chars), result limit capping (max 20), and filter support (repo/language). Includes JSON schema for input validation. Integrated into KnowledgeServer's tools array.
Code Indexing Services
app/Services/CodeIndexerService.php, app/Services/SymbolIndexService.php
CodeIndexerService adds indexSymbol() for single-symbol vectorization and vectorizeFromIndex() for batch processing with optional progress callback; updated search result payload to include symbol metadata (name, kind, signature). SymbolIndexService increases Python subprocess timeout to 600s and uses explicit Python 3.12 interpreter path with enhanced environment configuration.

Sequence Diagrams

sequenceDiagram
    participant CLI as CLI / User
    participant VCmd as VectorizeCodeCommand
    participant SI as SymbolIndexService
    participant CI as CodeIndexerService
    participant Qdrant as Qdrant Vector DB
    
    CLI->>VCmd: Execute with repo, kinds, language
    VCmd->>VCmd: Load index from ~/.code-index/{repo-slug}.json
    VCmd->>CI: Ensure code collection exists
    CI->>Qdrant: Verify/Create collection
    Qdrant-->>CI: Collection ready
    
    VCmd->>CI: vectorizeFromIndex(indexPath, repo, kinds, language, onProgress)
    loop For each symbol in index
        CI->>SI: Detect language
        SI-->>CI: Language detected
        CI->>CI: indexSymbol(text, filepath, repo, language, ...)
        CI->>CI: Generate embedding
        CI->>Qdrant: Upsert vector + payload
        Qdrant-->>CI: Success/Error
        CI-->>VCmd: Progress callback (every 100 items)
    end
    CI-->>VCmd: Return {success, total, failures}
    VCmd->>CLI: Exit with status code
Loading
sequenceDiagram
    participant User as MCP Client / User
    participant SearchTool as SearchCodeTool
    participant CI as CodeIndexerService
    participant Qdrant as Qdrant Vector DB
    
    User->>SearchTool: handle(query, repo?, language?, limit?)
    SearchTool->>SearchTool: Validate query (min 2 chars)
    SearchTool->>SearchTool: Cap limit at 20
    SearchTool->>SearchTool: Build filters from repo/language
    SearchTool->>CI: search(query, limit, filters)
    CI->>Qdrant: Vector search
    Qdrant-->>CI: Results with scores
    CI->>CI: Format results (filepath, repo, language, symbol details, line, score, content)
    CI-->>SearchTool: Formatted results array
    SearchTool-->>User: Response {results, total}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Whiskers twitch with coding delight,
Vectors dance through the database night,
Symbols indexed, semantics take flight,
Search and vectorize—what pure insight!
The warren now finds what's hiding just right. 🔍✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: code vectorization pipeline and search-code MCP tool' directly and clearly describes the main changes: a new vectorization pipeline and a new MCP tool for code search.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/code-vectorization-mcp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2026

📊 Coverage Report

Metric Coverage Threshold Status
Lines 97.6% 95%

Files Below Threshold

File Coverage Uncovered Lines
app/Enums/ObservationType.php 0% None
app/Exceptions/Qdrant/QdrantException.php 0% None
app/Integrations/Qdrant/Requests/ScrollPoints.php 0% None
app/Mcp/Servers/KnowledgeServer.php 0% None
app/Services/AgentHealthService.php 0% None
app/Mcp/Tools/RememberTool.php 66.7% 106, 107, 108, 109, 110... (+19 more)
app/Mcp/Tools/CorrectTool.php 70.4% 53, 54, 55, 56, 57... (+3 more)
app/Mcp/Tools/SearchCodeTool.php 72.7% 72, 73, 74, 75, 76... (+7 more)
app/Providers/McpServiceProvider.php 73.7% 44, 46, 48, 49, 50
app/Mcp/Tools/RecallTool.php 75.9% 81, 82, 83, 84, 85... (+15 more)

🏆 Synapse Sentinel Gate

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2026

🔧 Synapse Sentinel: 1 check need attention

The following issues must be resolved before this PR can be merged:


All tests passed.---

Quick Reference:

  • PHPStan errors → Fix type mismatches first, then missing types
  • Test failures → Read the assertion message, trace expected vs actual
  • Style issues → Run composer format to auto-fix

🤖 Generated by Synapse Sentinel - View Run

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/Services/CodeIndexerService.php (1)

173-175: ⚠️ Potential issue | 🟡 Minor

Update the return type docblock to match actual return structure.

The docblock declares the return type without symbol_name, symbol_kind, signature, start_line, or end_line, but the implementation at lines 207-211 includes these fields. This causes PHPStan errors downstream in SearchCodeTool.php.

📝 Suggested docblock fix
     /**
      * Search code semantically.
      *
      * `@param`  array{repo?: string, language?: string}  $filters
-     * `@return` array<array{filepath: string, repo: string, language: string, content: string, score: float, functions: array<string>}>
+     * `@return` array<array{filepath: string, repo: string, language: string, content: string, score: float, functions: array<string>, symbol_name: string|null, symbol_kind: string|null, signature: string|null, start_line: int, end_line: int}>
      */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/Services/CodeIndexerService.php` around lines 173 - 175, The method in
CodeIndexerService returns entries that include additional keys (symbol_name,
symbol_kind, signature, start_line, end_line) but the docblock return type is
missing them; update the method's `@return` annotation in CodeIndexerService.php
to list those keys with types (e.g. array<array{filepath:string, repo:string,
language:string, content:string, score:float, functions:array<string>,
symbol_name?:string, symbol_kind?:string, signature?:string, start_line?:int,
end_line?:int}>) so it matches the actual structure used by SearchCodeTool.php
and PHPStan.
🧹 Nitpick comments (3)
app/Mcp/Servers/KnowledgeServer.php (1)

20-20: Consider updating the #[Instructions] attribute to include the new search-code tool.

The instructions mention recall, remember, correct, context, and stats, but omit the newly added SearchCodeTool. Users interacting via MCP will not be informed about the code search capability.

📝 Suggested update
-#[Instructions('Semantic knowledge base with vector search. Use `recall` to search, `remember` to capture discoveries, `correct` to fix wrong knowledge, `context` to load project-relevant entries, and `stats` for health checks. All tools auto-detect the current project from git context.')]
+#[Instructions('Semantic knowledge base with vector search. Use `recall` to search, `remember` to capture discoveries, `correct` to fix wrong knowledge, `context` to load project-relevant entries, `stats` for health checks, and `search-code` for semantic code search across indexed repositories. All tools auto-detect the current project from git context.')]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/Mcp/Servers/KnowledgeServer.php` at line 20, The #[Instructions(...)]
attribute on the KnowledgeServer class/documentation string currently lists
tools (recall, remember, correct, context, stats) but omits the new
SearchCodeTool; update that attribute to mention the new "search-code" tool and
a short usage hint (e.g., "Use `search-code` to search repository source code")
so users know MCP supports code search; locate the attribute token
#[Instructions(...)] in KnowledgeServer.php and append the new tool name and
brief usage note to the existing instructions string.
app/Commands/VectorizeCodeCommand.php (1)

79-81: Consider returning FAILURE when all symbols fail.

The command returns SUCCESS even when $result['failed'] === $result['total'] (complete failure). This could mask issues in CI/CD pipelines.

🔧 Suggested fix
         info("Done: {$result['success']}/{$result['total']} symbols vectorized, {$result['failed']} failed");

-        return self::SUCCESS;
+        return $result['success'] > 0 ? self::SUCCESS : self::FAILURE;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/Commands/VectorizeCodeCommand.php` around lines 79 - 81, The command
currently always returns self::SUCCESS after logging "Done:
{$result['success']}/{$result['total']} symbols vectorized, {$result['failed']}
failed" even when all symbols failed; update the VectorizeCodeCommand (e.g., in
the method that produces $result) to check if $result['failed'] ===
$result['total'] and return self::FAILURE in that case (otherwise keep returning
self::SUCCESS), ensuring you reference the existing $result['failed'] and
$result['total'] variables and preserve the info() log line.
app/Services/CodeIndexerService.php (1)

269-276: Consider making SymbolIndexService a constructor dependency.

Passing SymbolIndexService as a method parameter creates an inconsistent API pattern compared to other dependencies like EmbeddingServiceInterface which are injected via constructor. This also makes testing and mocking harder.

♻️ Suggested refactor

Inject via constructor:

 public function __construct(
     private readonly EmbeddingServiceInterface $embeddingService,
+    private readonly SymbolIndexService $symbolIndex,
     private readonly int $vectorSize = 1024,
 ) {

Then update method signature:

 public function vectorizeFromIndex(
     string $indexPath,
     string $repo,
-    SymbolIndexService $symbolIndex,
     array $kinds = [],
     ?string $language = null,
     ?callable $onProgress = null,
 ): array {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/Services/CodeIndexerService.php` around lines 269 - 276, The method
vectorizeFromIndex currently accepts SymbolIndexService as a parameter which is
inconsistent with other injected services (e.g. EmbeddingServiceInterface);
refactor by adding a private property (e.g. private SymbolIndexService
$symbolIndex) and accept SymbolIndexService in the class constructor, assign it
to the property, remove the SymbolIndexService parameter from vectorizeFromIndex
signature, and update all internal references in vectorizeFromIndex to use
$this->symbolIndex; also update all call sites and unit tests to stop passing
the service into vectorizeFromIndex and instead construct/mocks should be
provided to the class constructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/Services/CodeIndexerService.php`:
- Around line 325-326: The local variable $language in the loop shadows the
method parameter $language and causes the user-supplied language filter to be
overwritten; rename the loop-local variable (e.g., $fileLanguage) so you call
$this->detectLanguage($ext) into $fileLanguage and use $fileLanguage for
per-file checks while leaving the method parameter $language intact, updating
any subsequent comparisons in the loop to reference $fileLanguage instead of
$language (see detectLanguage call and surrounding loop inside
CodeIndexerService::...).

In `@app/Services/SymbolIndexService.php`:
- Line 51: The hardcoded Python binary in the Process run call
(Process::timeout(600)->run([...])) breaks portability; replace the literal
'/opt/homebrew/opt/python@3.12/bin/python3.12' with a configurable path pulled
from config('services.jcodemunch.python_path', 'python3') (or
env('JCODEMUNCH_PYTHON_PATH', 'python3') if you prefer), add the suggested
'jcodemunch' => ['python_path' => env('JCODEMUNCH_PYTHON_PATH', 'python3')]
entry to config/services.php, and update the code that constructs the Process
(the array passed to run in SymbolIndexService.php) to use that config value so
systems fall back to 'python3' if no env override is provided.

---

Outside diff comments:
In `@app/Services/CodeIndexerService.php`:
- Around line 173-175: The method in CodeIndexerService returns entries that
include additional keys (symbol_name, symbol_kind, signature, start_line,
end_line) but the docblock return type is missing them; update the method's
`@return` annotation in CodeIndexerService.php to list those keys with types (e.g.
array<array{filepath:string, repo:string, language:string, content:string,
score:float, functions:array<string>, symbol_name?:string, symbol_kind?:string,
signature?:string, start_line?:int, end_line?:int}>) so it matches the actual
structure used by SearchCodeTool.php and PHPStan.

---

Nitpick comments:
In `@app/Commands/VectorizeCodeCommand.php`:
- Around line 79-81: The command currently always returns self::SUCCESS after
logging "Done: {$result['success']}/{$result['total']} symbols vectorized,
{$result['failed']} failed" even when all symbols failed; update the
VectorizeCodeCommand (e.g., in the method that produces $result) to check if
$result['failed'] === $result['total'] and return self::FAILURE in that case
(otherwise keep returning self::SUCCESS), ensuring you reference the existing
$result['failed'] and $result['total'] variables and preserve the info() log
line.

In `@app/Mcp/Servers/KnowledgeServer.php`:
- Line 20: The #[Instructions(...)] attribute on the KnowledgeServer
class/documentation string currently lists tools (recall, remember, correct,
context, stats) but omits the new SearchCodeTool; update that attribute to
mention the new "search-code" tool and a short usage hint (e.g., "Use
`search-code` to search repository source code") so users know MCP supports code
search; locate the attribute token #[Instructions(...)] in KnowledgeServer.php
and append the new tool name and brief usage note to the existing instructions
string.

In `@app/Services/CodeIndexerService.php`:
- Around line 269-276: The method vectorizeFromIndex currently accepts
SymbolIndexService as a parameter which is inconsistent with other injected
services (e.g. EmbeddingServiceInterface); refactor by adding a private property
(e.g. private SymbolIndexService $symbolIndex) and accept SymbolIndexService in
the class constructor, assign it to the property, remove the SymbolIndexService
parameter from vectorizeFromIndex signature, and update all internal references
in vectorizeFromIndex to use $this->symbolIndex; also update all call sites and
unit tests to stop passing the service into vectorizeFromIndex and instead
construct/mocks should be provided to the class constructor.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e8c7e730-e46e-471d-8567-e87c2e4b1e78

📥 Commits

Reviewing files that changed from the base of the PR and between 0d25c79 and e50e0f4.

📒 Files selected for processing (5)
  • app/Commands/VectorizeCodeCommand.php
  • app/Mcp/Servers/KnowledgeServer.php
  • app/Mcp/Tools/SearchCodeTool.php
  • app/Services/CodeIndexerService.php
  • app/Services/SymbolIndexService.php

Comment on lines +325 to +326
$ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION));
$language = $this->detectLanguage($ext);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Variable shadowing breaks language filtering.

The loop variable $language shadows the method parameter $language, causing the filter to be ignored after the first iteration. The detected language from the current file overwrites the user-supplied filter.

🐛 Fix: Use a different variable name
             $ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION));
-            $language = $this->detectLanguage($ext);
+            $detectedLanguage = $this->detectLanguage($ext);

             $result = $this->indexSymbol(
                 text: $text,
                 filepath: $symbol['file'] ?? '',
                 repo: $repo,
-                language: $language,
+                language: $detectedLanguage,
                 symbolName: $symbol['name'] ?? '',
                 symbolKind: $symbol['kind'] ?? '',
                 line: (int) ($symbol['line'] ?? 0),
                 signature: $symbol['signature'] ?? '',
             );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
$ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION));
$language = $this->detectLanguage($ext);
$ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION));
$detectedLanguage = $this->detectLanguage($ext);
$result = $this->indexSymbol(
text: $text,
filepath: $symbol['file'] ?? '',
repo: $repo,
language: $detectedLanguage,
symbolName: $symbol['name'] ?? '',
symbolKind: $symbol['kind'] ?? '',
line: (int) ($symbol['line'] ?? 0),
signature: $symbol['signature'] ?? '',
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/Services/CodeIndexerService.php` around lines 325 - 326, The local
variable $language in the loop shadows the method parameter $language and causes
the user-supplied language filter to be overwritten; rename the loop-local
variable (e.g., $fileLanguage) so you call $this->detectLanguage($ext) into
$fileLanguage and use $fileLanguage for per-file checks while leaving the method
parameter $language intact, updating any subsequent comparisons in the loop to
reference $fileLanguage instead of $language (see detectLanguage call and
surrounding loop inside CodeIndexerService::...).

PYTHON;

$result = Process::timeout(120)->run(['python3', '-c', $script]);
$result = Process::timeout(600)->run(['/opt/homebrew/opt/python@3.12/bin/python3.12', '-c', $script]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded Python path breaks portability.

The path /opt/homebrew/opt/python@3.12/bin/python3.12 is specific to macOS with Homebrew on Apple Silicon. This will fail on:

  • Intel Macs (/usr/local/opt/...)
  • Linux systems
  • Any system using pyenv, asdf, or system Python
🔧 Suggested fix: Use configurable or discoverable path
-        $result = Process::timeout(600)->run(['/opt/homebrew/opt/python@3.12/bin/python3.12', '-c', $script]);
+        $pythonPath = config('services.jcodemunch.python_path', 'python3');
+        $result = Process::timeout(600)->run([$pythonPath, '-c', $script]);

Then add to config/services.php:

'jcodemunch' => [
    'python_path' => env('JCODEMUNCH_PYTHON_PATH', 'python3'),
],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/Services/SymbolIndexService.php` at line 51, The hardcoded Python binary
in the Process run call (Process::timeout(600)->run([...])) breaks portability;
replace the literal '/opt/homebrew/opt/python@3.12/bin/python3.12' with a
configurable path pulled from config('services.jcodemunch.python_path',
'python3') (or env('JCODEMUNCH_PYTHON_PATH', 'python3') if you prefer), add the
suggested 'jcodemunch' => ['python_path' => env('JCODEMUNCH_PYTHON_PATH',
'python3')] entry to config/services.php, and update the code that constructs
the Process (the array passed to run in SymbolIndexService.php) to use that
config value so systems fall back to 'python3' if no env override is provided.

…overage gate

Fix $language parameter being overwritten inside the foreach loop in
vectorizeFromIndex(), which broke language filtering during vectorization.

Add 28 tests across 3 files covering SearchCodeTool, VectorizeCodeCommand,
and new CodeIndexerService methods (indexSymbol, vectorizeFromIndex).
Coverage: 94.8% → 97.5%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2026

🏆 Sentinel Certified

Tests & Coverage: 0 tests passed
Security Audit: No security vulnerabilities found
Pest Syntax: All test files use describe/it syntax


Add this badge to your README:

[![Sentinel Certified](https://img.shields.io/github/actions/workflow/status/conduit-ui/knowledge/gate.yml?label=Sentinel%20Certified&style=flat-square)](https://github.com/conduit-ui/knowledge/actions/workflows/gate.yml)

@github-actions github-actions bot merged commit 6020a3f into master Mar 6, 2026
1 check passed
@github-actions github-actions bot deleted the feature/code-vectorization-mcp branch March 6, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant