feat: code vectorization pipeline and search-code MCP tool#136
feat: code vectorization pipeline and search-code MCP tool#136github-actions[bot] merged 2 commits intomasterfrom
Conversation
Tree-sitter symbols can now be embedded into Qdrant for semantic code search. Adds vectorize-code command with --kind and --language filters, a search-code-tool exposed via MCP, and updates SymbolIndexService for macOS Python 3.12 with a 10k file limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe PR introduces a complete code vectorization and semantic search workflow: a new console command vectorizes tree-sitter symbols into Qdrant, a new MCP tool enables semantic code search, CodeIndexerService gains indexing and batch-vectorization methods, and SymbolIndexService's Python environment is optimized. Changes
Sequence DiagramssequenceDiagram
participant CLI as CLI / User
participant VCmd as VectorizeCodeCommand
participant SI as SymbolIndexService
participant CI as CodeIndexerService
participant Qdrant as Qdrant Vector DB
CLI->>VCmd: Execute with repo, kinds, language
VCmd->>VCmd: Load index from ~/.code-index/{repo-slug}.json
VCmd->>CI: Ensure code collection exists
CI->>Qdrant: Verify/Create collection
Qdrant-->>CI: Collection ready
VCmd->>CI: vectorizeFromIndex(indexPath, repo, kinds, language, onProgress)
loop For each symbol in index
CI->>SI: Detect language
SI-->>CI: Language detected
CI->>CI: indexSymbol(text, filepath, repo, language, ...)
CI->>CI: Generate embedding
CI->>Qdrant: Upsert vector + payload
Qdrant-->>CI: Success/Error
CI-->>VCmd: Progress callback (every 100 items)
end
CI-->>VCmd: Return {success, total, failures}
VCmd->>CLI: Exit with status code
sequenceDiagram
participant User as MCP Client / User
participant SearchTool as SearchCodeTool
participant CI as CodeIndexerService
participant Qdrant as Qdrant Vector DB
User->>SearchTool: handle(query, repo?, language?, limit?)
SearchTool->>SearchTool: Validate query (min 2 chars)
SearchTool->>SearchTool: Cap limit at 20
SearchTool->>SearchTool: Build filters from repo/language
SearchTool->>CI: search(query, limit, filters)
CI->>Qdrant: Vector search
Qdrant-->>CI: Results with scores
CI->>CI: Format results (filepath, repo, language, symbol details, line, score, content)
CI-->>SearchTool: Formatted results array
SearchTool-->>User: Response {results, total}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📊 Coverage Report
Files Below Threshold
🏆 Synapse Sentinel Gate |
🔧 Synapse Sentinel: 1 check need attentionThe following issues must be resolved before this PR can be merged: All tests passed.--- Quick Reference:
🤖 Generated by Synapse Sentinel - View Run |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/Services/CodeIndexerService.php (1)
173-175:⚠️ Potential issue | 🟡 MinorUpdate the return type docblock to match actual return structure.
The docblock declares the return type without
symbol_name,symbol_kind,signature,start_line, orend_line, but the implementation at lines 207-211 includes these fields. This causes PHPStan errors downstream inSearchCodeTool.php.📝 Suggested docblock fix
/** * Search code semantically. * * `@param` array{repo?: string, language?: string} $filters - * `@return` array<array{filepath: string, repo: string, language: string, content: string, score: float, functions: array<string>}> + * `@return` array<array{filepath: string, repo: string, language: string, content: string, score: float, functions: array<string>, symbol_name: string|null, symbol_kind: string|null, signature: string|null, start_line: int, end_line: int}> */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/Services/CodeIndexerService.php` around lines 173 - 175, The method in CodeIndexerService returns entries that include additional keys (symbol_name, symbol_kind, signature, start_line, end_line) but the docblock return type is missing them; update the method's `@return` annotation in CodeIndexerService.php to list those keys with types (e.g. array<array{filepath:string, repo:string, language:string, content:string, score:float, functions:array<string>, symbol_name?:string, symbol_kind?:string, signature?:string, start_line?:int, end_line?:int}>) so it matches the actual structure used by SearchCodeTool.php and PHPStan.
🧹 Nitpick comments (3)
app/Mcp/Servers/KnowledgeServer.php (1)
20-20: Consider updating the#[Instructions]attribute to include the newsearch-codetool.The instructions mention
recall,remember,correct,context, andstats, but omit the newly addedSearchCodeTool. Users interacting via MCP will not be informed about the code search capability.📝 Suggested update
-#[Instructions('Semantic knowledge base with vector search. Use `recall` to search, `remember` to capture discoveries, `correct` to fix wrong knowledge, `context` to load project-relevant entries, and `stats` for health checks. All tools auto-detect the current project from git context.')] +#[Instructions('Semantic knowledge base with vector search. Use `recall` to search, `remember` to capture discoveries, `correct` to fix wrong knowledge, `context` to load project-relevant entries, `stats` for health checks, and `search-code` for semantic code search across indexed repositories. All tools auto-detect the current project from git context.')]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/Mcp/Servers/KnowledgeServer.php` at line 20, The #[Instructions(...)] attribute on the KnowledgeServer class/documentation string currently lists tools (recall, remember, correct, context, stats) but omits the new SearchCodeTool; update that attribute to mention the new "search-code" tool and a short usage hint (e.g., "Use `search-code` to search repository source code") so users know MCP supports code search; locate the attribute token #[Instructions(...)] in KnowledgeServer.php and append the new tool name and brief usage note to the existing instructions string.app/Commands/VectorizeCodeCommand.php (1)
79-81: Consider returning FAILURE when all symbols fail.The command returns
SUCCESSeven when$result['failed'] === $result['total'](complete failure). This could mask issues in CI/CD pipelines.🔧 Suggested fix
info("Done: {$result['success']}/{$result['total']} symbols vectorized, {$result['failed']} failed"); - return self::SUCCESS; + return $result['success'] > 0 ? self::SUCCESS : self::FAILURE; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/Commands/VectorizeCodeCommand.php` around lines 79 - 81, The command currently always returns self::SUCCESS after logging "Done: {$result['success']}/{$result['total']} symbols vectorized, {$result['failed']} failed" even when all symbols failed; update the VectorizeCodeCommand (e.g., in the method that produces $result) to check if $result['failed'] === $result['total'] and return self::FAILURE in that case (otherwise keep returning self::SUCCESS), ensuring you reference the existing $result['failed'] and $result['total'] variables and preserve the info() log line.app/Services/CodeIndexerService.php (1)
269-276: Consider makingSymbolIndexServicea constructor dependency.Passing
SymbolIndexServiceas a method parameter creates an inconsistent API pattern compared to other dependencies likeEmbeddingServiceInterfacewhich are injected via constructor. This also makes testing and mocking harder.♻️ Suggested refactor
Inject via constructor:
public function __construct( private readonly EmbeddingServiceInterface $embeddingService, + private readonly SymbolIndexService $symbolIndex, private readonly int $vectorSize = 1024, ) {Then update method signature:
public function vectorizeFromIndex( string $indexPath, string $repo, - SymbolIndexService $symbolIndex, array $kinds = [], ?string $language = null, ?callable $onProgress = null, ): array {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/Services/CodeIndexerService.php` around lines 269 - 276, The method vectorizeFromIndex currently accepts SymbolIndexService as a parameter which is inconsistent with other injected services (e.g. EmbeddingServiceInterface); refactor by adding a private property (e.g. private SymbolIndexService $symbolIndex) and accept SymbolIndexService in the class constructor, assign it to the property, remove the SymbolIndexService parameter from vectorizeFromIndex signature, and update all internal references in vectorizeFromIndex to use $this->symbolIndex; also update all call sites and unit tests to stop passing the service into vectorizeFromIndex and instead construct/mocks should be provided to the class constructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/Services/CodeIndexerService.php`:
- Around line 325-326: The local variable $language in the loop shadows the
method parameter $language and causes the user-supplied language filter to be
overwritten; rename the loop-local variable (e.g., $fileLanguage) so you call
$this->detectLanguage($ext) into $fileLanguage and use $fileLanguage for
per-file checks while leaving the method parameter $language intact, updating
any subsequent comparisons in the loop to reference $fileLanguage instead of
$language (see detectLanguage call and surrounding loop inside
CodeIndexerService::...).
In `@app/Services/SymbolIndexService.php`:
- Line 51: The hardcoded Python binary in the Process run call
(Process::timeout(600)->run([...])) breaks portability; replace the literal
'/opt/homebrew/opt/python@3.12/bin/python3.12' with a configurable path pulled
from config('services.jcodemunch.python_path', 'python3') (or
env('JCODEMUNCH_PYTHON_PATH', 'python3') if you prefer), add the suggested
'jcodemunch' => ['python_path' => env('JCODEMUNCH_PYTHON_PATH', 'python3')]
entry to config/services.php, and update the code that constructs the Process
(the array passed to run in SymbolIndexService.php) to use that config value so
systems fall back to 'python3' if no env override is provided.
---
Outside diff comments:
In `@app/Services/CodeIndexerService.php`:
- Around line 173-175: The method in CodeIndexerService returns entries that
include additional keys (symbol_name, symbol_kind, signature, start_line,
end_line) but the docblock return type is missing them; update the method's
`@return` annotation in CodeIndexerService.php to list those keys with types (e.g.
array<array{filepath:string, repo:string, language:string, content:string,
score:float, functions:array<string>, symbol_name?:string, symbol_kind?:string,
signature?:string, start_line?:int, end_line?:int}>) so it matches the actual
structure used by SearchCodeTool.php and PHPStan.
---
Nitpick comments:
In `@app/Commands/VectorizeCodeCommand.php`:
- Around line 79-81: The command currently always returns self::SUCCESS after
logging "Done: {$result['success']}/{$result['total']} symbols vectorized,
{$result['failed']} failed" even when all symbols failed; update the
VectorizeCodeCommand (e.g., in the method that produces $result) to check if
$result['failed'] === $result['total'] and return self::FAILURE in that case
(otherwise keep returning self::SUCCESS), ensuring you reference the existing
$result['failed'] and $result['total'] variables and preserve the info() log
line.
In `@app/Mcp/Servers/KnowledgeServer.php`:
- Line 20: The #[Instructions(...)] attribute on the KnowledgeServer
class/documentation string currently lists tools (recall, remember, correct,
context, stats) but omits the new SearchCodeTool; update that attribute to
mention the new "search-code" tool and a short usage hint (e.g., "Use
`search-code` to search repository source code") so users know MCP supports code
search; locate the attribute token #[Instructions(...)] in KnowledgeServer.php
and append the new tool name and brief usage note to the existing instructions
string.
In `@app/Services/CodeIndexerService.php`:
- Around line 269-276: The method vectorizeFromIndex currently accepts
SymbolIndexService as a parameter which is inconsistent with other injected
services (e.g. EmbeddingServiceInterface); refactor by adding a private property
(e.g. private SymbolIndexService $symbolIndex) and accept SymbolIndexService in
the class constructor, assign it to the property, remove the SymbolIndexService
parameter from vectorizeFromIndex signature, and update all internal references
in vectorizeFromIndex to use $this->symbolIndex; also update all call sites and
unit tests to stop passing the service into vectorizeFromIndex and instead
construct/mocks should be provided to the class constructor.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e8c7e730-e46e-471d-8567-e87c2e4b1e78
📒 Files selected for processing (5)
app/Commands/VectorizeCodeCommand.phpapp/Mcp/Servers/KnowledgeServer.phpapp/Mcp/Tools/SearchCodeTool.phpapp/Services/CodeIndexerService.phpapp/Services/SymbolIndexService.php
app/Services/CodeIndexerService.php
Outdated
| $ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION)); | ||
| $language = $this->detectLanguage($ext); |
There was a problem hiding this comment.
Critical: Variable shadowing breaks language filtering.
The loop variable $language shadows the method parameter $language, causing the filter to be ignored after the first iteration. The detected language from the current file overwrites the user-supplied filter.
🐛 Fix: Use a different variable name
$ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION));
- $language = $this->detectLanguage($ext);
+ $detectedLanguage = $this->detectLanguage($ext);
$result = $this->indexSymbol(
text: $text,
filepath: $symbol['file'] ?? '',
repo: $repo,
- language: $language,
+ language: $detectedLanguage,
symbolName: $symbol['name'] ?? '',
symbolKind: $symbol['kind'] ?? '',
line: (int) ($symbol['line'] ?? 0),
signature: $symbol['signature'] ?? '',
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| $ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION)); | |
| $language = $this->detectLanguage($ext); | |
| $ext = strtolower(pathinfo($symbol['file'] ?? '', PATHINFO_EXTENSION)); | |
| $detectedLanguage = $this->detectLanguage($ext); | |
| $result = $this->indexSymbol( | |
| text: $text, | |
| filepath: $symbol['file'] ?? '', | |
| repo: $repo, | |
| language: $detectedLanguage, | |
| symbolName: $symbol['name'] ?? '', | |
| symbolKind: $symbol['kind'] ?? '', | |
| line: (int) ($symbol['line'] ?? 0), | |
| signature: $symbol['signature'] ?? '', | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/Services/CodeIndexerService.php` around lines 325 - 326, The local
variable $language in the loop shadows the method parameter $language and causes
the user-supplied language filter to be overwritten; rename the loop-local
variable (e.g., $fileLanguage) so you call $this->detectLanguage($ext) into
$fileLanguage and use $fileLanguage for per-file checks while leaving the method
parameter $language intact, updating any subsequent comparisons in the loop to
reference $fileLanguage instead of $language (see detectLanguage call and
surrounding loop inside CodeIndexerService::...).
| PYTHON; | ||
|
|
||
| $result = Process::timeout(120)->run(['python3', '-c', $script]); | ||
| $result = Process::timeout(600)->run(['/opt/homebrew/opt/python@3.12/bin/python3.12', '-c', $script]); |
There was a problem hiding this comment.
Hardcoded Python path breaks portability.
The path /opt/homebrew/opt/python@3.12/bin/python3.12 is specific to macOS with Homebrew on Apple Silicon. This will fail on:
- Intel Macs (
/usr/local/opt/...) - Linux systems
- Any system using pyenv, asdf, or system Python
🔧 Suggested fix: Use configurable or discoverable path
- $result = Process::timeout(600)->run(['/opt/homebrew/opt/python@3.12/bin/python3.12', '-c', $script]);
+ $pythonPath = config('services.jcodemunch.python_path', 'python3');
+ $result = Process::timeout(600)->run([$pythonPath, '-c', $script]);Then add to config/services.php:
'jcodemunch' => [
'python_path' => env('JCODEMUNCH_PYTHON_PATH', 'python3'),
],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/Services/SymbolIndexService.php` at line 51, The hardcoded Python binary
in the Process run call (Process::timeout(600)->run([...])) breaks portability;
replace the literal '/opt/homebrew/opt/python@3.12/bin/python3.12' with a
configurable path pulled from config('services.jcodemunch.python_path',
'python3') (or env('JCODEMUNCH_PYTHON_PATH', 'python3') if you prefer), add the
suggested 'jcodemunch' => ['python_path' => env('JCODEMUNCH_PYTHON_PATH',
'python3')] entry to config/services.php, and update the code that constructs
the Process (the array passed to run in SymbolIndexService.php) to use that
config value so systems fall back to 'python3' if no env override is provided.
…overage gate Fix $language parameter being overwritten inside the foreach loop in vectorizeFromIndex(), which broke language filtering during vectorization. Add 28 tests across 3 files covering SearchCodeTool, VectorizeCodeCommand, and new CodeIndexerService methods (indexSymbol, vectorizeFromIndex). Coverage: 94.8% → 97.5%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🏆 Sentinel Certified✅ Tests & Coverage: 0 tests passed Add this badge to your README: [](https://github.com/conduit-ui/knowledge/actions/workflows/gate.yml) |
Summary
vectorize-codecommand to embed tree-sitter symbols into Qdrant'scodecollection for semantic searchsearch-code-toolMCP tool exposing semantic code search to Claude Desktop and Claude CodeSymbolIndexServicefor macOS Python 3.12 with 10k file limit and 600s timeoutCodeIndexerServicegainsindexSymbol()andvectorizeFromIndex()with--kindand--languagefiltersTest plan
./know vectorize-code local/pstrax-laravel --kind=class --language=phpembeds PHP classes into Qdrantsearch-code-toolreturns semantic results from thecodecollection./know index-code /path/to/repoworks with jcodemunch on Python 3.12search-code-tool🤖 Generated with Claude Code
Summary by CodeRabbit