-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Research: Cross-Repository Graph Support
🎯 Goal
Design and implement cross-repository graph support that enables CALLS edges between projects, dependency tracking across repos, and unified semantic search across multiple related codebases.
📋 Current State
- ✅ Single-project indexing with
project_idisolation - ✅
CROSS_FILE_CALLedges for cross-file calls within a project - ✅
DEPENDS_ONedges for import/include directives - ❌ No CALLS edges between different
project_idvalues - ❌ No mechanism to declare inter-project dependencies
- ❌ No cross-project symbol resolution
- ❌ Qdrant collections are per-project (no shared embeddings)
🔍 Research Questions
1. Project Dependency Configuration
Questions:
- How do we declare that project A depends on project B?
- Should dependencies be in config file or discovered automatically?
- How to handle versioned dependencies (A v1.0 depends on B v2.3)?
Proposed Config Format:
{
"neo4j": {
"project_id": "microservice-a",
"dependencies": [
{"project_id": "shared-lib-b", "version": ">=1.0.0"},
{"project_id": "common-utils", "version": "*"}
]
}
}2. Cross-Project Symbol Resolution
Challenges:
- Current ID scheme:
SHA256(project_id:file_path:kind:qualified_name) - Same class in different projects has different IDs
- How to resolve
import com.example.Serviceto correct project?
Approaches:
A. Import-Based Resolution:
# When parsing file that imports com.example.Service
if import_path in cross_project_index:
target_project = cross_project_index[import_path]
# Create CALLS edge to target_project:ServiceB. Type-Based Resolution:
# Use type information from signatures
if method_returns_type == "com.example.Service":
# Search across all dependent projects
matches = find_in_projects("com.example.Service", projects=dependencies)C. Hybrid (Proposed):
- Import-based for direct dependencies
- Type-based for indirect/transitive calls
- Confidence scores: direct=0.9, type-based=0.6
3. Graph Schema Changes
Current Node ID:
raw = f"{self.project_id}:{self.file_path}:{self.kind.value}:{self.qualified_name}"
self.id = hashlib.sha256(raw.encode()).hexdigest()[:24]Proposed Changes:
- Keep project-specific IDs (for isolation)
- Add global lookup index for cross-project resolution
- Add
CROSS_PROJECT_CALLedge type with metadata
New Edge Type:
class ASTEdge(BaseModel):
kind: EdgeKind
from_id: str
to_id: str
# New fields:
target_project_id: Optional[str] # For cross-project edges
resolution_method: str # "import", "type", "name"
confidence: float # 0.0-1.04. Neo4j Multi-Project Queries
Challenge: How to query across projects efficiently?
Approach:
// Query: Find all callers of Service.process() across all projects
MATCH (caller)-[c:CALLS|CROSS_PROJECT_CALL]->(target)
WHERE target.qualified_name CONTAINS 'Service.process'
AND target.project_id IN $project_ids
RETURN caller, c, targetIndex Requirements:
CREATE INDEX project_qualified_name IF NOT EXISTS
FOR (n:Function) ON (n.project_id, n.qualified_name);5. Qdrant Cross-Project Search
Current: One collection per project (ast_rag_{project_id})
Options:
A. Single Shared Collection:
- All projects index to same Qdrant collection
- Add
project_idto payload for filtering - Pro: Unified semantic search
- Con: Less isolation, larger collection
B. Multi-Collection Search:
- Keep per-project collections
- Search all dependent collections, merge results
- Pro: Better isolation
- Con: More complex search logic
C. Hybrid (Proposed):
- Default: per-project collections
- Optional: shared collection for cross-project search
- Config flag:
qdrant.shared_collection: true/false
📐 Proposed Architecture
Configuration
File: ast_rag/dto/config.py
class Neo4jConfig(BaseModel):
project_id: str = "default"
dependencies: list[ProjectDependency] = field(default_factory=list)
class ProjectDependency(BaseModel):
project_id: str
version_constraint: Optional[str] = None # e.g., ">=1.0.0"
path: Optional[str] = None # Local path for monorepoCross-Project Index
File: ast_rag/services/cross_project_resolver.py (new)
class CrossProjectResolver:
def __init__(self, driver: Driver, dependencies: list[ProjectDependency]):
self._driver = driver
self._dependencies = dependencies
self._symbol_index: dict[str, str] = {} # symbol -> project_id
def build_index(self) -> None:
"""Build in-memory index of exported symbols from dependencies."""
for dep in self._dependencies:
symbols = self._get_exported_symbols(dep.project_id)
for symbol in symbols:
self._symbol_index[symbol] = dep.project_id
def resolve(self, symbol_name: str) -> Optional[tuple[str, str]]:
"""Resolve symbol to (project_id, node_id)."""
project_id = self._symbol_index.get(symbol_name)
if project_id:
node_id = self._get_node_id(project_id, symbol_name)
return (project_id, node_id)
return None
def _get_exported_symbols(self, project_id: str) -> list[str]:
# Query Neo4j for public classes/functions
cypher = """
MATCH (n)
WHERE n.project_id = $project_id
AND n.valid_to IS NULL
AND (n.kind IN ['CLASS', 'FUNCTION'] OR n.visibility = 'public')
RETURN n.qualified_name
"""Parser Manager Changes
File: ast_rag/services/parsing/parser_manager.py
class ParserManager:
def __init__(
self,
project_id: str = "default",
cross_project_resolver: Optional[CrossProjectResolver] = None,
):
self._project_id = project_id
self._resolver = cross_project_resolver
def _resolve_cross_project_call(
self,
symbol_name: str,
caller_id: str,
) -> Optional[ASTEdge]:
if not self._resolver:
return None
result = self._resolver.resolve(symbol_name)
if result:
target_project_id, target_node_id = result
return ASTEdge(
kind=EdgeKind.CROSS_PROJECT_CALL,
from_id=caller_id,
to_id=target_node_id,
target_project_id=target_project_id,
confidence=0.8,
resolution_method="import",
)
return None🧪 Implementation Plan
Phase 1: Configuration & Dependency Tracking
- Add
dependenciesfield toNeo4jConfig - Add
DEPENDS_ON_PROJECTedge type - Create project dependency graph in Neo4j
Phase 2: Cross-Project Symbol Index
- Implement
CrossProjectResolverclass - Build index of exported symbols from dependencies
- Add CLI command:
ast-rag projects list
Phase 3: Cross-Project CALLS Edges
- Modify
extract_edges()to check cross-project resolver - Create
CROSS_PROJECT_CALLedges - Add confidence scoring
Phase 4: Unified Search
- Modify
hybrid_search()to search across dependencies - Add project filtering to search API
- CLI:
ast-rag search --projects=all "query"
Phase 5: Monorepo Support
- Detect monorepo structure (multiple projects in one repo)
- Auto-discover project dependencies from package.json, setup.py, etc.
- Incremental indexing per project
⚠️ Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Circular dependencies | High | Detect cycles, warn user, break manually |
| Symbol name collisions | Medium | Use fully qualified names, namespace prefixes |
| Performance (large index) | Medium | Lazy loading, LRU cache for resolver |
| Version mismatches | Low | Version constraints in config, validation |
📊 Success Metrics
- Can create CALLS edges between two projects
- Cross-project search returns results from all dependencies
- No false positives in symbol resolution (>90% precision)
- Index build time <10s for 10 projects
- CLI:
ast-rag search --projects=allworks
📚 References
- Bazel BUILD files for dependency graphs
- TypeScript project references
- Python namespace packages
- Maven/Gradle dependency resolution
🎯 Deliverables
- Research document (this file)
- Prototype:
CrossProjectResolverclass - Config schema for dependencies
- CLI commands for multi-project operations
- Implementation GitHub issues
Labels: research, enhancement, cross-project, monorepo
Priority: High
Estimated Research Time: 2-3 days
Metadata
Metadata
Assignees
Labels
Projects
Status
Status