Skip to content

Research: Cross-Repository Graph Support (monorepo, cross-project calls) #35

@lexasub

Description

@lexasub

Research: Cross-Repository Graph Support

🎯 Goal

Design and implement cross-repository graph support that enables CALLS edges between projects, dependency tracking across repos, and unified semantic search across multiple related codebases.

📋 Current State

  • ✅ Single-project indexing with project_id isolation
  • CROSS_FILE_CALL edges for cross-file calls within a project
  • DEPENDS_ON edges for import/include directives
  • ❌ No CALLS edges between different project_id values
  • ❌ No mechanism to declare inter-project dependencies
  • ❌ No cross-project symbol resolution
  • ❌ Qdrant collections are per-project (no shared embeddings)

🔍 Research Questions

1. Project Dependency Configuration

Questions:

  • How do we declare that project A depends on project B?
  • Should dependencies be in config file or discovered automatically?
  • How to handle versioned dependencies (A v1.0 depends on B v2.3)?

Proposed Config Format:

{
  "neo4j": {
    "project_id": "microservice-a",
    "dependencies": [
      {"project_id": "shared-lib-b", "version": ">=1.0.0"},
      {"project_id": "common-utils", "version": "*"}
    ]
  }
}

2. Cross-Project Symbol Resolution

Challenges:

  • Current ID scheme: SHA256(project_id:file_path:kind:qualified_name)
  • Same class in different projects has different IDs
  • How to resolve import com.example.Service to correct project?

Approaches:

A. Import-Based Resolution:

# When parsing file that imports com.example.Service
if import_path in cross_project_index:
    target_project = cross_project_index[import_path]
    # Create CALLS edge to target_project:Service

B. Type-Based Resolution:

# Use type information from signatures
if method_returns_type == "com.example.Service":
    # Search across all dependent projects
    matches = find_in_projects("com.example.Service", projects=dependencies)

C. Hybrid (Proposed):

  • Import-based for direct dependencies
  • Type-based for indirect/transitive calls
  • Confidence scores: direct=0.9, type-based=0.6

3. Graph Schema Changes

Current Node ID:

raw = f"{self.project_id}:{self.file_path}:{self.kind.value}:{self.qualified_name}"
self.id = hashlib.sha256(raw.encode()).hexdigest()[:24]

Proposed Changes:

  • Keep project-specific IDs (for isolation)
  • Add global lookup index for cross-project resolution
  • Add CROSS_PROJECT_CALL edge type with metadata

New Edge Type:

class ASTEdge(BaseModel):
    kind: EdgeKind
    from_id: str
    to_id: str
    # New fields:
    target_project_id: Optional[str]  # For cross-project edges
    resolution_method: str  # "import", "type", "name"
    confidence: float  # 0.0-1.0

4. Neo4j Multi-Project Queries

Challenge: How to query across projects efficiently?

Approach:

// Query: Find all callers of Service.process() across all projects
MATCH (caller)-[c:CALLS|CROSS_PROJECT_CALL]->(target)
WHERE target.qualified_name CONTAINS 'Service.process'
  AND target.project_id IN $project_ids
RETURN caller, c, target

Index Requirements:

CREATE INDEX project_qualified_name IF NOT EXISTS 
    FOR (n:Function) ON (n.project_id, n.qualified_name);

5. Qdrant Cross-Project Search

Current: One collection per project (ast_rag_{project_id})

Options:

A. Single Shared Collection:

  • All projects index to same Qdrant collection
  • Add project_id to payload for filtering
  • Pro: Unified semantic search
  • Con: Less isolation, larger collection

B. Multi-Collection Search:

  • Keep per-project collections
  • Search all dependent collections, merge results
  • Pro: Better isolation
  • Con: More complex search logic

C. Hybrid (Proposed):

  • Default: per-project collections
  • Optional: shared collection for cross-project search
  • Config flag: qdrant.shared_collection: true/false

📐 Proposed Architecture

Configuration

File: ast_rag/dto/config.py

class Neo4jConfig(BaseModel):
    project_id: str = "default"
    dependencies: list[ProjectDependency] = field(default_factory=list)

class ProjectDependency(BaseModel):
    project_id: str
    version_constraint: Optional[str] = None  # e.g., ">=1.0.0"
    path: Optional[str] = None  # Local path for monorepo

Cross-Project Index

File: ast_rag/services/cross_project_resolver.py (new)

class CrossProjectResolver:
    def __init__(self, driver: Driver, dependencies: list[ProjectDependency]):
        self._driver = driver
        self._dependencies = dependencies
        self._symbol_index: dict[str, str] = {}  # symbol -> project_id
    
    def build_index(self) -> None:
        """Build in-memory index of exported symbols from dependencies."""
        for dep in self._dependencies:
            symbols = self._get_exported_symbols(dep.project_id)
            for symbol in symbols:
                self._symbol_index[symbol] = dep.project_id
    
    def resolve(self, symbol_name: str) -> Optional[tuple[str, str]]:
        """Resolve symbol to (project_id, node_id)."""
        project_id = self._symbol_index.get(symbol_name)
        if project_id:
            node_id = self._get_node_id(project_id, symbol_name)
            return (project_id, node_id)
        return None
    
    def _get_exported_symbols(self, project_id: str) -> list[str]:
        # Query Neo4j for public classes/functions
        cypher = """
        MATCH (n)
        WHERE n.project_id = $project_id 
          AND n.valid_to IS NULL
          AND (n.kind IN ['CLASS', 'FUNCTION'] OR n.visibility = 'public')
        RETURN n.qualified_name
        """

Parser Manager Changes

File: ast_rag/services/parsing/parser_manager.py

class ParserManager:
    def __init__(
        self,
        project_id: str = "default",
        cross_project_resolver: Optional[CrossProjectResolver] = None,
    ):
        self._project_id = project_id
        self._resolver = cross_project_resolver
    
    def _resolve_cross_project_call(
        self,
        symbol_name: str,
        caller_id: str,
    ) -> Optional[ASTEdge]:
        if not self._resolver:
            return None
        
        result = self._resolver.resolve(symbol_name)
        if result:
            target_project_id, target_node_id = result
            return ASTEdge(
                kind=EdgeKind.CROSS_PROJECT_CALL,
                from_id=caller_id,
                to_id=target_node_id,
                target_project_id=target_project_id,
                confidence=0.8,
                resolution_method="import",
            )
        return None

🧪 Implementation Plan

Phase 1: Configuration & Dependency Tracking

  1. Add dependencies field to Neo4jConfig
  2. Add DEPENDS_ON_PROJECT edge type
  3. Create project dependency graph in Neo4j

Phase 2: Cross-Project Symbol Index

  1. Implement CrossProjectResolver class
  2. Build index of exported symbols from dependencies
  3. Add CLI command: ast-rag projects list

Phase 3: Cross-Project CALLS Edges

  1. Modify extract_edges() to check cross-project resolver
  2. Create CROSS_PROJECT_CALL edges
  3. Add confidence scoring

Phase 4: Unified Search

  1. Modify hybrid_search() to search across dependencies
  2. Add project filtering to search API
  3. CLI: ast-rag search --projects=all "query"

Phase 5: Monorepo Support

  1. Detect monorepo structure (multiple projects in one repo)
  2. Auto-discover project dependencies from package.json, setup.py, etc.
  3. Incremental indexing per project

⚠️ Risks & Mitigations

Risk Impact Mitigation
Circular dependencies High Detect cycles, warn user, break manually
Symbol name collisions Medium Use fully qualified names, namespace prefixes
Performance (large index) Medium Lazy loading, LRU cache for resolver
Version mismatches Low Version constraints in config, validation

📊 Success Metrics

  • Can create CALLS edges between two projects
  • Cross-project search returns results from all dependencies
  • No false positives in symbol resolution (>90% precision)
  • Index build time <10s for 10 projects
  • CLI: ast-rag search --projects=all works

📚 References

  • Bazel BUILD files for dependency graphs
  • TypeScript project references
  • Python namespace packages
  • Maven/Gradle dependency resolution

🎯 Deliverables

  1. Research document (this file)
  2. Prototype: CrossProjectResolver class
  3. Config schema for dependencies
  4. CLI commands for multi-project operations
  5. Implementation GitHub issues

Labels: research, enhancement, cross-project, monorepo
Priority: High
Estimated Research Time: 2-3 days

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Spike/Need Analytics

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions