Skip to content

Research: Git Integration Layer (Commit/Author nodes, blame tracking) #34

@lexasub

Description

@lexasub

Research: Git Integration Layer for AST-RAG

🎯 Goal

Design and prototype a full Git integration layer that adds commit/author nodes and blame-tracking edges to the AST graph, enabling queries like "who wrote this function?" and "when was this bug introduced?".

📋 Current State

  • ✅ MVCC versioning via valid_from / valid_to properties (commit hashes)
  • ✅ Incremental updates via update_from_git() using git diff
  • ✅ Diff computation between commits
  • ❌ No Commit or Author nodes in the graph
  • ❌ No edges like AUTHORED, COMMITTED_ON, MODIFIED, INTRODUCED
  • ❌ No blame information (who wrote each line/function)
  • ❌ No temporal queries (show graph state at commit X)

🔍 Research Questions

1. Graph Schema Design

Questions:

  • Should Commit and Author be separate node types or properties?
  • How to handle merge commits (multiple parents)?
  • Should every AST node have edges to commits, or only track changes?

Proposed Schema:

// New Node Types
(:Commit {
    hash: str,
    author_email: str,
    author_name: str,
    committer_email: str,
    committer_name: str,
    message: str,
    timestamp: datetime,
    parents: list[str]
})

(:Author {
    email: str,
    name: str,
    first_commit: datetime,
    last_commit: datetime
})

// New Edge Types
(:Author)-[:AUTHORED]->(:Commit)
(:Commit)-[:MODIFIED]->(:Function {valid_from: commit.hash})
(:Commit)-[:INTRODUCED]->(:Function)
(:Function)-[:CHANGED_IN]->(:Commit)

2. Performance Impact

Questions:

  • How many new nodes/edges will be added? (estimate: 1 commit per change, 1 author per commit)
  • Will queries slow down with blame edges?
  • Should blame edges be lazy-loaded or precomputed?

Estimates for medium project (1000 commits, 5000 functions):

  • New Commit nodes: ~1000
  • New Author nodes: ~10-50
  • New MODIFIED edges: ~10,000-50,000 (functions × commits)
  • Graph size increase: 2-5×

3. Blame Analysis Implementation

Approach:

# Use GitPython blame API
repo = git.Repo(path)
blame = repo.blame('HEAD', 'path/to/file.py')
for commit, lines in blame:
    # commit.author, commit.message, lines
    # Link AST nodes to commits

Challenges:

  • Blame is line-based, AST nodes are function/class-based
  • Need to map line ranges to AST nodes
  • Handle moved/renamed functions (git blame -C)

4. Temporal Queries

Use Cases:

  • "Show me the graph as it was at commit abc123"
  • "When was function X introduced?"
  • "Who last modified this class?"
  • "Show all changes between two dates"

Implementation Options:

  1. Snapshot approach: Store full graph snapshots per commit (expensive)
  2. Delta approach: Reconstruct state by replaying commits (slow queries)
  3. Hybrid: Current MVCC + Commit edges (proposed)

📐 Proposed Architecture

Schema Changes

File: ast_rag/schema/graph_schema.cql

// New constraints
CREATE CONSTRAINT commit_hash IF NOT EXISTS FOR (c:Commit) REQUIRE c.hash IS UNIQUE;
CREATE CONSTRAINT author_email IF NOT EXISTS FOR (a:Author) REQUIRE a.email IS UNIQUE;

// New indexes
CREATE INDEX commit_timestamp IF NOT EXISTS FOR (c:Commit) ON (c.timestamp);
CREATE INDEX authored_by IF NOT EXISTS FOR ()-[a:AUTHORED]-() ON (a.author_email);

New DTOs

File: ast_rag/dto/git.py (new)

class GitCommit(BaseModel):
    hash: str
    short_hash: str
    author_name: str
    author_email: str
    committer_name: str
    committer_email: str
    message: str
    timestamp: datetime
    parents: list[str]

class GitBlameEntry(BaseModel):
    commit: GitCommit
    start_line: int
    end_line: int
    path: str

New Services

File: ast_rag/services/git_service.py (new)

class GitService:
    def extract_commits(self, repo_path: str, from_commit: str, to_commit: str) -> list[GitCommit]
    def get_blame(self, repo_path: str, file_path: str, commit: str) -> list[GitBlameEntry]
    def get_author(self, email: str) -> Author

File: ast_rag/services/graph_updater_service.py (modify)

def update_from_git(
    # ... existing params
    extract_git_metadata: bool = True,  # New flag
) -> DiffResult:
    # After applying diff:
    if extract_git_metadata:
        self._update_git_nodes(diff, new_commit_hash)

🧪 Prototype Plan

Phase 1: Basic Commit/Author Nodes

  1. Add Commit and Author node types to schema
  2. Extract commit metadata during update_from_git()
  3. Create AUTHORED and COMMITTED_ON edges
  4. Test with small repository

Phase 2: Blame Integration

  1. Implement GitService.get_blame()
  2. Map blame line ranges to AST nodes
  3. Create MODIFIED edges from commits to AST nodes
  4. Add confidence scores (direct blame = 1.0, inherited = 0.5)

Phase 3: Temporal Queries API

  1. Add get_node_at_commit(node_id, commit_hash) method
  2. Add get_commit_history(node_id) method
  3. Add get_changes_between(from_commit, to_commit) method
  4. CLI commands: ast-rag blame <function>, ast-rag history <function>

Phase 4: Performance Optimization

  1. Benchmark query performance with blame edges
  2. Add edge expiration (don't track every minor change)
  3. Consider edge compression (group by author/date ranges)

⚠️ Risks & Mitigations

Risk Impact Mitigation
Graph size explosion High Limit blame tracking to function/class level, not lines
Query performance degradation Medium Add indexes, use edge type filtering
Git blame is slow Medium Cache blame results, lazy loading
Merge commit complexity Low Track first parent only, or all parents with weights

📊 Success Metrics

  • Can query "who wrote function X?" in <100ms
  • Can query "when was bug introduced?" with commit hash
  • Graph size increase <3× for typical projects
  • No regression in existing query performance
  • CLI command ast-rag blame <function> works

📚 References

🎯 Deliverables

  1. Research document (this file) with findings
  2. Prototype implementation in feature branch
  3. Performance benchmarks
  4. Final design document with recommendations
  5. GitHub issues for implementation tasks

Labels: research, enhancement, git-integration
Priority: High
Estimated Research Time: 2-3 days

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Spike/Need Analytics

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions