-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
Description
Research: Git Integration Layer for AST-RAG
🎯 Goal
Design and prototype a full Git integration layer that adds commit/author nodes and blame-tracking edges to the AST graph, enabling queries like "who wrote this function?" and "when was this bug introduced?".
📋 Current State
- ✅ MVCC versioning via
valid_from/valid_toproperties (commit hashes) - ✅ Incremental updates via
update_from_git()using git diff - ✅ Diff computation between commits
- ❌ No
CommitorAuthornodes in the graph - ❌ No edges like
AUTHORED,COMMITTED_ON,MODIFIED,INTRODUCED - ❌ No blame information (who wrote each line/function)
- ❌ No temporal queries (show graph state at commit X)
🔍 Research Questions
1. Graph Schema Design
Questions:
- Should
CommitandAuthorbe separate node types or properties? - How to handle merge commits (multiple parents)?
- Should every AST node have edges to commits, or only track changes?
Proposed Schema:
// New Node Types
(:Commit {
hash: str,
author_email: str,
author_name: str,
committer_email: str,
committer_name: str,
message: str,
timestamp: datetime,
parents: list[str]
})
(:Author {
email: str,
name: str,
first_commit: datetime,
last_commit: datetime
})
// New Edge Types
(:Author)-[:AUTHORED]->(:Commit)
(:Commit)-[:MODIFIED]->(:Function {valid_from: commit.hash})
(:Commit)-[:INTRODUCED]->(:Function)
(:Function)-[:CHANGED_IN]->(:Commit)2. Performance Impact
Questions:
- How many new nodes/edges will be added? (estimate: 1 commit per change, 1 author per commit)
- Will queries slow down with blame edges?
- Should blame edges be lazy-loaded or precomputed?
Estimates for medium project (1000 commits, 5000 functions):
- New
Commitnodes: ~1000 - New
Authornodes: ~10-50 - New
MODIFIEDedges: ~10,000-50,000 (functions × commits) - Graph size increase: 2-5×
3. Blame Analysis Implementation
Approach:
# Use GitPython blame API
repo = git.Repo(path)
blame = repo.blame('HEAD', 'path/to/file.py')
for commit, lines in blame:
# commit.author, commit.message, lines
# Link AST nodes to commitsChallenges:
- Blame is line-based, AST nodes are function/class-based
- Need to map line ranges to AST nodes
- Handle moved/renamed functions (git blame -C)
4. Temporal Queries
Use Cases:
- "Show me the graph as it was at commit abc123"
- "When was function X introduced?"
- "Who last modified this class?"
- "Show all changes between two dates"
Implementation Options:
- Snapshot approach: Store full graph snapshots per commit (expensive)
- Delta approach: Reconstruct state by replaying commits (slow queries)
- Hybrid: Current MVCC + Commit edges (proposed)
📐 Proposed Architecture
Schema Changes
File: ast_rag/schema/graph_schema.cql
// New constraints
CREATE CONSTRAINT commit_hash IF NOT EXISTS FOR (c:Commit) REQUIRE c.hash IS UNIQUE;
CREATE CONSTRAINT author_email IF NOT EXISTS FOR (a:Author) REQUIRE a.email IS UNIQUE;
// New indexes
CREATE INDEX commit_timestamp IF NOT EXISTS FOR (c:Commit) ON (c.timestamp);
CREATE INDEX authored_by IF NOT EXISTS FOR ()-[a:AUTHORED]-() ON (a.author_email);New DTOs
File: ast_rag/dto/git.py (new)
class GitCommit(BaseModel):
hash: str
short_hash: str
author_name: str
author_email: str
committer_name: str
committer_email: str
message: str
timestamp: datetime
parents: list[str]
class GitBlameEntry(BaseModel):
commit: GitCommit
start_line: int
end_line: int
path: strNew Services
File: ast_rag/services/git_service.py (new)
class GitService:
def extract_commits(self, repo_path: str, from_commit: str, to_commit: str) -> list[GitCommit]
def get_blame(self, repo_path: str, file_path: str, commit: str) -> list[GitBlameEntry]
def get_author(self, email: str) -> AuthorFile: ast_rag/services/graph_updater_service.py (modify)
def update_from_git(
# ... existing params
extract_git_metadata: bool = True, # New flag
) -> DiffResult:
# After applying diff:
if extract_git_metadata:
self._update_git_nodes(diff, new_commit_hash)🧪 Prototype Plan
Phase 1: Basic Commit/Author Nodes
- Add
CommitandAuthornode types to schema - Extract commit metadata during
update_from_git() - Create
AUTHOREDandCOMMITTED_ONedges - Test with small repository
Phase 2: Blame Integration
- Implement
GitService.get_blame() - Map blame line ranges to AST nodes
- Create
MODIFIEDedges from commits to AST nodes - Add confidence scores (direct blame = 1.0, inherited = 0.5)
Phase 3: Temporal Queries API
- Add
get_node_at_commit(node_id, commit_hash)method - Add
get_commit_history(node_id)method - Add
get_changes_between(from_commit, to_commit)method - CLI commands:
ast-rag blame <function>,ast-rag history <function>
Phase 4: Performance Optimization
- Benchmark query performance with blame edges
- Add edge expiration (don't track every minor change)
- Consider edge compression (group by author/date ranges)
⚠️ Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Graph size explosion | High | Limit blame tracking to function/class level, not lines |
| Query performance degradation | Medium | Add indexes, use edge type filtering |
| Git blame is slow | Medium | Cache blame results, lazy loading |
| Merge commit complexity | Low | Track first parent only, or all parents with weights |
📊 Success Metrics
- Can query "who wrote function X?" in <100ms
- Can query "when was bug introduced?" with commit hash
- Graph size increase <3× for typical projects
- No regression in existing query performance
- CLI command
ast-rag blame <function>works
📚 References
- GitPython documentation: https://gitpython.readthedocs.io/
- Neo4j temporal graph patterns: https://neo4j.com/developer/cypher/guide-modeling-product-state/
- GitHub blame API: https://docs.github.com/en/rest/repos/contents
🎯 Deliverables
- Research document (this file) with findings
- Prototype implementation in feature branch
- Performance benchmarks
- Final design document with recommendations
- GitHub issues for implementation tasks
Labels: research, enhancement, git-integration
Priority: High
Estimated Research Time: 2-3 days
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Projects
Status
Spike/Need Analytics
Status
Backlog