-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Description
In the context of the project's evolution and the existing issues regarding multi-project support, we are missing a crucial dimension: Time (Versioning). Code within the same repository can exist in multiple states (branches, tags, commit history), and the RAG engine must be capable of handling this context.
This task is a "Research of Research" (Meta-Task). Its goal is not immediate feature implementation, but a deep analysis of the problem space, identification of use cases, and the creation of specific technical tasks (Sub-issues) based on findings.
Research Goals
Define key use cases for RAG interaction with versioned code.
Identify technical constraints and risks (performance issues, data duplication).
Formulate a strategy for how raged will "travel" through history (git worktree, shallow clones, diff-based indexing).
Define requirements for the interface (CLI, API) and internal storage structure.
Key Research Areas
- Use Cases
We need to categorize the queries a user might make:
Diffing: "What is the difference between the implementation of X in branch main vs feature/Y?"
Archeology (History): "Why was this code written this way? (Find relevant commits/messages)".
Isolation (Branch Context): Questions that must be answered strictly within the context of a specific version (e.g., for deprecated APIs).
CI/CD Check: Automatic analysis of changes in PRs (diff-only indexing).
2. Challenges & Mitigations
Storage Explosion (Duplication): If we simply index every branch separately, the vector database will bloat, as 90% of the code overlaps.
Hypothesis: Use content hashing (content-addressable storage) or index only diffs.
Context Confusion: The LLM might mix code from different branches in a single response (version hallucinations).
Hypothesis: Strict metadata filtering in the vector DB.
Index Staleness: The develop branch moves forward, but the index remains old.
Hypothesis: Integration with git hooks or incremental indexing mechanisms.
3. Implementation Strategies
Git Worktree: Should we leverage native worktree support to physically parallelize versions?
Semantic Search Patterns: Are there existing patterns for versioned code among embedding providers?
Git Native Approach: Should we parse .git objects directly to access old file versions without checking out?
Expected Deliverables
Upon completion of this research, a summary should be posted as a comment on this issue, and separate Story/Task issues should be created:
RFC (Request for Comments): A document outlining the chosen architecture for version storage.
Sub-issues: Concrete tasks (e.g., "Add git_commit_hash to metadata schema", "Implement branch filtering in query engine").
Roadmap Decision: Determining if this fits into the MVP or qualifies as future work.
Rationale
Without this preliminary stage, we risk implementing a "naive" solution (simply scanning all worktrees as separate projects), which would lead to index bloat and logical errors in LLM responses.
Metadata
Metadata
Assignees
Labels
Projects
Status
Status