-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Implementation Guide: Documentation Semantic Search
🎯 Goal
Extend semantic search to index docstrings, inline comments, and README files, enabling natural language queries like "find function that validates user input" to match code with relevant documentation.
📋 Current State
What's indexed:
- Function/method/class names
- Signatures (parameters, return types)
- File paths and line numbers
- Language and node kind
What's NOT indexed:
- ❌ Docstrings (Python
"""...""", Java/**...*/) - ❌ Inline comments (
# comment,// comment) - ❌ README.md and other documentation files
- ❌ Source code bodies
🔧 Implementation Plan
Step 1: Add Docstring Extraction to Parser
File: ast_rag/services/parsing/language_queries.py
Add new query types for docstrings:
PYTHON_QUERIES = {
# ... existing queries
"docstring": """
(function_definition
name: (identifier) @name
body: (block
(expression_statement
(string) @docstring
)?
)
) @node
""",
(class_definition
name: (identifier) @name
body: (block
(expression_statement
(string) @docstring
)?
)
) @node
""",
}
JAVA_QUERIES = {
# ... existing queries
"method_docstring": """
(method_declaration
(modifiers)? @modifiers
(comment) @docstring
) @node
""",
"class_docstring": """
(class_declaration
name: (identifier) @name
(comment)* @docstring
) @node
""",
}File: ast_rag/services/parsing/parser_manager.py
Modify extract_nodes() to capture docstrings:
def extract_nodes(...) -> list[ASTNode]:
# ... existing code
# Extract docstrings
docstring_query = compiled.get("docstring") or compiled.get("method_docstring")
if docstring_query:
for _, md in QueryCursor(docstring_query).matches(tree.root_node):
docstring_text = source[md["docstring"].start_byte:md["docstring"].end_byte].decode()
# Clean docstring (remove quotes, asterisks, etc.)
docstring_clean = self._clean_docstring(docstring_text)
# Attach to parent node
parent_node = find_parent_node(md["node"])
if parent_node:
parent_node.docstring = docstring_cleanStep 2: Update ASTNode Model
File: ast_rag/dto/node.py
Add optional docstring field:
class ASTNode(BaseModel):
# ... existing fields
docstring: Optional[str] = None # New field
source_text: Optional[str] = None
def to_neo4j_props(self) -> dict[str, Any]:
props = {
# ... existing props
}
if self.docstring:
props["docstring"] = self.docstring
return propsStep 3: Update Embedding Summary
File: ast_rag/services/embedding_manager.py
Modify build_summary() to include docstring:
def build_summary(node: ASTNode) -> str:
"""Build summary including docstring for better semantic search."""
sig_part = f" | signature: {node.signature}" if node.signature else ""
docstring_part = ""
if node.docstring:
# Truncate long docstrings
docstring = node.docstring.strip()
if len(docstring) > 500:
docstring = docstring[:497] + "..."
docstring_part = f" | docs: {docstring}"
return (
f"{node.lang.value} {node.kind.value}: {node.qualified_name}"
f"{sig_part}"
f"{docstring_part}"
f" | file: {node.file_path}:{node.start_line}"
)Step 4: Add README/Markdown Support
File: ast_rag/services/parsing/parser_manager.py
Add markdown to supported extensions:
EXT_TO_LANG: dict[str, str] = {
# ... existing
".md": "markdown",
".rst": "rst",
".txt": "text",
}File: ast_rag/dto/enums.py
Add new language enum:
class Language(str, Enum):
# ... existing
MARKDOWN = "markdown"
RST = "rst"
TEXT = "text"File: ast_rag/services/parsing/language_queries.py
Add markdown "parsing" (just extract text):
def extract_markdown_sections(file_path: str, source: bytes) -> list[ASTNode]:
"""Extract sections from markdown files as pseudo-nodes."""
content = source.decode("utf-8")
nodes = []
# Split by headers
sections = re.split(r'^(#+)\s+(.+)$', content, flags=re.MULTILINE)
current_section = ""
current_header = "README"
line_num = 1
for i, part in enumerate(sections):
if part.startswith('#'):
# This is a header
current_header = part.lstrip('#').strip()
line_num = content.count('\n', 0, content.find(part))
elif part.strip():
# This is section content
# Create a pseudo-node for this section
node = ASTNode(
kind=NodeKind.BLOCK, # or new kind DOCUMENTATION
name=current_header,
qualified_name=f"{Path(file_path).stem}.{current_header}",
lang=Language.MARKDOWN,
file_path=file_path,
start_line=line_num,
source_text=part.strip(),
)
nodes.append(node)
return nodesStep 5: Update Embedding Config
File: ast_rag/dto/config.py
Add documentation to embeddable kinds:
# In embedding_manager.py, update EMBEDDABLE_KINDS
EMBEDDABLE_KINDS = frozenset([
NodeKind.CLASS, NodeKind.INTERFACE, NodeKind.STRUCT, NodeKind.ENUM,
NodeKind.TRAIT, NodeKind.FUNCTION, NodeKind.METHOD,
NodeKind.CONSTRUCTOR, NodeKind.DESTRUCTOR,
NodeKind.BLOCK, # For documentation sections
])Step 6: Neo4j Schema Updates
File: ast_rag/schema/graph_schema.cql
Add docstring index:
// Full-text index for docstrings
CALL db.index.fulltext.createNodeIndex('ast_docstring_fulltext',
['Function', 'Method', 'Class', 'Interface', 'Module'],
['docstring'],
{analyzer: 'english'}
);Step 7: Update Hybrid Search
File: ast_rag/services/embedding_manager.py
Update hybrid_search() to also search docstrings:
def hybrid_search(
self,
query: str,
limit: int = 10,
search_docstrings: bool = True, # New parameter
# ...
) -> list[SearchResult]:
# ... existing vector search
# Keyword search in docstrings
if search_docstrings:
docstring_results = self._search_docstrings(query, limit * 2)
# Merge with existing results📊 Expected Impact
Benefits:
- Better semantic search accuracy (more context)
- Find code by what it does, not just name
- Documentation becomes searchable
- README examples are discoverable
Trade-offs:
- Larger embeddings (more text)
- Slightly slower indexing
- More noise in search results (need tuning)
🧪 Testing
def test_docstring_indexing():
# Index a file with docstrings
nodes = parse_file("test.py")
assert nodes[0].docstring is not None
# Build embeddings
embed_manager.build_embeddings(nodes)
# Search by docstring content
results = embed_manager.hybrid_search("validate user input")
assert any("validate" in r.node.docstring for r in results)📁 Files to Modify
ast_rag/services/parsing/language_queries.py- Add docstring queriesast_rag/services/parsing/parser_manager.py- Extract docstrings, add markdownast_rag/dto/node.py- Add docstring fieldast_rag/dto/enums.py- Add MARKDOWN languageast_rag/services/embedding_manager.py- Include docstring in embeddingsast_rag/schema/graph_schema.cql- Add docstring indexast_rag_config.json- Add .md to extensions
⏱️ Estimated Time
- 4-6 hours for implementation
- 1-2 hours for testing
- 1 hour for documentation
Labels: enhancement, semantic-search, documentation
Priority: Medium
Implementation Time: 6-8 hours
Metadata
Metadata
Assignees
Labels
Projects
Status