Skip to content

Implement: Documentation Semantic Search (docstrings, comments, README) #37

@lexasub

Description

@lexasub

Implementation Guide: Documentation Semantic Search

🎯 Goal

Extend semantic search to index docstrings, inline comments, and README files, enabling natural language queries like "find function that validates user input" to match code with relevant documentation.

📋 Current State

What's indexed:

  • Function/method/class names
  • Signatures (parameters, return types)
  • File paths and line numbers
  • Language and node kind

What's NOT indexed:

  • ❌ Docstrings (Python """...""", Java /**...*/)
  • ❌ Inline comments (# comment, // comment)
  • ❌ README.md and other documentation files
  • ❌ Source code bodies

🔧 Implementation Plan

Step 1: Add Docstring Extraction to Parser

File: ast_rag/services/parsing/language_queries.py

Add new query types for docstrings:

PYTHON_QUERIES = {
    # ... existing queries
    "docstring": """
(function_definition
  name: (identifier) @name
  body: (block
    (expression_statement
      (string) @docstring
    )?
  )
) @node
""",
    
(class_definition
  name: (identifier) @name
  body: (block
    (expression_statement
      (string) @docstring
    )?
  )
) @node
""",
}

JAVA_QUERIES = {
    # ... existing queries
    "method_docstring": """
(method_declaration
  (modifiers)? @modifiers
  (comment) @docstring
) @node
""",
    
"class_docstring": """
(class_declaration
  name: (identifier) @name
  (comment)* @docstring
) @node
""",
}

File: ast_rag/services/parsing/parser_manager.py

Modify extract_nodes() to capture docstrings:

def extract_nodes(...) -> list[ASTNode]:
    # ... existing code
    
    # Extract docstrings
    docstring_query = compiled.get("docstring") or compiled.get("method_docstring")
    if docstring_query:
        for _, md in QueryCursor(docstring_query).matches(tree.root_node):
            docstring_text = source[md["docstring"].start_byte:md["docstring"].end_byte].decode()
            # Clean docstring (remove quotes, asterisks, etc.)
            docstring_clean = self._clean_docstring(docstring_text)
            
            # Attach to parent node
            parent_node = find_parent_node(md["node"])
            if parent_node:
                parent_node.docstring = docstring_clean

Step 2: Update ASTNode Model

File: ast_rag/dto/node.py

Add optional docstring field:

class ASTNode(BaseModel):
    # ... existing fields
    docstring: Optional[str] = None  # New field
    source_text: Optional[str] = None
    
    def to_neo4j_props(self) -> dict[str, Any]:
        props = {
            # ... existing props
        }
        if self.docstring:
            props["docstring"] = self.docstring
        return props

Step 3: Update Embedding Summary

File: ast_rag/services/embedding_manager.py

Modify build_summary() to include docstring:

def build_summary(node: ASTNode) -> str:
    """Build summary including docstring for better semantic search."""
    sig_part = f" | signature: {node.signature}" if node.signature else ""
    docstring_part = ""
    
    if node.docstring:
        # Truncate long docstrings
        docstring = node.docstring.strip()
        if len(docstring) > 500:
            docstring = docstring[:497] + "..."
        docstring_part = f" | docs: {docstring}"
    
    return (
        f"{node.lang.value} {node.kind.value}: {node.qualified_name}"
        f"{sig_part}"
        f"{docstring_part}"
        f" | file: {node.file_path}:{node.start_line}"
    )

Step 4: Add README/Markdown Support

File: ast_rag/services/parsing/parser_manager.py

Add markdown to supported extensions:

EXT_TO_LANG: dict[str, str] = {
    # ... existing
    ".md": "markdown",
    ".rst": "rst",
    ".txt": "text",
}

File: ast_rag/dto/enums.py

Add new language enum:

class Language(str, Enum):
    # ... existing
    MARKDOWN = "markdown"
    RST = "rst"
    TEXT = "text"

File: ast_rag/services/parsing/language_queries.py

Add markdown "parsing" (just extract text):

def extract_markdown_sections(file_path: str, source: bytes) -> list[ASTNode]:
    """Extract sections from markdown files as pseudo-nodes."""
    content = source.decode("utf-8")
    nodes = []
    
    # Split by headers
    sections = re.split(r'^(#+)\s+(.+)$', content, flags=re.MULTILINE)
    
    current_section = ""
    current_header = "README"
    line_num = 1
    
    for i, part in enumerate(sections):
        if part.startswith('#'):
            # This is a header
            current_header = part.lstrip('#').strip()
            line_num = content.count('\n', 0, content.find(part))
        elif part.strip():
            # This is section content
            # Create a pseudo-node for this section
            node = ASTNode(
                kind=NodeKind.BLOCK,  # or new kind DOCUMENTATION
                name=current_header,
                qualified_name=f"{Path(file_path).stem}.{current_header}",
                lang=Language.MARKDOWN,
                file_path=file_path,
                start_line=line_num,
                source_text=part.strip(),
            )
            nodes.append(node)
    
    return nodes

Step 5: Update Embedding Config

File: ast_rag/dto/config.py

Add documentation to embeddable kinds:

# In embedding_manager.py, update EMBEDDABLE_KINDS
EMBEDDABLE_KINDS = frozenset([
    NodeKind.CLASS, NodeKind.INTERFACE, NodeKind.STRUCT, NodeKind.ENUM,
    NodeKind.TRAIT, NodeKind.FUNCTION, NodeKind.METHOD,
    NodeKind.CONSTRUCTOR, NodeKind.DESTRUCTOR,
    NodeKind.BLOCK,  # For documentation sections
])

Step 6: Neo4j Schema Updates

File: ast_rag/schema/graph_schema.cql

Add docstring index:

// Full-text index for docstrings
CALL db.index.fulltext.createNodeIndex('ast_docstring_fulltext', 
    ['Function', 'Method', 'Class', 'Interface', 'Module'], 
    ['docstring'], 
    {analyzer: 'english'}
);

Step 7: Update Hybrid Search

File: ast_rag/services/embedding_manager.py

Update hybrid_search() to also search docstrings:

def hybrid_search(
    self,
    query: str,
    limit: int = 10,
    search_docstrings: bool = True,  # New parameter
    # ...
) -> list[SearchResult]:
    # ... existing vector search
    
    # Keyword search in docstrings
    if search_docstrings:
        docstring_results = self._search_docstrings(query, limit * 2)
        # Merge with existing results

📊 Expected Impact

Benefits:

  • Better semantic search accuracy (more context)
  • Find code by what it does, not just name
  • Documentation becomes searchable
  • README examples are discoverable

Trade-offs:

  • Larger embeddings (more text)
  • Slightly slower indexing
  • More noise in search results (need tuning)

🧪 Testing

def test_docstring_indexing():
    # Index a file with docstrings
    nodes = parse_file("test.py")
    assert nodes[0].docstring is not None
    
    # Build embeddings
    embed_manager.build_embeddings(nodes)
    
    # Search by docstring content
    results = embed_manager.hybrid_search("validate user input")
    assert any("validate" in r.node.docstring for r in results)

📁 Files to Modify

  1. ast_rag/services/parsing/language_queries.py - Add docstring queries
  2. ast_rag/services/parsing/parser_manager.py - Extract docstrings, add markdown
  3. ast_rag/dto/node.py - Add docstring field
  4. ast_rag/dto/enums.py - Add MARKDOWN language
  5. ast_rag/services/embedding_manager.py - Include docstring in embeddings
  6. ast_rag/schema/graph_schema.cql - Add docstring index
  7. ast_rag_config.json - Add .md to extensions

⏱️ Estimated Time

  • 4-6 hours for implementation
  • 1-2 hours for testing
  • 1 hour for documentation

Labels: enhancement, semantic-search, documentation
Priority: Medium
Implementation Time: 6-8 hours

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestgood first issueGood for newcomers

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions