Implement: Documentation Semantic Search (docstrings, comments, README)

# Implementation Guide: Documentation Semantic Search

## 🎯 Goal

Extend semantic search to index docstrings, inline comments, and README files, enabling natural language queries like "find function that validates user input" to match code with relevant documentation.

## 📋 Current State

**What's indexed:**
- Function/method/class names
- Signatures (parameters, return types)
- File paths and line numbers
- Language and node kind

**What's NOT indexed:**
- ❌ Docstrings (Python `"""..."""`, Java `/**...*/`)
- ❌ Inline comments (`# comment`, `// comment`)
- ❌ README.md and other documentation files
- ❌ Source code bodies

## 🔧 Implementation Plan

### Step 1: Add Docstring Extraction to Parser

**File:** `ast_rag/services/parsing/language_queries.py`

Add new query types for docstrings:

```python
PYTHON_QUERIES = {
    # ... existing queries
    "docstring": """
(function_definition
  name: (identifier) @name
  body: (block
    (expression_statement
      (string) @docstring
    )?
  )
) @node
""",
    
(class_definition
  name: (identifier) @name
  body: (block
    (expression_statement
      (string) @docstring
    )?
  )
) @node
""",
}

JAVA_QUERIES = {
    # ... existing queries
    "method_docstring": """
(method_declaration
  (modifiers)? @modifiers
  (comment) @docstring
) @node
""",
    
"class_docstring": """
(class_declaration
  name: (identifier) @name
  (comment)* @docstring
) @node
""",
}
```

**File:** `ast_rag/services/parsing/parser_manager.py`

Modify `extract_nodes()` to capture docstrings:

```python
def extract_nodes(...) -> list[ASTNode]:
    # ... existing code
    
    # Extract docstrings
    docstring_query = compiled.get("docstring") or compiled.get("method_docstring")
    if docstring_query:
        for _, md in QueryCursor(docstring_query).matches(tree.root_node):
            docstring_text = source[md["docstring"].start_byte:md["docstring"].end_byte].decode()
            # Clean docstring (remove quotes, asterisks, etc.)
            docstring_clean = self._clean_docstring(docstring_text)
            
            # Attach to parent node
            parent_node = find_parent_node(md["node"])
            if parent_node:
                parent_node.docstring = docstring_clean
```

### Step 2: Update ASTNode Model

**File:** `ast_rag/dto/node.py`

Add optional `docstring` field:

```python
class ASTNode(BaseModel):
    # ... existing fields
    docstring: Optional[str] = None  # New field
    source_text: Optional[str] = None
    
    def to_neo4j_props(self) -> dict[str, Any]:
        props = {
            # ... existing props
        }
        if self.docstring:
            props["docstring"] = self.docstring
        return props
```

### Step 3: Update Embedding Summary

**File:** `ast_rag/services/embedding_manager.py`

Modify `build_summary()` to include docstring:

```python
def build_summary(node: ASTNode) -> str:
    """Build summary including docstring for better semantic search."""
    sig_part = f" | signature: {node.signature}" if node.signature else ""
    docstring_part = ""
    
    if node.docstring:
        # Truncate long docstrings
        docstring = node.docstring.strip()
        if len(docstring) > 500:
            docstring = docstring[:497] + "..."
        docstring_part = f" | docs: {docstring}"
    
    return (
        f"{node.lang.value} {node.kind.value}: {node.qualified_name}"
        f"{sig_part}"
        f"{docstring_part}"
        f" | file: {node.file_path}:{node.start_line}"
    )
```

### Step 4: Add README/Markdown Support

**File:** `ast_rag/services/parsing/parser_manager.py`

Add markdown to supported extensions:

```python
EXT_TO_LANG: dict[str, str] = {
    # ... existing
    ".md": "markdown",
    ".rst": "rst",
    ".txt": "text",
}
```

**File:** `ast_rag/dto/enums.py`

Add new language enum:

```python
class Language(str, Enum):
    # ... existing
    MARKDOWN = "markdown"
    RST = "rst"
    TEXT = "text"
```

**File:** `ast_rag/services/parsing/language_queries.py`

Add markdown "parsing" (just extract text):

```python
def extract_markdown_sections(file_path: str, source: bytes) -> list[ASTNode]:
    """Extract sections from markdown files as pseudo-nodes."""
    content = source.decode("utf-8")
    nodes = []
    
    # Split by headers
    sections = re.split(r'^(#+)\s+(.+)$', content, flags=re.MULTILINE)
    
    current_section = ""
    current_header = "README"
    line_num = 1
    
    for i, part in enumerate(sections):
        if part.startswith('#'):
            # This is a header
            current_header = part.lstrip('#').strip()
            line_num = content.count('\n', 0, content.find(part))
        elif part.strip():
            # This is section content
            # Create a pseudo-node for this section
            node = ASTNode(
                kind=NodeKind.BLOCK,  # or new kind DOCUMENTATION
                name=current_header,
                qualified_name=f"{Path(file_path).stem}.{current_header}",
                lang=Language.MARKDOWN,
                file_path=file_path,
                start_line=line_num,
                source_text=part.strip(),
            )
            nodes.append(node)
    
    return nodes
```

### Step 5: Update Embedding Config

**File:** `ast_rag/dto/config.py`

Add documentation to embeddable kinds:

```python
# In embedding_manager.py, update EMBEDDABLE_KINDS
EMBEDDABLE_KINDS = frozenset([
    NodeKind.CLASS, NodeKind.INTERFACE, NodeKind.STRUCT, NodeKind.ENUM,
    NodeKind.TRAIT, NodeKind.FUNCTION, NodeKind.METHOD,
    NodeKind.CONSTRUCTOR, NodeKind.DESTRUCTOR,
    NodeKind.BLOCK,  # For documentation sections
])
```

### Step 6: Neo4j Schema Updates

**File:** `ast_rag/schema/graph_schema.cql`

Add docstring index:

```cypher
// Full-text index for docstrings
CALL db.index.fulltext.createNodeIndex('ast_docstring_fulltext', 
    ['Function', 'Method', 'Class', 'Interface', 'Module'], 
    ['docstring'], 
    {analyzer: 'english'}
);
```

### Step 7: Update Hybrid Search

**File:** `ast_rag/services/embedding_manager.py`

Update `hybrid_search()` to also search docstrings:

```python
def hybrid_search(
    self,
    query: str,
    limit: int = 10,
    search_docstrings: bool = True,  # New parameter
    # ...
) -> list[SearchResult]:
    # ... existing vector search
    
    # Keyword search in docstrings
    if search_docstrings:
        docstring_results = self._search_docstrings(query, limit * 2)
        # Merge with existing results
```

## 📊 Expected Impact

**Benefits:**
- Better semantic search accuracy (more context)
- Find code by what it does, not just name
- Documentation becomes searchable
- README examples are discoverable

**Trade-offs:**
- Larger embeddings (more text)
- Slightly slower indexing
- More noise in search results (need tuning)

## 🧪 Testing

```python
def test_docstring_indexing():
    # Index a file with docstrings
    nodes = parse_file("test.py")
    assert nodes[0].docstring is not None
    
    # Build embeddings
    embed_manager.build_embeddings(nodes)
    
    # Search by docstring content
    results = embed_manager.hybrid_search("validate user input")
    assert any("validate" in r.node.docstring for r in results)
```

## 📁 Files to Modify

1. `ast_rag/services/parsing/language_queries.py` - Add docstring queries
2. `ast_rag/services/parsing/parser_manager.py` - Extract docstrings, add markdown
3. `ast_rag/dto/node.py` - Add docstring field
4. `ast_rag/dto/enums.py` - Add MARKDOWN language
5. `ast_rag/services/embedding_manager.py` - Include docstring in embeddings
6. `ast_rag/schema/graph_schema.cql` - Add docstring index
7. `ast_rag_config.json` - Add .md to extensions

## ⏱️ Estimated Time

- 4-6 hours for implementation
- 1-2 hours for testing
- 1 hour for documentation

---

**Labels:** `enhancement`, `semantic-search`, `documentation`
**Priority:** Medium
**Implementation Time:** 6-8 hours


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement: Documentation Semantic Search (docstrings, comments, README) #37

Implementation Guide: Documentation Semantic Search

🎯 Goal

📋 Current State

🔧 Implementation Plan

Step 1: Add Docstring Extraction to Parser

Step 2: Update ASTNode Model

Step 3: Update Embedding Summary

Step 4: Add README/Markdown Support

Step 5: Update Embedding Config

Step 6: Neo4j Schema Updates

Step 7: Update Hybrid Search

📊 Expected Impact

🧪 Testing

📁 Files to Modify

⏱️ Estimated Time

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement: Documentation Semantic Search (docstrings, comments, README) #37

Description

Implementation Guide: Documentation Semantic Search

🎯 Goal

📋 Current State

🔧 Implementation Plan

Step 1: Add Docstring Extraction to Parser

Step 2: Update ASTNode Model

Step 3: Update Embedding Summary

Step 4: Add README/Markdown Support

Step 5: Update Embedding Config

Step 6: Neo4j Schema Updates

Step 7: Update Hybrid Search

📊 Expected Impact

🧪 Testing

📁 Files to Modify

⏱️ Estimated Time

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions