BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations#7
Merged
BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations#7
Conversation
- Add BM25 search using PostgreSQL pg_textsearch extension - Implement Reciprocal Rank Fusion (RRF) for hybrid search - Add hybrid+ query mode combining BM25 + vector search in parallel - Add bm25-only query mode for full-text search - Implement PostgresBM25Adapter with connection pool management - Auto-indexing via database triggers - Add comprehensive unit tests for BM25 and RRF - Add database migration for pg_textsearch support - Update Dockerfile to install pg_textsearch extension - Fix critical issues: connection pool leak, error handling, edge cases Closes BRIC-7
- Remove redundant hasattr checks in query_use_case - Simplify error handling in pg_textsearch_adapter (merge duplicate except blocks) - Remove docling TXT patch from Dockerfile (no longer needed) - Move BM25 migration SQL to migrations/ directory
- Add alembic>=1.13.0 dependency - Create src/alembic.ini configuration - Create src/alembic/env.py for async migrations (no models, raw SQL) - Create src/alembic/versions/001_add_bm25_support.py migration - Add lifespan to main.py for all transport modes (stdio/sse/streamable) - Run migrations at startup via asyncio.to_thread() - Close BM25 adapter pool on shutdown - Add close() method to BM25EnginePort interface - Remove migrations/001_add_bm25_support.sql (converted to Alembic)
…(BRIC-7) - Add test_lifespan.py: lifespan startup/shutdown, migration errors, BM25 pool close - Add test_alembic_config.py: config files, URL conversion, migration validation - Add hybrid+, bm25-only, and bm25-unavailable tests to test_query_use_case.py - Add close() tests to test_pg_textsearch_adapter.py
Critical: - C1: Lifespan now raises on migration failure instead of silently continuing - C2: Add asyncio.Lock to prevent race condition in BM25 pool initialization High: - H2: Handle asyncio.gather exceptions in hybrid+ mode with return_exceptions=True - H4: Add WARNING comment about backfill on large tables in migration Suggestions: - S5: Add ge=1 validation on BM25_RRF_K config field - S7: Add close() method assertion in BM25EnginePort test
- Reduce verbose docstrings in adapter, use cleaner type annotations - Simplify dependencies.py (remove redundant comments) - Streamline query_use_case.py (reduce nested conditionals) - Simplify main.py lifespan (remove redundant comments) - Clean up env.py and config.py
Alembic's async_engine_from_config requires postgresql+asyncpg:// URL. Previously get_url() was stripping +asyncpg, causing 'No module named psycopg2' error.
The chunks table doesn't exist at startup — LightRAG uses lightrag_doc_chunks. BM25 adapter needs its own chunks table with tsvector column. Migration now creates the full table with IF NOT EXISTS instead of ALTER.
Shared DB with composable-agents caused 'Can't locate revision 002' error. Use raganything_alembic_version table to isolate migration histories. Also fix: create chunks table in migration (table doesn't exist at startup).
- Add CREATE EXTENSION IF NOT EXISTS pg_textsearch to migration - Add BM25 index directly (no longer conditional) since extension is guaranteed - Add shared_preload_libraries=pg_textsearch to bricks-db in docker-compose - Drop pg_textsearch extension in downgrade
- Add hybrid+ and bm25 query modes to documentation - Add BM25 configuration section - Add Database Migrations section with Alembic details - Document hybrid+ response format with RRF scoring - Add BM25 env variables to .env.example - Update project structure with new files
…nk_id - RRF combiner now matches by chunk_id (database hash ID) instead of reference_id (per-file sequential number), fixing bm25_rank always being null in hybrid+ results - BM25 adapter queries lightrag_doc_chunks directly (no separate chunks table) - BM25 SQL uses to_bm25query(query, index_name) with GIN pre-filter for correctness - Added _make_workspace() to BM25 adapter matching LightRAGAdapter's workspace mapping - Alembic migration runs synchronously before uvicorn (fixes event loop deadlock) - Logging visible in Docker via custom LOG_CONFIG dict passed to both dictConfig and uvicorn - Empty folder indexing returns SUCCESS with 'No files found' instead of FAILED - file_extensions empty string coerced to None via BeforeValidator - HybridSearchResult now includes optional reference_id field from vector results - ChunkResponse.reference_id is now Optional (null for BM25-only results) - All 116 tests passing
- BM25 adapter now accepts text_config parameter (default: english, env: BM25_TEXT_CONFIG) - Creates text-config-specific BM25 index (e.g. idx_lightrag_chunks_bm25_french) - Auto-rebuilds content_tsv and trigger function when text_config changes - Removed GIN tsvector pre-filter (was too strict with AND matching for multi-word queries) - BM25 ranking via to_bm25query handles relevance scoring directly - Updated .env.raganything-api to BM25_TEXT_CONFIG=french - All 121 tests passing
…k response These are internal ranking details not useful for API consumers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira
BRIC-7
Changes
Tests
DO NOT MERGE