BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations by Kaiohz · Pull Request #7 · SoluDevTech/mcp-raganything

Kaiohz · 2026-04-08T06:19:06Z

Jira

Changes

Add BM25 full-text search via PostgreSQL pg_textsearch extension
Add hybrid+ query mode: parallel BM25 + vector search with RRF
Add bm25 query mode: BM25-only search
Add Alembic database migrations running at FastAPI lifespan startup
Add PostgresBM25Adapter with async connection pool (double-checked locking)
Add RRFCombiner for combining BM25 and vector search results
Add chunks table migration with tsvector, GIN index, BM25 index, trigger
Add shared_preload_libraries=pg_textsearch to bricks-db docker-compose
Use separate raganything_alembic_version table for migration isolation

Tests

109 unit tests passed
Linter: ruff check + format passed
Code review: 2 critical + 4 high issues fixed
SonarQube: 0 new issues
Trivy: 0 CRITICAL/HIGH vulnerabilities
Docker QA: pg_textsearch installed, BM25 index created, service healthy

DO NOT MERGE

- Add BM25 search using PostgreSQL pg_textsearch extension - Implement Reciprocal Rank Fusion (RRF) for hybrid search - Add hybrid+ query mode combining BM25 + vector search in parallel - Add bm25-only query mode for full-text search - Implement PostgresBM25Adapter with connection pool management - Auto-indexing via database triggers - Add comprehensive unit tests for BM25 and RRF - Add database migration for pg_textsearch support - Update Dockerfile to install pg_textsearch extension - Fix critical issues: connection pool leak, error handling, edge cases Closes BRIC-7

- Remove redundant hasattr checks in query_use_case - Simplify error handling in pg_textsearch_adapter (merge duplicate except blocks) - Remove docling TXT patch from Dockerfile (no longer needed) - Move BM25 migration SQL to migrations/ directory

- Add alembic>=1.13.0 dependency - Create src/alembic.ini configuration - Create src/alembic/env.py for async migrations (no models, raw SQL) - Create src/alembic/versions/001_add_bm25_support.py migration - Add lifespan to main.py for all transport modes (stdio/sse/streamable) - Run migrations at startup via asyncio.to_thread() - Close BM25 adapter pool on shutdown - Add close() method to BM25EnginePort interface - Remove migrations/001_add_bm25_support.sql (converted to Alembic)

…(BRIC-7) - Add test_lifespan.py: lifespan startup/shutdown, migration errors, BM25 pool close - Add test_alembic_config.py: config files, URL conversion, migration validation - Add hybrid+, bm25-only, and bm25-unavailable tests to test_query_use_case.py - Add close() tests to test_pg_textsearch_adapter.py

… (BRIC-7)

Critical: - C1: Lifespan now raises on migration failure instead of silently continuing - C2: Add asyncio.Lock to prevent race condition in BM25 pool initialization High: - H2: Handle asyncio.gather exceptions in hybrid+ mode with return_exceptions=True - H4: Add WARNING comment about backfill on large tables in migration Suggestions: - S5: Add ge=1 validation on BM25_RRF_K config field - S7: Add close() method assertion in BM25EnginePort test

- Reduce verbose docstrings in adapter, use cleaner type annotations - Simplify dependencies.py (remove redundant comments) - Streamline query_use_case.py (reduce nested conditionals) - Simplify main.py lifespan (remove redundant comments) - Clean up env.py and config.py

…BRIC-7)

Alembic's async_engine_from_config requires postgresql+asyncpg:// URL. Previously get_url() was stripping +asyncpg, causing 'No module named psycopg2' error.

The chunks table doesn't exist at startup — LightRAG uses lightrag_doc_chunks. BM25 adapter needs its own chunks table with tsvector column. Migration now creates the full table with IF NOT EXISTS instead of ALTER.

Shared DB with composable-agents caused 'Can't locate revision 002' error. Use raganything_alembic_version table to isolate migration histories. Also fix: create chunks table in migration (table doesn't exist at startup).

- Add CREATE EXTENSION IF NOT EXISTS pg_textsearch to migration - Add BM25 index directly (no longer conditional) since extension is guaranteed - Add shared_preload_libraries=pg_textsearch to bricks-db in docker-compose - Drop pg_textsearch extension in downgrade

- Add hybrid+ and bm25 query modes to documentation - Add BM25 configuration section - Add Database Migrations section with Alembic details - Document hybrid+ response format with RRF scoring - Add BM25 env variables to .env.example - Update project structure with new files

…nk_id - RRF combiner now matches by chunk_id (database hash ID) instead of reference_id (per-file sequential number), fixing bm25_rank always being null in hybrid+ results - BM25 adapter queries lightrag_doc_chunks directly (no separate chunks table) - BM25 SQL uses to_bm25query(query, index_name) with GIN pre-filter for correctness - Added _make_workspace() to BM25 adapter matching LightRAGAdapter's workspace mapping - Alembic migration runs synchronously before uvicorn (fixes event loop deadlock) - Logging visible in Docker via custom LOG_CONFIG dict passed to both dictConfig and uvicorn - Empty folder indexing returns SUCCESS with 'No files found' instead of FAILED - file_extensions empty string coerced to None via BeforeValidator - HybridSearchResult now includes optional reference_id field from vector results - ChunkResponse.reference_id is now Optional (null for BM25-only results) - All 116 tests passing

- BM25 adapter now accepts text_config parameter (default: english, env: BM25_TEXT_CONFIG) - Creates text-config-specific BM25 index (e.g. idx_lightrag_chunks_bm25_french) - Auto-rebuilds content_tsv and trigger function when text_config changes - Removed GIN tsvector pre-filter (was too strict with AND matching for multi-word queries) - BM25 ranking via to_bm25query handles relevance scoring directly - Updated .env.raganything-api to BM25_TEXT_CONFIG=french - All 121 tests passing

…k response These are internal ranking details not useful for API consumers.

Kaiohz added 17 commits April 7, 2026 19:44

style: Fix lint issues - combine with statements, trailing whitespace…

804912d

… (BRIC-7)

refactor: Reduce cognitive complexity in RRF combiner (sonar S3776) (…

c3e5cb3

…BRIC-7)

chore: Update uv.lock for alembic dependency (BRIC-7)

c8751d5

fix: Use asyncpg driver URL in Alembic env.py (BRIC-7)

b917fd6

Alembic's async_engine_from_config requires postgresql+asyncpg:// URL. Previously get_url() was stripping +asyncpg, causing 'No module named psycopg2' error.

fix: Create chunks table in migration instead of ALTER TABLE (BRIC-7)

9b7cbcd

The chunks table doesn't exist at startup — LightRAG uses lightrag_doc_chunks. BM25 adapter needs its own chunks table with tsvector column. Migration now creates the full table with IF NOT EXISTS instead of ALTER.

fix: Use separate alembic version table for raganything (BRIC-7)

e15d418

Shared DB with composable-agents caused 'Can't locate revision 002' error. Use raganything_alembic_version table to isolate migration histories. Also fix: create chunks table in migration (table doesn't exist at startup).

refactor: remove score/bm25_rank/vector_rank/combined_score from chun…

4c725a5

…k response These are internal ranking details not useful for API consumers.

Kaiohz marked this pull request as ready for review April 9, 2026 07:26

Kaiohz merged commit efca694 into main Apr 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations#7

BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations#7
Kaiohz merged 17 commits intomainfrom
BRIC-7/add-bm25-pg-textsearch

Kaiohz commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kaiohz commented Apr 8, 2026

Jira

Changes

Tests

DO NOT MERGE

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant