Skip to content

BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations#7

Merged
Kaiohz merged 17 commits intomainfrom
BRIC-7/add-bm25-pg-textsearch
Apr 9, 2026
Merged

BRIC-7: Add BM25 full-text search with pg_textsearch and Alembic migrations#7
Kaiohz merged 17 commits intomainfrom
BRIC-7/add-bm25-pg-textsearch

Conversation

@Kaiohz
Copy link
Copy Markdown
Collaborator

@Kaiohz Kaiohz commented Apr 8, 2026

Jira

BRIC-7

Changes

  • Add BM25 full-text search via PostgreSQL pg_textsearch extension
  • Add hybrid+ query mode: parallel BM25 + vector search with RRF
  • Add bm25 query mode: BM25-only search
  • Add Alembic database migrations running at FastAPI lifespan startup
  • Add PostgresBM25Adapter with async connection pool (double-checked locking)
  • Add RRFCombiner for combining BM25 and vector search results
  • Add chunks table migration with tsvector, GIN index, BM25 index, trigger
  • Add shared_preload_libraries=pg_textsearch to bricks-db docker-compose
  • Use separate raganything_alembic_version table for migration isolation

Tests

  • 109 unit tests passed
  • Linter: ruff check + format passed
  • Code review: 2 critical + 4 high issues fixed
  • SonarQube: 0 new issues
  • Trivy: 0 CRITICAL/HIGH vulnerabilities
  • Docker QA: pg_textsearch installed, BM25 index created, service healthy

DO NOT MERGE

Kaiohz added 17 commits April 7, 2026 19:44
- Add BM25 search using PostgreSQL pg_textsearch extension
- Implement Reciprocal Rank Fusion (RRF) for hybrid search
- Add hybrid+ query mode combining BM25 + vector search in parallel
- Add bm25-only query mode for full-text search
- Implement PostgresBM25Adapter with connection pool management
- Auto-indexing via database triggers
- Add comprehensive unit tests for BM25 and RRF
- Add database migration for pg_textsearch support
- Update Dockerfile to install pg_textsearch extension
- Fix critical issues: connection pool leak, error handling, edge cases

Closes BRIC-7
- Remove redundant hasattr checks in query_use_case
- Simplify error handling in pg_textsearch_adapter (merge duplicate except blocks)
- Remove docling TXT patch from Dockerfile (no longer needed)
- Move BM25 migration SQL to migrations/ directory
- Add alembic>=1.13.0 dependency
- Create src/alembic.ini configuration
- Create src/alembic/env.py for async migrations (no models, raw SQL)
- Create src/alembic/versions/001_add_bm25_support.py migration
- Add lifespan to main.py for all transport modes (stdio/sse/streamable)
- Run migrations at startup via asyncio.to_thread()
- Close BM25 adapter pool on shutdown
- Add close() method to BM25EnginePort interface
- Remove migrations/001_add_bm25_support.sql (converted to Alembic)
…(BRIC-7)

- Add test_lifespan.py: lifespan startup/shutdown, migration errors, BM25 pool close
- Add test_alembic_config.py: config files, URL conversion, migration validation
- Add hybrid+, bm25-only, and bm25-unavailable tests to test_query_use_case.py
- Add close() tests to test_pg_textsearch_adapter.py
Critical:
- C1: Lifespan now raises on migration failure instead of silently continuing
- C2: Add asyncio.Lock to prevent race condition in BM25 pool initialization

High:
- H2: Handle asyncio.gather exceptions in hybrid+ mode with return_exceptions=True
- H4: Add WARNING comment about backfill on large tables in migration

Suggestions:
- S5: Add ge=1 validation on BM25_RRF_K config field
- S7: Add close() method assertion in BM25EnginePort test
- Reduce verbose docstrings in adapter, use cleaner type annotations
- Simplify dependencies.py (remove redundant comments)
- Streamline query_use_case.py (reduce nested conditionals)
- Simplify main.py lifespan (remove redundant comments)
- Clean up env.py and config.py
Alembic's async_engine_from_config requires postgresql+asyncpg:// URL.
Previously get_url() was stripping +asyncpg, causing 'No module named psycopg2' error.
The chunks table doesn't exist at startup — LightRAG uses lightrag_doc_chunks.
BM25 adapter needs its own chunks table with tsvector column.
Migration now creates the full table with IF NOT EXISTS instead of ALTER.
Shared DB with composable-agents caused 'Can't locate revision 002' error.
Use raganything_alembic_version table to isolate migration histories.
Also fix: create chunks table in migration (table doesn't exist at startup).
- Add CREATE EXTENSION IF NOT EXISTS pg_textsearch to migration
- Add BM25 index directly (no longer conditional) since extension is guaranteed
- Add shared_preload_libraries=pg_textsearch to bricks-db in docker-compose
- Drop pg_textsearch extension in downgrade
- Add hybrid+ and bm25 query modes to documentation
- Add BM25 configuration section
- Add Database Migrations section with Alembic details
- Document hybrid+ response format with RRF scoring
- Add BM25 env variables to .env.example
- Update project structure with new files
…nk_id

- RRF combiner now matches by chunk_id (database hash ID) instead of reference_id
  (per-file sequential number), fixing bm25_rank always being null in hybrid+ results
- BM25 adapter queries lightrag_doc_chunks directly (no separate chunks table)
- BM25 SQL uses to_bm25query(query, index_name) with GIN pre-filter for correctness
- Added _make_workspace() to BM25 adapter matching LightRAGAdapter's workspace mapping
- Alembic migration runs synchronously before uvicorn (fixes event loop deadlock)
- Logging visible in Docker via custom LOG_CONFIG dict passed to both dictConfig and uvicorn
- Empty folder indexing returns SUCCESS with 'No files found' instead of FAILED
- file_extensions empty string coerced to None via BeforeValidator
- HybridSearchResult now includes optional reference_id field from vector results
- ChunkResponse.reference_id is now Optional (null for BM25-only results)
- All 116 tests passing
- BM25 adapter now accepts text_config parameter (default: english, env: BM25_TEXT_CONFIG)
- Creates text-config-specific BM25 index (e.g. idx_lightrag_chunks_bm25_french)
- Auto-rebuilds content_tsv and trigger function when text_config changes
- Removed GIN tsvector pre-filter (was too strict with AND matching for multi-word queries)
- BM25 ranking via to_bm25query handles relevance scoring directly
- Updated .env.raganything-api to BM25_TEXT_CONFIG=french
- All 121 tests passing
…k response

These are internal ranking details not useful for API consumers.
@Kaiohz Kaiohz marked this pull request as ready for review April 9, 2026 07:26
@Kaiohz Kaiohz merged commit efca694 into main Apr 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant