Skip to content

DuckDB Metadata Index#80

Merged
kylebd99 merged 9 commits intomainfrom
duckdb-metadata-index
Mar 9, 2026
Merged

DuckDB Metadata Index#80
kylebd99 merged 9 commits intomainfrom
duckdb-metadata-index

Conversation

@albert-du
Copy link
Collaborator

@albert-du albert-du commented Mar 8, 2026

#73

Implements a DuckDB metadata index.

Worst case performance is better than the current SQLite index, but worse than if the SQLite index was better optimized.

poetry run python govscape/benchmarks/metadata_index_benchmark.py --documents 5000000 --queries 100
Index            Docs  Ingest(s)   Docs/s     Size  Scenario            Q/s    Lat(ms)
--------------------------------------------------------------------------------------
sqlite         5000000    16.6362   300550  886.5MB  no_filter        3220.3       0.31
                                                     domain_filter       2.9     344.46
                                                     date_filter      2451.8       0.41
                                                     all_filters         2.9     341.45
--------------------------------------------------------------------------------------
improved-sqlite 5000000     9.0737   551043  807.7MB  no_filter        9523.6       0.11
                                                      domain_filter   19925.2       0.05
                                                      date_filter     15548.3       0.06
                                                      all_filters     21146.7       0.05
--------------------------------------------------------------------------------------
duckdb         5000000     5.2445   953388  328.0MB  no_filter         108.0       9.26
                                                     domain_filter     172.2       5.81
                                                     date_filter       108.3       9.23
                                                     all_filters       188.5       5.30
--------------------------------------------------------------------------------------

I added a benchmark with the four possible query types:

  • no_filter, searches for random pdf names
  • domain_filter, pdf names and subdomain
  • crawl_filter, pdf names and crawl ranges
  • all_filters, pdf names, subdomains, and crawl ranges

I also cleaned up a duplicate .gitignore entry.

@albert-du albert-du marked this pull request as draft March 8, 2026 05:07
@albert-du albert-du linked an issue Mar 8, 2026 that may be closed by this pull request
@kylebd99 kylebd99 marked this pull request as ready for review March 9, 2026 20:10
@kylebd99
Copy link
Collaborator

kylebd99 commented Mar 9, 2026

This is great. Cool to see that SQLite holds up when properly optimized. I removed the non-optimized SQLite version because it appears to be fully superseded.

@kylebd99 kylebd99 merged commit ac06aa0 into main Mar 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement a DuckDB Metadata Index

2 participants