-
Notifications
You must be signed in to change notification settings - Fork 0
experiment: stream CSV import via DuckDB Appender to show progress #52
Description
Opening a large archive (100GB+) can take 10+ minutes with no meaningful feedback during the DuckDB import step, which is likely the dominant bottleneck. The current create_from_core_files uses a single CREATE TABLE AS SELECT * FROM read_csv(...) — one atomic SQL call with no progress API.
Proposed approach
Replace the read_csv SQL with Rust-side CSV streaming into DuckDB's Appender API in batches. Thread a progress: f64 callback (0.0–1.0) up through create_from_core_files → Archive::open → open_archive, and surface it as real percentage progress on the loading screen.
Key changes
src-tauri/Cargo.toml: addcsv = "1.3.1"ArchiveOpenProgress::CreatingDatabase: addprogress: f64field; change internal progress channel fromStringto the enum directlyDatabase::create_from_core_files: addon_progress: Fcallback parameter; count rows first (cheap line scan), buildCREATE TABLEDDL from sniffed columns +TYPE_OVERRIDES, stream rows viaAppenderin 50k-row batches, callingon_progressafter each flushdwca/archive.rs: thread progress closure intocreate_from_core_files+page.svelte: readprogress.progressoncreatingDatabaseevents; updatearchiveLoadingProgressderivation to use actual value in the 40–95% range
Type conversion per row
VARCHAR: pass as&strDOUBLE(decimalLatitude, decimalLongitude):parse::<f64>().ok()→Option<f64>, empty →NoneBOOLEAN(captive, hasCoordinate, etc.): match"true"/"1"→Some(true),"false"/"0"→Some(false),""→None
Extension tables are left on read_csv for now (usually much smaller than occurrences).
Risks
| Risk | Mitigation |
|---|---|
| Appender column order diverges from CREATE TABLE | Build both from a single shared Vec<(col, type)> |
| BOOLEAN format variation across archives | Lowercase + check both forms; warn on unrecognised values |
| Archives with no rows | Guard total_rows == 0, call on_progress(1.0) immediately |
| Regression on NULL / empty-column-drop behaviour | Existing tests cover this; add explicit test for NULL handling in Appender path |
Why this is an experiment
Estimated 50k–100k tokens to implement, with the wide range driven by how finicky the Appender type handling turns out to be. The improvement is real but only matters for very large archives. Worth doing if those users are a priority.