Skip to content

experiment: stream CSV import via DuckDB Appender to show progress #52

@kueda

Description

@kueda

Opening a large archive (100GB+) can take 10+ minutes with no meaningful feedback during the DuckDB import step, which is likely the dominant bottleneck. The current create_from_core_files uses a single CREATE TABLE AS SELECT * FROM read_csv(...) — one atomic SQL call with no progress API.

Proposed approach

Replace the read_csv SQL with Rust-side CSV streaming into DuckDB's Appender API in batches. Thread a progress: f64 callback (0.0–1.0) up through create_from_core_filesArchive::openopen_archive, and surface it as real percentage progress on the loading screen.

Key changes

  • src-tauri/Cargo.toml: add csv = "1.3.1"
  • ArchiveOpenProgress::CreatingDatabase: add progress: f64 field; change internal progress channel from String to the enum directly
  • Database::create_from_core_files: add on_progress: F callback parameter; count rows first (cheap line scan), build CREATE TABLE DDL from sniffed columns + TYPE_OVERRIDES, stream rows via Appender in 50k-row batches, calling on_progress after each flush
  • dwca/archive.rs: thread progress closure into create_from_core_files
  • +page.svelte: read progress.progress on creatingDatabase events; update archiveLoadingProgress derivation to use actual value in the 40–95% range

Type conversion per row

  • VARCHAR: pass as &str
  • DOUBLE (decimalLatitude, decimalLongitude): parse::<f64>().ok()Option<f64>, empty → None
  • BOOLEAN (captive, hasCoordinate, etc.): match "true"/"1"Some(true), "false"/"0"Some(false), ""None

Extension tables are left on read_csv for now (usually much smaller than occurrences).

Risks

Risk Mitigation
Appender column order diverges from CREATE TABLE Build both from a single shared Vec<(col, type)>
BOOLEAN format variation across archives Lowercase + check both forms; warn on unrecognised values
Archives with no rows Guard total_rows == 0, call on_progress(1.0) immediately
Regression on NULL / empty-column-drop behaviour Existing tests cover this; add explicit test for NULL handling in Appender path

Why this is an experiment

Estimated 50k–100k tokens to implement, with the wide range driven by how finicky the Appender type handling turns out to be. The improvement is real but only matters for very large archives. Worth doing if those users are a priority.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions