Tooling to store and analyze disk reliability data, especially from Backblaze.
WARNING: This project is in active development and still alpha-quality.
diskstats ingests large SMART datasets (currently Backblaze), normalizes them into query-friendly relational tables, and provides a foundation for reliability analysis.
Today the repository focuses on:
- repeatable ingest and normalization pipelines
- stable table design for daily drive telemetry
- canonical model/manufacturer mapping to tame source data noise
- operational tooling for long-running jobs
Pipeline shape:
- Download/extract quarterly archives (
bin/bb_dl, optionallybin/bb_dl_legacy) - Raw ingest into
bb.drive_stats_raw(bin/bb_load.py) - Normalize into canonical tables (
bin/bb_norm→bb.load_drive_day_backfill) - Query/analysis against normalized facts (
public.drive_day) and dimensions (public.drive,public.drive_model, etc.)
Data domains:
bb.*: source-specific raw/staging objects (Backblaze)public.*: provider-agnostic canonical dimensions/facts
- PostgreSQL with
psqlavailable inPATH - Python 3 with
psycopg2installed (required bybin/bb_load.py) wget,zip,unzip, andflockinPATH- A host/environment that can run long jobs reliably (for example, utility VM +
screen/tmux) - Enough CPU and memory bandwidth for long normalization runs (multi-core server recommended)
- Sufficient local storage: plan for at least 500 GB free for full download/extract/load + database growth
- A
PSQL_DSNvalue via environment variable or.envin project root
Run commands from the repository root (diskstats/) inside a persistent shell session (for example tmux/screen). Both download/ingest and normalization are long-running jobs on full datasets. Scripts create/use these directories and files:
downloads/: downloaded archivesexpanded/: extracted CSV payloadslogs/: job logs (dl_*.log,norm_*.log, etc.)state_*.lock: lock files used to prevent overlapping runs
Set PSQL_DSN to the target diskstats DB for ingest/normalization:
export PSQL_DSN="postgresql://user:pass@localhost:5432/diskstats"or create .env in project root:
PSQL_DSN=postgresql://user:pass@localhost:5432/diskstatsThen run:
bin/db_setup "postgresql://user:pass@localhost:5432/postgres"
bin/db_preflight
bin/bb_dl_legacy # Optional if you want pre-2016 data
bin/bb_dl
bin/bb_normdiskstats ships with a harmonized collection of analytics materialized views in sql/builtin-queries.sql, installed by bin/db_setup (unless --no-builtin-queries is passed).
Added built-ins:
public.model_fleet_latest_stats: most recent-day fleet size/failure snapshot by modelpublic.model_reliability_90d: trailing 90-day model reliability summarypublic.model_smart_signals_30d: trailing 30-day SMART signal prevalence and failure co-occurrencepublic.model_lifetime_leaderboard: lifetime model leaderboard (min 10,000 drive-days)
Related existing analytics materialized views already in schema:
public.drive_day_growthpublic.drive_lifecyclepublic.model_hazard_5kpublic.model_lifetime_statspublic.model_quarter_stats
Refresh the full analytics collection after large ingest/normalization runs:
psql "$PSQL_DSN" -c "CALL public.refresh_analytics_materialized_views();"Example usage:
psql "$PSQL_DSN" -c "SELECT * FROM public.model_reliability_90d ORDER BY annualized_failures_per_drive_year DESC LIMIT 20;"Creates/rebuilds schema, seed data, and quarterly partitions. Use an admin-capable DSN (commonly .../postgres) because this can DROP DATABASE and CREATE DATABASE.
Useful flags:
--no-model-seed--no-model-alias--no-builtin-queries
Checks local tooling, DB connectivity, expected schema objects, and partition coverage before long jobs.
Download and ingest Backblaze quarterly datasets into bb.drive_stats_raw. Run from a persistent shell (tmux/screen) for full loads.
bb_dl_legacy: older dataset rangesbb_dl: current quarterly dataset pages/archives
Both scripts are lock-protected and should not run in parallel with each other.
Calls bb.load_drive_day_backfill(start_year, end_year, continue_on_error) to normalize from raw ingest into public.drive_day and related canonical dimensions.
- Emits quarter-level progress notices (start/skip/done/error, rows, elapsed)
- Emits monthly chunk heartbeats within each quarter
- Intended for very long runs; use
tmux/screen
Observed runtime guidance:
- Full raw ingest can take many hours (around 8 hours in one observed environment) and may process hundreds of millions of rows
- Full normalization can run for multiple days
- Normalization is typically CPU + memory-bandwidth bound
Recommended process:
- Run
bin/db_preflight - Start downloads/ingest in a persistent session (
tmux/screen) - Start normalization in a persistent session (
tmux/screen) - Monitor
logs/and keep lock files intact
bb.load_drive_day_backfill performs internal COMMITs. Ensure the procedure is called as a top-level statement from psql (not inside an explicit BEGIN ... COMMIT block, and not wrapped in a way that forces transaction-block semantics).
Run db_setup (or public.ensure_core_partitions(start_year, end_year)) to create missing quarterly partitions.
Export PSQL_DSN or add it to .env in project root.
seed-manufacturer.sql: curated manufacturer reference rowsseed-drive-model-curated.sql: curatedpublic.drive_modelrowsseed-model-alias.sql: deterministic exact aliases (raw model_name -> model_id)
For SQL object layout and procedure inventory, see sql/README.md.