Skip to content

hesscl/quackrfss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ¦† quackrfss

BRFSS survey data β†’ DuckDB. Clone, run one command, start analyzing.

No manual downloads. No SAS. No hours of prep.

git clone https://github.com/hesscl/quackrfss
cd quackrfss
pip install uv && uv sync
quackrfss                  # builds brfss.duckdb with 1990–2024 data (~10.1M respondents)

Or query instantly β€” no build required:

import duckdb
con = duckdb.connect()
con.sql("SELECT GENHLTH_lbl, COUNT(*) FROM read_parquet('hf://datasets/hesscl/quackrfss/data/BRFSS_2024.parquet') GROUP BY 1 ORDER BY 2 DESC").show()

✨ What it does

Downloads CDC Behavioral Risk Factor Surveillance System (BRFSS) data, converts it from SAS Transport (XPT) to Parquet, and builds a DuckDB database you can query instantly:

Object Type Description
brfss VIEW All years unified. NULL where a variable was absent in a given year.
brfss_2024 … brfss_1990 VIEW Per-year views backed directly by Parquet files.
variable_labels TABLE (var, year, label, section) β€” human name for each variable.
value_labels TABLE (var, year, value, label) β€” what each numeric code means.

Every categorical variable gets a *_lbl companion column baked into the Parquet (e.g. GENHLTH_lbl = 'Good' alongside GENHLTH = 2). The .duckdb file is tiny (< 10 MB) β€” it stores only views and metadata; the Parquet files are the source of truth.


πŸš€ Quickstart

Query without building

All 35 years of Parquet files are published at hesscl/quackrfss on Hugging Face. Query any year directly with DuckDB β€” no clone, no build, no 2.5 GB download:

import duckdb
con = duckdb.connect()

# Single year
con.sql("SELECT * FROM read_parquet('hf://datasets/hesscl/quackrfss/data/BRFSS_2024.parquet') LIMIT 5").show()

# All years at once
con.sql("SELECT YEAR, COUNT(*) FROM read_parquet('hf://datasets/hesscl/quackrfss/data/BRFSS_*.parquet') GROUP BY 1 ORDER BY 1").show()

Build locally

# Install dependencies (Python 3.11+)
pip install uv
uv sync

# Full build: all years, all stages
quackrfss

# Just one year to try it out
quackrfss --years 2024

# Specific years, already have raw files
quackrfss --years 2022 2023 --skip-download

# Re-run from scratch
quackrfss --force

Then query:

import duckdb
con = duckdb.connect("brfss.duckdb")

# Poor/fair health by state and year
con.sql("""
    SELECT
        _STATE_lbl AS state,
        YEAR,
        ROUND(100.0 * COUNT(*) FILTER (WHERE GENHLTH_lbl IN ('Fair', 'Poor'))
              / COUNT(*), 1) AS pct_fair_poor
    FROM brfss
    WHERE GENHLTH_lbl IS NOT NULL
    GROUP BY 1, 2
    ORDER BY 3 DESC
    LIMIT 20
""").show()
# DuckDB CLI
duckdb brfss.duckdb
D SELECT GENHLTH_lbl, COUNT(*) FROM brfss_2024 GROUP BY 1 ORDER BY 2 DESC;

πŸ”§ Pipeline stages

download  β†’  parse_layout + parse_formats + parse_sasout  β†’  load  β†’  schema
 (XPT ZIP)    (HTML layouts, SAS formats, sasout files)      (Parquet)   (DuckDB)

Each stage is idempotent β€” re-running skips work that's already done unless --force is passed.

# Individual stages
python -m scripts.download       --years 2024
python -m scripts.parse_layout   --years 2024    # writes metadata/layouts/
python -m scripts.parse_formats  --years 2024    # writes metadata/labels/ (2000+)
python -m scripts.parse_sasout   --years 1995    # writes metadata/labels/ (1990–1999)
python -m scripts.load           --years 2024    # writes data/parquet/
python -m scripts.schema                         # (re)builds brfss.duckdb

πŸ” Metadata lookup

-- What does a variable name mean?
SELECT * FROM variable_labels WHERE var = 'MENTHLTH';

-- What do the numeric codes mean?
SELECT * FROM value_labels WHERE var = 'GENHLTH' AND year = 2024 ORDER BY value::INT;

-- Find variables related to diabetes across all years
SELECT DISTINCT var, label FROM variable_labels WHERE lower(label) LIKE '%diabet%';

βš–οΈ Survey weights

BRFSS uses complex sampling. For population-level estimates use the appropriate weight for your year range:

  • 2011–2024: _LLCPWT (dual-frame landline + cellphone)
  • 1990–2010: _FINALWT (landline only)
df = con.execute("""
    SELECT _STATE_lbl AS state, GENHLTH, _LLCPWT AS weight
    FROM brfss_2024
    WHERE GENHLTH < 7
""").df()
# Then use a survey-weighted analysis package (e.g. samplics, weightedstats)

Cross-era comparisons (pre- vs post-2011) require care due to the sampling frame change.


πŸ’Ύ Storage

Artifact Approx. size
Raw XPT ZIPs (35 years) ~2.5 GB
Parquet files (35 years) ~850 MB
brfss.duckdb < 10 MB (views + metadata only)

data/ and brfss.duckdb are gitignored β€” everyone builds from source. Alternatively, the Parquet files are hosted publicly on Hugging Face and can be queried directly without any local build.


πŸ“ Notes on specific years

  • 2020: COVID-19 forced telephone-only collection and a lower response rate. The brfss_2020 view is included; treat cross-year comparisons carefully.
  • 2011: No variable layout HTML is available from CDC (PDF only). Variable labels are sourced from XPT metadata instead; value labels are unaffected.
  • 2000–2005: No variable layout HTML available (PDF only). Variable labels come from XPT metadata; value labels from the .sas format file.
  • 2006–2010: Layout HTML exists but has only 3 columns (start position, variable name, field length) β€” no variable label or section. Labels come from XPT metadata.
  • 1998: Has a standard PROC FORMAT .sas file β€” value labels are fully parsed.
  • 1990–1997: Value labels come from SAS DATA step sasout files. Three comment styles are handled: /* */ blocks (1990–1993, 1995–1997), and * ... *; star-comment boxes (1994).
  • 1999: The SASOUT99.sas file uses a single multi-variable LABEL block with no per-variable value comments. Data loads correctly but without *_lbl columns.
  • Variable drift: Variables are added and dropped year to year. The unified brfss view fills gaps with NULL. Use variable_labels to check which years a given variable appears in.
  • Format files: 2023–2024 ship .zip format archives; 2000–2022 ship raw .sas format files; 1990–1999 use sasout DATA step programs. All are handled transparently.
  • Dual-frame sampling: 2011 introduced combined landline + cellphone sampling and the _LLCPWT weight. Years 2011–2024 are broadly comparable. Pre-2011 data uses landline-only sampling β€” cross-era comparisons require care.

πŸ“¦ Dependencies

Package Purpose
duckdb Analytical database
pyreadstat XPT file reading
pyarrow Parquet writing
httpx HTTP downloads
beautifulsoup4 HTML layout parsing
rich Progress display
click CLI

Dev extras (for notebooks): uv sync --extra dev


πŸ—ΊοΈ Roadmap

Phase 1 β€” 2011–2024 βœ… Done

Covers the dual-frame (landline + cellphone) era with _LLCPWT weighting.

Phase 2 β€” 2000–2010 βœ… Done

Landline-only era with _FINALWT weighting. All 11 years in the manifest.

Phase 3 β€” 1990–1999 βœ… Done

Early BRFSS with _FINALWT weighting and smaller variable sets. Value labels parsed from SAS DATA step sasout files via regex extraction of embedded comment blocks. 1998 has a proper PROC FORMAT file and is fully labelled.

Other improvements

  • Weighted analysis helpers β€” thin wrappers that apply _LLCPWT by default for common aggregations
  • Optional materialization β€” --materialize flag to copy views into real DuckDB tables for environments where Parquet paths change
  • GitHub Actions CI β€” test the pipeline against a single year on each push
  • Validation checks β€” compare loaded row counts against expected counts in years.json
  • Example notebooks β€” Jupyter notebooks for common analyses (prevalence trends, state maps, weighted estimates)
  • Published artifact β€” Parquet files hosted at hesscl/quackrfss on Hugging Face; query any year without building

πŸ“„ License

Pipeline code: MIT. BRFSS data is public domain (CDC / US Government).

About

πŸ¦† BRFSS survey data (1990–2024) β†’ DuckDB. No SAS, no manual downloads, just data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages