BRFSS survey data β DuckDB. Clone, run one command, start analyzing.
No manual downloads. No SAS. No hours of prep.
git clone https://github.com/hesscl/quackrfss
cd quackrfss
pip install uv && uv sync
quackrfss # builds brfss.duckdb with 1990β2024 data (~10.1M respondents)Or query instantly β no build required:
import duckdb
con = duckdb.connect()
con.sql("SELECT GENHLTH_lbl, COUNT(*) FROM read_parquet('hf://datasets/hesscl/quackrfss/data/BRFSS_2024.parquet') GROUP BY 1 ORDER BY 2 DESC").show()Downloads CDC Behavioral Risk Factor Surveillance System (BRFSS) data, converts it from SAS Transport (XPT) to Parquet, and builds a DuckDB database you can query instantly:
| Object | Type | Description |
|---|---|---|
brfss |
VIEW | All years unified. NULL where a variable was absent in a given year. |
brfss_2024 β¦ brfss_1990 |
VIEW | Per-year views backed directly by Parquet files. |
variable_labels |
TABLE | (var, year, label, section) β human name for each variable. |
value_labels |
TABLE | (var, year, value, label) β what each numeric code means. |
Every categorical variable gets a *_lbl companion column baked into the Parquet (e.g. GENHLTH_lbl = 'Good' alongside GENHLTH = 2). The .duckdb file is tiny (< 10 MB) β it stores only views and metadata; the Parquet files are the source of truth.
All 35 years of Parquet files are published at hesscl/quackrfss on Hugging Face. Query any year directly with DuckDB β no clone, no build, no 2.5 GB download:
import duckdb
con = duckdb.connect()
# Single year
con.sql("SELECT * FROM read_parquet('hf://datasets/hesscl/quackrfss/data/BRFSS_2024.parquet') LIMIT 5").show()
# All years at once
con.sql("SELECT YEAR, COUNT(*) FROM read_parquet('hf://datasets/hesscl/quackrfss/data/BRFSS_*.parquet') GROUP BY 1 ORDER BY 1").show()# Install dependencies (Python 3.11+)
pip install uv
uv sync
# Full build: all years, all stages
quackrfss
# Just one year to try it out
quackrfss --years 2024
# Specific years, already have raw files
quackrfss --years 2022 2023 --skip-download
# Re-run from scratch
quackrfss --forceThen query:
import duckdb
con = duckdb.connect("brfss.duckdb")
# Poor/fair health by state and year
con.sql("""
SELECT
_STATE_lbl AS state,
YEAR,
ROUND(100.0 * COUNT(*) FILTER (WHERE GENHLTH_lbl IN ('Fair', 'Poor'))
/ COUNT(*), 1) AS pct_fair_poor
FROM brfss
WHERE GENHLTH_lbl IS NOT NULL
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 20
""").show()# DuckDB CLI
duckdb brfss.duckdb
D SELECT GENHLTH_lbl, COUNT(*) FROM brfss_2024 GROUP BY 1 ORDER BY 2 DESC;download β parse_layout + parse_formats + parse_sasout β load β schema
(XPT ZIP) (HTML layouts, SAS formats, sasout files) (Parquet) (DuckDB)
Each stage is idempotent β re-running skips work that's already done unless --force is passed.
# Individual stages
python -m scripts.download --years 2024
python -m scripts.parse_layout --years 2024 # writes metadata/layouts/
python -m scripts.parse_formats --years 2024 # writes metadata/labels/ (2000+)
python -m scripts.parse_sasout --years 1995 # writes metadata/labels/ (1990β1999)
python -m scripts.load --years 2024 # writes data/parquet/
python -m scripts.schema # (re)builds brfss.duckdb-- What does a variable name mean?
SELECT * FROM variable_labels WHERE var = 'MENTHLTH';
-- What do the numeric codes mean?
SELECT * FROM value_labels WHERE var = 'GENHLTH' AND year = 2024 ORDER BY value::INT;
-- Find variables related to diabetes across all years
SELECT DISTINCT var, label FROM variable_labels WHERE lower(label) LIKE '%diabet%';BRFSS uses complex sampling. For population-level estimates use the appropriate weight for your year range:
- 2011β2024:
_LLCPWT(dual-frame landline + cellphone) - 1990β2010:
_FINALWT(landline only)
df = con.execute("""
SELECT _STATE_lbl AS state, GENHLTH, _LLCPWT AS weight
FROM brfss_2024
WHERE GENHLTH < 7
""").df()
# Then use a survey-weighted analysis package (e.g. samplics, weightedstats)Cross-era comparisons (pre- vs post-2011) require care due to the sampling frame change.
| Artifact | Approx. size |
|---|---|
| Raw XPT ZIPs (35 years) | ~2.5 GB |
| Parquet files (35 years) | ~850 MB |
brfss.duckdb |
< 10 MB (views + metadata only) |
data/ and brfss.duckdb are gitignored β everyone builds from source. Alternatively, the Parquet files are hosted publicly on Hugging Face and can be queried directly without any local build.
- 2020: COVID-19 forced telephone-only collection and a lower response rate. The
brfss_2020view is included; treat cross-year comparisons carefully. - 2011: No variable layout HTML is available from CDC (PDF only). Variable labels are sourced from XPT metadata instead; value labels are unaffected.
- 2000β2005: No variable layout HTML available (PDF only). Variable labels come from XPT metadata; value labels from the
.sasformat file. - 2006β2010: Layout HTML exists but has only 3 columns (start position, variable name, field length) β no variable label or section. Labels come from XPT metadata.
- 1998: Has a standard PROC FORMAT
.sasfile β value labels are fully parsed. - 1990β1997: Value labels come from SAS DATA step
sasoutfiles. Three comment styles are handled:/* */blocks (1990β1993, 1995β1997), and* ... *;star-comment boxes (1994). - 1999: The
SASOUT99.sasfile uses a single multi-variableLABELblock with no per-variable value comments. Data loads correctly but without*_lblcolumns. - Variable drift: Variables are added and dropped year to year. The unified
brfssview fills gaps with NULL. Usevariable_labelsto check which years a given variable appears in. - Format files: 2023β2024 ship
.zipformat archives; 2000β2022 ship raw.sasformat files; 1990β1999 usesasoutDATA step programs. All are handled transparently. - Dual-frame sampling: 2011 introduced combined landline + cellphone sampling and the
_LLCPWTweight. Years 2011β2024 are broadly comparable. Pre-2011 data uses landline-only sampling β cross-era comparisons require care.
| Package | Purpose |
|---|---|
| duckdb | Analytical database |
| pyreadstat | XPT file reading |
| pyarrow | Parquet writing |
| httpx | HTTP downloads |
| beautifulsoup4 | HTML layout parsing |
| rich | Progress display |
| click | CLI |
Dev extras (for notebooks): uv sync --extra dev
Covers the dual-frame (landline + cellphone) era with _LLCPWT weighting.
Landline-only era with _FINALWT weighting. All 11 years in the manifest.
Early BRFSS with _FINALWT weighting and smaller variable sets. Value labels parsed
from SAS DATA step sasout files via regex extraction of embedded comment blocks.
1998 has a proper PROC FORMAT file and is fully labelled.
- Weighted analysis helpers β thin wrappers that apply
_LLCPWTby default for common aggregations - Optional materialization β
--materializeflag to copy views into real DuckDB tables for environments where Parquet paths change - GitHub Actions CI β test the pipeline against a single year on each push
- Validation checks β compare loaded row counts against expected counts in
years.json - Example notebooks β Jupyter notebooks for common analyses (prevalence trends, state maps, weighted estimates)
- Published artifact β Parquet files hosted at hesscl/quackrfss on Hugging Face; query any year without building
Pipeline code: MIT. BRFSS data is public domain (CDC / US Government).