An R Markdown template for automated data consistency analysis of household surveys. One .Rmd file reads the raw database, verifies structural integrity, checks skip logic, flags missing values, and produces a PDF report with a formal ruling — all reactive to the data. When the database gets corrected and resubmitted, you re-run the document and every finding, count, and paragraph updates automatically.
Data consistency analysis for household surveys usually happens in a disconnected workflow: someone reviews the database in R or Excel, writes findings in a Word document, and sends it to the data provider. When the corrected database comes back, the entire review has to be repeated manually. This template eliminates that manual loop.
The system has three layers:
A global list that collects every finding classified by severity (critical, medium, minor) as the document generates:
add_incidence <- function(severity, section, variable, description, n = NA) {
new_row <- data.frame(Section = section, Variable = variable,
Description = description, N_Affected = n,
stringsAsFactors = FALSE)
if(severity == "critical") incidences$critical <<- rbind(incidences$critical, new_row)
else if(severity == "medium") incidences$medium <<- rbind(incidences$medium, new_row)
else incidences$minor <<- rbind(incidences$minor, new_row)
}Each survey module has its own code chunk that evaluates the instrument's rules and calls add_incidence() when it finds a problem. This includes duplicate detection, household composition rules, missing value checks, and skip logic validation across all six modules (roster, dwelling, income, food consumption, health, education).
Below each verification, the report text uses inline R to write the result automatically. If there are no errors, the paragraph says everything is in order; if there are, it reports the exact count and severity:
**Result:** `r ifelse(n_error_rent == 0,
"All records comply with the skip rule.",
paste("**MEDIUM:**", n_error_rent, "households that own their dwelling reported a monthly rent."))`At the end, the ruling generates automatically based on accumulated incidences:
ruling <- if(n_critical > 0) "REJECTED"
else if(n_medium > 0 | n_minor > 0) "ACCEPTED WITH OBSERVATIONS"
else "ACCEPTED"The sample survey has six modules. The template verifies:
Roster
- Duplicate person IDs
- Household composition — missing head of household, multiple heads, head under 15
- Functional screening completeness (four domains)
Dwelling
- Duplicate household IDs
- Missing values in mandatory fields (wall, roof, floor materials, rooms, water source, cooking fuel)
- Skip logic — monthly rent reported by owners, electricity hours reported by households without electricity, shared toilet reported by households without a toilet
Income
- Skip logic — main activity reported by persons who did not work, remittance amount reported by non-recipients
Food consumption
- Skip logic across ten food groups — days consumed and source reported for food groups the household did not consume
Health
- Skip logic — treatment-seeking reported by persons who were not ill, coverage type reported by persons without health coverage
Education
- Skip logic — school type reported by persons not currently enrolled
- Clone this repo
- Run
data/generate_sample_data.Rto create the sample dataset (or replace it with your own) - Knit
consistency_check.Rmd
source("data/generate_sample_data.R")
rmarkdown::render("consistency_check.Rmd")The sample dataset includes deliberately injected errors (duplicates, missing values, skip logic violations across all modules) so you can see the system in action.
The template is designed to be extended. To add a new verification:
- Add a code chunk that identifies the problem
- Call
add_incidence()with the appropriate severity - Add an inline R paragraph below the chunk that reports the result reactively
The accumulator collects everything, and the final ruling updates automatically.
- R >= 4.0
- Packages:
readxl,tidyverse,knitr,kableExtra,openxlsx(for sample data generation) - A LaTeX distribution for PDF output (or change the output to
html_document)
Camila Bidó Cuello — Bidó Social Science Consulting
MIT