survey-consistency-check

An R Markdown template for automated data consistency analysis of household surveys. One .Rmd file reads the raw database, verifies structural integrity, checks skip logic, flags missing values, and produces a PDF report with a formal ruling — all reactive to the data. When the database gets corrected and resubmitted, you re-run the document and every finding, count, and paragraph updates automatically.

The problem

Data consistency analysis for household surveys usually happens in a disconnected workflow: someone reviews the database in R or Excel, writes findings in a Word document, and sends it to the data provider. When the corrected database comes back, the entire review has to be repeated manually. This template eliminates that manual loop.

How it works

The system has three layers:

1. Incidence accumulator

A global list that collects every finding classified by severity (critical, medium, minor) as the document generates:

add_incidence <- function(severity, section, variable, description, n = NA) {
  new_row <- data.frame(Section = section, Variable = variable, 
                        Description = description, N_Affected = n,
                        stringsAsFactors = FALSE)
  if(severity == "critical")    incidences$critical <<- rbind(incidences$critical, new_row)
  else if(severity == "medium") incidences$medium   <<- rbind(incidences$medium, new_row)
  else                          incidences$minor    <<- rbind(incidences$minor, new_row)
}

2. Module-by-module verification blocks

Each survey module has its own code chunk that evaluates the instrument's rules and calls add_incidence() when it finds a problem. This includes duplicate detection, household composition rules, missing value checks, and skip logic validation across all six modules (roster, dwelling, income, food consumption, health, education).

3. Reactive inline text

Below each verification, the report text uses inline R to write the result automatically. If there are no errors, the paragraph says everything is in order; if there are, it reports the exact count and severity:

**Result:** `r ifelse(n_error_rent == 0, 
  "All records comply with the skip rule.", 
  paste("**MEDIUM:**", n_error_rent, "households that own their dwelling reported a monthly rent."))`

At the end, the ruling generates automatically based on accumulated incidences:

ruling <- if(n_critical > 0) "REJECTED" 
          else if(n_medium > 0 | n_minor > 0) "ACCEPTED WITH OBSERVATIONS" 
          else "ACCEPTED"

What the template checks

The sample survey has six modules. The template verifies:

Roster

Duplicate person IDs
Household composition — missing head of household, multiple heads, head under 15
Functional screening completeness (four domains)

Dwelling

Duplicate household IDs
Missing values in mandatory fields (wall, roof, floor materials, rooms, water source, cooking fuel)
Skip logic — monthly rent reported by owners, electricity hours reported by households without electricity, shared toilet reported by households without a toilet

Income

Skip logic — main activity reported by persons who did not work, remittance amount reported by non-recipients

Food consumption

Skip logic across ten food groups — days consumed and source reported for food groups the household did not consume

Health

Skip logic — treatment-seeking reported by persons who were not ill, coverage type reported by persons without health coverage

Education

Skip logic — school type reported by persons not currently enrolled

Quick start

Clone this repo
Run data/generate_sample_data.R to create the sample dataset (or replace it with your own)
Knit consistency_check.Rmd

source("data/generate_sample_data.R")
rmarkdown::render("consistency_check.Rmd")

The sample dataset includes deliberately injected errors (duplicates, missing values, skip logic violations across all modules) so you can see the system in action.

Adapting to your own survey

The template is designed to be extended. To add a new verification:

Add a code chunk that identifies the problem
Call add_incidence() with the appropriate severity
Add an inline R paragraph below the chunk that reports the result reactively

The accumulator collects everything, and the final ruling updates automatically.

Requirements

R >= 4.0
Packages: readxl, tidyverse, knitr, kableExtra, openxlsx (for sample data generation)
A LaTeX distribution for PDF output (or change the output to html_document)

Author

Camila Bidó Cuello — Bidó Social Science Consulting

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
consistency_check.Rmd		consistency_check.Rmd
generate_sample_data.R		generate_sample_data.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

survey-consistency-check

The problem

How it works

1. Incidence accumulator

2. Module-by-module verification blocks

3. Reactive inline text

What the template checks

Quick start

Adapting to your own survey

Requirements

Author

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

survey-consistency-check

The problem

How it works

1. Incidence accumulator

2. Module-by-module verification blocks

3. Reactive inline text

What the template checks

Quick start

Adapting to your own survey

Requirements

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages