Skip to content

joaoascenso02/envpipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EnvPipe

EnvPipe is a data product for environmental data analysis, built with Databricks.
It focuses on air quality, weather, and health data, enabling end-to-end data processing from ingestion to insight-ready outputs.

Developed during my 2025 summer internship at NTT Data.

Architecture

EnvPipe follows a Hub & Spokes topological model with the medallion architecture:

  • Bronze layer (spokes): ingestion of raw data from multiple domains.
  • Silver layer (hub): integrated and standardized datasets, following the Inmon approach.
  • Gold layer (spokes): insight-ready data products for analytics and health risk assessment.

Data Sources

Setup

  1. Run and follow the instructions in scripts/catalog_setup.ipynb
  2. Run the setup pipeline once: this ingests historical data, prepares Silver tables, and trains prediction models
  3. Schedule the forecast pipeline to run every hour, this ingests forecast data, generates predictions, and updates insights

Pipeline configuration files (.yml) are stored under the pipeline-config/ folder to ensure reproducibility. Each notebook also includes descriptions/explanations that document design choices and logic. Mermaid diagrams for each data layer and the pipelines are available in the diagrams/ folder.

Progress

  • Understand the data - done
  • Catalog setup - done
  • Bronze layer (data ingestion) - done
  • Silver layer (joined data, training set, models, forecasts) - done
  • Gold layer (feature importance, pollution patterns, health risk) - done
  • Setup and Forecast pipelines - done (next step: dtl integration)
  • Dashboards for insights - done

EnvPipe delivers an end-to-end data product that ingests, integrates, and enriches environmental data into insight-ready outputs through pipelines, accessible in a dashboard.

About

Connecting air quality and weather to health risks with ML and Databricks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors