EnvPipe is a data product for environmental data analysis, built with Databricks.
It focuses on air quality, weather, and health data, enabling end-to-end data processing from ingestion to insight-ready outputs.
Developed during my 2025 summer internship at NTT Data.
EnvPipe follows a Hub & Spokes topological model with the medallion architecture:
- Bronze layer (spokes): ingestion of raw data from multiple domains.
- Silver layer (hub): integrated and standardized datasets, following the Inmon approach.
- Gold layer (spokes): insight-ready data products for analytics and health risk assessment.
- Weather data from Open-Meteo APIs:
- Air quality data from QualAR
- Health data from:
- Run and follow the instructions in scripts/catalog_setup.ipynb
- Run the setup pipeline once: this ingests historical data, prepares Silver tables, and trains prediction models
- Schedule the forecast pipeline to run every hour, this ingests forecast data, generates predictions, and updates insights
Pipeline configuration files (.yml) are stored under the pipeline-config/ folder to ensure reproducibility. Each notebook also includes descriptions/explanations that document design choices and logic. Mermaid diagrams for each data layer and the pipelines are available in the diagrams/ folder.
- Understand the data - done
- Catalog setup - done
- Bronze layer (data ingestion) - done
- Silver layer (joined data, training set, models, forecasts) - done
- Gold layer (feature importance, pollution patterns, health risk) - done
- Setup and Forecast pipelines - done (next step: dtl integration)
- Dashboards for insights - done
EnvPipe delivers an end-to-end data product that ingests, integrates, and enriches environmental data into insight-ready outputs through pipelines, accessible in a dashboard.