Skip to content

czubi1928/de_airflow_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow Tutorial - Employee Data Processing

A hands-on Apache Airflow tutorial demonstrating ETL workflows with Docker, PostgreSQL, and modern Airflow 3.0 features.

Overview

This project showcases a real-world ETL pipeline that:

  • Fetches employee data from a remote CSV source
  • Loads data into a PostgreSQL database
  • Performs upsert operations to handle data updates
  • Demonstrates Airflow task orchestration and dependency management

Architecture

  • Airflow 3.0.3 with CeleryExecutor
  • PostgreSQL 13 for metadata and data storage
  • Redis 7.2 as message broker
  • Docker Compose for container orchestration

Services include:

  • API Server (port 8080)
  • Scheduler
  • DAG Processor
  • Celery Worker
  • Triggerer
  • Flower (monitoring, optional)

Prerequisites

  • Docker (20.10+)
  • Docker Compose (2.0+)
  • At least 4GB RAM
  • At least 2 CPU cores
  • 10GB free disk space

Quick Start

1. Clone and Setup

cd de_airflow_tutorial

2. Create Environment File

Create a .env file in the project root (optional, defaults are provided):

AIRFLOW_UID=50000
AIRFLOW_PROJ_DIR=.
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow

3. Start Services

docker-compose up -d

4. Access Airflow UI

Open your browser and navigate to: http://localhost:8080

Default credentials:

  • Username: airflow
  • Password: airflow

5. Configure Database Connection

Before running the DAG, set up the PostgreSQL connection:

  1. Go to AdminConnections
  2. Click + to add a new connection
  3. Set the following:
    • Connection Id: tutorial_pg_conn
    • Connection Type: Postgres
    • Host: postgres
    • Schema: airflow
    • Login: airflow
    • Password: airflow
    • Port: 5432
  4. Click Save

6. Run the DAG

  1. Navigate to the DAGs page
  2. Find process_employees
  3. Toggle it ON
  4. Click the Play button to trigger manually

Project Structure

de_airflow_tutorial/
├── dags/                      # Airflow DAG definitions
│   ├── process_employees.py   # Main ETL pipeline
│   └── files/                 # Data files directory
├── logs/                      # Airflow logs
├── plugins/                   # Custom Airflow plugins
├── config/                    # Airflow configuration
├── docker-compose.yaml        # Docker services definition
├── requirements.txt           # Python dependencies
└── README.md                  # This file

DAG Details

process_employees

Schedule: Daily at midnight (0 0 * * *)

Tasks:

  1. create_employees_table - Creates main employees table if not exists
  2. create_employees_temp_table - Creates temporary staging table
  3. get_data - Downloads CSV from GitHub and loads into temp table
  4. merge_data - Merges data from temp to main table using UPSERT

Data Quality Features:

  • Duplicate removal via DISTINCT
  • Conflict resolution with ON CONFLICT clause
  • Error handling with try-except blocks
  • Alerting on failure via email (when configured)

Monitoring

Airflow UI

  • DAGs: Overview of all workflows
  • Grid View: Task execution timeline
  • Graph: Visual dependency graph
  • Logs: Detailed task logs

Flower (Celery Monitoring)

Enable Flower for Celery worker monitoring:

docker-compose --profile flower up -d

Access at: http://localhost:5555

Common Operations

View Logs

docker-compose logs -f airflow-scheduler
docker-compose logs -f airflow-worker

Stop Services

docker-compose down

Stop and Remove Volumes (Clean Start)

docker-compose down -v

Rebuild After Configuration Changes

docker-compose up -d --build

Execute Airflow CLI Commands

docker-compose run airflow-worker airflow dags list
docker-compose run airflow-worker airflow tasks list process_employees

Troubleshooting

Permission Issues (Linux/Mac)

Set the correct AIRFLOW_UID in .env:

echo "AIRFLOW_UID=$(id -u)" > .env

Connection Refused Errors

Ensure PostgreSQL is healthy:

docker-compose ps postgres
docker-compose logs postgres

DAG Not Appearing

  1. Check for syntax errors: docker-compose run airflow-worker airflow dags list-import-errors
  2. Verify file is in dags/ directory
  3. Wait 30-60 seconds for DAG processor to detect changes

Memory/Resource Issues

Increase Docker resources in Docker Desktop settings:

  • Memory: 4GB minimum, 8GB recommended
  • CPUs: 2 minimum, 4 recommended

Extending the Tutorial

Add New DAGs

  1. Create Python file in dags/ directory
  2. Define DAG using @dag decorator or DAG() class
  3. Wait for automatic detection or restart scheduler

Add Custom Operators

Place custom operators in plugins/ directory.

Install Additional Dependencies

For quick tests:

docker-compose down
# Edit docker-compose.yaml: _PIP_ADDITIONAL_REQUIREMENTS
docker-compose up -d

For production: Build custom Docker image (see Airflow docs).

Configure Email Alerts

Add to config/airflow.cfg or environment variables:

AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com
AIRFLOW__SMTP__SMTP_USER: your-email@gmail.com
AIRFLOW__SMTP__SMTP_PASSWORD: your-app-password
AIRFLOW__SMTP__SMTP_PORT: 587
AIRFLOW__SMTP__SMTP_MAIL_FROM: your-email@gmail.com

Then add to DAG:

default_args={
    'email': ['your-email@gmail.com'],
    'email_on_failure': True,
    'email_on_retry': False,
}

Security Notes

⚠️ This setup is for DEVELOPMENT/LEARNING ONLY

For production:

  • Change default passwords
  • Use secrets management (e.g., AWS Secrets Manager, HashiCorp Vault)
  • Enable SSL/TLS
  • Configure proper authentication (LDAP, OAuth)
  • Set AIRFLOW__CORE__FERNET_KEY
  • Use environment-specific configuration

Resources

License

This tutorial project is open source for educational purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages