A hands-on Apache Airflow tutorial demonstrating ETL workflows with Docker, PostgreSQL, and modern Airflow 3.0 features.
This project showcases a real-world ETL pipeline that:
- Fetches employee data from a remote CSV source
- Loads data into a PostgreSQL database
- Performs upsert operations to handle data updates
- Demonstrates Airflow task orchestration and dependency management
- Airflow 3.0.3 with CeleryExecutor
- PostgreSQL 13 for metadata and data storage
- Redis 7.2 as message broker
- Docker Compose for container orchestration
Services include:
- API Server (port 8080)
- Scheduler
- DAG Processor
- Celery Worker
- Triggerer
- Flower (monitoring, optional)
- Docker (20.10+)
- Docker Compose (2.0+)
- At least 4GB RAM
- At least 2 CPU cores
- 10GB free disk space
cd de_airflow_tutorialCreate a .env file in the project root (optional, defaults are provided):
AIRFLOW_UID=50000
AIRFLOW_PROJ_DIR=.
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflowdocker-compose up -dOpen your browser and navigate to: http://localhost:8080
Default credentials:
- Username:
airflow - Password:
airflow
Before running the DAG, set up the PostgreSQL connection:
- Go to Admin → Connections
- Click + to add a new connection
- Set the following:
- Connection Id:
tutorial_pg_conn - Connection Type:
Postgres - Host:
postgres - Schema:
airflow - Login:
airflow - Password:
airflow - Port:
5432
- Connection Id:
- Click Save
- Navigate to the DAGs page
- Find
process_employees - Toggle it ON
- Click the Play button to trigger manually
de_airflow_tutorial/
├── dags/ # Airflow DAG definitions
│ ├── process_employees.py # Main ETL pipeline
│ └── files/ # Data files directory
├── logs/ # Airflow logs
├── plugins/ # Custom Airflow plugins
├── config/ # Airflow configuration
├── docker-compose.yaml # Docker services definition
├── requirements.txt # Python dependencies
└── README.md # This file
Schedule: Daily at midnight (0 0 * * *)
Tasks:
create_employees_table- Creates main employees table if not existscreate_employees_temp_table- Creates temporary staging tableget_data- Downloads CSV from GitHub and loads into temp tablemerge_data- Merges data from temp to main table using UPSERT
Data Quality Features:
- Duplicate removal via DISTINCT
- Conflict resolution with ON CONFLICT clause
- Error handling with try-except blocks
- Alerting on failure via email (when configured)
- DAGs: Overview of all workflows
- Grid View: Task execution timeline
- Graph: Visual dependency graph
- Logs: Detailed task logs
Enable Flower for Celery worker monitoring:
docker-compose --profile flower up -dAccess at: http://localhost:5555
docker-compose logs -f airflow-scheduler
docker-compose logs -f airflow-workerdocker-compose downdocker-compose down -vdocker-compose up -d --builddocker-compose run airflow-worker airflow dags list
docker-compose run airflow-worker airflow tasks list process_employeesSet the correct AIRFLOW_UID in .env:
echo "AIRFLOW_UID=$(id -u)" > .envEnsure PostgreSQL is healthy:
docker-compose ps postgres
docker-compose logs postgres- Check for syntax errors:
docker-compose run airflow-worker airflow dags list-import-errors - Verify file is in
dags/directory - Wait 30-60 seconds for DAG processor to detect changes
Increase Docker resources in Docker Desktop settings:
- Memory: 4GB minimum, 8GB recommended
- CPUs: 2 minimum, 4 recommended
- Create Python file in
dags/directory - Define DAG using
@dagdecorator orDAG()class - Wait for automatic detection or restart scheduler
Place custom operators in plugins/ directory.
For quick tests:
docker-compose down
# Edit docker-compose.yaml: _PIP_ADDITIONAL_REQUIREMENTS
docker-compose up -dFor production: Build custom Docker image (see Airflow docs).
Add to config/airflow.cfg or environment variables:
AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com
AIRFLOW__SMTP__SMTP_USER: your-email@gmail.com
AIRFLOW__SMTP__SMTP_PASSWORD: your-app-password
AIRFLOW__SMTP__SMTP_PORT: 587
AIRFLOW__SMTP__SMTP_MAIL_FROM: your-email@gmail.comThen add to DAG:
default_args={
'email': ['your-email@gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
}For production:
- Change default passwords
- Use secrets management (e.g., AWS Secrets Manager, HashiCorp Vault)
- Enable SSL/TLS
- Configure proper authentication (LDAP, OAuth)
- Set AIRFLOW__CORE__FERNET_KEY
- Use environment-specific configuration
This tutorial project is open source for educational purposes.