A fully automated data engineering pipeline that scrapes daily product prices from Daraz.pk across 5 categories and loads them into Google BigQuery for price trend analysis.
Daraz.pk API → Scraper → Transform → Quality Check → BigQuery
↑
Airflow (Docker)
Runs daily @ 2AM
- Scrapes ~600 products daily across 5 categories from Daraz's internal JSON API
- Cleans and standardizes raw data (price types, URL fixes, deduplication)
- Runs data quality checks before loading (invalid prices, missing fields, duplicates)
- Loads clean data into BigQuery partitioned by date — building a historical price dataset over time
- Orchestrated by Apache Airflow running in Docker — fully automated, no manual intervention
| Category | Products/Day |
|---|---|
| Mobile Phones | ~120 |
| Laptops | ~120 |
| Home Appliances | ~120 |
| Men's Fashion | ~120 |
| Groceries | ~120 |
| Tool | Purpose |
|---|---|
| Python 3.11 | Core pipeline language |
| requests + fake-useragent | Web scraping |
| pandas + pandas-gbq | Data transformation and BigQuery loading |
| Google BigQuery | Data warehouse |
| Apache Airflow | Pipeline orchestration |
| Docker | Containerized Airflow environment |
daraz-intel/
├── pipeline/
│ ├── scraper.py # Hits Daraz internal API, extracts product data
│ ├── transform.py # Cleans types, fixes URLs, deduplicates
│ ├── quality.py # Validates data before loading
│ ├── loader.py # Loads to BigQuery via pandas-gbq
│ └── logger.py # Centralized logging to logs/pipeline.log
├── dags/
│ └── pipeline_dag.py # Airflow DAG — scrape → transform → quality → load
├── logs/
│ └── pipeline.log # Full pipeline run logs
├── docker-compose.yml # Airflow cluster setup
└── requirements.txt # Python dependencies
Table: daraz-intel.daraz_intel.products
Partitioned by: scraped_date
| Column | Type | Description |
|---|---|---|
| product_id | STRING | Unique Daraz product ID |
| product_name | STRING | Full product title |
| category | STRING | One of 5 tracked categories |
| brand | STRING | Brand name |
| seller_name | STRING | Seller on Daraz |
| price_pkr | INTEGER | Current price in PKR |
| original_price | INTEGER | Original price before discount |
| discount_pct | FLOAT | Discount percentage |
| rating | FLOAT | Product rating (0-5) |
| review_count | INTEGER | Number of reviews |
| in_stock | BOOLEAN | Stock availability |
| is_price_valid | BOOLEAN | False if price data is corrupted |
| scraped_date | DATE | Date of scrape (partition key) |
- Python 3.11
- Docker Desktop
- Google Cloud account with BigQuery enabled
git clone https://github.com/yourusername/daraz-intel.git
cd daraz-intel
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Create a Google Cloud project named
daraz-intel - Enable the BigQuery API
- Create a dataset named
daraz_intel - Create a service account with
BigQuery Data EditorandBigQuery Job Userroles - Download the JSON key and save as
gcp-key.jsonin the project root
python -m pipeline.loaderThis runs the full pipeline once — scrape → transform → quality → load.
# Start Airflow
docker compose up -d
# Open UI at http://localhost:8080
# Login: airflow / airflow
# Trigger the daraz_intel_pipeline DAG
# Stop Airflow when done
docker compose downThe pipeline flags but does not drop bad records. Every record loaded to BigQuery has an is_price_valid column:
TRUE— price data is reliableFALSE— price exceeds original price (corrupted Daraz API data)
Filter in BigQuery:
SELECT * FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUEAverage price by category today
SELECT category, ROUND(AVG(price_pkr), 0) as avg_price
FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE
GROUP BY category
ORDER BY avg_price DESCPrice trend for a specific product
SELECT scraped_date, price_pkr, discount_pct
FROM `daraz-intel.daraz_intel.products`
WHERE product_id = '926915234'
AND is_price_valid = TRUE
ORDER BY scraped_dateTop discounted products today
SELECT product_name, category, price_pkr, original_price, discount_pct
FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE
ORDER BY discount_pct DESC
LIMIT 20Built by Aryan — data engineering portfolio project.