Daraz Intel — Price Intelligence Pipeline

A fully automated data engineering pipeline that scrapes daily product prices from Daraz.pk across 5 categories and loads them into Google BigQuery for price trend analysis.

Architecture

Daraz.pk API → Scraper → Transform → Quality Check → BigQuery
                                                         ↑
                                                    Airflow (Docker)
                                                    Runs daily @ 2AM

What It Does

Scrapes ~600 products daily across 5 categories from Daraz's internal JSON API
Cleans and standardizes raw data (price types, URL fixes, deduplication)
Runs data quality checks before loading (invalid prices, missing fields, duplicates)
Loads clean data into BigQuery partitioned by date — building a historical price dataset over time
Orchestrated by Apache Airflow running in Docker — fully automated, no manual intervention

Categories Tracked

Category	Products/Day
Mobile Phones	~120
Laptops	~120
Home Appliances	~120
Men's Fashion	~120
Groceries	~120

Tech Stack

Tool	Purpose
Python 3.11	Core pipeline language
requests + fake-useragent	Web scraping
pandas + pandas-gbq	Data transformation and BigQuery loading
Google BigQuery	Data warehouse
Apache Airflow	Pipeline orchestration
Docker	Containerized Airflow environment

Project Structure

daraz-intel/
├── pipeline/
│   ├── scraper.py       # Hits Daraz internal API, extracts product data
│   ├── transform.py     # Cleans types, fixes URLs, deduplicates
│   ├── quality.py       # Validates data before loading
│   ├── loader.py        # Loads to BigQuery via pandas-gbq
│   └── logger.py        # Centralized logging to logs/pipeline.log
├── dags/
│   └── pipeline_dag.py  # Airflow DAG — scrape → transform → quality → load
├── logs/
│   └── pipeline.log     # Full pipeline run logs
├── docker-compose.yml   # Airflow cluster setup
└── requirements.txt     # Python dependencies

BigQuery Schema

Table: daraz-intel.daraz_intel.products Partitioned by: scraped_date

Column	Type	Description
product_id	STRING	Unique Daraz product ID
product_name	STRING	Full product title
category	STRING	One of 5 tracked categories
brand	STRING	Brand name
seller_name	STRING	Seller on Daraz
price_pkr	INTEGER	Current price in PKR
original_price	INTEGER	Original price before discount
discount_pct	FLOAT	Discount percentage
rating	FLOAT	Product rating (0-5)
review_count	INTEGER	Number of reviews
in_stock	BOOLEAN	Stock availability
is_price_valid	BOOLEAN	False if price data is corrupted
scraped_date	DATE	Date of scrape (partition key)

Setup

Prerequisites

Python 3.11
Docker Desktop
Google Cloud account with BigQuery enabled

Installation

git clone https://github.com/yourusername/daraz-intel.git
cd daraz-intel
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

BigQuery Setup

Create a Google Cloud project named daraz-intel
Enable the BigQuery API
Create a dataset named daraz_intel
Create a service account with BigQuery Data Editor and BigQuery Job User roles
Download the JSON key and save as gcp-key.json in the project root

Running Locally (without Airflow)

python -m pipeline.loader

This runs the full pipeline once — scrape → transform → quality → load.

Running with Airflow

# Start Airflow
docker compose up -d

# Open UI at http://localhost:8080
# Login: airflow / airflow
# Trigger the daraz_intel_pipeline DAG

# Stop Airflow when done
docker compose down

Data Quality

The pipeline flags but does not drop bad records. Every record loaded to BigQuery has an is_price_valid column:

TRUE — price data is reliable
FALSE — price exceeds original price (corrupted Daraz API data)

Filter in BigQuery:

SELECT * FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE

Sample Analytics Queries

Average price by category today

SELECT category, ROUND(AVG(price_pkr), 0) as avg_price
FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE
GROUP BY category
ORDER BY avg_price DESC

Price trend for a specific product

SELECT scraped_date, price_pkr, discount_pct
FROM `daraz-intel.daraz_intel.products`
WHERE product_id = '926915234'
AND is_price_valid = TRUE
ORDER BY scraped_date

Top discounted products today

SELECT product_name, category, price_pkr, original_price, discount_pct
FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE
ORDER BY discount_pct DESC
LIMIT 20

Author

Built by Aryan — data engineering portfolio project.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
dags		dags
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Daraz Intel — Price Intelligence Pipeline

Architecture

What It Does

Categories Tracked

Tech Stack

Project Structure

BigQuery Schema

Setup

Prerequisites

Installation

BigQuery Setup

Running Locally (without Airflow)

Running with Airflow

Data Quality

Sample Analytics Queries

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Daraz Intel — Price Intelligence Pipeline

Architecture

What It Does

Categories Tracked

Tech Stack

Project Structure

BigQuery Schema

Setup

Prerequisites

Installation

BigQuery Setup

Running Locally (without Airflow)

Running with Airflow

Data Quality

Sample Analytics Queries

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages