Skip to content

aryanlytics/daraz-intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Daraz Intel — Price Intelligence Pipeline

A fully automated data engineering pipeline that scrapes daily product prices from Daraz.pk across 5 categories and loads them into Google BigQuery for price trend analysis.


Architecture

Daraz.pk API → Scraper → Transform → Quality Check → BigQuery
                                                         ↑
                                                    Airflow (Docker)
                                                    Runs daily @ 2AM

What It Does

  • Scrapes ~600 products daily across 5 categories from Daraz's internal JSON API
  • Cleans and standardizes raw data (price types, URL fixes, deduplication)
  • Runs data quality checks before loading (invalid prices, missing fields, duplicates)
  • Loads clean data into BigQuery partitioned by date — building a historical price dataset over time
  • Orchestrated by Apache Airflow running in Docker — fully automated, no manual intervention

Categories Tracked

Category Products/Day
Mobile Phones ~120
Laptops ~120
Home Appliances ~120
Men's Fashion ~120
Groceries ~120

Tech Stack

Tool Purpose
Python 3.11 Core pipeline language
requests + fake-useragent Web scraping
pandas + pandas-gbq Data transformation and BigQuery loading
Google BigQuery Data warehouse
Apache Airflow Pipeline orchestration
Docker Containerized Airflow environment

Project Structure

daraz-intel/
├── pipeline/
│   ├── scraper.py       # Hits Daraz internal API, extracts product data
│   ├── transform.py     # Cleans types, fixes URLs, deduplicates
│   ├── quality.py       # Validates data before loading
│   ├── loader.py        # Loads to BigQuery via pandas-gbq
│   └── logger.py        # Centralized logging to logs/pipeline.log
├── dags/
│   └── pipeline_dag.py  # Airflow DAG — scrape → transform → quality → load
├── logs/
│   └── pipeline.log     # Full pipeline run logs
├── docker-compose.yml   # Airflow cluster setup
└── requirements.txt     # Python dependencies

BigQuery Schema

Table: daraz-intel.daraz_intel.products Partitioned by: scraped_date

Column Type Description
product_id STRING Unique Daraz product ID
product_name STRING Full product title
category STRING One of 5 tracked categories
brand STRING Brand name
seller_name STRING Seller on Daraz
price_pkr INTEGER Current price in PKR
original_price INTEGER Original price before discount
discount_pct FLOAT Discount percentage
rating FLOAT Product rating (0-5)
review_count INTEGER Number of reviews
in_stock BOOLEAN Stock availability
is_price_valid BOOLEAN False if price data is corrupted
scraped_date DATE Date of scrape (partition key)

Setup

Prerequisites

  • Python 3.11
  • Docker Desktop
  • Google Cloud account with BigQuery enabled

Installation

git clone https://github.com/yourusername/daraz-intel.git
cd daraz-intel
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

BigQuery Setup

  1. Create a Google Cloud project named daraz-intel
  2. Enable the BigQuery API
  3. Create a dataset named daraz_intel
  4. Create a service account with BigQuery Data Editor and BigQuery Job User roles
  5. Download the JSON key and save as gcp-key.json in the project root

Running Locally (without Airflow)

python -m pipeline.loader

This runs the full pipeline once — scrape → transform → quality → load.

Running with Airflow

# Start Airflow
docker compose up -d

# Open UI at http://localhost:8080
# Login: airflow / airflow
# Trigger the daraz_intel_pipeline DAG

# Stop Airflow when done
docker compose down

Data Quality

The pipeline flags but does not drop bad records. Every record loaded to BigQuery has an is_price_valid column:

  • TRUE — price data is reliable
  • FALSE — price exceeds original price (corrupted Daraz API data)

Filter in BigQuery:

SELECT * FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE

Sample Analytics Queries

Average price by category today

SELECT category, ROUND(AVG(price_pkr), 0) as avg_price
FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE
GROUP BY category
ORDER BY avg_price DESC

Price trend for a specific product

SELECT scraped_date, price_pkr, discount_pct
FROM `daraz-intel.daraz_intel.products`
WHERE product_id = '926915234'
AND is_price_valid = TRUE
ORDER BY scraped_date

Top discounted products today

SELECT product_name, category, price_pkr, original_price, discount_pct
FROM `daraz-intel.daraz_intel.products`
WHERE scraped_date = CURRENT_DATE()
AND is_price_valid = TRUE
ORDER BY discount_pct DESC
LIMIT 20

Author

Built by Aryan — data engineering portfolio project.

About

A fully automated data engineering pipeline that scrapes daily product prices from Daraz.pk across 5 categories and loads them into Google BigQuery for price trend analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages