This repository collects, cleans, and uploads App Store reviews for a fixed set of apps/countries, then supports optional multilingual sentiment/topic enrichment. It is designed so contributors and operators can run the full workflow locally or on a daily GitHub Actions schedule.
- Contributors maintaining ETL scripts and SQL schema.
- Analysts who need clean review data in Supabase.
- Operators who want to run or "host" the pipeline locally.
- Scrapes App Store reviews by app and country from
config/apps.json. - Cleans and normalizes text (emoji/url cleanup, dedupe, language filtering).
- Runs incremental filtering against Supabase by:
(source, app_name, country, source_review_id). - Uploads cleaned rows into Supabase table
clean_reviews. - Optionally runs ML sentiment/topic analysis for downstream reporting.
graph LR
A[scripts/01_scrape.py] --> B[scripts/02_process_reviews.py]
B --> C[scripts/03_upload_to_supabase.py]
C --> D[(Supabase clean_reviews)]
D --> E[notebooks + ml pipeline]
Runtime artifacts are written to:
data/raw/(scraped files + run summary)data/processed/(cleaned CSV outputs)data/metadata/(incremental and cleaning summaries)data/logs/(reserved for logs)
.
|- config/
| `- apps.json
|- scripts/
| |- 01_scrape.py
| |- 02_process_reviews.py
| |- 03_upload_to_supabase.py
| `- utils_supabase.py
|- sql/
| |- 001_create_tables.sql
| `- 002_views_and_indexes.sql
|- ml/
| |- README.md
| `- pipeline/sentiment_topics.py
|- notebooks/
|- tests/
`- .github/workflows/
|- pipeline.yml
`- test_pipeline_connection_supabase.yml
- Python
3.11 - Git
- A Supabase project (required for upload step, recommended for incremental dedupe in processing)
- Optional ML extras from
ml_requirements.txt
git clone https://github.com/<your-org-or-user>/data_pipeline.git
cd data_pipelinePowerShell:
python -m venv .venv
. .\.venv\Scripts\Activate.ps1bash/zsh:
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtCreate .env in repo root:
SUPABASE_URL=https://YOUR_PROJECT_ID.supabase.co
SUPABASE_SERVICE_ROLE_KEY=YOUR_SERVICE_ROLE_KEYOptional: also load them into your current shell session.
PowerShell:
$env:SUPABASE_URL="https://YOUR_PROJECT_ID.supabase.co"
$env:SUPABASE_SERVICE_ROLE_KEY="YOUR_SERVICE_ROLE_KEY"bash/zsh:
export SUPABASE_URL="https://YOUR_PROJECT_ID.supabase.co"
export SUPABASE_SERVICE_ROLE_KEY="YOUR_SERVICE_ROLE_KEY"Notes:
.envis loaded byscripts/utils_supabase.py.- You can still run scrape/process without valid Supabase credentials;
02_process_reviews.pywill disable incremental filtering if Supabase lookup fails. - Upload (
03_upload_to_supabase.py) requires valid Supabase credentials.
- Create a Supabase project.
- Open SQL Editor in Supabase.
- Run SQL files in this order:
sql/001_create_tables.sqlsql/002_views_and_indexes.sql
- Verify:
- Table
clean_reviewsexists. - Unique constraint exists on
(source, app_name, country, source_review_id). - View
v_reviews_daily_statsexists.
- Table
Edit config/apps.json:
apps: array of app objects with:name(internal app label)id(App Store app ID)
countries: two-letter country codes (for App Store review feed locale).source: source label stored in dataset (defaultapp_store).scrape_delay_seconds: delay between app-country calls.
Example:
{
"apps": [
{ "name": "tinder", "id": 547702041 },
{ "name": "yubo", "id": 1038653883 }
],
"countries": ["us", "fr"],
"source": "app_store",
"scrape_delay_seconds": 1
}python scripts/01_scrape.pyExpected output:
- CSV snapshots in
data/raw/named like:<app>_<country>_<YYYY-MM-DD>.csv
- Run summary JSON in
data/raw/:run_summary_<YYYY-MM-DD_HH-MM-SS>.json
python scripts/02_process_reviews.pyExpected output:
- Processed CSV files in
data/processed/:<app>_<country>_clean_<YYYY-MM-DD_HH-MM-SS>.csv
- Per-file metadata in
data/metadata/:<app>_<country>_metadata_<timestamp>.json
- Run summary metadata:
run_clean_summary_<timestamp>.jsonrun_incremental_overview_<timestamp>.json
python scripts/03_upload_to_supabase.pyExpected behavior:
- Upserts cleaned rows into
clean_reviewsin batches. - Skips files marked
no_new_reviews. - Logs warning if metadata references missing processed files.
02_process_reviews.py calls Supabase before cleaning each app-country cohort:
- Fetches existing
source_review_idvalues for(source, app, country). - Filters out already-ingested reviews before cleaning.
Metadata status values:
new_dataset: no existing rows were filtered for that file.partial_update: some rows already existed and were skipped.no_new_reviews: nothing new to process/upload.
03_upload_to_supabase.py reads the latest run_clean_summary_*.json and:
- Uploads listed processed files.
- Skips entries marked
no_new_reviews. - Falls back to scanning
data/processed/*_clean_*.csvif metadata is unavailable.
File: .github/workflows/pipeline.yml
- Triggers:
- Scheduled:
0 6 * * *(daily at 06:00 UTC) - Manual:
workflow_dispatch
- Scheduled:
- Steps:
- Install Python 3.11 and dependencies
- Run
01_scrape.py - Run
02_process_reviews.py - Run
03_upload_to_supabase.py - Upload artifacts:
data/**data/metadata/**
- Required secrets:
SUPABASE_URLSUPABASE_SERVICE_ROLE_KEY
File: .github/workflows/test_pipeline_connection_supabase.yml
- Trigger:
workflow_dispatch - Purpose: quick read test against
clean_reviewsusing repo secrets.
The ETL pipeline and ML pipeline are separate. ETL can run without ML dependencies.
pip install -r ml_requirements.txt- Open
notebooks/04_sentiment_topics_analysis.ipynb. - The notebook expects an input dataset at
data/processed_reviews.csv. - It produces:
reviews_sentiment_topics.csvnotebooklm_reviews.csvtopic_summary.csv
- Core functions are in
ml/pipeline/sentiment_topics.py. - Includes graceful fallbacks when model loading fails (for example, offline environments).
Run tests:
pytest -qCurrent focus:
tests/test_sentiment_topics.pycovers helper logic inml/pipeline/sentiment_topics.py.- Tests call
pytest.importorskip("spacy"), so missing spaCy can skip ML-related tests.
Symptoms:
- Upload script raises missing credentials error.
Fix:
- Ensure
.envexists withSUPABASE_URLandSUPABASE_SERVICE_ROLE_KEY. - Confirm you run commands from repo root.
Symptoms:
- Processor reports no raw files.
Fix:
- Run
python scripts/01_scrape.pyfirst. - Check
config/apps.jsonapp IDs/countries are valid.
Symptoms:
- Processor warns incremental filtering skipped for that file.
Fix:
- Keep scraper outputs unchanged before processing.
- Ensure custom raw files preserve
source_review_id.
Symptoms:
- Large drop after language detection stage.
Fix:
- Confirm country-to-language mapping in script matches your target countries.
- Validate review text quality and language mix in raw data.
Symptoms:
[SKIP] ... marked as no new reviews
Fix:
- This is expected on reruns with no new review IDs.
- Inspect latest
data/metadata/run_clean_summary_*.json.
- Never commit
.env. - Never commit
data/artifacts (already ignored in.gitignore). - Treat
SUPABASE_SERVICE_ROLE_KEYas sensitive production credential. - Store run artifacts in secure storage if sharing outside local machine.
- Add
raw_reviewsandrun_logstables for full lineage. - Add production sentiment writeback table(s) and views.
- Add alerting (for example Slack/webhook) on workflow failures.
- Add dashboard-oriented aggregates for trend monitoring.
This README documents the existing pipeline interfaces without changing code:
python scripts/01_scrape.pypython scripts/02_process_reviews.pypython scripts/03_upload_to_supabase.pysql/001_create_tables.sqlsql/002_views_and_indexes.sqlml/pipeline/sentiment_topics.py