Backend data platform for Troutlytics.
This repository does two things:
- Scrapes Washington Department of Fish and Wildlife (WDFW) trout stocking reports.
- Serves cleaned stocking data through a FastAPI service for dashboards, maps, and analytics.
- Website: https://troutlytics.com
- Backend repository: https://github.com/troutlytics/troutlytics-backend
- API base URL: https://xtczssso08.execute-api.us-west-2.amazonaws.com
WDFW stocking information is published as a human-readable web table, but product features need a reliable machine-readable dataset. The raw source is hard to query over time, aggregate by hatchery/date, and map consistently.
This backend solves that by:
- Extracting the source table on a schedule.
- Normalizing and deduplicating stocking rows.
- Resolving water bodies to stable location records.
- Persisting curated data in a queryable database.
- Exposing API endpoints with caching/ETag support for fast client reads.
web_scraper/scraper.pyfetches WDFW trout plant rows (items_per_page=250), parses fields, and normalizes names/date/values.- The scraper matches each row to existing
water_locationrecords (exact + relaxed matching) to avoid duplicates. - If enabled, it can create missing water locations and optionally geocode them with Google Geocoding.
data/database.pywritesstocking_reportentries with dedupe/upsert logic and records run metadata inutility.api/index.pyserves raw and aggregate endpoints from the same shared models indata/.
.
├── api/ FastAPI app (local Uvicorn + Lambda via Mangum)
├── web_scraper/ WDFW scraper and parser
├── data/ SQLAlchemy models, database access layer, local SQLite file
├── aws_config/ CloudFormation templates (OIDC/IAM, scheduled Fargate, full stack variants)
├── .github/workflows/ CI and image deployment workflows
├── docker-compose.yml Local services for API and scraper
└── Makefile ECR/Lambda helper commands
- API: FastAPI + SQLAlchemy (
api/index.py) - Scraper: Requests + BeautifulSoup (
web_scraper/scraper.py) - Database:
- Production path: PostgreSQL/Aurora via
POSTGRES_*env vars - Local fallback:
data/sqlite.dbwhen Postgres vars are missing
- Production path: PostgreSQL/Aurora via
- Deploy targets:
- API container image for AWS Lambda (
api/dockerfiles/prod/Dockerfile) - Scraper container image for ECS Fargate (
web_scraper/Dockerfile)
- API container image for AWS Lambda (
Core database variables:
POSTGRES_HOSTPOSTGRES_PORT(default5432)POSTGRES_DBPOSTGRES_USERPOSTGRES_PASSWORD
Scraper behavior flags:
SCRAPER_ALLOW_CREATE_WATER_LOCATION(true/false, defaultfalse)SCRAPER_GEOCODE(true/false, defaultfalse)GV3_API_KEY(required only when geocoding is enabled)
From repo root:
docker compose build
docker compose upUseful targets:
- API dev service only:
docker compose up api-dev - API prod image locally:
docker compose up api-prod - Scraper only:
docker compose up web-scraper
From repo root:
python -m venv .venv
source .venv/bin/activate
pip install -r web_scraper/requirements.txt
pip install -r api/requirements.txtRun scraper:
python -m web_scraper.scraperRun API:
uvicorn api.index:app --reload --port 8080API docs: http://localhost:8080/docs
Main routes in api/index.py:
GET /GET /stocked_lakes_dataGET /stocked_lakes_data_all_timeGET /total_stocked_by_date_dataGET /hatchery_totalsGET /derby_lakes_dataGET /date_data_updatedGET /hatchery_names
Date-filtered endpoints accept optional start_date and end_date (ISO format). Defaults are last 7 days.
.github/workflows/deploy-scraper.yml- On push to
main, builds and pushes both scraper and API images to ECR. - Uses GitHub OIDC (
AWS_ROLE_ARN) to assume an AWS role.
- On push to
.github/workflows/python-app.yml- Lints with flake8 and runs pytest.
.github/workflows/deploy-to-ecr.yml- Additional scraper image push workflow (also on
main).
- Additional scraper image push workflow (also on
Infra templates:
aws_config/configure-aws-credentials-latest.yml: IAM role/OIDC provider setup for GitHub Actions.aws_config/scheduled-scraper-fargate.yaml: Scheduled EventBridge -> ECS Fargate scraper task using Secrets Manager.aws_config/full-api-creation.yamlandaws_config/fargate-rds-secrets.yaml: broader/legacy stack templates kept in-repo.
- If Postgres env vars are missing, the app falls back to SQLite (
data/sqlite.db). - Scraper write behavior is intentionally conservative by default: if a water location does not already exist and create mode is off, the row is skipped.
- API responses use cache headers and ETags; the all-time route keeps an in-memory cache (~12 hours) to reduce query cost.
MIT