GoldenSource for Crypto - A production-grade reference data platform for cryptocurrency instruments, demonstrating staff-level data engineering excellence.
The K2 Reference Data Platform provides bitemporal reference data for cryptocurrency instruments across multiple exchanges (Binance, Kraken, Bybit, etc.), enabling:
- Point-in-time queries: "What was the tick size for BTCUSDT on 2024-01-15 at 10:00 UTC?"
- Late correction handling: Corrections don't corrupt historical data
- Cross-exchange symbology: Unified canonical IDs (BTC-USD-SPOT) mapped to exchange-specific symbols
- Complete audit trail: Regulatory compliance with full change history
- Bitemporal Modeling: Tracks both business time (when effective) and system time (when learned)
- SCD Type 2 with Corrections: Late corrections insert new records, preserving history
- Symbolic Normalization: Handles XBT→BTC, USDT→USD, and other exchange quirks
- Immutable Bronze Layer: Full API responses preserved for replay
┌─────────────┐
│ Exchanges │ Binance, Kraken, Bybit (REST APIs)
└──────┬──────┘
│ Polling (hourly)
▼
┌─────────────┐
│ Bronze │ Raw JSON (Iceberg) - 7 day retention
└──────┬──────┘
│ DBT Transformations
▼
┌─────────────┐
│ Silver │ Bitemporal SCD Type 2 (Iceberg)
└──────┬──────┘
│ DBT Symbology Mapping
▼
┌─────────────┐
│ Gold │ Canonical Symbology Master (Iceberg)
└──────┬──────┘
│
▼
┌─────────────┐
│ FastAPI │ REST API with DuckDB query engine
└─────────────┘
Technology Stack:
- Ingestion: Python (httpx, confluent-kafka)
- Storage: Apache Iceberg (Format Version 2), MinIO/S3
- Transformations: DBT (dbt-duckdb)
- API: FastAPI with DuckDB query engine
- Streaming: Apache Kafka + Schema Registry (Avro)
- Orchestration: Kubernetes CronJobs (ingestion + DBT)
🚀 New here? Start with GETTING-STARTED.md (30 minutes)
This guide gets you from zero to a running platform with:
- All services running locally (Docker)
- Sample data ingested and transformed
- API serving queries
- Full understanding of data flow
# 1. Install & start services
make install-dev
make docker-up
make init-infra
# 2. Run pipeline
make ingest-binance # Fetch from Binance API
make dbt-run # Transform Bronze → Silver → Gold
# 3. Query API
make api-dev
curl http://localhost:8001/v1/instruments?limit=5Next: Follow GETTING-STARTED.md for detailed walkthrough.
Week 1 Plan: docs/development/DEVELOPER-ONBOARDING.md
Day-by-day learning path to go from zero to shipping features:
- Day 1: Understand the "why" (read ADRs, trace data flow)
- Day 2: Code deep dive (ingestion, DBT models)
- Day 3: API & testing patterns
- Day 4: Hands-on exercises (DBT, adding fields)
- Day 5: Ship your first feature
Other Guides:
- COMMON-WORKFLOWS.md - How to add exchanges, fix bugs, optimize queries
- TROUBLESHOOTING.md - When things break (Docker, DBT, API, data quality)
k2-reference-data-platform/
├── src/refdata/ # Main application code
│ ├── ingestion/ # Exchange clients, Kafka producers
│ ├── api/ # FastAPI endpoints
│ ├── query/ # DuckDB query helpers (bitemporal logic)
│ ├── common/ # Config, logging, DB connections
│ └── cli/ # CLI commands
├── dbt/ # DBT project
│ ├── models/
│ │ ├── bronze/ # Source definitions
│ │ ├── silver/ # Bitemporal SCD Type 2
│ │ └── gold/ # Symbology master
│ ├── macros/ # Custom bitemporal macros
│ └── tests/ # Data quality tests
├── tests/ # Test suite
│ ├── unit/ # Fast unit tests
│ ├── integration/ # Docker-based integration tests
│ └── e2e/ # End-to-end pipeline tests
├── docs/ # Documentation
│ ├── architecture/ # ADRs (Architecture Decision Records)
│ ├── api/ # OpenAPI specs
│ ├── runbooks/ # Operational guides
│ └── development/ # Developer guides
├── infrastructure/ # Docker & Kubernetes configs
│ ├── docker/ # Dockerfiles
│ └── compose/ # docker-compose files
├── config/ # Configuration files
│ └── schemas/ # Avro schemas
└── scripts/ # Utility scripts
# Unit tests only (fast, no Docker)
make test-unit
# Integration tests (requires Docker services)
make test-integration
# All tests
make test-all
# With coverage
make coverage# Lint code
make lint
# Format code
make format
# Type checking
make type-check
# All quality checks
make quality# Run all DBT models
make dbt-run
# Run specific model
cd dbt && dbt run --select silver_instruments
# Run DBT tests
make dbt-test
# Generate documentation
make dbt-docs# Start API with auto-reload
make api-dev
# Access interactive docs
open http://localhost:8001/docs
# Access ReDoc
open http://localhost:8001/redocThe platform tracks two temporal dimensions:
- Business Time (
valid_from,valid_to): When the specification was effective in reality - System Time (
record_created_at,record_updated_at): When we learned about it
Example:
Jan 10, 9am: Binance announces tick_size change effective Jan 15
Jan 11, 3pm: We ingest (record_created_at=Jan 11, valid_from=Jan 15)
Jan 15, 12am: Change goes live
Jan 16, 8am: Correction: "Actually effective Jan 14, 11pm"
New record: record_created_at=Jan 16, valid_from=Jan 14 11pm
Query: "What was tick_size on Jan 14 10pm?" → Returns old value (corrected valid_from is Jan 14 11pm)
See ADR-001: Bitemporal Modeling for details.
Exchanges use inconsistent symbols for the same instrument:
| Exchange | BTC/USD Spot | Notes |
|---|---|---|
| Binance | BTCUSDT |
No separator, uses USDT |
| Kraken | XBT/USD |
Uses XBT instead of BTC! |
| Coinbase | BTC-USD |
Hyphen separator |
Canonical IDs provide a unified format:
BTC-USD-SPOTETH-USDT-PERPBTC-USD-FUT-20240329
See ADR-004: Symbology Mapping for details.
Instruments:
GET /v1/instruments?exchange={ex}&symbol={sym}&as_of={timestamp}
GET /v1/instruments/{exchange}/{symbol}/history
Symbology:
GET /v1/symbology/{canonical_id}
GET /v1/symbology/resolve?exchange={ex}&symbol={sym}
GET /v1/symbology/search?base_asset={asset}&instrument_class={class}
Health:
GET /v1/health
Point-in-Time Query:
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT&as_of=2024-01-15T10:00:00Z"Audit Trail:
curl http://localhost:8001/v1/instruments/binance/BTCUSDT/historySymbology Lookup:
curl http://localhost:8001/v1/symbology/BTC-USD-SPOT
# Returns: {"binance_symbol": "BTCUSDT", "kraken_symbol": "XBT/USD", ...}Reverse Lookup:
curl "http://localhost:8001/v1/symbology/resolve?exchange=kraken&symbol=XBT/USD"
# Returns: {"canonical_id": "BTC-USD-SPOT", "base_asset": "BTC", ...}Key architectural decisions are documented as ADRs (Architecture Decision Records):
-
- Why dual temporality (business + system time)?
- How to handle late corrections?
-
- Polling vs streaming
- Idempotency and change detection
-
- Why DBT for transformations?
- SCD Type 2 implementation
-
- Canonical ID format
- Cross-exchange normalization
-
- Handling new exchange API fields
- Backward compatibility strategy
- Phase 1A: Project scaffolding and ADRs (Week 1)
- Phase 1B: Bronze ingestion (Binance + Kraken) (Week 2)
- Phase 1C: DBT Silver transformations + Gold symbology (Weeks 3-5)
- Phase 1D: FastAPI query layer (Week 4)
- Phase 1F: Documentation and operational readiness (Week 6)
- Add Bybit exchange
- Add Coinbase exchange
- Implement manual override workflow
- Grafana dashboards
- Data quality alerting
- Options and futures support
- Historical data backfill
- GraphQL API
- Real-time change notifications (WebSocket)
See CLAUDE.md for development standards and guidelines.
- All tests pass (
make test-all) - Code quality checks pass (
make quality) - ADR written if architectural decision made
- API docs updated if endpoints changed
- DBT docs updated if models changed
- Runbook updated if operational impact
MIT License - see LICENSE file for details.
- Project Lead: K2 Engineering Team
- Issues: https://github.com/k2/k2-reference-data-platform/issues
- Documentation: https://docs.k2.com/refdata
Built with ❤️ by the K2 Engineering Team
Demonstrating staff-level data engineering excellence through simplicity, correctness, and maintainability.