Skip to content

rjdscott/k2-reference-data-platform

Repository files navigation

K2 Reference Data Platform

GoldenSource for Crypto - A production-grade reference data platform for cryptocurrency instruments, demonstrating staff-level data engineering excellence.

Lint Tests codecov Python 3.11+ License: MIT

Overview

The K2 Reference Data Platform provides bitemporal reference data for cryptocurrency instruments across multiple exchanges (Binance, Kraken, Bybit, etc.), enabling:

  • Point-in-time queries: "What was the tick size for BTCUSDT on 2024-01-15 at 10:00 UTC?"
  • Late correction handling: Corrections don't corrupt historical data
  • Cross-exchange symbology: Unified canonical IDs (BTC-USD-SPOT) mapped to exchange-specific symbols
  • Complete audit trail: Regulatory compliance with full change history

Core Differentiators

  1. Bitemporal Modeling: Tracks both business time (when effective) and system time (when learned)
  2. SCD Type 2 with Corrections: Late corrections insert new records, preserving history
  3. Symbolic Normalization: Handles XBT→BTC, USDT→USD, and other exchange quirks
  4. Immutable Bronze Layer: Full API responses preserved for replay

Architecture

┌─────────────┐
│  Exchanges  │  Binance, Kraken, Bybit (REST APIs)
└──────┬──────┘
       │ Polling (hourly)
       ▼
┌─────────────┐
│   Bronze    │  Raw JSON (Iceberg) - 7 day retention
└──────┬──────┘
       │ DBT Transformations
       ▼
┌─────────────┐
│   Silver    │  Bitemporal SCD Type 2 (Iceberg)
└──────┬──────┘
       │ DBT Symbology Mapping
       ▼
┌─────────────┐
│    Gold     │  Canonical Symbology Master (Iceberg)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  FastAPI    │  REST API with DuckDB query engine
└─────────────┘

Technology Stack:

  • Ingestion: Python (httpx, confluent-kafka)
  • Storage: Apache Iceberg (Format Version 2), MinIO/S3
  • Transformations: DBT (dbt-duckdb)
  • API: FastAPI with DuckDB query engine
  • Streaming: Apache Kafka + Schema Registry (Avro)
  • Orchestration: Kubernetes CronJobs (ingestion + DBT)

Quick Start

🚀 New here? Start with GETTING-STARTED.md (30 minutes)

This guide gets you from zero to a running platform with:

  • All services running locally (Docker)
  • Sample data ingested and transformed
  • API serving queries
  • Full understanding of data flow

30-Second Overview

# 1. Install & start services
make install-dev
make docker-up
make init-infra

# 2. Run pipeline
make ingest-binance  # Fetch from Binance API
make dbt-run         # Transform Bronze → Silver → Gold

# 3. Query API
make api-dev
curl http://localhost:8001/v1/instruments?limit=5

Next: Follow GETTING-STARTED.md for detailed walkthrough.

New Team Member Onboarding

Week 1 Plan: docs/development/DEVELOPER-ONBOARDING.md

Day-by-day learning path to go from zero to shipping features:

  • Day 1: Understand the "why" (read ADRs, trace data flow)
  • Day 2: Code deep dive (ingestion, DBT models)
  • Day 3: API & testing patterns
  • Day 4: Hands-on exercises (DBT, adding fields)
  • Day 5: Ship your first feature

Other Guides:

Project Structure

k2-reference-data-platform/
├── src/refdata/              # Main application code
│   ├── ingestion/            # Exchange clients, Kafka producers
│   ├── api/                  # FastAPI endpoints
│   ├── query/                # DuckDB query helpers (bitemporal logic)
│   ├── common/               # Config, logging, DB connections
│   └── cli/                  # CLI commands
├── dbt/                      # DBT project
│   ├── models/
│   │   ├── bronze/           # Source definitions
│   │   ├── silver/           # Bitemporal SCD Type 2
│   │   └── gold/             # Symbology master
│   ├── macros/               # Custom bitemporal macros
│   └── tests/                # Data quality tests
├── tests/                    # Test suite
│   ├── unit/                 # Fast unit tests
│   ├── integration/          # Docker-based integration tests
│   └── e2e/                  # End-to-end pipeline tests
├── docs/                     # Documentation
│   ├── architecture/         # ADRs (Architecture Decision Records)
│   ├── api/                  # OpenAPI specs
│   ├── runbooks/             # Operational guides
│   └── development/          # Developer guides
├── infrastructure/           # Docker & Kubernetes configs
│   ├── docker/               # Dockerfiles
│   └── compose/              # docker-compose files
├── config/                   # Configuration files
│   └── schemas/              # Avro schemas
└── scripts/                  # Utility scripts

Development Workflow

Running Tests

# Unit tests only (fast, no Docker)
make test-unit

# Integration tests (requires Docker services)
make test-integration

# All tests
make test-all

# With coverage
make coverage

Code Quality

# Lint code
make lint

# Format code
make format

# Type checking
make type-check

# All quality checks
make quality

DBT Development

# Run all DBT models
make dbt-run

# Run specific model
cd dbt && dbt run --select silver_instruments

# Run DBT tests
make dbt-test

# Generate documentation
make dbt-docs

API Development

# Start API with auto-reload
make api-dev

# Access interactive docs
open http://localhost:8001/docs

# Access ReDoc
open http://localhost:8001/redoc

Key Concepts

Bitemporal Modeling

The platform tracks two temporal dimensions:

  1. Business Time (valid_from, valid_to): When the specification was effective in reality
  2. System Time (record_created_at, record_updated_at): When we learned about it

Example:

Jan 10, 9am:  Binance announces tick_size change effective Jan 15
Jan 11, 3pm:  We ingest (record_created_at=Jan 11, valid_from=Jan 15)
Jan 15, 12am: Change goes live
Jan 16, 8am:  Correction: "Actually effective Jan 14, 11pm"
              New record: record_created_at=Jan 16, valid_from=Jan 14 11pm

Query: "What was tick_size on Jan 14 10pm?" → Returns old value (corrected valid_from is Jan 14 11pm)

See ADR-001: Bitemporal Modeling for details.

Symbology Mapping

Exchanges use inconsistent symbols for the same instrument:

Exchange BTC/USD Spot Notes
Binance BTCUSDT No separator, uses USDT
Kraken XBT/USD Uses XBT instead of BTC!
Coinbase BTC-USD Hyphen separator

Canonical IDs provide a unified format:

  • BTC-USD-SPOT
  • ETH-USDT-PERP
  • BTC-USD-FUT-20240329

See ADR-004: Symbology Mapping for details.

API Reference

Endpoints

Instruments:

GET /v1/instruments?exchange={ex}&symbol={sym}&as_of={timestamp}
GET /v1/instruments/{exchange}/{symbol}/history

Symbology:

GET /v1/symbology/{canonical_id}
GET /v1/symbology/resolve?exchange={ex}&symbol={sym}
GET /v1/symbology/search?base_asset={asset}&instrument_class={class}

Health:

GET /v1/health

Example Queries

Point-in-Time Query:

curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT&as_of=2024-01-15T10:00:00Z"

Audit Trail:

curl http://localhost:8001/v1/instruments/binance/BTCUSDT/history

Symbology Lookup:

curl http://localhost:8001/v1/symbology/BTC-USD-SPOT
# Returns: {"binance_symbol": "BTCUSDT", "kraken_symbol": "XBT/USD", ...}

Reverse Lookup:

curl "http://localhost:8001/v1/symbology/resolve?exchange=kraken&symbol=XBT/USD"
# Returns: {"canonical_id": "BTC-USD-SPOT", "base_asset": "BTC", ...}

Architecture Decisions

Key architectural decisions are documented as ADRs (Architecture Decision Records):

  1. ADR-001: Bitemporal Modeling

    • Why dual temporality (business + system time)?
    • How to handle late corrections?
  2. ADR-002: Ingestion Strategy

    • Polling vs streaming
    • Idempotency and change detection
  3. ADR-003: DBT vs Spark

    • Why DBT for transformations?
    • SCD Type 2 implementation
  4. ADR-004: Symbology Mapping

    • Canonical ID format
    • Cross-exchange normalization
  5. ADR-005: Schema Evolution

    • Handling new exchange API fields
    • Backward compatibility strategy

Roadmap

Phase 1: Foundation (Weeks 1-6) ✅ COMPLETE

  • Phase 1A: Project scaffolding and ADRs (Week 1)
  • Phase 1B: Bronze ingestion (Binance + Kraken) (Week 2)
  • Phase 1C: DBT Silver transformations + Gold symbology (Weeks 3-5)
  • Phase 1D: FastAPI query layer (Week 4)
  • Phase 1F: Documentation and operational readiness (Week 6)

Phase 2: Expansion (Weeks 7-12)

  • Add Bybit exchange
  • Add Coinbase exchange
  • Implement manual override workflow
  • Grafana dashboards
  • Data quality alerting

Phase 3: Advanced Features (Weeks 13-18)

  • Options and futures support
  • Historical data backfill
  • GraphQL API
  • Real-time change notifications (WebSocket)

Contributing

See CLAUDE.md for development standards and guidelines.

Before Submitting a PR

  • All tests pass (make test-all)
  • Code quality checks pass (make quality)
  • ADR written if architectural decision made
  • API docs updated if endpoints changed
  • DBT docs updated if models changed
  • Runbook updated if operational impact

License

MIT License - see LICENSE file for details.

Contact


Built with ❤️ by the K2 Engineering Team

Demonstrating staff-level data engineering excellence through simplicity, correctness, and maintainability.

About

K2 Reference Data Platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors