K2 Reference Data Platform

GoldenSource for Crypto - A production-grade reference data platform for cryptocurrency instruments, demonstrating staff-level data engineering excellence.

Overview

The K2 Reference Data Platform provides bitemporal reference data for cryptocurrency instruments across multiple exchanges (Binance, Kraken, Bybit, etc.), enabling:

Point-in-time queries: "What was the tick size for BTCUSDT on 2024-01-15 at 10:00 UTC?"
Late correction handling: Corrections don't corrupt historical data
Cross-exchange symbology: Unified canonical IDs (BTC-USD-SPOT) mapped to exchange-specific symbols
Complete audit trail: Regulatory compliance with full change history

Core Differentiators

Bitemporal Modeling: Tracks both business time (when effective) and system time (when learned)
SCD Type 2 with Corrections: Late corrections insert new records, preserving history
Symbolic Normalization: Handles XBT→BTC, USDT→USD, and other exchange quirks
Immutable Bronze Layer: Full API responses preserved for replay

Architecture

┌─────────────┐
│  Exchanges  │  Binance, Kraken, Bybit (REST APIs)
└──────┬──────┘
       │ Polling (hourly)
       ▼
┌─────────────┐
│   Bronze    │  Raw JSON (Iceberg) - 7 day retention
└──────┬──────┘
       │ DBT Transformations
       ▼
┌─────────────┐
│   Silver    │  Bitemporal SCD Type 2 (Iceberg)
└──────┬──────┘
       │ DBT Symbology Mapping
       ▼
┌─────────────┐
│    Gold     │  Canonical Symbology Master (Iceberg)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  FastAPI    │  REST API with DuckDB query engine
└─────────────┘

Technology Stack:

Ingestion: Python (httpx, confluent-kafka)
Storage: Apache Iceberg (Format Version 2), MinIO/S3
Transformations: DBT (dbt-duckdb)
API: FastAPI with DuckDB query engine
Streaming: Apache Kafka + Schema Registry (Avro)
Orchestration: Kubernetes CronJobs (ingestion + DBT)

Quick Start

🚀 New here? Start with GETTING-STARTED.md (30 minutes)

This guide gets you from zero to a running platform with:

All services running locally (Docker)
Sample data ingested and transformed
API serving queries
Full understanding of data flow

30-Second Overview

# 1. Install & start services
make install-dev
make docker-up
make init-infra

# 2. Run pipeline
make ingest-binance  # Fetch from Binance API
make dbt-run         # Transform Bronze → Silver → Gold

# 3. Query API
make api-dev
curl http://localhost:8001/v1/instruments?limit=5

Next: Follow GETTING-STARTED.md for detailed walkthrough.

New Team Member Onboarding

Week 1 Plan: docs/development/DEVELOPER-ONBOARDING.md

Day-by-day learning path to go from zero to shipping features:

Day 1: Understand the "why" (read ADRs, trace data flow)
Day 2: Code deep dive (ingestion, DBT models)
Day 3: API & testing patterns
Day 4: Hands-on exercises (DBT, adding fields)
Day 5: Ship your first feature

Other Guides:

COMMON-WORKFLOWS.md - How to add exchanges, fix bugs, optimize queries
TROUBLESHOOTING.md - When things break (Docker, DBT, API, data quality)

Project Structure

k2-reference-data-platform/
├── src/refdata/              # Main application code
│   ├── ingestion/            # Exchange clients, Kafka producers
│   ├── api/                  # FastAPI endpoints
│   ├── query/                # DuckDB query helpers (bitemporal logic)
│   ├── common/               # Config, logging, DB connections
│   └── cli/                  # CLI commands
├── dbt/                      # DBT project
│   ├── models/
│   │   ├── bronze/           # Source definitions
│   │   ├── silver/           # Bitemporal SCD Type 2
│   │   └── gold/             # Symbology master
│   ├── macros/               # Custom bitemporal macros
│   └── tests/                # Data quality tests
├── tests/                    # Test suite
│   ├── unit/                 # Fast unit tests
│   ├── integration/          # Docker-based integration tests
│   └── e2e/                  # End-to-end pipeline tests
├── docs/                     # Documentation
│   ├── architecture/         # ADRs (Architecture Decision Records)
│   ├── api/                  # OpenAPI specs
│   ├── runbooks/             # Operational guides
│   └── development/          # Developer guides
├── infrastructure/           # Docker & Kubernetes configs
│   ├── docker/               # Dockerfiles
│   └── compose/              # docker-compose files
├── config/                   # Configuration files
│   └── schemas/              # Avro schemas
└── scripts/                  # Utility scripts

Development Workflow

Running Tests

# Unit tests only (fast, no Docker)
make test-unit

# Integration tests (requires Docker services)
make test-integration

# All tests
make test-all

# With coverage
make coverage

Code Quality

# Lint code
make lint

# Format code
make format

# Type checking
make type-check

# All quality checks
make quality

DBT Development

# Run all DBT models
make dbt-run

# Run specific model
cd dbt && dbt run --select silver_instruments

# Run DBT tests
make dbt-test

# Generate documentation
make dbt-docs

API Development

# Start API with auto-reload
make api-dev

# Access interactive docs
open http://localhost:8001/docs

# Access ReDoc
open http://localhost:8001/redoc

Key Concepts

Bitemporal Modeling

The platform tracks two temporal dimensions:

Business Time (valid_from, valid_to): When the specification was effective in reality
System Time (record_created_at, record_updated_at): When we learned about it

Example:

Jan 10, 9am:  Binance announces tick_size change effective Jan 15
Jan 11, 3pm:  We ingest (record_created_at=Jan 11, valid_from=Jan 15)
Jan 15, 12am: Change goes live
Jan 16, 8am:  Correction: "Actually effective Jan 14, 11pm"
              New record: record_created_at=Jan 16, valid_from=Jan 14 11pm

Query: "What was tick_size on Jan 14 10pm?" → Returns old value (corrected valid_from is Jan 14 11pm)

See ADR-001: Bitemporal Modeling for details.

Symbology Mapping

Exchanges use inconsistent symbols for the same instrument:

Exchange	BTC/USD Spot	Notes
Binance	`BTCUSDT`	No separator, uses USDT
Kraken	`XBT/USD`	Uses XBT instead of BTC!
Coinbase	`BTC-USD`	Hyphen separator

Canonical IDs provide a unified format:

BTC-USD-SPOT
ETH-USDT-PERP
BTC-USD-FUT-20240329

See ADR-004: Symbology Mapping for details.

API Reference

Endpoints

Instruments:

GET /v1/instruments?exchange={ex}&symbol={sym}&as_of={timestamp}
GET /v1/instruments/{exchange}/{symbol}/history

Symbology:

GET /v1/symbology/{canonical_id}
GET /v1/symbology/resolve?exchange={ex}&symbol={sym}
GET /v1/symbology/search?base_asset={asset}&instrument_class={class}

Health:

GET /v1/health

Example Queries

Point-in-Time Query:

curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT&as_of=2024-01-15T10:00:00Z"

Audit Trail:

curl http://localhost:8001/v1/instruments/binance/BTCUSDT/history

Symbology Lookup:

curl http://localhost:8001/v1/symbology/BTC-USD-SPOT
# Returns: {"binance_symbol": "BTCUSDT", "kraken_symbol": "XBT/USD", ...}

Reverse Lookup:

curl "http://localhost:8001/v1/symbology/resolve?exchange=kraken&symbol=XBT/USD"
# Returns: {"canonical_id": "BTC-USD-SPOT", "base_asset": "BTC", ...}

Architecture Decisions

Key architectural decisions are documented as ADRs (Architecture Decision Records):

ADR-001: Bitemporal Modeling
- Why dual temporality (business + system time)?
- How to handle late corrections?
ADR-002: Ingestion Strategy
- Polling vs streaming
- Idempotency and change detection
ADR-003: DBT vs Spark
- Why DBT for transformations?
- SCD Type 2 implementation
ADR-004: Symbology Mapping
- Canonical ID format
- Cross-exchange normalization
ADR-005: Schema Evolution
- Handling new exchange API fields
- Backward compatibility strategy

Roadmap

Phase 1: Foundation (Weeks 1-6) ✅ COMPLETE

Phase 1A: Project scaffolding and ADRs (Week 1)
Phase 1B: Bronze ingestion (Binance + Kraken) (Week 2)
Phase 1C: DBT Silver transformations + Gold symbology (Weeks 3-5)
Phase 1D: FastAPI query layer (Week 4)
Phase 1F: Documentation and operational readiness (Week 6)

Phase 2: Expansion (Weeks 7-12)

Phase 3: Advanced Features (Weeks 13-18)

Options and futures support
Historical data backfill
GraphQL API
Real-time change notifications (WebSocket)

Contributing

See CLAUDE.md for development standards and guidelines.

Before Submitting a PR

All tests pass (make test-all)
Code quality checks pass (make quality)
ADR written if architectural decision made
API docs updated if endpoints changed
DBT docs updated if models changed
Runbook updated if operational impact

License

MIT License - see LICENSE file for details.

Contact

Project Lead: K2 Engineering Team
Issues: https://github.com/k2/k2-reference-data-platform/issues
Documentation: https://docs.k2.com/refdata

Built with ❤️ by the K2 Engineering Team

Demonstrating staff-level data engineering excellence through simplicity, correctness, and maintainability.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
.idea		.idea
config/schemas		config/schemas
dbt		dbt
docs		docs
infrastructure		infrastructure
scripts		scripts
src/refdata		src/refdata
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT-STATUS.md		PROJECT-STATUS.md
README.md		README.md
project-kickoff.md		project-kickoff.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K2 Reference Data Platform

Overview

Core Differentiators

Architecture

Quick Start

30-Second Overview

New Team Member Onboarding

Project Structure

Development Workflow

Running Tests

Code Quality

DBT Development

API Development

Key Concepts

Bitemporal Modeling

Symbology Mapping

API Reference

Endpoints

Example Queries

Architecture Decisions

Roadmap

Phase 1: Foundation (Weeks 1-6) ✅ COMPLETE

Phase 2: Expansion (Weeks 7-12)

Phase 3: Advanced Features (Weeks 13-18)

Contributing

Before Submitting a PR

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

K2 Reference Data Platform

Overview

Core Differentiators

Architecture

Quick Start

30-Second Overview

New Team Member Onboarding

Project Structure

Development Workflow

Running Tests

Code Quality

DBT Development

API Development

Key Concepts

Bitemporal Modeling

Symbology Mapping

API Reference

Endpoints

Example Queries

Architecture Decisions

Roadmap

Phase 1: Foundation (Weeks 1-6) ✅ COMPLETE

Phase 2: Expansion (Weeks 7-12)

Phase 3: Advanced Features (Weeks 13-18)

Contributing

Before Submitting a PR

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages