Skip to content

developyrs/ducklake-demo

Repository files navigation

DuckLake Demo: The Local Lakehouse

A complete demonstration of building a production-grade data lakehouse on your laptop using DuckDB, DuckLake, dbt, and SQLMesh.

Overview

This project demonstrates the "Local Lakehouse" architecture - a modern data stack that runs entirely on your local machine while providing enterprise-grade capabilities including ACID transactions, time travel, and sophisticated data transformations.

Key Technologies:

  • DuckDB: High-performance analytical database engine
  • DuckLake: Open table format providing ACID transactions and versioning
  • dbt: Data transformation and testing framework
  • SQLMesh: Next-generation data transformation and orchestration
  • Streamlit: Interactive data visualization dashboard

Dataset

Ondoriya Synthetic World Dataset

  • 1,425,690 people across 200 regions
  • 4 political factions with demographic distributions
  • Regional biome classifications and political control data
  • Age demographics and population distributions

Architecture

Data Flow

Raw CSVs → DuckLake Bronze → Silver (Staging) → Gold (Marts) → Dashboard

Layer Descriptions

Bronze Layer: Raw data ingested as-is into DuckLake format

  • ACID compliant storage
  • Time travel capabilities
  • Schema enforcement
  • Automatic compression and optimization

Silver Layer: Cleaned and standardized data

  • Column name standardization
  • Data type corrections
  • Basic validation and quality checks
  • Optimized for downstream consumption

Gold Layer: Business-ready analytical tables

  • Aggregated metrics and KPIs
  • Dimensional models
  • Regional and faction analytics
  • Performance optimized for querying

Quick Start

Prerequisites

# Python environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install duckdb ducklake dbt-duckdb sqlmesh streamlit plotly pandas

1. Data Ingestion

cd data
python download_data.py

cd 02_bronze_ingestion/
python raw_to_ducklake.py

2. dbt Pipeline

cd 03_dbt_pipeline/
dbt run
dbt test

3. SQLMesh Pipeline

cd 04_sqlmesh_pipeline/
sqlmesh plan dev
# Type 'y' when prompted to apply changes

4. Launch Dashboard

cd 05_streamlit_dashboard/
streamlit run dashboard.py

Key Features Demonstrated

DuckLake Capabilities

  • ACID Transactions: Guaranteed data consistency
  • Time Travel: Query historical versions of tables
  • Schema Evolution: Handle schema changes gracefully
  • Automatic Optimization: Background compaction and optimization

Transformation Patterns

  • dbt Approach: SQL-first, testing-oriented, mature ecosystem
  • SQLMesh Approach: Next-generation features, virtual environments, advanced state management
  • Side-by-side Comparison: Same business logic implemented in both frameworks

Analytics Capabilities

  • Regional population analysis
  • Faction power distribution
  • Demographic trends and breakdowns
  • Data quality monitoring
  • Interactive visualizations

Performance Highlights

  • Ingestion: 1.4M records in under 30 seconds
  • Transformations: Complex joins and aggregations in seconds
  • Queries: Sub-second response times on analytical workloads
  • Storage: 80%+ compression compared to raw CSV

Use Cases

This architecture is ideal for:

Development & Testing

  • Local development of data pipelines
  • Testing transformation logic
  • Prototyping analytical models

Edge Analytics

  • Embedded analytics in applications
  • IoT data processing
  • Remote location analytics

Cost-Sensitive Environments

  • Startups and small teams
  • Proof-of-concept projects
  • Development environments

Data Science Workflows

  • Feature engineering pipelines
  • Model training data preparation
  • Analytical research projects

Scaling Considerations

Excellent for (< 1TB datasets):

  • Analytical workloads
  • Batch processing
  • Single-node performance
  • Development workflows

Consider alternatives for:

  • Multi-petabyte datasets
  • High-concurrency OLTP
  • Distributed processing requirements
  • Multi-region deployments

dbt vs SQLMesh Comparison

Feature dbt SQLMesh
Learning Curve Moderate Steep
Testing Built-in framework Advanced auditing
State Management External orchestrator Built-in incremental
Virtual Environments Limited Advanced
Column Lineage Third-party tools Native support
Ecosystem Mature, extensive Growing, modern

Configuration Details

DuckLake Connection

import duckdb
con = duckdb.connect(":memory:")
con.execute("ATTACH 'ducklake:./data/catalog.ducklake' as my_lake")

dbt profiles.yml

ondoriya_dbt:
  outputs:
    dev:
      type: duckdb
      path: "ducklake:/Volumes/External/developyr/source/ducklake-demo/data/catalog.ducklake"
      extensions:
        - ducklake
      schema: bronze
      threads: 1

    prod:
      type: duckdb
      path: prod.duckdb
      threads: 4

  target: dev

SQLMesh config.yaml

gateways:
  duckdb:
    connection:
      type: duckdb

      catalogs:
        my_lake:
          type: ducklake
          path: /Volumes/External/developyr/source/ducklake-demo/data/catalog.ducklake
          data_path: /Volumes/External/developyr/source/ducklake-demo/data/catalog_data/
      extensions:
        - ducklake
    state_connection:
      type: duckdb
      database: /Volumes/External/developyr/source/ducklake-demo/04_sqlmesh_pipeline/sqlmesh_state.db

default_gateway: duckdb

model_defaults:
  dialect: duckdb
  start: '2025-01-01'

Troubleshooting

Connection Issues

  • Verify DuckLake extension is installed: INSTALL ducklake;
  • Check file paths are absolute when needed
  • Ensure proper permissions on data directory

Performance Issues

  • Monitor memory usage with large datasets
  • Use appropriate materialization strategies
  • Consider partitioning for large tables

Schema Issues

  • Validate column names match between layers
  • Check data types in transformations
  • Use DESCRIBE to inspect table schemas

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • DuckDB team for the incredible analytical engine
  • DuckLake contributors for the open table format
  • dbt Labs for the transformation framework
  • Tobiko Data for SQLMesh innovation
  • Streamlit team for the visualization platform

Contact

For questions about this demo or speaking opportunities:


Built with ❤️ for the data community

About

Complete demo materials for 'The Local Lakehouse' talk. Build a production-grade analytics stack on your laptop using DuckDB, dbt, SQLMesh & DuckLake. Transform raw Ondoriya world data through bronze/silver/gold layers. 10x faster than cloud, 90% cheaper. Modern data engineering without the cloud bills.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors