DuckPond is a query-native filesystem for time-series data, built on Apache Arrow, DataFusion, and Delta Lake. Every filesystem object can be queried with SQL, and SQL queries create new filesystem objects that appear as native files and directories.
Built by the Caspar Water System.
DuckPond gives you a transactional filesystem where files are first-class data. It has two operating modes:
-
Pond mode (
$POND): A persistent, transactional filesystem backed by Delta Lake. Every write is atomic. Files can be raw bytes, queryable Parquet tables, or multi-version time-series. The pond replicates to S3-compatible storage for backup and cross-machine access. -
Host mode (
host+): A read-only view of the local filesystem where the same query and factory tools work directly on host files -- no pond initialization required.
Both modes use the same URL scheme system and the same SQL engine.
A CSV file on your local disk and a time-series inside a remote
pond backup are queried with the same pond cat --sql syntax.
Every pond can push incremental backups to S3-compatible storage (MinIO, AWS S3). From any other machine, you can:
- Discover remote ponds:
pond run host+remote:///config.yaml list-ponds - Browse backup contents:
pond run host+remote:///config.yaml show - Import a subtree into a local pond for querying
The host+remote:// pattern lets you point at a YAML config file on
your local disk and interact with a remote backup without initializing
a pond first. This makes it easy to inspect what a remote machine has
collected before deciding what to pull down.
# Build
make build
# Run unit tests
make test
# Initialize a pond and try it out
export POND=/tmp/mypond
pond init
pond mkdir /data
echo "hello" | pond copy - /data/greeting.txt
pond list /**
pond cat /data/greeting.txt
# Query a local CSV with SQL (no pond needed)
pond cat host+csv:///tmp/data.csv --format=table --sql "SELECT * FROM source"
# Run a factory from a local config file (no pond needed)
pond run host+remote:///path/to/backup-config.yaml list-ponds- Rust stable toolchain (see
rust-toolchain.toml) - Docker (for integration tests and site deployment)
- Node.js >= 22 (for browser tests and vendor download)
make build # Build pond binary (debug)
make test # Run all unit tests
make integration # Build Docker test image + run integration tests
make check # fmt + clippy + test (CI equivalent)Run make with no arguments to see all available targets.
Download JavaScript vendor dependencies for offline site generation:
make vendor # Downloads DuckDB-WASM, Observable Plot, D3This populates crates/sitegen/vendor/dist/ (gitignored, ~35MB).
After this, pond run sitegen build produces sites that work without
network access.
crates/
tinyfs/ Pure filesystem abstractions (FS, WD, Node, path resolution)
tlogfs/ Delta Lake persistence (OpLog, transactions, DataFusion)
steward/ Transaction orchestration, control table, factory execution
provider/ URL-based data access, factory registry, table providers
cmd/ CLI commands (pond init/list/cat/copy/run/...)
sitegen/ Static site generator (factory)
remote/ S3 backup & replication (factory)
hydrovu/ HydroVu API collector (factory)
utilities/ Shared helpers (glob, chunked files, perf tracing)
scripts/ Shared deployment scripts
testsuite/ Integration tests (Docker-based)
tests/ Individual test scripts (NNN-description.sh)
browser/ Puppeteer browser validation tests
docs/ Architecture and design documentation
water/ Water monitoring demo site
septic/ Septic system demo site
noyo/ Noyo Harbor demo site
See docs/duckpond-overview.md for the full architecture description. Key layers (bottom to top):
| Layer | Crate | Role |
|---|---|---|
| Filesystem | tinyfs |
Pure abstractions: FS, WD, Node, path resolution |
| Persistence | tlogfs |
Delta Lake storage, OpLog, DataFusion integration |
| Orchestration | steward |
Transactions, control table, factory lifecycle |
| Data Access | provider |
URL schemes, factory registry, table providers |
| CLI | cmd |
User-facing commands |
See docs/cli-reference.md for the complete command reference. Common commands:
pond init # Create a new pond
pond list '/**' # List all entries
pond cat /path/to/file # Read a file
pond cat --sql "SELECT * FROM source WHERE ..." /path # Query a table
pond copy host:///local/file /pond/path # Import a file
pond copy host+series:///data.parquet /pond/series # Import time-series
pond mkdir /dir # Create a directory
pond mknod <factory> /path --config-path config.yaml # Install a factory
pond run /path/to/factory <command> # Execute a factory
pond log # Transaction historyHost mode (no pond required):
pond cat host+csv:///tmp/data.csv --format=table # Query a local CSV
pond run host+remote:///config.yaml list-ponds # Browse S3 backups
pond run host+sitegen:///site.yaml build ./dist # Generate a siteTests live in testsuite/tests/ as numbered shell scripts. Each test
runs in a fresh Docker container with the pond binary:
make test-image # Build the test Docker image
make integration # Run all tests (skips browser tests)
make integration-all # Run all tests including browser
# Run a single test
cd testsuite && ./run-test.sh 201
# Run interactively (explore in container)
cd testsuite && ./run-test.sh --interactiveEach demo site (water/, septic/, noyo/) rsyncs data from its remote machine and runs everything locally:
# First time: configure your site
cp water/deploy.env.example water/deploy.env
# Edit deploy.env with your remote host and S3 credentials
# Site workflow (all run locally)
cd water
./setup-local.sh # rsync data + init pond + install factories
./run-local.sh # rsync new data + ingest
./generate-local.sh # build static site + preview
./update-local.sh # after editing YAML/templatesCredentials are kept in deploy.env (gitignored) — never in the YAML
configs checked into the repository. Remote machines use container
images built by GitHub Actions.
| Document | Contents |
|---|---|
| CLI Reference | Complete command syntax and examples |
| Architecture Overview | System design and crate map |
| System Patterns | Transaction model, factories, providers |
| Sitegen Design | Static site generator architecture |
| Cross-Pond Import | Foreign pond import status |
| Large File Storage | Content-addressed storage for large files |
| Releasing | Release process and supply chain security |
Apache-2.0 — see LICENSES/ for details.
