An ATProto AppView for the science.alt.dataset lexicon namespace. It indexes dataset metadata published across the AT Protocol network and serves it through XRPC endpoints — enabling discovery, search, and resolution of datasets, schemas, labels, and lenses.
In the AT Protocol architecture, an AppView is a service that subscribes to the network firehose, indexes records it cares about, and exposes query endpoints for clients. atdata-app does this for scientific and ML dataset metadata:
- Schemas define the structure of datasets (JSON Schema, Arrow schema, etc.)
- Dataset entries describe a dataset — its name, storage location, schema, tags, license, and size
- Labels are human-readable version tags pointing to a specific dataset entry (like git tags)
- Lenses are bidirectional schema transforms with getter/putter code for migrating data between schema versions
ATProto Network
│
├── Jetstream (WebSocket firehose) ──► Real-time ingestion
│ │
└── BGS Relay (HTTP backfill) ──────► Historical backfill
│
▼
PostgreSQL
│
▼
XRPC Query Endpoints ──► Clients
- Python 3.12+
- PostgreSQL 14+
- uv package manager
# Install dependencies
uv sync --dev
# Initialize the lexicon submodule
git submodule update --init
# Set up PostgreSQL (schema auto-applies on startup)
createdb atdata_app
# Start the server
uv run uvicorn atdata_app.main:app --reloadThe server starts with dev-mode defaults: http://localhost:8000, DID did:web:localhost%3A8000. On startup it connects to Jetstream and begins indexing science.alt.dataset.* records, and runs a one-shot backfill of historical records from the BGS relay.
All settings are environment variables prefixed with ATDATA_, managed by pydantic-settings.
| Variable | Default | Description |
|---|---|---|
ATDATA_HOSTNAME |
localhost |
Public hostname, used to derive did:web identity |
ATDATA_PORT |
8000 |
Server port (included in DID in dev mode) |
ATDATA_DEV_MODE |
true |
Dev mode uses http:// and includes port in DID; production uses https:// |
ATDATA_DATABASE_URL |
postgresql://localhost:5432/atdata_app |
PostgreSQL connection string |
ATDATA_JETSTREAM_URL |
wss://jetstream2.us-east.bsky.network/subscribe |
Jetstream WebSocket endpoint |
ATDATA_JETSTREAM_COLLECTIONS |
science.alt.dataset.* |
Collections to subscribe to |
ATDATA_RELAY_HOST |
https://bsky.network |
BGS relay for backfill DID discovery |
The service derives its did:web identity from the hostname and port:
- Dev mode:
did:web:localhost%3A8000with endpointhttp://localhost:8000 - Production:
did:web:datasets.example.comwith endpointhttps://datasets.example.com
The DID document is served at GET /.well-known/did.json and advertises the service as an AtprotoAppView.
See docs/api-reference.md for the full XRPC endpoint reference (queries, procedures, and other routes).
See docs/data-model.md for the database schema (schemas, entries, labels, lenses).
The app ships with a multi-stage Dockerfile using uv for fast dependency installation.
docker build -t atdata-app .
docker run -p 8000:8000 \
-e ATDATA_DATABASE_URL=postgresql://user:pass@host:5432/atdata_app \
-e ATDATA_HOSTNAME=localhost \
-e ATDATA_DEV_MODE=true \
atdata-appThe repo includes a railway.toml that configures the Dockerfile builder, health checks at /health, and a restart-on-failure policy.
- Connect the repo to a Railway project
- Add a PostgreSQL service and link it
- Set the required environment variables:
| Variable | Value |
|---|---|
ATDATA_DATABASE_URL |
Provided by Railway's PostgreSQL plugin (${{Postgres.DATABASE_URL}}) |
ATDATA_HOSTNAME |
Your Railway public domain (e.g. atdata-app-production.up.railway.app) |
ATDATA_DEV_MODE |
false |
ATDATA_PORT |
Omit — Railway sets PORT automatically and the container respects it |
Optional variables for ingestion tuning:
| Variable | Default | Description |
|---|---|---|
ATDATA_JETSTREAM_URL |
wss://jetstream2.us-east.bsky.network/subscribe |
Jetstream endpoint |
ATDATA_RELAY_HOST |
https://bsky.network |
BGS relay for backfill |
Railway will auto-deploy on push, build the Docker image, and start the container.
# Run tests (no database required)
uv run pytest
# Run a single test
uv run pytest tests/test_models.py::test_parse_at_uri -v
# Run with coverage
uv run pytest --cov=atdata_app
# Lint
uv run ruff check src/ tests/Tests mock all external dependencies (database, HTTP, identity resolution) using unittest.mock.AsyncMock. HTTP endpoint tests use httpx ASGITransport for in-process testing without a running server.
The lexicons/ directory is a git submodule containing the authoritative science.alt.dataset.* lexicon schemas. Initialize it with:
git submodule update --initThe lexicons are for reference and CI validation. The Python source code uses hardcoded NSID constants and does not read the lexicon JSON files at runtime.
MIT