Provides a secure, lineage-aware, metadata-rich interface to heterogeneous datasets (PostgreSQL, object storage, filesystem). Exposes a DCAT-AP 3.0 compatible catalogue, a governed SQL query interface, and OpenLineage-integrated provenance, designed to support Digital Twins, analytical applications, and DSSC-aligned dataspace participants.
Public catalogue endpoint returning application/ld+json responses conforming to DCAT-AP 3.0.
GET /catalogue— full catalogue as adcat:Catalognode with embeddeddcat:Datasetanddcat:DistributionnodesGET /catalogue/{id}— single dataset by IDPOST /catalogue/search— filtered search byq,access_level,keywords
Each dataset includes dct:publisher, dcat:theme, dct:language, dct:spatial, dct:accrualPeriodicity, and odrl:hasPolicy on every distribution. Publisher URI is derived from governance.yaml; the fallback is settings.catalog_uri.
The downloadURL is only present on distributions with access_level: open. All other distributions require negotiating access through a dataspace connector.
SQL SELECT queries over exposed datasets with strict validation, server-side pagination, hard row caps, and row-level filters.
POST /query— accepts{"sql": "SELECT ...", "limit": 50, "offset": 0}- Validates SQL (SELECT-only, table allowlist)
- Enforces
LIMIT/OFFSETserver-side - Applies row-level filter plans from governance handlers (OPA, HTTP allow-list, direct user match)
When queries arrive through the EDC data plane (via an Endpoint Data Reference), the API detects the EDR context via the Edc-Contract-Agreement-Id header and switches to a dataspace-specific enforcement path.
Enabled by:
EDR_ENABLED=true
CONNECTOR_INTERNAL_URL=http://ds-connector:30001EDR query flow:
- Detects
Edc-Contract-Agreement-IdandEdc-Bpnheaders - Calls
ds-connector GET /internal/agreements/{id}/status— checks the agreement is active - If the dataset has a
user_filter_column, callsds-connector GET /internal/consent/check— retrieves the list of subject IDs the consumer has consent for - Injects an SQL
IN (subject_ids)predicate or a deny plan into the row filter pipeline - Skips the Keycloak/OPA path entirely — the EDC data plane already validated the EDR JWT
This path requires no JWT re-validation by dataset-api since the EDC data plane validates the bearer token before proxying.
Access levels:
open— no authentication required;downloadURLexposed in DCATinternal— JWT required;ds:accessScope eq "dataspaces.query"constraint in ODRLrestricted— JWT + contract required;ds:contractRequired eq "true"in ODRLsecret— not exposed in catalogue or EDC
Row-level filtering via the pluggable governance handler registry. When user_filter_column is set and consent_required: true, the consent handler injects an IN (subject_ids) predicate.
- OpenLineage ingestion via Marquez
- Namespace-based dataset grouping
- Governance facets embedded in lineage events (
userFilterColumn,medallion,classification) - Provenance surfaced in catalogue metadata
- JSON Schema (2020-12) generated from physical tables
- Column-level metadata for UI and clients
GET /catalogue— DCAT-AP catalogue (application/ld+json)GET /catalogue/{id}— single datasetPOST /catalogue/search— filtered searchPOST /query— governed SQL query; EDR-gated whenEDR_ENABLED=trueGET /admin/catalogue— catalogue import (CLI-only)GET /health
The CLI is the primary control plane for the Dataset API:
dataset-cli --helpMain commands:
export openlineage— extract lineage from Marquezexport governance— export governance rules to dataset entriesimport catalogue— validate and import dataset cataloguevalidate catalogue— schema validation onlyontology— ontology fetch, analysis, tree generation
The export governance command reads governance.yaml files and propagates dcat: and dataspace: blocks to DatasetEntry records. The expose: true field on a source entry controls whether the dataset is visible in the catalogue and registered in EDC.
Dataset-api reads governance rules resolved by celine-utils GovernanceResolver. The following extended blocks are supported:
dcat: block — DCAT-AP metadata:
publisher_uri— overrides the settings-level fallbackthemes—dcat:themeURIs (EU Publications Office vocabulary)language_uris—dct:languageURIsspatial_uris—dct:spatialURIsaccrual_periodicity—dct:accrualPeriodicityURIconforms_to—dct:conformsToURItemporal—dct:temporalwithstartandenddates
dataspace: block — access control and ODRL hints:
contract_required— addsds:contractRequiredconstraint to ODRLconsent_required— addsds:consentStatus eq activeconstraintodrl_action— default action for the ODRL offerpurpose— purpose values for ODRL purpose constraintsmedallion— data quality level
expose: true on the source entry (top-level, not under dataspace:) makes the dataset visible in the catalogue.
- Python ≥ 3.11
- Async SQLAlchemy
- Pydantic v2
- FastAPI + httpx
- sqlglot-based SQL validation
Before opening a PR:
- validate all YAML definitions
- add tests for new API behaviour
- include migrations for schema changes
- keep docs in sync with API behaviour
Copyright © 2025 Spindox Labs
Licensed under the Apache License, Version 2.0.