-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Summary
Generate a DCAT-AP dataset description after a pipeline run, describing what was produced: triple counts, distribution URLs, provider metadata.
Context
loda-pipeline generates a datasetdescription.ttl after each pipeline run containing:
dcat:Datasetwith title, description, publisher, licensedcat:Distributionentries for each output file (N-Triples, EDM XML ZIP) with byte size, media type, access URL- Triple and record counts
- Temporal metadata (modification date)
This description is then validated against NDE's dataset register SHACL shapes and registered with the NDE Dataset Register API.
loda-pipeline also has a update_data_catalog.sh that combines all individual dataset descriptions into a single dcat:Catalog for multi-dataset pipelines.
Approach
This is distinct from withProvenance() (which records PROV-O process metadata) and from #82.
Could be:
- A post-pipeline step that counts triples written and generates the description
- A
Writerdecorator that tracks counts as quads flow through, then emits the description at the end - A standalone utility that takes a pipeline's output files and generates the description
The catalog generation (combining multiple dataset descriptions) is a secondary concern for multi-dataset orchestration.
Relates to
Reactions are currently unavailable