Skip to content

DCAT dataset description generation #86

@ddeboer

Description

@ddeboer

Summary

Generate a DCAT-AP dataset description after a pipeline run, describing what was produced: triple counts, distribution URLs, provider metadata.

Context

loda-pipeline generates a datasetdescription.ttl after each pipeline run containing:

  • dcat:Dataset with title, description, publisher, license
  • dcat:Distribution entries for each output file (N-Triples, EDM XML ZIP) with byte size, media type, access URL
  • Triple and record counts
  • Temporal metadata (modification date)

This description is then validated against NDE's dataset register SHACL shapes and registered with the NDE Dataset Register API.

loda-pipeline also has a update_data_catalog.sh that combines all individual dataset descriptions into a single dcat:Catalog for multi-dataset pipelines.

Approach

This is distinct from withProvenance() (which records PROV-O process metadata) and from #82.

Could be:

  • A post-pipeline step that counts triples written and generates the description
  • A Writer decorator that tracks counts as quads flow through, then emits the description at the end
  • A standalone utility that takes a pipeline's output files and generates the description

The catalog generation (combining multiple dataset descriptions) is a secondary concern for multi-dataset orchestration.

Relates to

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions