Skip to content

DCAT dataset description generation #86

@ddeboer

Description

@ddeboer

Summary

Generate a DCAT-AP dataset description after a pipeline run, describing what was produced: triple counts, distribution URLs, provider metadata.

Context

loda-pipeline generates a datasetdescription.ttl after each pipeline run containing:

  • dcat:Dataset with title, description, publisher, license
  • dcat:Distribution entries for each output file (N-Triples, EDM XML ZIP) with byte size, media type, access URL
  • Triple and record counts
  • Temporal metadata (modification date)

This description is then validated against NDE's dataset register SHACL shapes and registered with the NDE Dataset Register API.

loda-pipeline also has a update_data_catalog.sh that combines all individual dataset descriptions into a single dcat:Catalog for multi-dataset pipelines.

Approach

This is distinct from withProvenance() (which records PROV-O process metadata) and from #82.

Could be:

  • A post-pipeline step that counts triples written and generates the description
  • A Writer decorator that tracks counts as quads flow through, then emits the description at the end
  • A standalone utility that takes a pipeline's output files and generates the description

The catalog generation (combining multiple dataset descriptions) is a secondary concern for multi-dataset orchestration.

Relates to

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions