We test here the idea that one can use a unique method, independent of data to create the sets of validation rules needed for various datasets/stages of the GSBPM. See our slides for the sdmx-2025 conference.
This involves, in addition to using expert knowledge based rules, which still play a role, exploiting:
-
the SDMX registry codelists and DSD (data structure definition) files, using R-validate, Python: sdmxthon
-
data properties (type, range, distribution, correlation, ...) using R-validate_suggest
-
machine learning algorithms (e.g. apriori, eclat algorithms using R-arules) trained on the whole datasets to discover association rules. The new rule sets are subsequently evaluated against the traditional rule repositories
-
machine/deep learning algorithms for anomaly detection (unsupervised, e.g. deep-/isolation forest as in the R- solitude or HRTnomaly) which can single out rare/unique patterns in data which might be signaling errors