MetaDIG
Initial Evaluation
The goal is to review MetaDIG based on existing quality reports and XML checks to determine if parts of it can be adopted for PresQT.
NCEAS/metadig-engine
Java engine that does metadata quality checks. Each check is given in a rule file using a custom xml format. The checks specify pieces of the metadata to look at (using an xpath selector), and has an algorithm to execute with the selected data to return either "SUCCESS" or "FAILURE". The algorithms can be given in R, Java, or python. A check can also return messages about what it found or as warnings. Each check has metadata for:
an identifier for the check
a name for the check
a plain text description of the check
a "type" (not sure what possible types are)
a level (INFO, OPTIONAL, REQUIRED)
The engine has a way to distribute and run jobs over many VMs. It has been adapted it to run on Kubernetes. There are many components to the system: Metadig-controller, Metadig-engine-core, Metadig-report-worker, Metadig-index-worker (https://github.com/NCEAS/metadig-engine/blob/master/docs/Architecture.md#metadig-system-components)
Observations
the checks are specifically for the metadata standards NCEAS uses: EML and ISO-19150
b/c of the specificity of the xpaths, the checks would need to be adapted to support other metadata standards.
the individual checks do not care about the xml format of the data, except for the need for xpath selectors
non-xml structured metadata (e.g. in JSON) could converted to a parallel form in xml.
checks only return "SUCCESS" or "FAILURE" as a status code. Checks do not support changing/adding/updating the metadata records.
While most checks are written in R, python is supported (e.g. see check /src/main/resources/checks/check-keywordPresent.xml )
checks can make web calls or import other packages
Open questions
Is it possible to deploy without using Kubernetes?
Thoughts
How much throughput does PresQT need for analyzing metadata records? The parallel approach may not be necessary. But if it is, metadig already has the processing infrastructure built-out.
If altering/enriching records is desired at some point, it might be more strategic to implement a similar engine that supports that functionality.
metadig is based on a centralized model. PresQT needs to determine whether that fits its use cases.
Conclusion
After a group discussion we decided that we would not pursue Metadig for the moment because it does not have a way to enrich metadata. Also, right now we are not sure about the amount of distributed/consolidated architecture of the PresQT metadata.
Resources
Presentation slides from the Partner Meeting in January 2019
GitHub
Last updated