MetaDIG

Initial Evaluation

The goal is to review MetaDIG based on existing quality reports and XML checks to determine if parts of it can be adopted for PresQT.

NCEAS/metadig-engine

Java engine that does metadata quality checks. Each check is given in a rule file using a custom xml format. The checks specify pieces of the metadata to look at (using an xpath selector), and has an algorithm to execute with the selected data to return either "SUCCESS" or "FAILURE". The algorithms can be given in R, Java, or python. A check can also return messages about what it found or as warnings. Each check has metadata for:

  • an identifier for the check

  • a name for the check

  • a plain text description of the check

  • a "type" (not sure what possible types are)

  • a level (INFO, OPTIONAL, REQUIRED)

The engine has a way to distribute and run jobs over many VMs. It has been adapted it to run on Kubernetes. There are many components to the system: Metadig-controller, Metadig-engine-core, Metadig-report-worker, Metadig-index-worker (https://github.com/NCEAS/metadig-engine/blob/master/docs/Architecture.md#metadig-system-components)

Observations

  • the checks are specifically for the metadata standards NCEAS uses: EML and ISO-19150

  • b/c of the specificity of the xpaths, the checks would need to be adapted to support other metadata standards.

  • the individual checks do not care about the xml format of the data, except for the need for xpath selectors

  • non-xml structured metadata (e.g. in JSON) could converted to a parallel form in xml.

  • checks only return "SUCCESS" or "FAILURE" as a status code. Checks do not support changing/adding/updating the metadata records.

  • While most checks are written in R, python is supported (e.g. see check /src/main/resources/checks/check-keywordPresent.xml )

  • checks can make web calls or import other packages

Open questions

  • Is it possible to deploy without using Kubernetes?

Thoughts

  • How much throughput does PresQT need for analyzing metadata records? The parallel approach may not be necessary. But if it is, metadig already has the processing infrastructure built-out.

  • If altering/enriching records is desired at some point, it might be more strategic to implement a similar engine that supports that functionality.

  • metadig is based on a centralized model. PresQT needs to determine whether that fits its use cases.

Conclusion

After a group discussion we decided that we would not pursue Metadig for the moment because it does not have a way to enrich metadata. Also, right now we are not sure about the amount of distributed/consolidated architecture of the PresQT metadata.

Resources

Last updated