Metadata Verification: A Workflow for Computational Archival Science
Author
Abstract

Metadata Discovery Problem - Researchers seeking to apply computational methods are increasingly turning to scientific digital archives containing images of specimens. Unfortunately, metadata errors can inhibit the discovery and use of scientific archival images. One such case is the NSF-sponsored Biology Guided Neural Network (BGNN) project, where an abundance of metadata errors has significantly delayed development of a proposed, new class of neural networks. This paper reports on research addressing this challenge. We present a prototype workflow for specimen scientific name metadata verification that is grounded in Computational Archival Science (CAS), report on a taxonomy of specimen name metadata error types with preliminary solutions. Our 3-phased workflow includes tag extraction, text processing, and interactive assessment. A baseline test with the prototype workflow identified at least 15 scientific name metadata errors out of 857 manually reviewed, potentially erroneous specimen images, corresponding to a ∼ 0.2\% error rate for the full image dataset. The prototype workflow minimizes the amount of time domain experts need to spend reviewing archive metadata for correctness and AI-readiness before these archival images can be utilized in downstream analysis.

Year of Publication
2022
Date Published
dec
Publisher
IEEE
Conference Location
Osaka, Japan
ISBN Number
978-1-66548-045-1
URL
https://ieeexplore.ieee.org/document/10020340/
DOI
10.1109/BigData55660.2022.10020340
Google Scholar | BibTeX | DOI