Dokumentdetails
ID

oai:arXiv.org:2410.08740

Thema
Computer Science - Computer Vision... Computer Science - Artificial Inte... Computer Science - Information Ret...
Autor
Turnbull, Robert Fitzgerald, Emily Thompson, Karen Birch, Joanne L.
Kategorie

Computer Science

Jahr

2024

Auflistungsdatum

16.10.2024

Schlüsselwörter
science recognition institutional detects text-based text pipeline specimen
Metrisch

Zusammenfassung

Specimen associated biodiversity data are sought after for biological, environmental, climate, and conservation sciences.

A rate shift is required for the extraction of data from specimen images to eliminate the bottleneck that the reliance on human-mediated transcription of these data represents.

We applied advanced computer vision techniques to develop the `Hespi' (HErbarium Specimen sheet PIpeline), which extracts a pre-catalogue subset of collection data on the institutional labels on herbarium specimens from their digital images.

The pipeline integrates two object detection models; the first detects bounding boxes around text-based labels and the second detects bounding boxes around text-based data fields on the primary institutional label.

The pipeline classifies text-based institutional labels as printed, typed, handwritten, or a combination and applies Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for data extraction.

The recognized text is then corrected against authoritative databases of taxon names.

The extracted text is also corrected with the aide of a multimodal Large Language Model (LLM).

Hespi accurately detects and extracts text for test datasets including specimen sheet images from international herbaria.

The components of the pipeline are modular and users can train their own models with their own data and use them in place of the models provided.

Turnbull, Robert,Fitzgerald, Emily,Thompson, Karen,Birch, Joanne L., 2024, Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

Dokumentieren

Öffnen

Teilen

Quelle

Artikel empfohlen von ES/IODE AI

Hespi: A pipeline for automatically detecting information from hebarium specimen sheets
science recognition institutional detects text-based text pipeline specimen