Détail du document
Identifiant

oai:arXiv.org:2411.09255

Sujet
Computer Science - Computation and...
Auteur
Seo, Jean Lim, Jongwon Jang, Dongjun Shin, Hyopil
Catégorie

Computer Science

Année

2024

Date de référencement

20/11/2024

Mots clés
models benchmark dataset
Métrique

Résumé

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain.

Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories.

DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information.

The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks.

We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy.

The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains.

We release the dataset and code in public.

;Comment: EMNLP2024/FEVER

Seo, Jean,Lim, Jongwon,Jang, Dongjun,Shin, Hyopil, 2024, DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

Document

Ouvrir

Partager

Source

Articles recommandés par ES/IODE IA

Clinical Practice Guidelines For the Management of Hepatocellular Carcinoma: A Systematic Review
hepatocellular carcinoma hcc cancer liver clinical guidelines guidelines approach review risk patients using ii recommended cpgs included guidelines hcc