Document detail
ID

oai:arXiv.org:2407.17468

Topic
Computer Science - Computation and... Computer Science - Artificial Inte...
Author
Zhao, Wenting Goyal, Tanya Chiu, Yu Ying Jiang, Liwei Newman, Benjamin Ravichander, Abhilasha Chandu, Khyathi Bras, Ronan Le Cardie, Claire Deng, Yuntian Choi, Yejin
Category

Computer Science

Year

2024

listing date

7/31/2024

Keywords
factuality real-world hallucinations entities llms
Metrics

Abstract

While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about.

To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality.

It does so by prompting LLMs to generate information about entities mined from user-chatbot conversations in the wild.

These generations are then automatically fact-checked against a systematically curated knowledge source collected from web search.

Notably, half of these real-world entities do not have associated Wikipedia pages.

We evaluate 118,785 generations from 15 LLMs on 7,919 entities.

We find that LLMs consistently hallucinate more on entities without Wikipedia pages and exhibit varying hallucination rates across different domains.

Finally, given the same base models, adding a retrieval component only slightly reduces hallucinations but does not eliminate hallucinations.

Zhao, Wenting,Goyal, Tanya,Chiu, Yu Ying,Jiang, Liwei,Newman, Benjamin,Ravichander, Abhilasha,Chandu, Khyathi,Bras, Ronan Le,Cardie, Claire,Deng, Yuntian,Choi, Yejin, 2024, WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

Document

Open

Share

Source

Articles recommended by ES/IODE AI

Batoclimab as induction and maintenance therapy in patients with myasthenia gravis: rationale and study design of a phase 3 clinical trial
gravis myasthenia study clinical phase baseline improvement mg-adl 340 week trial placebo period mg maintenance qw