Détail du document
Identifiant

oai:arXiv.org:2407.12832

Sujet
Computer Science - Computation and...
Auteur
Cavalin, Paulo Domingues, Pedro Henrique Pinhanez, Claudio
Catégorie

Computer Science

Année

2024

Date de référencement

24/07/2024

Métrique

Résumé

In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems.

With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT.

We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem.

Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation.

Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

Cavalin, Paulo,Domingues, Pedro Henrique,Pinhanez, Claudio, 2024, Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation

Document

Ouvrir

Partager

Source

Articles recommandés par ES/IODE IA

Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning
potential deliberation information outcomes language contexts historical reasoning
The Lasting impact of the COVID-19 pandemic on outpatient neurology consultations
rates consultations patients neurology outcomes clinic appointments referrals outpatient pandemic
Evaluation of ‘implications for research’ statements in systematic reviews of interventions in advanced cancer patients – a meta-research study
patients advanced cancer bias statement reporting ‘patient design included interventions domains srs cochrane 5% statements intervention elements study