Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

detalle del documento

IDENTIFICACIÓN

oai:arXiv.org:2410.06733

Tema

Computer Science - Computation and... Computer Science - Artificial Inte... Computer Science - Computer Vision...

Autor

Chen, Qi Zhang, Bowen Wang, Gang Wu, Qi

Categoría

Computer Science

Año

2024

fecha de cotización

16/10/2024

Palabras clave

situation puzzles scenario thinking evaluation lateral model llms computer

Métrico

Resumen

While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data.

To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs.

This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model.

This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario.

The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one.

This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs.

The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans.

Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements.

This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs.

Code is available at: https://github.com/chenqi008/LateralThinking.

;Comment: Accepted by NeurIPS 2024

Chen, Qi,Zhang, Bowen,Wang, Gang,Wu, Qi, 2024, Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Documento

Abrir

Fuente

Artículos recomendados por ES/IODE IA

Computer Science

Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

science recognition institutional detects text-based text pipeline specimen

Neuroscience Bulletin

Impairment of Autophagic Flux After Hypobaric Hypoxia Potentiates Oxidative Stress and Cognitive Function Disturbances in Mice

hypobaric stress oxidative damage flux autophagic brain

Biomedicines

Update on Classic and Novel Approaches in Metastatic Triple-Negative Breast Cancer Treatment: A Comprehensive Review

triple-negative tnbc cancer