LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.

Anna Bavaresco Raffaella Bernardi Leonardo Bertolazzi Desmond Elliott Raquel Fernández Albert Gatt Esam Ghaleb Mario Giulianelli Michael Hanna Alexander Koller André F. T. Martins Philipp Mondorf Vera Neplenbroek Sandro Pezzelle Barbara Plank David Schlangen Alessandro Suglia Aditya K. Surikuchi Ece Takmaz Alberto Testoni

Published in: CoRR (2024)

Keyphrases

empirical studies
empirical analysis
human judgments
human users
natural language processing
uci datasets
real world data sets
real world
xml retrieval
real life
experimental design
natural language
data mining
question answering
evaluation criteria
knowledge base
text processing
information retrieval
human operators
multiple tasks
neural network
tasks in natural language processing
field of natural language processing