LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.
Anna BavarescoRaffaella BernardiLeonardo BertolazziDesmond ElliottRaquel FernándezAlbert GattEsam GhalebMario GiulianelliMichael HannaAlexander KollerAndré F. T. MartinsPhilipp MondorfVera NeplenbroekSandro PezzelleBarbara PlankDavid SchlangenAlessandro SugliaAditya K. SurikuchiEce TakmazAlberto TestoniPublished in: CoRR (2024)
Keyphrases
- empirical studies
- empirical analysis
- human judgments
- human users
- natural language processing
- uci datasets
- real world data sets
- real world
- xml retrieval
- real life
- experimental design
- natural language
- data mining
- question answering
- evaluation criteria
- knowledge base
- text processing
- information retrieval
- human operators
- multiple tasks
- neural network
- tasks in natural language processing
- field of natural language processing