SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading.

Tu Anh Dinh Carlos Mullov Leonard Bärmann Zhaolin Li Danni Liu Simon Reiß Jueun Lee Nathan Lerzer Fabian Tërnava Jianfeng Gao Alexander Waibel Tamim Asfour Michael Beigl Rainer Stiefelhagen Carsten Dachsbacher Klemens Böhm Jan Niehues

Published in: CoRR (2024)