Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP.
Anya BelzCraig ThomsonEhud ReiterGavin AbercrombieJose Maria Alonso-MoralMohammad ArvanJackie Chi Kit CheungMark CieliebakElizabeth ClarkKees van DeemterTanvi DinkarOndrej DusekSteffen EgerQixiang FangAlbert GattDimitra GkatziaJavier González-CorbelleDirk HovyManuela HürlimannTakumi ItoJohn D. KelleherFilip KlubickaHuiyuan LaiChris van der LeeEmiel van MiltenburgYiru LiSaad MahamoodMargot MieskesMalvina NissimNatalie PardeOndrej PlátekVerena RieserPablo Mosteiro RomeroJoel R. TetreaultAntonio ToralXiaojun WanLeo WannerLewis WatsonDiyi YangPublished in: CoRR (2023)