Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs.
Simone BalloccuPatrícia SchmidtováMateusz LangoOndrej DusekPublished in: CoRR (2024)
Keyphrases
- data sets
- high quality
- data sources
- data structure
- database
- training data
- data analysis
- original data
- data collection
- multi source
- data quality
- raw data
- application domains
- sensor data
- data processing
- knowledge discovery
- neural network
- missing values
- small number
- image data
- complex data
- personal information
- multiple sources