A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity.
Shayne LongpreGregory YauneyEmily ReifKatherine LeeAdam RobertsBarret ZophDenny ZhouJason WeiKevin RobinsonDavid MimnoDaphne IppolitoPublished in: NAACL-HLT (2024)
Keyphrases
- training data
- data sets
- data quality
- test data
- decision trees
- high quality
- raw data
- data distribution
- synthetic data
- statistical analysis
- data processing
- image data
- original data
- data sources
- domain experts
- data structure
- database
- missing values
- incomplete data
- low quality
- target domain
- noisy data
- data samples
- training examples
- labeled data
- data collection
- input data
- data points
- classification accuracy
- learning algorithm