When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale.
Max MarionAhmet ÜstünLuiza PozzobonAlex WangMarzieh FadaeeSara HookerPublished in: CoRR (2023)
Keyphrases
- data sets
- synthetic data
- data collection
- data analysis
- application domains
- image data
- complex data
- data processing
- data structure
- probability distribution
- data points
- database
- data objects
- input data
- high quality
- original data
- prior knowledge
- data distribution
- statistical analysis
- experimental data
- data mining
- missing data
- information retrieval
- high dimensional data
- decision trees
- multi dimensional
- data sources