Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Jonathan HayaseAlisa LiuYejin ChoiSewoong OhNoah A. SmithPublished in: CoRR (2024)
Keyphrases
- training data
- data sets
- data collection
- test data
- high quality
- noisy data
- data distribution
- experimental data
- synthetic data
- data quality
- database
- prior knowledge
- knowledge discovery
- data mining techniques
- high dimensional data
- training samples
- statistical analysis
- training dataset
- data sources
- training set
- data analysis
- data structure
- computer systems
- data processing
- input data
- training examples
- supervised learning
- data points
- missing data
- spatial data
- probability distribution
- social networks