AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters.
Li LucySuchin GururanganLuca SoldainiEmma StrubellDavid BammanLauren KleinJesse DodgePublished in: CoRR (2024)
Keyphrases
- data collection
- data sets
- raw data
- database
- high quality
- data sources
- synthetic data
- image data
- clustering algorithm
- training data
- keywords
- data structure
- data analysis
- missing data
- data points
- search engine
- web documents
- information retrieval
- website
- high dimensional data
- information retrieval systems
- probability distribution
- data model
- high level