AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters.
Li LucySuchin GururanganLuca SoldainiEmma StrubellDavid BammanLauren KleinJesse DodgePublished in: ACL (1) (2024)
Keyphrases
- synthetic data
- data sets
- training data
- raw data
- data structure
- data collection
- data processing
- input data
- computer systems
- data sources
- data analysis
- keywords
- website
- original data
- data quality
- document analysis
- natural language processing
- data mining techniques
- co occurrence
- high quality
- social networks
- information retrieval
- data mining