D4: Improving LLM Pretraining via Document De-Duplication and Diversification.
Kushal TirumalaDaniel SimigArmen AghajanyanAri MorcosPublished in: NeurIPS (2023)
Keyphrases
- information retrieval systems
- document classification
- document collections
- document clustering
- document retrieval
- tabu search
- retrieval systems
- text documents
- machine learning
- feature selection
- keyword extraction
- keywords
- document images
- cf loadingtexthtml
- document structure
- textual content
- ranked list
- web documents
- metadata
- knowledge base
- information retrieval