D4: Improving LLM Pretraining via Document De-Duplication and Diversification.
Kushal TirumalaDaniel SimigArmen AghajanyanAri S. MorcosPublished in: CoRR (2023)
Keyphrases
- document collections
- document images
- information retrieval
- tabu search
- document clustering
- information retrieval systems
- web documents
- document analysis
- document classification
- relevant documents
- neural network
- structured documents
- retrieval systems
- keywords
- case study
- learning algorithm
- vector space model
- data mining
- database
- document structure