Balanced Data Sampling for Language Model Training with Clustering.
Yunfan ShaoLinyang LiZhaoye FeiHang YanDahua LinXipeng QiuPublished in: CoRR (2024)
Keyphrases
- language model
- data points
- categorical data
- document retrieval
- clustering algorithm
- language modelling
- context sensitive
- spectral clustering
- web search
- query expansion
- text categorization
- clustering method
- information extraction
- retrieval model
- probabilistic model
- high dimensional
- language modeling
- training set
- information retrieval