DataComp-LM: In search of the next generation of training sets for language models.
Jeffrey LiAlex FangGeorgios SmyrnisMaor IvgiMatt JordanSamir Yitzhak GadreHritik BansalEtash GuhaSedrick KehKushal AroraSaurabh GargRui XinNiklas MuennighoffReinhard HeckelJean MercatMayee ChenSuchin GururanganMitchell WortsmanAlon AlbalakYonatan BittonMarianna NezhurinaAmro AbbasCheng-Yu HsiehDhruba GhoshJosh GardnerMaciej KilianHanlin ZhangRulin ShaoSarah M. PrattSunny SanyalGabriel IlharcoGiannis DarasKalyani MaratheAaron GokaslanJieyu ZhangKhyathi Raghavi ChanduThao NguyenIgor VasiljevicSham M. KakadeShuran SongSujay SanghaviFartash FaghriSewoong OhLuke ZettlemoyerKyle LoAlaaeldin El-NoubyHadi PouransariAlexander ToshevStephanie WangDirk GroeneveldLuca SoldainiPang Wei KohJenia JitsevThomas KollarAlexandros G. DimakisYair CarmonAchal DaveLudwig SchmidtVaishaal ShankarPublished in: CoRR (2024)
Keyphrases
- language model
- language modeling
- speech recognition
- document retrieval
- probabilistic model
- n gram
- retrieval model
- document ranking
- query specific
- document level
- query expansion
- information retrieval
- training set
- context sensitive
- word clouds
- language modelling
- statistical language models
- test collection
- ad hoc information retrieval
- machine learning
- query terms
- pseudo relevance feedback
- classification accuracy
- training data
- term dependencies
- document length
- retrieval effectiveness
- hidden markov models
- okapi bm
- language models for information retrieval