Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.
Luca SoldainiRodney KinneyAkshita BhagiaDustin SchwenkDavid AtkinsonRussell AuthurBen BoginKhyathi ChanduJennifer DumasYanai ElazarValentin HofmannAnanya Harsh JhaSachin KumarLi LucyXinxi LyuNathan LambertIan MagnussonJacob MorrisonNiklas MuennighoffAakanksha NaikCrystal NamMatthew E. PetersAbhilasha RavichanderKyle RichardsonZejiang ShenEmma StrubellNishant SubramaniOyvind TafjordPete WalshLuke ZettlemoyerNoah A. SmithHannaneh HajishirziIz BeltagyDirk GroeneveldJesse DodgeKyle LoPublished in: CoRR (2024)
Keyphrases
- language model
- language modeling
- probabilistic model
- n gram
- query expansion
- retrieval model
- document retrieval
- speech recognition
- information retrieval
- test collection
- mixture model
- context sensitive
- ad hoc information retrieval
- smoothing methods
- query terms
- vector space model
- statistical language models
- language modelling
- language model for information retrieval
- document length
- dependency structure
- translation model
- retrieval effectiveness