The Pile: An 800GB Dataset of Diverse Text for Language Modeling.
Leo GaoStella BidermanSid BlackLaurence GoldingTravis HoppeCharles FosterJason PhangHorace HeAnish ThiteNoa NabeshimaShawn PresserConnor LeahyPublished in: CoRR (2021)
Keyphrases
- language modeling
- information retrieval
- language model
- retrieval model
- query expansion
- text retrieval
- cross lingual
- n gram
- probabilistic model
- anchor text
- relevance model
- document retrieval
- text classification
- information retrieval systems
- text mining
- translation model
- retrieval systems
- retrieval effectiveness
- search engine