The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.
Guilherme PenedoQuentin MalarticDaniel HesslowRuxandra CojocaruAlessandro CappelliHamza AlobeidliBaptiste PannierEbtesam AlmazroueiJulien LaunayPublished in: CoRR (2023)
Keyphrases
- web data
- web mining
- semi structured
- web pages
- web documents
- web usage mining
- web queries
- web content
- incremental mining
- deep web
- web information
- web sources
- natural language processing
- query logs
- link structure
- data mining techniques
- information integration
- data sets
- information extraction
- probabilistic model
- web crawling
- search engine