The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only.
Guilherme PenedoQuentin MalarticDaniel HesslowRuxandra CojocaruHamza AlobeidliAlessandro CappelliBaptiste PannierEbtesam AlmazroueiJulien LaunayPublished in: NeurIPS (2023)
Keyphrases
- web data
- semi structured
- web mining
- web usage mining
- web pages
- web content
- incremental mining
- web information
- natural language processing
- web documents
- link structure
- web sources
- deep web
- web crawling
- web information extraction
- page contents
- query logs
- information integration
- database
- structured data
- text mining
- website
- metadata
- data sets