The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.
Guilherme PenedoHynek KydlícekLoubna Ben AllalAnton LozhkovMargaret MitchellColin RaffelLeandro von WerraThomas WolfPublished in: CoRR (2024)
Keyphrases
- text data
- web pages
- text classification
- textual data
- text mining
- high dimensional
- structured data
- massive datasets
- text documents
- topic hierarchies
- high dimensional data
- web documents
- document collections
- bag of words
- knowledge discovery
- machine learning
- feature selection
- text categorization
- unsupervised learning
- co occurrence
- topic models
- supervised learning
- active learning
- bayesian networks
- metadata
- search engine
- text analytics
- databases