Login / Signup
Guilherme Penedo
Publication Activity (10 Years)
Years Active: 2023-2024
Publications (10 Years): 5
Top Topics
Language Models For Information Retrieval
Web Data
Language Model
Incremental Mining
Top Venues
CoRR
ArabicNLP
NeurIPS
</>
Publications
</>
Guilherme Penedo
,
Hynek Kydlícek
,
Loubna Ben Allal
,
Anton Lozhkov
,
Margaret Mitchell
,
Colin Raffel
,
Leandro von Werra
,
Thomas Wolf
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.
CoRR
(2024)
Guilherme Penedo
,
Quentin Malartic
,
Daniel Hesslow
,
Ruxandra Cojocaru
,
Alessandro Cappelli
,
Hamza Alobeidli
,
Baptiste Pannier
,
Ebtesam Almazrouei
,
Julien Launay
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.
CoRR
(2023)
Ebtesam Almazrouei
,
Ruxandra Cojocaru
,
Michele Baldo
,
Quentin Malartic
,
Hamza Alobeidli
,
Daniele Mazzotta
,
Guilherme Penedo
,
Giulia Campesan
,
Mugariya Farooq
,
Maitha Alhammadi
,
Julien Launay
,
Badreddine Noune
AlGhafa Evaluation Benchmark for Arabic Language Models.
ArabicNLP
(2023)
Guilherme Penedo
,
Quentin Malartic
,
Daniel Hesslow
,
Ruxandra Cojocaru
,
Hamza Alobeidli
,
Alessandro Cappelli
,
Baptiste Pannier
,
Ebtesam Almazrouei
,
Julien Launay
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only.
NeurIPS
(2023)
Ebtesam Almazrouei
,
Hamza Alobeidli
,
Abdulaziz Alshamsi
,
Alessandro Cappelli
,
Ruxandra Cojocaru
,
Mérouane Debbah
,
Étienne Goffinet
,
Daniel Hesslow
,
Julien Launay
,
Quentin Malartic
,
Daniele Mazzotta
,
Badreddine Noune
,
Baptiste Pannier
,
Guilherme Penedo
The Falcon Series of Open Language Models.
CoRR
(2023)