The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.
Hugo LaurençonLucile SaulnierThomas WangChristopher AkikiAlbert Villanova del MoralTeven Le ScaoLeandro von WerraChenghao MouEduardo González PonferradaHuu NguyenJörg FrohbergMario SaskoQuentin LhoestAngelina McMillan-MajorGérard DupontStella BidermanAnna RogersLoubna Ben AllalFrancesco De ToniGiada PistilliOlivier NguyenSomaieh NikpoorMaraim MasoudPierre ColomboJavier de la RosaPaulo VillegasTristan ThrushShayne LongpreSebastian NagelLeon WeberManuel MuñozJian ZhuDaniel van StrienZaid AlyafeaiKhalid AlmubarakMinh Chien VuItziar Gonzalez-DiosAitor SoroaKyle LoManan DeyPedro Ortiz SuarezAaron GokaslanShamik BoseDavid Ifeoluwa AdelaniLong PhanHieu TranIan YuSuhas PaiJenny ChimViolette LepercqSuzana IlicMargaret MitchellSasha LuccioniYacine JernitePublished in: CoRR (2023)