Login / Signup

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset.

Hugo LaurençonLucile SaulnierThomas WangChristopher AkikiAlbert Villanova del MoralTeven Le ScaoLeandro von WerraChenghao MouEduardo González PonferradaHuu NguyenJörg FrohbergMario SaskoQuentin LhoestAngelina McMillan-MajorGérard DupontStella BidermanAnna RogersLoubna Ben AllalFrancesco De ToniGiada PistilliOlivier NguyenSomaieh NikpoorMaraim MasoudPierre ColomboJavier de la RosaPaulo VillegasTristan ThrushShayne LongpreSebastian NagelLeon WeberManuel MuñozJian ZhuDaniel van StrienZaid AlyafeaiKhalid AlmubarakMinh Chien VuItziar Gonzalez-DiosAitor SoroaKyle LoManan DeyPedro Ortiz SuarezAaron GokaslanShamik BoseDavid Ifeoluwa AdelaniLong PhanHieu TranIan YuSuhas PaiJenny ChimViolette LepercqSuzana IlicMargaret MitchellSasha LuccioniYacine Jernite
Published in: CoRR (2023)
Keyphrases
  • digital libraries
  • parallel corpus
  • synthetic datasets
  • manually annotated
  • machine translation system
  • test set
  • training dataset
  • language independent
  • statistical machine translation
  • language resources