A New Massive Multilingual Dataset for High-Performance Language Technologies.
Ona de GibertGraeme NailNikolay ArefyevMarta BañónJelmer van der LindeShaoxiong JiJaume Zaragoza-BernabeuMikko AulamoGema Ramírez-SánchezAndrey KutuzovSampo PyysaloStephan OepenJörg TiedemannPublished in: LREC/COLING (2024)
Keyphrases
- language resources
- language specific
- programming language
- massive datasets
- language independent
- text generation
- natural language
- language learning
- parallel corpus
- multilingual documents
- data analysis
- language processing
- modeling language
- comparable corpora
- specification language
- computational linguistics
- database
- cross lingual
- information retrieval
- st century
- machine translation system
- cross language
- web intelligence
- feature set
- general purpose
- data mining