A New Massive Multilingual Dataset for High-Performance Language Technologies.
Ona de GibertGraeme NailNikolay ArefyevMarta BañónJelmer van der LindeShaoxiong JiJaume Zaragoza-BernabeuMikko AulamoGema Ramírez-SánchezAndrey KutuzovSampo PyysaloStephan OepenJörg TiedemannPublished in: CoRR (2024)
Keyphrases
- language specific
- language resources
- programming language
- parallel corpus
- language learning
- database
- digital libraries
- multilingual documents
- data analysis
- language independent
- feature set
- massive datasets
- natural language
- extensible markup language
- artificial intelligence
- comparable corpora
- text generation
- massive data
- cross lingual
- emerging technologies
- training dataset
- web technologies
- st century
- n gram
- natural language processing
- mobile devices
- knowledge base
- feature selection
- data mining