A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages.
Jorge Palomar-GinerJosé Javier SaizFerran EspuñaMario MinaSeverino Da DaltJoan LlopMalte OstendorffPedro Ortiz SuarezGeorg RehmAitor Gonzalez-AgirreMarta VillegasPublished in: LREC/COLING (2024)
Keyphrases
- linguistic resources
- expressive power
- automatic extraction
- natural language processing
- statistical machine translation
- information extraction
- comparable corpora
- language independent
- bilingual lexicon
- cross lingual
- databases
- scientific databases
- multi lingual
- text summarization
- parallel corpus
- automatically extracted
- infrared
- machine learning