NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages.
Samuel CahyawijayaHoly LoveniaFajri KotoDea AdhistaEmmanuel DaveSarah OktaviantiSalsabil Maulana AkbarJhonson LeeNuur ShadieqTjeng Wawan CenggoroHanung Wahyuning LinuwihBryan WilieGalih Pradipta MuridanGenta Indra WinataDavid MoeljadiAlham Fikri AjiAyu PurwariantiPascale FungPublished in: CoRR (2023)
Keyphrases
- high quality
- statistical machine translation
- linguistic resources
- expressive power
- resource management
- multi lingual
- parallel corpora
- low quality
- natural language processing
- resource allocation
- language independent
- language identification
- text summarization
- high levels
- data sets
- arabic language
- syntactic and semantic dependencies
- target language
- query translation
- information resources
- information retrieval systems
- digital libraries
- knowledge base
- machine learning