Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus.
Erwan MoreauCarl VogelPublished in: LREC (2018)
Keyphrases
- language specific
- word segmentation
- language independent
- n gram
- out of vocabulary
- cross lingual
- topic tracking
- parallel corpus
- machine translation
- text classification
- parallel corpora
- word recognition
- natural language
- language model
- word level
- cross language
- language modeling
- statistical machine translation
- document analysis
- machine translation system
- bag of words
- part of speech
- text retrieval