Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus.

Erwan Moreau Carl Vogel

Published in: LREC (2018)

Keyphrases

language specific
word segmentation
language independent
n gram
out of vocabulary
cross lingual
topic tracking
parallel corpus
machine translation
text classification
parallel corpora
word recognition
natural language
language model
word level
cross language
language modeling
statistical machine translation
document analysis
machine translation system
bag of words
part of speech
text retrieval