Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages.
Tomasz LimisiewiczJirí BalharDavid MarecekPublished in: CoRR (2023)
Keyphrases
- language modeling
- cross lingual
- language model
- comparable corpora
- n gram
- language specific
- out of vocabulary
- language independent
- character n grams
- cross language
- retrieval model
- information retrieval
- query expansion
- probabilistic model
- text classification
- parallel corpora
- word segmentation
- improvements in retrieval effectiveness
- translation model
- parallel corpus
- multilingual retrieval
- named entities
- information retrieval systems
- relevance model
- cross language information retrieval
- machine translation
- test collection
- keywords
- query translation
- retrieval effectiveness
- statistical machine translation
- machine translation system
- document retrieval
- linguistic resources
- mixture model
- wordnet