Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge.
Khuyagbaatar BatsurenEkaterina VylomovaVerna DankersTsetsuukhei DelgerbaatarOmri UzanYuval PinterGábor BellaPublished in: CoRR (2024)
Keyphrases
- out of vocabulary
- spoken document retrieval
- n gram
- named entities
- broadcast news
- language model
- named entity recognition
- word segmentation
- cross language information retrieval
- natural language processing
- speech recognition
- information extraction
- character n grams
- parallel corpora
- hand crafted
- test collection
- query translation
- automatic speech recognition
- cross lingual
- language modeling
- information retrieval
- language independent
- information access
- video retrieval
- machine translation