Learn Your Tokens: Word-Pooled Tokenization for Language Modeling.
Avijit ThawaniSaurabh GhanekarXiaoyuan ZhuJay PujaraPublished in: EMNLP (Findings) (2023)
Keyphrases
- language modeling
- n gram
- language model
- character n grams
- term weighting
- statistical language modeling
- word segmentation
- information retrieval
- translation model
- retrieval model
- cross lingual
- query expansion
- probabilistic model
- text classification
- language independent
- vector space model
- statistical language models
- machine learning
- information retrieval systems
- query terms
- named entities
- comparable corpora
- digital libraries
- test collection