Learn Your Tokens: Word-Pooled Tokenization for Language Modeling.
Avijit ThawaniSaurabh GhanekarXiaoyuan ZhuJay PujaraPublished in: CoRR (2023)
Keyphrases
- language modeling
- n gram
- language model
- character n grams
- term weighting
- translation model
- word segmentation
- statistical language modeling
- information retrieval
- retrieval model
- query expansion
- cross lingual
- language independent
- text classification
- probabilistic model
- improvements in retrieval effectiveness
- data mining
- statistical language models
- word sense disambiguation
- text categorization
- co occurrence
- statistical machine translation
- pseudo relevance feedback
- machine learning
- information retrieval systems
- document retrieval
- named entities
- test collection