Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages.
Rik van NoordTaja KuzmanPeter RupnikNikola LjubesicMiquel Esplà-GomisGema Ramírez-SánchezAntonio ToralPublished in: LREC/COLING (2024)
Keyphrases
- language model
- statistical machine translation
- information retrieval
- language modeling
- web documents
- n gram
- web pages
- retrieval model
- test collection
- document retrieval
- probabilistic model
- language independent
- statistical language models
- language modelling
- document ranking
- speech recognition
- query expansion
- text summarization
- smoothing methods
- text corpora
- translation model
- text data
- vector space model
- cross lingual
- document level
- text collections
- text retrieval
- linguistic resources
- relevance model
- context sensitive
- linked data
- semantic web
- keywords
- multiword
- text documents
- question answering
- parallel corpora
- chinese english
- natural language processing
- language models for information retrieval