Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages.
Rik van NoordTaja KuzmanPeter RupnikNikola LjubesicMiquel Esplà-GomisGema Ramírez-SánchezAntonio ToralPublished in: CoRR (2024)
Keyphrases
- language model
- statistical machine translation
- information retrieval
- language modeling
- n gram
- web pages
- web documents
- probabilistic model
- query expansion
- retrieval model
- text retrieval
- translation model
- language modelling
- document level
- document retrieval
- statistical language models
- cross lingual
- text data
- multiword
- language independent
- context sensitive
- test collection
- query terms
- language models for information retrieval
- text summarization
- text collections
- text corpora
- speech recognition
- machine translation system
- document ranking
- smoothing methods
- pseudo relevance feedback
- relevance model
- vector space model
- text mining
- keywords
- chinese english
- semantic web
- natural language processing
- linguistic resources
- query specific
- linked data