FuLG: 150B Romanian Corpus for Language Model Pretraining.
Vlad-Andrei BadoiuMihai-Valentin DumitruAlexandru M. GherghescuAlexandru AgacheCostin RaiciuPublished in: CoRR (2024)
Keyphrases
- language model
- document level
- language modeling
- statistical machine translation
- multiword
- n gram
- speech recognition
- document retrieval
- probabilistic model
- language modelling
- retrieval model
- mixture model
- information retrieval
- context sensitive
- query expansion
- ad hoc information retrieval
- test collection
- statistical language models
- vector space model
- query specific
- language models for information retrieval