The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models.
Jorge Abreu-VicenteHannah SonntagThomas EidensThomas LembergerPublished in: CoRR (2023)
Keyphrases
- language model
- language modeling
- probabilistic model
- document retrieval
- language modelling
- n gram
- query expansion
- retrieval model
- information retrieval
- speech recognition
- statistical language models
- test collection
- natural language processing
- language model for information retrieval
- context sensitive
- query terms
- ad hoc information retrieval
- information extraction
- training set
- natural language
- relevance model
- smoothing methods
- retrieval effectiveness
- pseudo relevance feedback
- linked data
- term dependencies
- document ranking
- translation model
- language processing
- document length
- vector space model
- question answering
- spoken term detection
- training data