DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models.
Renqiu XiaSong MaoXiangchao YanHongbin ZhouBo ZhangHaoyang PengJiahao PiDaocheng FuWenjie WuHancheng YeShiyang FengBin WangChao XuConghui HePinlong CaiMin DouBotian ShiSheng ZhouYongwei WangBin WangJunchi YanFei WuYu QiaoPublished in: CoRR (2024)
Keyphrases
- multi modal
- language model
- document retrieval
- document ranking
- ad hoc information retrieval
- document length
- language modeling
- query terms
- information retrieval
- passage retrieval
- vector space model
- audio visual
- relevance model
- n gram
- document level
- query expansion
- probabilistic model
- query specific
- retrieval model
- language modeling approaches
- language modelling
- speech recognition
- multi modality
- term dependencies
- test collection
- retrieval systems
- translation model
- video search
- smoothing methods
- okapi bm
- language models for information retrieval
- document collections
- high dimensional
- image annotation
- statistical language models
- language modeling framework
- information retrieval systems
- inter document similarities
- pseudo relevance feedback
- text classifiers
- retrieved documents
- web documents
- keywords