Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models.
Guangzhi SunWenyi YuChangli TangXianzhao ChenTian TanWei LiLu LuZejun MaChao ZhangPublished in: CoRR (2023)
Keyphrases
- audio visual
- fine grained
- language model
- multi modal
- coarse grained
- language modeling
- passage retrieval
- n gram
- document retrieval
- probabilistic model
- visual data
- information retrieval
- multimodal fusion
- speech recognition
- multi stream
- query expansion
- access control
- visual information
- test collection
- retrieval model
- multimedia
- query terms
- vector space model
- relevance model
- contextual information