video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.
Guangzhi SunWenyi YuChangli TangXianzhao ChenTian TanWei LiLu LuZejun MaYuxuan WangChao ZhangPublished in: CoRR (2024)
Keyphrases
- audio visual
- language model
- visual data
- multimedia
- passage retrieval
- audio features
- language modeling
- multi modal
- visual information
- speech recognition
- n gram
- document retrieval
- video data
- multi stream
- probabilistic model
- video sequences
- information retrieval
- retrieval model
- test collection
- query expansion
- video content
- human actions
- word error rate
- smoothing methods
- audio visual speech recognition
- video frames
- high dimensional
- video search
- contextual information