Audio-visual training for improved grounding in video-text LLMs.
Shivprasad SagareHemachandran SKinshuk SarabhaiPrashant UllegaddiRajeshkumar SAPublished in: CoRR (2024)
Keyphrases
- audio visual
- visual data
- video summarization
- multimedia
- meeting room
- multi modal
- audio visual content
- audio features
- person authentication
- temporal context
- visual information
- multimodal fusion
- video data
- audio visual speech recognition
- text data
- user comments
- multi stream
- video content
- video sequences
- multimedia data
- video retrieval
- data sets
- training set
- video streams
- video frames
- mobile devices
- information retrieval
- text documents
- semantic information
- high dimensional data
- co occurrence