An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM.
Wonkyun KimChangin ChoiWonseok LeeWonjong RheePublished in: CoRR (2024)
Keyphrases
- question answering
- video sequences
- video data
- multimedia
- visual data
- natural language processing
- image classification
- information retrieval
- video content
- key frames
- image content
- named entities
- image features
- information extraction
- video frames
- natural language
- image representation
- natural language questions
- video retrieval
- video shots
- relation extraction
- question classification
- qa clef
- textual entailment recognition
- cross language
- image retrieval
- knowledge base