MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.
Difei GaoLuowei ZhouLei JiLinchao ZhuYi YangMike Zheng ShouPublished in: CoRR (2022)
Keyphrases
- multi modal
- spatial temporal
- question answering
- video shots
- spatial and temporal
- semantic concepts
- information retrieval
- information extraction
- natural language
- spatio temporal
- natural language processing
- temporal information
- named entities
- video search
- action recognition
- spatial information
- image annotation
- audio visual
- high dimensional
- video streams
- human actions
- video retrieval
- high level
- video frames
- image classification