MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.
Difei GaoLuowei ZhouLei JiLinchao ZhuYi YangMike Zheng ShouPublished in: CVPR (2023)
Keyphrases
- multi modal
- spatial temporal
- question answering
- video shots
- semantic concepts
- temporal information
- spatial and temporal
- information extraction
- spatio temporal
- action recognition
- natural language processing
- named entities
- information retrieval
- natural language
- audio visual
- video search
- video retrieval
- spatial information
- high dimensional
- machine learning
- video data
- video content
- human actions
- video sequences
- multimedia