Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling.
Hsin-Ying LeeHung-Ting SuBing-Chen TsaiTsung-Han WuJia-Fong YehWinston H. HsuPublished in: CoRR (2022)
Keyphrases
- fine grained
- spatial temporal
- question answering
- coarse grained
- temporal information
- spatial and temporal
- natural language processing
- spatio temporal
- information extraction
- access control
- information retrieval
- action recognition
- visual data
- video streams
- named entities
- multimedia
- video frames
- video retrieval
- low level
- image sequences