MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning.
Cong ZouXuchen WangYaosi HuZhenzhong ChenShan LiuPublished in: VCIP (2021)
Keyphrases
- multimedia
- video data
- video sequences
- multi modal
- real time video
- video content
- video streams
- visual attention
- multimodal information
- real time
- body motions
- key frames
- visual data
- audio visual
- generation process
- multiple modalities
- video clips
- video analysis
- video database
- digital video
- linguistic features
- story segmentation
- spatial and temporal
- video frames