Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning.
Xin WangYuan-Fang WangWilliam Yang WangPublished in: CoRR (2018)
Keyphrases
- cross modal
- multi modal
- visual data
- video data
- multimedia retrieval
- video sequences
- semantic concepts
- multimedia
- image retrieval
- video content
- video frames
- multimedia data
- multimedia databases
- visual similarity
- video analysis
- visual recognition
- perceptual information
- video retrieval
- visual content
- visual features
- image sequences
- computer vision
- human actions
- image understanding
- key frames
- low level
- feature extraction
- information retrieval