Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning.
Xin WangYuan-Fang WangWilliam Yang WangPublished in: NAACL-HLT (2) (2018)
Keyphrases
- cross modal
- multi modal
- visual data
- video data
- video sequences
- multimedia
- multimedia retrieval
- semantic concepts
- video content
- visual recognition
- multimedia databases
- image retrieval
- video streams
- perceptual information
- video frames
- visual information
- space time
- video retrieval
- video analysis
- visual similarity
- visual features
- human actions
- multimedia data
- feature extraction