Attention-based Visual-Audio Fusion for Video Caption Generation.
Ningning GuoHuaping LiuLinhua JiangPublished in: ICARM (2019)
Keyphrases
- news video
- visual data
- visual features
- visual information
- video indexing and retrieval
- story segmentation
- content based video retrieval
- multimedia
- audio video
- video retrieval
- video content
- audio features
- video database
- video indexing
- video shots
- video data
- digital video
- key frames
- visual cues
- selective attention
- multimodal fusion
- visual speech
- visual saliency
- video sequences
- audio files
- multimedia processing
- multimedia data
- visual input
- low level
- cross modal
- video content analysis
- scene change detection
- video files
- multimodal information
- human actions
- video material
- semantic concepts
- visual content
- video search
- saliency map
- audio visual
- broadcast news
- visual analysis
- video streams
- video clips
- video analysis
- video scene
- data fusion
- video frames
- digital audio
- media streams
- news stories
- fusion method
- video segments
- multimedia information
- audio visual content
- lifelog
- audio signals
- video signals
- visual field
- semantic information
- high level
- audio stream
- closed captions
- low level features
- image fusion
- visual stimuli
- audio signal