MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing.
Jiashuo YuYing ChengRui-Wei ZhaoRui FengYuejie ZhangPublished in: ACM Multimedia (2022)
Keyphrases
- audio visual
- video summarization
- visual data
- multi modal
- multimedia
- multimodal fusion
- audio features
- multiscale
- temporal context
- visual information
- multi stream
- event detection
- video sequences
- audio visual content
- sports video
- input image
- multimedia data
- video data
- image representation
- natural language processing
- spatio temporal
- audio visual speech recognition
- space time
- natural language