Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing.
Yating XuConghui HuGim Hee LeePublished in: WACV (2024)
Keyphrases
- audio visual
- visual data
- multi modal
- multimedia
- visual information
- video data
- high dimensional
- video sequences
- multimedia data
- image data
- video retrieval
- video content
- video frames
- natural language
- image sequences
- information retrieval
- high dimensional data
- visual features
- object class
- data sources
- human motion
- human actions
- visual content
- superpixels