Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing.
Yating XuConghui HuGim Hee LeePublished in: CoRR (2023)
Keyphrases
- audio visual
- visual data
- multi modal
- multimedia
- visual information
- video data
- video sequences
- multimedia data
- video frames
- natural language processing
- object class
- high dimensional data
- image data
- visual features
- superpixels
- high dimensional
- topic models
- image content
- information extraction
- human motion
- visual content
- low level
- co occurrence
- domain knowledge
- object detection