Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams.
Yuanbo HouZhesong YuXia LiangXingjian DuBilei ZhuZejun MaDick BotteldoorenPublished in: Interspeech (2021)
Keyphrases
- audio visual
- video streams
- cross modal
- multi modal
- visual data
- video data
- video content
- visual information
- video sequences
- multimedia
- video analysis
- video frames
- high dimensional
- audio features
- visual features
- machine learning
- multimedia databases
- image annotation
- image content
- multimedia data
- contextual information
- image classification
- image retrieval
- data analysis
- computer vision
- search engine