Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams.
Yuanbo HouZhesong YuXia LiangXingjian DuBilei ZhuZejun MaDick BotteldoorenPublished in: CoRR (2021)
Keyphrases
- audio visual
- video streams
- cross modal
- multi modal
- visual data
- video data
- video content
- video frames
- visual information
- multimedia
- high dimensional
- video analysis
- video sequences
- visual features
- multimedia databases
- music information retrieval
- data management
- object recognition
- image annotation
- high level
- noisy environments
- information retrieval
- audio features