Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation.
Debang LiuTianqi ZhangMads Græsbøll ChristensenYing WeiZeliang AnPublished in: INTERSPEECH (2023)
Keyphrases
- audio visual
- multiscale
- temporal context
- multimodal fusion
- person authentication
- multi modal
- image fusion
- visual information
- sound source
- visual data
- multi stream
- emotion recognition
- spatio temporal
- temporal information
- speaker verification
- multimedia
- image processing
- frequency domain
- visual features
- low level
- audio features
- image segmentation