Audio-visual speech separation based on joint feature representation with cross-modal attention.
Junwen XiongPeng ZhangLei XieWei HuangYufei ZhaYanning ZhangPublished in: CoRR (2022)
Keyphrases
- audio visual
- cross modal
- feature representation
- multi modal
- visual data
- feature extraction
- face recognition
- low dimensional
- visual information
- sparse representation
- high dimensional
- feature set
- image retrieval
- high dimensional data
- image content
- image classification
- data sets
- video data
- multiscale
- search engine
- machine learning