Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.
Xueyuan ChenYuejiao WangXixin WuDisong WangZhiyong WuXunying LiuHelen MengPublished in: ICASSP (2024)
Keyphrases
- audio visual
- multi modal
- visual features
- visual information
- audio features
- semantic concepts
- visual data
- image classification
- emotion recognition
- image annotation
- visual content
- image retrieval
- low level
- keywords
- cross modal
- low level features
- speaker verification
- acoustic features
- high dimensional
- image collections
- multi modality
- semantic gap
- uni modal
- image search
- content based video retrieval
- machine learning
- video search
- human actions
- visual similarity
- web images
- key frames
- speech recognition
- image representation
- feature extraction
- multimedia