Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.
Xueyuan ChenYuejiao WangXixin WuDisong WangZhiyong WuXunying LiuHelen MengPublished in: CoRR (2024)
Keyphrases
- audio visual
- multi modal
- visual features
- visual information
- audio features
- semantic concepts
- visual data
- image classification
- image annotation
- visual content
- emotion recognition
- image retrieval
- speaker verification
- image search
- low level
- multi modality
- cross modal
- keywords
- image collections
- low level features
- visual similarity
- semantic gap
- high dimensional
- acoustic features
- machine learning
- automatic image annotation
- human actions
- key frames
- video streams
- semantic information