VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion.
Disong WangShan YangDan SuXunying LiuDong YuHelen MengPublished in: CoRR (2022)
Keyphrases
- speech synthesis
- knowledge transfer
- cross modal
- prosodic features
- speech recognition
- text to speech
- vocal tract
- multi modal
- visual data
- knowledge sharing
- video sequences
- transfer learning
- multimedia retrieval
- semantic concepts
- multimedia
- audio visual
- video analysis
- video streams
- image retrieval
- video frames
- video data
- language model
- visual recognition
- speaker verification
- learning tasks
- multimedia databases
- multimedia data
- video content
- automatic speech recognition
- visual similarity
- pattern recognition
- image processing
- data sets
- visual information
- hidden markov models
- word processing
- image sequences
- visual content
- speech signal
- information extraction
- object recognition