DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding.
Jeongsoo ChoiJoanna HongYong Man RoPublished in: CoRR (2023)
Keyphrases
- speech synthesis
- vision guided
- speech recognition
- prosodic features
- vocal tract
- mobile robot navigation
- text to speech
- multimedia
- natural scenes
- real time
- speaker verification
- video sequences
- speech signal
- video data
- automatic speech recognition
- language model
- pattern recognition
- hidden markov models
- neural network
- video frames
- key frames
- human computer interaction
- mobile robot
- natural images
- image quality
- multiscale
- information retrieval