Audio-visual video-to-speech synthesis with synthesized input audio.
Triantafyllos KefalasYannis PanagakisMaja PanticPublished in: CoRR (2023)
Keyphrases
- audio visual
- speech synthesis
- visual data
- video summarization
- audio features
- multimedia
- audio visual content
- meeting room
- multi modal
- speech recognition
- visual information
- prosodic features
- speaker verification
- multimodal fusion
- text to speech
- sports video
- audio visual speech recognition
- video data
- emotion recognition
- video scene
- multi stream
- video sequences
- video content
- vocal tract
- data sets
- language model
- multimedia data
- video streams
- high dimensional data
- input data
- image data
- high dimensional
- low dimensional
- visual content
- human actions
- human activities
- image sequences
- video frames
- image content
- space time
- contextual information