Large-scale unsupervised audio pre-training for video-to-speech synthesis.
Triantafyllos KefalasYannis PanagakisMaja PanticPublished in: CoRR (2023)
Keyphrases
- speech synthesis
- text to speech
- prosodic features
- multimedia
- audio video
- speech recognition
- multimedia processing
- digital video
- supervised learning
- video sequences
- scene change detection
- supervised training
- multimedia information
- video data
- visual data
- vocal tract
- video files
- unsupervised manner
- video content
- video streams
- audio files
- training set
- unsupervised learning
- soccer video
- digital audio
- video content analysis
- content based video retrieval
- semi supervised
- audio signals
- audio content
- audio stream
- audio visual content
- video frames
- video database
- speech corpus
- pattern recognition
- video signals
- machine learning
- training data
- classifier training
- story segmentation
- audio visual
- video analysis
- deep architectures
- computer vision
- hidden markov models
- language model
- video recordings
- action recognition
- audio signal
- broadcast news
- visual information
- noisy environments
- video retrieval