Audio-visual speech synthesis using vision transformer-enhanced autoencoders with ensemble of loss functions.
Subhayu GhoshSnehashis SarkarSovan GhoshFrank ZalkowNanda Dulal JanaPublished in: Appl. Intell. (2024)
Keyphrases
- loss function
- audio visual
- speech synthesis
- speech recognition
- multi modal
- text to speech
- pairwise
- visual information
- loss minimization
- denoising
- support vector
- visual data
- multimedia
- emotion recognition
- computer vision
- ensemble methods
- neural network
- multi stream
- image processing
- speaker verification
- convex loss functions
- base classifiers
- similarity measure
- language model
- probabilistic model
- feature vectors
- training set
- boosting algorithms
- training data