Integrating audio and visual cues for speaker friendliness in multimodal speech synthesis.
David HousePublished in: INTERSPEECH (2007)
Keyphrases
- visual cues
- speech synthesis
- prosodic features
- visual information
- audio visual
- speech recognition
- text to speech
- vocal tract
- low level
- multimodal interaction
- speaker identification
- multiple modalities
- mid level
- visual data
- speaker verification
- multi modal
- multimodal fusion
- speech corpus
- lecture videos
- multimedia
- noisy environments
- speech signal
- visual features
- cross modal
- automatic speech recognition
- speaker recognition
- emotion recognition
- multiple cues
- machine learning
- hidden markov models
- relational databases
- audio stream
- feature extraction
- multimodal information
- pattern recognition
- visual speech
- multi stream
- domain knowledge
- key frames