Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity.
Youngwoo YoonBok ChaJoo-Haeng LeeMinsu JangJaeyeon LeeJaehong KimGeehyuk LeePublished in: CoRR (2020)
Keyphrases
- audio visual
- text to speech
- prosodic features
- speaker identification
- speech recognition
- audio stream
- speaker recognition
- hidden markov models
- automatic speech recognition
- broadcast news
- automatic transcription
- text to speech synthesis
- spontaneous speech
- speaker verification
- multi stream
- speech processing
- speech synthesis
- emotion recognition
- human language
- english text
- text generation
- contextual information
- spoken documents
- visual data
- multi modal
- text recognition
- visual speech
- multimedia
- speaker diarization
- text graphics
- audio signals
- speech signal
- cepstral features
- context aware
- acoustic features
- lexical features
- semantic context
- multimodal interfaces
- spoken language
- audio features
- noisy environments
- natural language descriptions
- hand movements
- text input
- information retrieval
- gaussian mixture model
- speech music discrimination