Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion.
Haoquan YangLiqun DengYu Ting YeungNianzu ZhengYong XuPublished in: INTERSPEECH (2022)
Keyphrases
- text to speech
- speech synthesis
- prosodic features
- synthesized speech
- speech recognition
- audio visual
- multi modal
- multimodal interaction
- neural network
- feature extraction
- text to speech synthesis
- voice activity detection
- data sets
- vocal tract
- temporal aspects
- speaker recognition
- emotion recognition
- representation scheme
- model construction
- dynamic bayesian networks