Disentangling Prosody Representations With Unsupervised Speech Reconstruction.
Leyuan QuTaihao LiCornelius WeberTheresa Pekarek-RosinFuji RenStefan WermterPublished in: IEEE ACM Trans. Audio Speech Lang. Process. (2024)
Keyphrases
- text to speech
- speech synthesis
- speech recognition
- prosodic features
- multi stream
- audio visual
- unsupervised learning
- synthesized speech
- vocal tract
- multi modal
- higher level
- high resolution
- supervised learning
- three dimensional
- reconstruction process
- data driven
- semi supervised
- reconstruction error
- multiple representations
- visual features
- speaker identification
- d objects
- hidden markov models
- pattern recognition
- bayesian networks