ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer.
Huadai LiuRongjie HuangXuan LinWenqiang XuMaozong ZhengHong ChenJinzheng HeZhou ZhaoPublished in: EMNLP (2023)
Keyphrases
- text to speech
- speech synthesis
- text to speech synthesis
- multimodal interaction
- prosodic features
- programming tool
- low level
- word processing
- visual features
- anisotropic diffusion
- english text
- fuzzy logic
- multi modal
- diffusion process
- high level
- high voltage
- human vision
- diffusion model
- power system
- learning styles
- multiscale