Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model.
Jinlong XueYayue DengYichen HanYingming GaoYa LiPublished in: CoRR (2024)
Keyphrases
- multi modal
- language model
- audio visual
- cross modal
- context sensitive
- language modeling
- text to speech synthesis
- n gram
- speech recognition
- probabilistic model
- document retrieval
- query expansion
- text to speech
- multi modality
- information retrieval
- retrieval model
- test collection
- mixture model
- high dimensional
- dependency structure
- single modality
- ad hoc information retrieval
- query terms
- video search
- relevance model
- feature space
- feature extraction
- multimedia
- smoothing methods
- search engine