Login / Signup
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models.
Hao-Wen Dong
Xiaoyu Liu
Jordi Pons
Gautam Bhattacharya
Santiago Pascual
Joan Serrà
Taylor Berg-Kirkpatrick
Julian J. McAuley
Published in:
WASPAA (2023)
Keyphrases
</>
prior knowledge
human language
real time
information retrieval
multimedia
text to speech
natural language
computer vision
video sequences
training set
unsupervised learning
unlabeled data
news video
machine translation system