Login / Signup
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models.
Hao-Wen Dong
Xiaoyu Liu
Jordi Pons
Gautam Bhattacharya
Santiago Pascual
Joan Serrà
Taylor Berg-Kirkpatrick
Julian J. McAuley
Published in:
CoRR (2023)
Keyphrases
</>
human language
probabilistic model
computer vision
real time
keywords
multimedia
prior knowledge
information extraction
vision system
video analysis
text mining
unsupervised learning
text documents
visual information
visual data
cognitive models