VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning.
Qiu-Shi ZhuLong ZhouZiqiang ZhangShujie LiuBinxing JiaoJie ZhangLirong DaiDaxin JiangJinyu LiFuru WeiPublished in: CoRR (2022)
Keyphrases
- supervised learning
- online learning
- visual information
- learning algorithm
- learning process
- training set
- text mining
- audio visual
- text to speech
- information retrieval
- audio stream
- text graphics
- content based video retrieval
- audio signals
- visual representation
- prediction accuracy
- learning environment
- reinforcement learning