VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning.
Qiushi ZhuLong ZhouZiqiang ZhangShujie LiuBinxing JiaoJie ZhangLi-Rong DaiDaxin JiangJinyu LiFuru WeiPublished in: IEEE Trans. Multim. (2024)
Keyphrases
- visual information
- online learning
- information retrieval
- text to speech
- supervised learning
- learning algorithm
- linear prediction
- training process
- active learning
- visual data
- cross modal
- learning process
- reinforcement learning
- low level
- audio visual
- visual representation
- content based video retrieval
- acoustic signals