Login / Signup
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text.
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
Published in:
CoRR (2021)
Keyphrases
</>
multimedia
learning process
reinforcement learning
learning algorithm
supervised learning
video sequences
signal processing
video streams
real time
information retrieval
visual information
multimedia data
key frames
text graphics