Sign in

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer.

Jacob Zhiyuan FangSkyler ZhengVasu SharmaRobinson Piramuthu
Published in: CoRR (2023)
Keyphrases