E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer.

Published in: CoRR (2023)

Keyphrases