VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling.

Published in: CoRR (2021)

Keyphrases