An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling.

Published in: CoRR (2022)

Keyphrases