Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data.

Published in: WACV (Workshops) (2023)

Keyphrases