Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data.

Published in: CoRR (2023)

Keyphrases