Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection.

Published in: CoRR (2024)

Keyphrases