GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions.

Published in: CoRR (2023)

Keyphrases