Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding.

Published in: CVPR (2023)

Keyphrases