Publication: STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding.