Publication: STSI: Efficiently Mine Spatio- Temporal Semantic Information between Different Multimodal for Video Captioning.