Publication: Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval.