Publication: Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering.