Publication: SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set.