Publication: Automatic caption generation for video data. Time alignment between caption and acoustic signal.