Publication: Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog.