Publication: See, move and hear: a local-to-global multi-modal interaction network for video action recognition.