Login / Signup
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions.
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogério Feris
James R. Glass
Aude Oliva
Published in:
CoRR (2021)
Keyphrases
</>
audio visual
multimedia
visual data
multi modal
high level
speech recognition
data sets
three dimensional
human computer interaction
space time
video retrieval
audio visual content