Login / Signup
Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions.
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogério Feris
James R. Glass
Aude Oliva
Published in:
CVPR (2021)
Keyphrases
</>
audio visual
multi modal
video data
visual information
spatio temporal
nearest neighbor
image content
person authentication