Login / Signup
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.
Bowen Shi
Wei-Ning Hsu
Kushal Lakhotia
Abdelrahman Mohamed
Published in:
CoRR (2022)
Keyphrases
</>
audio visual
multi modal
visual information
multi stream
multimedia
emotion recognition
multimodal fusion
affective states
visual data
low level
speaker verification
spatio temporal
dimensionality reduction
person authentication