Login / Signup
Siamese Vision Transformers are Scalable Audio-visual Learners.
Yan-Bo Lin
Gedas Bertasius
Published in:
CoRR (2024)
Keyphrases
</>
audio visual
multi modal
multi stream
visual data
computer vision
emotion recognition
visual information
e learning
video summarization
temporal context
image processing
person authentication
audio visual speech recognition
multimedia
low dimensional
co occurrence
feature vectors
object recognition
feature extraction