Login / Signup
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing.
Xianghu Yue
Xiaohai Tian
Malu Zhang
Zhizheng Wu
Haizhou Li
Published in:
CoRR (2024)
Keyphrases
</>
audio visual
multi modal
high level
three dimensional
image sequences
keywords
feature vectors
temporal context