Sign in

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing.

Xianghu YueXiaohai TianMalu ZhangZhizheng WuHaizhou Li
Published in: CoRR (2024)
Keyphrases
  • audio visual
  • multi modal
  • high level
  • three dimensional
  • image sequences
  • keywords
  • feature vectors
  • temporal context