Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog.
Zekang LiZongjia LiJinchao ZhangYang FengJie ZhouPublished in: IEEE ACM Trans. Audio Speech Lang. Process. (2021)
Keyphrases
- visual scene
- story segmentation
- multimedia
- news video
- audio content
- visual information
- audio visual
- multiple modalities
- multimodal fusion
- visual data
- video sequences
- broadcast news
- video search
- multi modal
- audio features
- video data
- complex scenes
- vision system
- object recognition
- visual attention
- video frames
- video content
- information retrieval
- semantic information
- natural images
- natural language
- metadata
- natural scenes
- text mining
- visual speech
- image sequences
- key frames
- closely related
- music information retrieval
- image quality
- real time