Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog.
Zekang LiZongjia LiJinchao ZhangYang FengCheng NiuJie ZhouPublished in: CoRR (2020)
Keyphrases
- video sequences
- multimedia
- scene change detection
- video data
- visual data
- video scene
- video streams
- video content
- digital video
- video frames
- video database
- audio video
- video analysis
- natural language descriptions
- news video
- video images
- story segmentation
- video content analysis
- multimedia processing
- video files
- video segments
- dynamic textures
- key frames
- image sequences
- audio visual
- moving camera
- video search
- closed captions
- audio content
- multimedia data
- surveillance videos
- video retrieval
- dynamic scenes
- content based video retrieval
- event detection
- stationary camera
- image mosaics
- live video
- visual information
- audio signals
- audio stream
- multi modal
- video shots