VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions.
Yuxuan WangZilong ZhengXueliang ZhaoJinpeng LiYueqian WangDongyan ZhaoPublished in: CoRR (2023)
Keyphrases
- video sequences
- dynamic scenes
- human actions
- video images
- video scene
- natural language
- semantic video retrieval
- scene change detection
- visual data
- input video
- video event
- video data
- scene categories
- moving camera
- semantic concepts
- scene classification
- d scene
- semantic context
- surveillance videos
- live video
- video frames
- cognitive systems
- video content
- image set
- semantic similarity
- scene analysis
- video annotation
- motion features
- image mosaics
- weakly labeled
- video analysis
- single image
- three dimensional
- space time
- action recognition
- video retrieval
- high level
- image sequences
- image frames
- human activities
- outdoor images
- motion estimation
- input image
- dynamic textures
- topic models
- moving objects
- low level
- single frame
- image classification
- video database
- semantic information
- sports video
- video streams
- object motion
- multimedia data
- object detectors
- computer vision