Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought.
Vaishnavi HimakunthalaAndy OuyangDaniel RoseRyan HeAlex MeiYujie LuChinmay SonarMichael SaxonWilliam Yang WangPublished in: EMNLP (2023)
Keyphrases
- video frames
- key frames
- temporal coherence
- video sequences
- video data
- successive frames
- image frames
- single frame
- multimedia
- temporal continuity
- spatial and temporal
- video streams
- real time
- video content
- space time
- input video
- video retrieval
- video images
- video dataset
- frame rate
- video signals
- video objects
- real time video
- video database
- video analysis
- multiview video
- video clips
- human actions
- event detection
- high resolution
- adjacent frames
- neighboring frames
- event recognition
- reference frame
- dynamic scenes
- multimedia data
- optical flow