A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot.
Aanisha BhattacharyyaYaman SinglaBalaji KrishnamurthyRajiv Ratn ShahChangyou ChenPublished in: EMNLP (2023)
Keyphrases
- video content
- video sequences
- video frames
- video data
- video database
- video analysis
- content based video retrieval
- video clips
- video indexing
- video dataset
- key frames
- online video
- event recognition
- video editing
- video images
- youtube videos
- temporal coherence
- video retrieval
- spatiotemporal features
- input video
- moving camera
- lecture videos
- video segments
- video streams
- video annotation
- human activities
- video event
- video event detection
- video search
- natural language descriptions
- video surveillance
- video shots
- successive frames
- video sharing
- foreground background segmentation
- semantic concept detection
- content based copy detection
- visual analysis
- temporal domain
- video material
- video browsing
- spatio temporal
- event detection
- human actions
- video stabilization
- semantic concepts
- instructional videos
- dynamic scenes
- surveillance videos
- action classification
- high definition
- video summarization
- video representation
- stereoscopic video
- motion features
- news video
- space time
- spatial and temporal
- sports video
- video objects
- dynamic textures
- stationary camera
- eye tracking data
- video copy detection
- multimedia data
- web videos
- visual features
- annotation tool
- space time interest points
- video scene