Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators.
Harsh LuniaPublished in: CoRR (2024)
Keyphrases
- action recognition
- human actions
- video dataset
- action classification
- recognition of human actions
- recognizing human actions
- mid level
- video database
- ucf sports
- motion features
- human activities
- bag of words
- recognizing actions
- view invariant
- static images
- low level descriptors
- activity recognition
- spatio temporal interest points
- space time interest points
- action detection
- action recognition in videos
- computer vision
- human detection
- visual features
- spatial temporal
- body parts
- mid level features
- bag of features
- visual information
- video sequences
- human object interactions
- spatio temporal
- motion history images
- human pose
- action primitives
- depth sensors
- low level
- human activity recognition
- image retrieval
- visual cues
- video shots
- high level
- max margin