Incorporating Scene Graphs into Pre-trained Vision-Language Models for Multimodal Open-vocabulary Action Recognition.
Chao WeiZhidong DengPublished in: ICRA (2024)
Keyphrases
- action recognition
- language model
- pre trained
- computer vision
- spoken term detection
- human actions
- n gram
- probabilistic model
- information retrieval
- bag of words
- training data
- d scene
- speech recognition
- context sensitive
- atomic actions
- body parts
- video sequences
- three dimensional
- training examples
- control signals
- audio visual
- image sequences
- input image
- visual data
- multi modal
- single image
- face recognition
- real scenes
- object detection
- visual words
- moving objects
- principal component analysis
- pose estimation