Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR.
Zhenyang LiYangyang GuoKejie WangXiaolin ChenLiqiang NieMohan S. KankanhalliPublished in: CoRR (2024)
Keyphrases
- visual processing
- visual perception
- human vision
- language learning
- visual information
- visual field
- computer vision
- programming language
- visual features
- natural language
- vision system
- visual query language
- image processing
- object oriented programming
- knowledge base
- specification language
- visual scene
- video data
- object recognition
- high level
- visual search
- modeling language
- target language
- visual representation
- information retrieval
- data sets
- commonsense knowledge
- real time