Leveraging per Image-Token Consistency for Vision-Language Pre-training.
Yunhao GouTom KoHansi YangJames KwokYu ZhangMingxuan WangPublished in: CoRR (2022)
Keyphrases
- image data
- image segmentation
- image features
- input image
- multiscale
- image analysis
- image retrieval
- template matching
- low level image processing
- single image
- image representation
- image classification
- image pixels
- image content
- image structure
- visual perception
- image synthesis
- language learning
- region of interest
- consistency constraints
- vision system
- hough transform
- low level
- image processing
- image regions
- real time
- edge detection
- image collections
- high resolution
- pixel values
- training set
- feature space
- natural language
- similarity measure