Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training.
Haowei LiuYaya ShiHaiyang XuChunfeng YuanQinghao YeChenliang LiMing YanJi ZhangFei HuangBing LiWeiming HuPublished in: LREC/COLING (2024)
Keyphrases
- cross modal
- image data
- image features
- image classification
- image retrieval
- image segmentation
- visual data
- image content
- multiscale
- image representation
- image collections
- multi modal
- visual similarity
- perceptual information
- computer vision
- low level
- object recognition
- similarity measure
- feature vectors
- training set
- test images
- spatial relationships