Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation.
Chang CheQunwei LinXinyu ZhaoJiaxin HuangLiqiang YuPublished in: CoRR (2024)
Keyphrases
- multiscale
- single image
- image analysis
- image data
- input image
- image features
- image segmentation
- image retrieval
- template matching
- image classification
- image regions
- hough transform
- image representation
- multi modal
- low level
- high level
- information retrieval
- image matching
- spatial information
- image collections
- scanned documents
- image scrambling
- segmentation algorithm
- edge detection
- text mining
- multimedia
- test images
- relevance feedback
- key frames
- multiresolution
- image pixels
- visual data
- textual information
- similarity measure