Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching.
Nicola MessinaDavide Alessandro CoccominiAndrea EsuliFabrizio FalchiPublished in: CoRR (2022)
Keyphrases
- multi modal
- image content
- image matching
- input image
- uni modal
- multiscale
- fusing multiple
- image data
- feature points
- image classification
- image features
- high resolution
- audio visual
- image analysis
- auto annotation
- cross modal
- single modality
- image segmentation
- video search
- image representation
- low level
- image retrieval
- multi modality
- semantic information
- contrast enhancement
- semantic concepts
- image set
- computer vision
- image processing
- affine invariant
- image regions
- edge detection