Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens.
Minsu KimJeongsoo ChoiSoumi MaitiJeong Hun YeoShinji WatanabeYong Man RoPublished in: ICASSP (2024)
Keyphrases
- multi modal
- audio visual
- image data
- uni modal
- input image
- image segmentation
- multi modality
- image representation
- low level
- image features
- auto annotation
- fusing multiple
- single modality
- image retrieval
- similarity measure
- image classification
- image analysis
- image content
- cross modal
- speech recognition
- multiple modalities
- high resolution
- image collections
- computer vision
- high level
- multiscale
- high dimensional
- visual data
- image regions
- image annotation
- x ray
- image processing
- edge detection