Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens.
Minsu KimJeongsoo ChoiSoumi MaitiJeong Hun YeoShinji WatanabeYong Man RoPublished in: CoRR (2023)
Keyphrases
- multi modal
- audio visual
- uni modal
- image data
- image features
- input image
- auto annotation
- fusing multiple
- image content
- computer vision
- high dimensional
- single modality
- multiscale
- high resolution
- image segmentation
- semantic concepts
- image classification
- image retrieval
- image regions
- speech recognition
- multiple modalities
- edge detection
- image analysis
- similarity measure
- segmentation method
- mutual information
- web images
- visual concepts
- video search
- cross modal
- smart room