TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages.
Minsu KimJee-weon JungHyeongseop RhaSoumi MaitiSiddhant AroraXuankai ChangShinji WatanabeYong Man RoPublished in: CoRR (2024)
Keyphrases
- image data
- input image
- image features
- image content
- image analysis
- english text
- single image
- multi lingual
- high resolution
- image retrieval
- image classification
- image segmentation
- edge detection
- multiscale
- language identification
- spoken language
- web images
- query translation
- language resources
- processing pipeline
- image collections
- speech recognition
- text retrieval
- image regions
- segmentation method
- image representation
- information retrieval