MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation.
Ruibo FuShuchen ShiHongming GuoTao WangChunyu QiangZhengqi WenJianhua TaoXin QiYi LuXiaopeng WangZhiyong WangYukun LiuXuefei LiuShuai ZhangGuanjun LiPublished in: CoRR (2024)
Keyphrases
- multi modal
- multiple modalities
- image features
- input image
- audio content
- image data
- image content
- uni modal
- image retrieval
- image classification
- multi modality
- video search
- audio visual
- low level
- image representation
- text mining
- single modality
- multimedia content
- image collections
- information retrieval
- similarity measure