TMT: A Transformer-Based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-Aware Dialog.
Wubo LiDongwei JiangWei ZouXiangang LiPublished in: INTERSPEECH (2020)
Keyphrases
- visual scene
- visual information
- audio visual
- multimedia
- multimodal fusion
- visual attention
- multi modal
- vision system
- spatial relations
- image processing
- natural scenes
- image collections
- visual features
- natural images
- object recognition
- eye movements
- image annotation
- higher level
- user interface
- complex scenes
- search engine