Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation.
Hung LeDoyen SahooNancy F. ChenSteven C. H. HoiPublished in: Comput. Speech Lang. (2020)
Keyphrases
- end to end
- visual scene
- visual attention
- visual information
- audio visual
- multimedia
- vision system
- congestion control
- eye movements
- object recognition
- admission control
- natural scenes
- multi modal
- saliency map
- natural images
- eye tracking
- visual features
- complex scenes
- higher level
- real world
- low level
- multiscale
- video streaming
- scalable video
- text localization and recognition