End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features.
Chiori HoriHuda AlAmriJue WangGordon WichernTakaaki HoriAnoop CherianTim K. MarksVincent CartillierRaphael Gontijo LopesAbhishek DasIrfan EssaDhruv BatraDevi ParikhPublished in: CoRR (2018)
Keyphrases
- audio visual
- end to end
- visual data
- multimodal fusion
- audio features
- person authentication
- video scene
- video summarization
- multi modal
- multimedia
- video sequences
- visual information
- multi stream
- text localization and recognition
- scalable video
- feature extraction
- congestion control
- audio visual content
- video data
- low level
- feature vectors
- video streams
- contextual information
- image sequences
- three dimensional
- high robustness
- moving objects
- video content
- visual features
- high dimensional
- frame rate
- human actions
- computer vision
- audio visual speech recognition
- space time