Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations.
Yoshihiro YamazakiShota OrihashiRyo MasumuraMihiro UchidaAkihiko TakashimaPublished in: CoRR (2022)
Keyphrases
- audio visual
- visual data
- video scene
- video summarization
- video sequences
- visual information
- video data
- multimedia
- multi modal
- meeting room
- audio visual content
- multi stream
- audio features
- visual features
- image data
- multimodal fusion
- audio visual speech recognition
- temporal context
- video frames
- multimedia data
- image sequences
- high dimensional data
- computer vision
- contextual information
- visual content
- video streams
- video retrieval
- image collections
- domain knowledge
- data analysis
- feature selection