An Attempt towards Interpretable Audio-Visual Video Captioning.
Yapeng TianChenxiao GuanJustin GoodmanMarc MooreChenliang XuPublished in: CoRR (2018)
Keyphrases
- audio visual
- video summarization
- visual data
- multimedia
- meeting room
- multi modal
- audio visual content
- audio features
- sports video
- visual information
- video data
- temporal context
- video sequences
- multimodal fusion
- multi stream
- video content
- audio visual speech recognition
- person authentication
- video frames
- video streams
- data sets
- image sequences
- key frames
- spatial and temporal
- high dimensional
- visual features
- context aware
- video retrieval
- human computer interaction