A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions.
Jack HesselBo PangZhenhai ZhuRadu SoricutPublished in: CoNLL (2019)
Keyphrases
- visual features
- key frames
- semantic concepts
- video shots
- content based video retrieval
- multimedia
- human actions
- visual data
- visual information
- visual content
- image classification
- low level features
- motion features
- image retrieval
- image annotation
- video content
- keywords
- video database
- image search
- semantic features
- video sequences
- video data
- video clips
- video streams
- audio features
- visual appearance
- global features
- video frames
- speech recognition
- low level
- bag of features
- image collections
- semantic gap
- video retrieval
- web images
- low level visual features
- automatic speech recognition
- noisy environments
- hidden markov models
- multimedia data