Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization.
Jun-Tae LeeMihir JainHyoungwoo ParkSungrack YunPublished in: ICLR (2021)
Keyphrases
- audio visual
- weakly supervised
- multi modal
- superpixels
- visual information
- topic models
- relation extraction
- multimedia
- visual data
- object class
- semi supervised
- named entities
- multiscale
- domain knowledge
- high dimensional data
- image processing
- hidden markov models
- visual attention
- human actions
- feature extraction
- high level