Login / Signup
MAVT-FG: Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition.
Xiaoyu Zhou
Xiaotong Song
Hao Wu
Jingran Zhang
Xing Xu
Published in:
ACM Multimedia (2022)
Keyphrases
</>
fine grained
audio visual
weakly supervised
multi modal
object class
object recognition
multi stream
visual information
superpixels
topic models
visual data
access control
semi supervised
multimedia
named entities
feature extraction
feature selection
partial occlusion
information extraction
high level