MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer.
Dandan ZhuKun ZhuWeiping DingNana ZhangXiongkuo MinGuangtao ZhaiXiaokang YangPublished in: IEEE Trans. Emerg. Top. Comput. Intell. (2024)
Keyphrases
- multi modal
- prediction model
- visual saliency
- audio visual
- natural images
- saliency map
- superpixels
- visual attention
- object class
- multimedia
- eye movements
- human visual system
- bayesian framework
- high dimensional
- visual information
- region of interest
- visual data
- computer vision
- higher level
- object detectors
- object recognition
- input image
- prior knowledge
- higher order
- image annotation
- low level features
- markov random field
- eye tracking