A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by CLIP Variants.
Soichiro OkazakiQuan KongTomoaki YoshinagaPublished in: DCASE (2021)
Keyphrases
- audio visual
- scene classification
- multi modal fusion
- object recognition
- multi modal
- image classification
- natural scenes
- biologically inspired
- visual words
- image representation
- visual information
- visual data
- bag of features
- facial features
- multimedia
- data sets
- low level features
- high dimensional
- training data
- image processing