MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers.
Muhammad Bilal ShaikhDouglas ChaiSyed Mohammed Shamsul IslamNaveed AkhtarPublished in: CoRR (2023)
Keyphrases
- multimedia
- visual data
- video files
- image data
- image segmentation
- image retrieval
- image content
- image features
- input image
- image regions
- multiscale
- single image
- image representation
- image classification
- edge detection
- high resolution
- video sequences
- multi modal
- video data
- audio video
- scene change detection
- image collections
- audio visual
- image frames
- audio files
- image quality
- low level
- human actions
- visual cues