From Image to Video, what do we need in multimodal LLMs?
Suyuan HuangHaoxin ZhangYan GaoYao HuZengchang QinPublished in: CoRR (2024)
Keyphrases
- image data
- single image
- multiscale
- image retrieval
- image content
- input image
- image analysis
- image features
- multimedia
- template matching
- image frames
- image representation
- image classification
- image collections
- video data
- segmentation algorithm
- pre trained
- image segmentation
- multi modal
- image pixels
- high resolution
- video retrieval
- key frames
- image regions
- video images
- static images
- temporal continuity
- region of interest
- image matching
- segmentation method
- video sequences
- video content
- image set
- test images
- visual data
- image processing
- object motion
- edge detection
- low level
- weakly labeled
- video files