LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.
Feng LiRenrui ZhangHao ZhangYuanhan ZhangBo LiWei LiZejun MaChunyuan LiPublished in: CoRR (2024)
Keyphrases
- input image
- image features
- multiscale
- image data
- single image
- image classification
- static images
- random fields
- image analysis
- visual cues
- video images
- bayesian framework
- segmentation method
- image segmentation
- image content
- image representation
- multimedia
- image retrieval
- feature points
- edge detection
- probabilistic model
- low level
- test images
- image frames
- similarity measure
- image set
- visual data
- video sequences
- weakly labeled
- pre trained
- face recognition
- object motion
- video surveillance
- image collections
- key frames
- region of interest
- high resolution
- spatial information
- image regions
- segmentation algorithm
- multi modal