InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.
Yi WangYinan HeYizhuo LiKunchang LiJiashuo YuXin MaXinhao LiGuo ChenXinyuan ChenYaohui WangPing LuoZiwei LiuYali WangLimin WangYu QiaoPublished in: ICLR (2024)
Keyphrases
- trecvid multimedia event detection
- text generation
- multiple modalities
- video collections
- multimedia
- video dataset
- video data
- multi modal
- text detection
- human actions
- natural language descriptions
- story segmentation
- news video
- video segments
- video search
- video retrieval
- video content
- real time
- event recognition
- event detection
- video sequences
- database
- video streams
- video frames
- video clips
- text retrieval
- weakly labeled
- information retrieval
- multimedia documents
- text mining
- video database
- action recognition
- space time
- text information
- multimodal interaction
- human activities
- video surveillance
- multimodal information
- audio content
- real world
- multimedia search
- video analysis
- keywords
- natural language
- digital video
- information extraction
- broadcast news
- benchmark datasets
- natural language generation
- text documents