InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.
Yi WangYinan HeYizhuo LiKunchang LiJiashuo YuXin MaXinyuan ChenYaohui WangPing LuoZiwei LiuYali WangLimin WangYu QiaoPublished in: CoRR (2023)
Keyphrases
- trecvid multimedia event detection
- text generation
- event detection
- video search
- video collections
- multiple modalities
- multimedia
- multi modal
- event recognition
- web videos
- video data
- video dataset
- video sequences
- natural language descriptions
- text detection
- weakly labeled
- video streams
- natural language generation
- real time
- database
- video frames
- video content
- benchmark datasets
- multimedia search
- real world
- video clips
- video retrieval
- multimodal information
- video segments
- text retrieval
- news video
- audio content
- multimedia documents
- information retrieval
- text mining
- text information
- feature set
- million images
- space time
- multimodal interaction
- temporal information
- human actions
- key frames
- digital video
- video database
- video analysis
- image search
- video surveillance