MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions.
Mattia SoldanAlejandro PardoJuan León AlcázarFabian Caba HeilbronChen ZhaoSilvio GiancolaBernard GhanemPublished in: CoRR (2021)
Keyphrases
- natural language descriptions
- human actions
- human language
- video dataset
- video sequences
- programming language
- high level
- visual data
- natural language
- multimedia
- language learning
- video material
- video content analysis
- digital video
- trecvid multimedia event detection
- multimedia event detection
- video analysis
- video frames
- text to speech
- spatio temporal
- low level
- benchmark datasets
- action recognition
- video surveillance
- photo collections
- event detection
- audio visual
- video annotation
- video indexing
- video search
- video recordings
- visual information
- user generated
- signal processing
- video indexing and retrieval
- weakly labeled
- tv series
- video content