MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions.
Mattia SoldanAlejandro PardoJuan León AlcázarFabian Caba HeilbronChen ZhaoSilvio GiancolaBernard GhanemPublished in: CVPR (2022)
Keyphrases
- natural language descriptions
- natural language
- human actions
- video sequences
- programming language
- signal processing
- visual data
- language learning
- video frames
- video database
- audio visual
- video content
- video dataset
- action recognition
- audio signals
- human language
- word meanings
- video content analysis
- conceptual models
- text to speech
- video indexing and retrieval
- video analysis
- video clips
- visual information
- video data
- multimedia
- tv series
- audio stream
- video signals
- textual descriptions
- digital video
- video surveillance
- benchmark datasets
- high level