Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding.
Yuanhao XiongLong ZhaoBoqing GongMing-Hsuan YangFlorian SchroffTing LiuCho-Jui HsiehLiangzhe YuanPublished in: CoRR (2023)
Keyphrases
- language generation
- english text
- computational linguistics
- text to speech synthesis
- video based face recognition
- natural language descriptions
- video data
- video search
- video content
- text to speech
- programming language
- video streams
- action classification
- language learning
- discriminative training
- video frames
- video database
- multimedia documents
- english language
- text detection
- video sequences
- information retrieval
- supervised learning
- word meanings
- training set
- weakly labeled
- news video
- text generation
- native language
- discriminative classifiers
- multimedia
- video segments
- text retrieval
- machine translation system
- video clips
- key frames
- syntactic categories
- feature extraction
- key concepts
- appearance features
- event detection
- training samples
- image classification
- semi supervised
- information extraction
- natural language
- keywords