TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training.
Yulong LiuGuibo ZhuBin ZhuQi SongGuojing GeHaoran ChenGuanhui QiaoRu PengLingxiang WuJinqiao WangPublished in: NeurIPS (2022)
Keyphrases
- high quality
- training dataset
- computer vision
- english text
- chinese web
- programming language
- benchmark datasets
- real time
- small scale
- ground truth
- language learning
- vision system
- real world
- chinese language
- million images
- training phase
- training samples
- natural language
- training process
- language processing
- training set
- image quality
- high level
- native speakers
- word segmentation
- higher quality
- low quality
- text summarization
- real life
- active learning
- classifier training
- training data
- supervised learning
- information retrieval
- ground truth labels