Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media.
Xiang DaiSarvnaz KarimiBen HacheyCécile ParisPublished in: EMNLP (Findings) (2020)
Keyphrases
- cost effective
- data sets
- data sources
- social media
- big data
- data analysis
- data points
- data collection
- database
- data processing
- case study
- data structure
- training data
- databases
- knowledge discovery
- low cost
- synthetic data
- social networks
- social media platforms
- historical data
- data quality
- original data
- high dimensional data
- statistical analysis
- probability distribution
- high quality
- prior knowledge