From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora.
Jacqueline Hettel TidwellPublished in: Data (2019)
Keyphrases
- text corpora
- data sets
- data structure
- raw data
- missing values
- database
- data mining methods
- statistical methods
- natural language processing
- noisy data
- high dimensional data
- statistical analysis
- data collections
- data collection
- data processing
- image data
- knowledge discovery
- training data
- data mining techniques
- real time
- multiple sources
- statistical tests
- data quality
- high quality
- original data
- benchmark datasets
- monitoring system
- machine learning methods
- data sources
- significant improvement
- language learning
- data analysis
- missing data
- synthetic data
- big data
- end users
- data points
- input data
- database systems
- data warehouse