Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT.
Biqiu LiJiabin WangXueli LiuPublished in: Sci. Program. (2021)
Keyphrases
- input data
- data sets
- data reduction
- data collection
- dynamic programming
- noisy data
- synthetic data
- data cleaning
- k means
- np hard
- computational cost
- prior information
- missing data
- detection algorithm
- data processing
- image data
- cost function
- computational complexity
- data sources
- probabilistic model
- parallel implementation
- objective function
- clustering algorithm
- spatial data
- synthetic datasets
- data analysis
- segmentation algorithm
- learning algorithm
- original data
- data quality
- database
- single scan
- spectral clustering
- data distribution
- data mining techniques
- knowledge discovery
- evolutionary algorithm
- lower bound
- training data
- similarity measure