A Web Page De-duplication Algorithm Based on Data Clearing.
Jian-ming LinDong-sheng LiuShi-wen GaoWei ChenPublished in: JCAI (2009)
Keyphrases
- learning algorithm
- noisy data
- data sets
- input data
- data collection
- data sources
- optimization algorithm
- data points
- data records
- dimensional data
- data reduction
- web data
- prior information
- matching algorithm
- computational cost
- np hard
- cost function
- dynamic programming
- detection algorithm
- search engine
- similarity measure
- objective function
- network structure
- data distribution
- computational complexity
- synthetic data
- data analysis
- expectation maximization
- k means
- image data
- worst case
- preprocessing
- segmentation algorithm
- high dimensional data
- search space
- simulated annealing
- clustering method
- knowledge discovery
- spectral clustering
- training data
- information loss
- website
- probabilistic model
- database