Primary Data Deduplication - Large Scale Study and System Design.
Ahmed El-ShimiRan KalachAnkit KumarAdi OtteanJin LiSudipta SenguptaPublished in: USENIX Annual Technical Conference (2012)
Keyphrases
- data sets
- statistical analysis
- data collection
- training data
- data quality
- raw data
- data mining techniques
- empirical data
- empirical studies
- data sources
- xml documents
- data analysis
- databases
- image data
- database
- input data
- high quality
- high dimensional data
- massive scale
- user interface
- data integrity
- real world
- original data
- search engine
- statistical methods
- data distribution
- synthetic data
- clustering algorithm
- data structure
- data processing
- probability distribution
- end users