LSH methods for data deduplication in a Wikipedia artificial dataset.
Juan CiroDaniel GalvezTim SchlippeDavid KanterPublished in: CoRR (2021)
Keyphrases
- data sets
- benchmark datasets
- statistical methods
- missing values
- data sources
- data mining techniques
- database
- data processing
- missing data
- data quality
- input data
- massive datasets
- knowledge discovery
- data analysis
- spatial data
- data structure
- computationally expensive
- training data
- high dimensional datasets
- data objects
- search engine
- data mining methods
- human experts
- data cleaning
- synthetic datasets
- search methods
- nearest neighbor
- knn