On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication Pipeline - Industrial Experience Report.
Pawel BoinskiWitold AndrzejewskiBartosz BebelRobert WrembelPublished in: DEXA (1) (2023)
Keyphrases
- synthetic data
- input data
- database
- noisy data
- data collection
- data sets
- detection method
- data processing
- data sources
- prior knowledge
- high quality
- significant improvement
- correlation analysis
- prior information
- raw data
- segmentation method
- missing data
- preprocessing
- statistical methods
- fine tuning
- data cleaning
- cost function
- information loss
- similarity measure
- original data
- probabilistic model
- missing values
- high accuracy
- data structure
- training samples
- computational complexity
- data analysis
- classification accuracy
- data points
- pairwise
- clustering method
- user input