Login / Signup
On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication Pipeline - Industrial Experience Report.
Pawel Boinski
Witold Andrzejewski
Bartosz Bebel
Robert Wrembel
Published in:
DEXA (1) (2023)
Keyphrases
</>
synthetic data
input data
database
noisy data
data collection
data sets
detection method
data processing
data sources
prior knowledge
high quality
significant improvement
correlation analysis
prior information
raw data
segmentation method
missing data
preprocessing
statistical methods
fine tuning
data cleaning
cost function
information loss
similarity measure
original data
probabilistic model
missing values
high accuracy
data structure
training samples
computational complexity
data analysis
classification accuracy
data points
pairwise
clustering method
user input