R2D2: Reducing Redundancy and Duplication in Data Lakes.
Raunak ShahKoyel MukherjeeAtharv TyagiSai Keerthana KarnamDhruv JoshiShivam Pravin BhosaleSubrata MitraPublished in: Proc. ACM Manag. Data (2023)
Keyphrases
- data sets
- data processing
- database
- input data
- raw data
- data points
- data structure
- data sources
- high quality
- missing data
- bit rate
- data collection
- statistical analysis
- complex data
- original data
- application domains
- spatial data
- website
- training data
- image data
- synthetic data
- end users
- xml documents
- data distribution
- network structure
- information retrieval
- high dimensional
- data quality
- data streams