Finding near-duplicate web pages: a large-scale evaluation of algorithms.
Monika Rauch HenzingerPublished in: SIGIR (2006)
Keyphrases
- web pages
- data structure
- learning algorithm
- computational cost
- website
- small scale
- machine learning algorithms
- worst case
- computational complexity
- web spam detection
- evaluation methods
- evaluation metrics
- web documents
- significant improvement
- neural network
- objective function
- feature selection
- genetic algorithm
- data mining