Finding near-duplicate web pages: a large-scale evaluation of algorithms.

Monika Rauch Henzinger

Published in: SIGIR (2006)

Keyphrases

web pages
data structure
learning algorithm
computational cost
website
small scale
machine learning algorithms
worst case
computational complexity
web spam detection
evaluation methods
evaluation metrics
web documents
significant improvement
neural network
objective function
feature selection
genetic algorithm
data mining