Using Word Clusters to Detect Similar Web Documents.
Jonathan KobersteinYiu-Kai NgPublished in: KSEM (2006)
Keyphrases
- web documents
- keywords
- n gram
- web search engines
- semi structured
- information extraction
- web pages
- related documents
- clustering algorithm
- document classification
- co occurrence
- html documents
- focused crawling
- website
- web content
- document representation
- databases
- returned by a search engine
- machine learning
- web data
- text mining
- relational databases
- database systems
- information retrieval
- related web pages