An Improved Algorithm of STC for the Deletion of Duplicated Web pages Based on Repeated Strings.
Huijiao WangBo YinJie HouPublished in: WGEC (2009)
Keyphrases
- web pages
- search engine
- website
- keywords
- web search
- suffix tree
- web page classification
- web search engines
- web content mining
- web documents
- suffix array
- dynamically generated
- web content
- web browser
- edit distance
- google search engine
- web logs
- dynamic content
- browsing experience
- web information extraction
- geographical locations
- hamming distance
- data extraction
- link structure
- web users
- web server
- hierarchical structure
- textual content
- approximate string matching
- database
- information retrieval
- finite alphabet
- databases