URL normalization for de-duplication of web pages.
Amit AgarwalHema Swetha KoppulaKrishna P. LeelaKrishna Prasad ChitrapuraSachin GargPavan Kumar GMChittaranjan HatyAnirban RoyAmit SasturkarPublished in: CIKM (2009)
Keyphrases
- web pages
- website
- web search
- search engine
- web documents
- web server
- keywords
- link analysis
- web content
- textual content
- preprocessing
- web search engines
- web data
- web crawler
- web page classification
- web browser
- web users
- hierarchical structure
- web resources
- information retrieval systems
- information overload
- web graph
- topic specific
- social bookmarking
- information extraction
- dynamically generated
- web content mining