A structural, content-similarity measure for detecting spam documents on the web.
Maria Soledad PeraYiu-Kai NgPublished in: Int. J. Web Inf. Syst. (2009)
Keyphrases
- web documents
- similarity measure
- web information
- relevant content
- web content
- content similarity
- web data
- textual content
- user generated content
- text content
- web crawler
- multimedia documents
- web pages
- page content
- text information
- spam detection
- content and structure
- social media content
- electronic documents
- meta information
- topic specific
- web queries
- multilingual documents
- website
- html pages
- metadata
- web resources
- web spam
- document representation
- user interests
- document content
- textual features
- web images
- cosine similarity
- desired information
- search interface
- information retrieval
- web mining
- keywords
- related documents
- email messages
- hyperlink structure
- adversarial information retrieval
- information retrieval systems
- search engine
- social media
- information extraction
- tag clouds
- textual information
- semantic information
- digital objects
- spam filtering
- current search engines
- web spam detection
- multimedia
- social bookmarking systems
- text documents
- web search engines
- helping users
- page layout
- digital libraries
- social bookmarking