Identifying "Soft 404" Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections.
Luis MenesesRichard FurutaFrank ShipmanPublished in: TPDL (2012)
Keyphrases
- heterogeneous collections
- document collections
- keywords
- distributed information retrieval
- information retrieval
- web documents
- query based sampling
- page layout
- data collections
- textual content
- search engine
- distributed systems
- metadata
- digital libraries
- information retrieval systems
- website
- relevant documents
- linguistic information
- text collections
- linguistic analysis
- web pages
- similar documents
- xml documents
- document representation
- web information
- resource selection
- document analysis
- semantic relations
- natural language text
- document clustering
- retrieval systems
- html pages
- manually constructed
- word frequency
- wordnet
- test collection
- text documents
- digital collections
- document retrieval
- semi structured
- text corpus