Clustering web documents with tables for information extraction.
Kostyantyn M. ShchekotykhinDietmar JannachGerhard FriedrichPublished in: K-CAP (2007)
Keyphrases
- web documents
- information extraction
- html documents
- semi structured
- content similarity
- k means
- web pages
- databases
- document classification
- clustering algorithm
- keywords
- web content
- database
- clustering method
- structured data
- web search engines
- text mining
- natural language processing
- unstructured text
- relation extraction
- vector space model
- unstructured documents
- information retrieval
- web logs
- textual information
- link structure
- text documents
- named entities
- document representation
- search engine
- data points
- focused crawling
- machine learning
- wrapper generation
- web search