A hybrid approach for content extraction with text density and visual importance of DOM nodes.
Dandan SongFei SunLejian LiaoPublished in: Knowl. Inf. Syst. (2015)
Keyphrases
- content extraction
- text content
- html documents
- web news
- web pages
- web documents
- digital archives
- multimedia information retrieval
- semi structured
- low level
- visual features
- xml documents
- text retrieval
- automatic extraction
- digital libraries
- text corpus
- text documents
- text analysis
- machine learning
- keywords
- database systems
- website
- social networks