Using the words/leafs ratio in the DOM tree for content extraction.
David InsaJosep SilvaSalvador TamaritPublished in: J. Log. Algebraic Methods Program. (2013)
Keyphrases
- content extraction
- html documents
- dom tree
- web documents
- digital archives
- web pages
- n gram
- automatic extraction
- multimedia information retrieval
- keywords
- text content
- semi structured
- web content
- structured documents
- xml documents
- information retrieval
- semantic information
- text documents
- semistructured data
- web mining
- metadata