Web2Text: Deep Structured Boilerplate Removal.
Thijs VogelsOctavian-Eugen GaneaCarsten EickhoffPublished in: ECIR (2018)
Keyphrases
- textual data
- web documents
- unstructured text
- website
- text information
- information retrieval and extraction
- web images
- text mining
- web applications
- information retrieval
- multi lingual
- structured data
- textual features
- unstructured data
- web pages
- web mining
- web users
- web data
- text content
- keywords
- newspaper articles
- web content
- free text
- end users
- information sources
- plain text
- content features
- database
- digital documents
- natural language
- search engine
- web communities
- linked data
- text documents
- semantic information
- information extraction
- semantic web