Web2Text: Deep Structured Boilerplate Removal.
Thijs VogelsOctavian-Eugen GaneaCarsten EickhoffPublished in: CoRR (2018)
Keyphrases
- web documents
- text information
- textual data
- information retrieval and extraction
- unstructured text
- textual case based reasoning
- website
- web applications
- linked data
- web pages
- digital documents
- multi lingual
- semantic web
- database
- web resources
- web images
- web content
- text mining
- web users
- free text
- unstructured data
- structured data
- information retrieval
- textual features
- information extraction
- plain text
- text content
- semi structured
- user generated content
- text data
- web mining
- web scale
- newspaper articles
- link analysis
- web services