STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents.
Nan ZhangShomir WilsonPrasenjit MitraPublished in: LREC (2022)
Keyphrases
- web documents
- keywords
- web pages
- semi structured
- information extraction
- textual information
- web search engines
- web content
- web data
- document classification
- vector space model
- structured documents
- link structure
- search engine
- data mining
- unstructured text
- unstructured documents
- html documents
- structural features
- machine learning