STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents.

Nan Zhang Shomir Wilson Prasenjit Mitra

Published in: LREC (2022)

Keyphrases

web documents
keywords
web pages
semi structured
information extraction
textual information
web search engines
web content
web data
document classification
vector space model
structured documents
link structure
search engine
data mining
unstructured text
unstructured documents
html documents
structural features
machine learning