Login / Signup
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data.
Michal Turski
Tomasz Stanislawek
Karol Kaczmarek
Pawel Dyda
Filip Gralinski
Published in:
ICDAR (3) (2023)
Keyphrases
</>
high quality
web data
data sets
database
xml documents
structured information
web documents
end users
data analysis
web mining
website
data points
web search
low quality
knowledge discovery
information sources
training data
log data
deep web
text corpora
information retrieval