CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data.
Michal TurskiTomasz StanislawekKarol KaczmarekPawel DydaFilip GralinskiPublished in: CoRR (2023)
Keyphrases
- high quality
- web data
- database
- data sets
- information retrieval
- xml documents
- textual data
- web pages
- newspaper articles
- data analysis
- end users
- deep web
- low quality
- text data
- structured information
- web sources
- web crawlers
- web content
- web mining
- text documents
- information sources
- information retrieval systems
- knowledge discovery
- data points
- data sources
- training data
- website