WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data.
Maurice WeberCarlo SiebenschuhRory ButlerAnton AlexandrovValdemar ThannerGeorgios TsolakisHaris JabbarIan T. FosterBo LiRick StevensCe ZhangPublished in: CoRR (2023)
Keyphrases
- data sets
- web data
- database
- web pages
- metadata
- meta information
- structured information
- knowledge discovery
- user interests
- multilingual documents
- textual data
- web documents
- web search
- data points
- information retrieval
- information sources
- web applications
- information retrieval systems
- log files
- user generated content
- data sources