WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data.
Maurice WeberCarlo SiebenschuhRory ButlerAnton AlexandrovValdemar ThannerGeorgios TsolakisHaris JabbarIan T. FosterBo LiRick StevensCe ZhangPublished in: NeurIPS (2023)
Keyphrases
- web data
- data analysis
- data sets
- information retrieval
- database
- web pages
- data points
- multilingual documents
- data extraction
- web mining
- information retrieval systems
- end users
- data sources
- knowledge discovery
- xml documents
- document collections
- keywords
- semantic annotation
- linked data
- website
- metadata
- web information
- search engine