Cleaner Pretraining Corpus Curation with Neural Web Scraping.
Zhipeng XuZhenghao LiuYukun YanZhiyuan LiuChenyan XiongGe YuPublished in: CoRR (2024)
Keyphrases
- web applications
- website
- web pages
- network architecture
- web documents
- web data
- neural network
- information sources
- web technologies
- textual features
- neural model
- web scale
- web intelligence
- web content
- web resources
- web users
- test set
- information extraction
- manually annotated
- multiword
- web communities
- web information retrieval
- database