DocumentNet: Bridging the Data Gap in Document Pre-training.
Lijun YuJin MiaoXiaoyu SunJiayi ChenAlexander G. HauptmannHanjun DaiWei WeiPublished in: EMNLP (Industry Track) (2023)
Keyphrases
- data sets
- database
- statistical analysis
- data distribution
- data analysis
- synthetic data
- training samples
- data collection
- xml format
- original data
- raw data
- training examples
- data mining techniques
- image data
- xml documents
- neural network
- co occurrence
- data sources
- prior knowledge
- high quality
- database systems
- information retrieval