Pre-training LLMs using human-like development data corpus.
Khushi BhardwajRaj Sanjay ShahSashank VarmaPublished in: CoRR (2023)
Keyphrases
- data sets
- database
- high quality
- statistical analysis
- image data
- synthetic data
- prior knowledge
- data sources
- training samples
- data collection
- computer systems
- data quality
- spatial data
- sensor data
- data processing
- small number
- knowledge discovery
- information retrieval
- databases
- data mining techniques
- data points
- end users
- training examples
- high dimensional data
- data analysis
- missing data
- search engine
- test data
- raw data