Pretraining Data and Tokenizer for Indic LLM.
Rahul KumarShubham KakdeDivyansh RajputDaud IbrahimRishabh NahataPidathala SowjanyaDeepak KumarPublished in: CoRR (2024)
Keyphrases
- synthetic data
- database
- missing data
- raw data
- complex data
- data structure
- data sources
- information retrieval
- missing values
- data processing
- image data
- statistical methods
- data analysis
- high quality
- knowledge base
- data sets
- data quality
- multimedia data
- big data
- historical data
- statistical analysis
- data collection
- input data
- knowledge discovery
- data points
- video sequences
- website
- information systems
- real time