Login / Signup

SlimPajama-DC: Understanding Data Combinations for LLM Training.

Zhiqiang ShenTianhua TaoLiqun MaWillie NeiswangerZhengzhong LiuHongyi WangBowen TanJoel HestnessNatalia VassilievaDaria SobolevaEric P. Xing
Published in: CoRR (2023)
Keyphrases
  • data analysis
  • data collection
  • data sets
  • data sources
  • statistical analysis
  • synthetic data
  • high quality
  • data quality
  • knowledge discovery
  • image data
  • small number
  • raw data
  • neural network
  • xml documents
  • data processing