SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures.
Swapnil GandhiMark ZhaoAthinagoras SkiadopoulosChristos KozyrakisPublished in: CoRR (2024)
Keyphrases
- distributed systems
- distributed environment
- training process
- online learning
- lightweight
- communication overhead
- training set
- supervised learning
- multi agent
- training phase
- communication cost
- failure detection
- database
- loosely coupled
- distributed computing
- fault tolerant
- computer networks
- back propagation
- data analysis
- information retrieval