Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training.
Xinyu LianSam Ade JacobsLev KurilenkoMasahiro TanakaStas BekmanOlatunji RuwaseMinjia ZhangPublished in: CoRR (2024)
Keyphrases
- low overhead
- distributed systems
- distributed database systems
- distributed databases
- fault tolerance
- lightweight
- main memory databases
- distributed environment
- failure recovery
- fault tolerant
- high scalability
- real world
- training process
- highly efficient
- distributed computing
- high reliability
- multi agent
- concurrency control
- load balancing
- cooperative
- log records
- computationally expensive
- response time
- supervised learning