Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers.
Shijian LiRobert J. WallsTian GuoPublished in: ICDCS (2020)
Keyphrases
- data center
- scalable distributed
- distributed systems
- cloud computing
- real time
- wide area network
- computing platform
- distributed environment
- training set
- parallel processing
- multi agent
- fault tolerant
- test set
- data transfer
- map reduce
- central server
- cooperative
- steady state
- communication cost
- training samples
- parallel computation
- general purpose
- computing infrastructure