dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training.
Hanpeng HuChenyu JiangYuchen ZhongYanghua PengChuan WuYibo ZhuHaibin LinChuanxiong GuoPublished in: MLSys (2022)
Keyphrases
- training process
- optimization problems
- optimization algorithm
- cooperative
- fault tolerant
- domain specific
- distributed systems
- distributed environment
- constrained optimization
- optimization process
- training set
- optimization method
- training algorithm
- peer to peer
- training examples
- stochastic gradient descent
- fault diagnosis
- supervised learning
- communication cost
- optimization model
- multi agent
- training data
- high level
- automatic diagnosis
- multiple faults
- model based reasoning
- decision trees
- distributed data mining
- cmac neural network
- medical diagnosis
- global optimization
- computer aided
- test set
- training samples
- lightweight
- online learning