Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs.
Arpan JainNawras AlnaasanAamir ShafiHari SubramoniDhabaleswar K. PandaPublished in: HOTI (2021)
Keyphrases
- training process
- distributed systems
- fault tolerance
- clustering algorithm
- high performance computing
- cooperative
- fault tolerant
- distributed environment
- training algorithm
- training examples
- data clustering
- neural network
- distributed data
- communication cost
- mobile agents
- training samples
- training set
- data distribution
- data points
- hierarchical clustering
- online learning
- fuzzy clustering
- input data
- multi agent
- peer to peer
- subspace clustering
- training phase
- general purpose