A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters.
Jose González-AbadÁlvaro López GarcíaValentin Y. KozlovPublished in: CoRR (2022)
Keyphrases
- deep architectures
- learning algorithm
- distributed systems
- deep learning
- supervised learning
- training examples
- multilayer neural networks
- training samples
- cooperative
- unsupervised learning
- clustering algorithm
- loosely coupled
- workflow management systems
- multi agent
- mobile agents
- distributed environment
- fault tolerance
- training algorithm
- active learning
- training data
- training process
- training set
- workflow execution
- fuzzy clustering
- distributed computing
- fault tolerant
- high performance computing
- scientific computing
- restricted boltzmann machine
- training and test data
- learning tasks
- back propagation
- learning rate
- hierarchical clustering
- learning problems
- batch mode
- computing infrastructure
- neural network