MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale.
Arnab ChoudhuryYang WangTuomas PelkonenKutta SrinivasanAbha JainShenghao LinDelia DavidSiavash SoleimanifardMichael ChenAbhishek YadavRitesh TijoriwalaDenis SamoylovChunqiang TangPublished in: OSDI (2024)
Keyphrases
- meeting scheduling
- distributed systems
- globally distributed
- scheduling problem
- dynamic scheduling
- global knowledge
- maximum likelihood
- training process
- scheduling algorithm
- computational grids
- multi agent
- resource allocation
- distributed environment
- training set
- remote sites
- computing environments
- distributed database systems
- federated databases
- training phase
- training samples
- fully distributed
- supervised learning
- online learning
- resource constraints
- cooperative
- real time database systems
- global information
- lightweight
- genetic algorithm
- computer networks
- test set
- training examples