Why (and When) does Local SGD Generalize Better than SGD?

Xinran Gu Kaifeng Lyu Longbo Huang Sanjeev Arora

Published in: ICLR (2023)

Keyphrases

stochastic gradient descent
least squares
neural network
pairwise
special case
data sets
information retrieval
genetic algorithm
computer vision
multiscale
objective function
probabilistic model
loss function