Small-scale proxies for large-scale Transformer training instabilities.
Mitchell WortsmanPeter J. LiuLechao XiaoKatie E. EverettAlexander A. AlemiBen AdlamJohn D. Co-ReyesIzzeddin GurAbhishek KumarRoman NovakJeffrey PenningtonJascha Sohl-DicksteinKelvin XuJaehoon LeeJustin GilmerSimon KornblithPublished in: ICLR (2024)