Small-scale proxies for large-scale Transformer training instabilities.
Mitchell WortsmanPeter J. LiuLechao XiaoKatie EverettAlex AlemiBen AdlamJohn D. Co-ReyesIzzeddin GurAbhishek KumarRoman NovakJeffrey PenningtonJascha Sohl-DicksteinKelvin XuJaehoon LeeJustin GilmerSimon KornblithPublished in: CoRR (2023)