Login / Signup
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.
Wenhui Wang
Furu Wei
Li Dong
Hangbo Bao
Nan Yang
Ming Zhou
Published in:
NeurIPS (2020)
Keyphrases
</>
pre trained
training data
training examples
statistical model
focus of attention