Stabilizing Transformer Training by Preventing Attention Entropy Collapse.
Shuangfei ZhaiTatiana LikhomanenkoEtai LittwinDan BusbridgeJason RamapuramYizhe ZhangJiatao GuJoshua M. SusskindPublished in: ICML (2023)
Keyphrases
- training set
- fuzzy logic
- training process
- information theoretic
- mutual information
- information theory
- artificial intelligence
- nonlinear systems
- training phase
- training algorithm
- test set
- training examples
- supervised learning
- information content
- probabilistic model
- active learning
- information retrieval
- information entropy