Publication: A Bi-objective Perspective on Controllable Language Models: Reward Dropout Improves Off-policy Control Performance.