Publication: A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition.