Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text.
Yoonhyung LeeSeunghyun YoonKyomin JungPublished in: CoRR (2022)
Keyphrases
- text to speech synthesis
- audio visual
- multimodal fusion
- emotion recognition
- text to speech
- multimodal interfaces
- multimodal interaction
- multi stream
- multi modal
- broadcast news
- speech synthesis
- story segmentation
- text graphics
- multimedia
- prosodic features
- spoken documents
- english text
- emotional speech
- emotional state
- human language
- speaker verification
- human computer interaction
- audio stream
- visual data
- visual information
- information retrieval
- multimodal information
- lexical features
- english words
- text recognition
- cross modal
- audio features
- multi lingual
- signal processing
- high robustness
- video search
- affect sensing
- facial expressions
- speaker identification
- text mining
- speaker recognition
- audio signals
- multimedia data
- emotion classification
- affect detection
- audio recordings
- keywords
- video retrieval
- affective states
- speech processing