Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture.
Ohad CohenGershon HazanSharon GannotPublished in: CoRR (2024)
Keyphrases
- emotion recognition
- audio visual
- audio stream
- automatic speech recognition
- visual information
- broadcast news
- speaker diarization
- emotional speech
- multimodal fusion
- audio signals
- hierarchical architecture
- speech recognition
- speaker identification
- text to speech synthesis
- emotional state
- text to speech
- human computer interaction
- speech processing
- natural language
- semantic context
- cepstral features
- audio features
- facial expressions
- audio recordings
- digital audio
- fuzzy logic
- fault diagnosis
- acoustic features
- visual features
- intermediate representations
- acoustic signals
- content based video retrieval
- audio video
- multimodal interfaces
- high level
- visual data
- multi modal
- low level
- speech music discrimination
- prosodic features
- semantic web
- power system
- semantic information
- sentiment analysis
- speaker recognition
- speaker verification
- semantic analysis