Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition.
Joanna HongMinsu KimDaehun YooYong Man RoPublished in: INTERSPEECH (2022)
Keyphrases
- end to end
- audio visual speech recognition
- visual context
- multi stream
- audio visual
- temporal context
- semantic context
- noisy environments
- feature vectors
- visual information
- multi modal
- multimedia
- scene interpretation
- visual speech
- feature set
- speech recognition
- image data
- audio features
- hidden markov models
- spatio temporal