Disentangling speech from surroundings in a neural audio codec.
Ahmed OmranNeil ZeghidourZalán BorsosFélix de Chaumont QuitryMalcolm SlaneyMarco TagliasacchiPublished in: CoRR (2022)
Keyphrases
- audio stream
- audio visual
- broadcast news
- speaker identification
- audio signals
- text to speech
- emotion recognition
- audio features
- network architecture
- digital audio
- cepstral features
- prosodic features
- speech processing
- speech music discrimination
- audio recordings
- speech recognition
- audio video
- linear predictive coding
- speech signal
- speech synthesis
- multi stream
- automatic speech recognition
- multi modal
- automatic transcription
- acoustic signals
- spoken documents
- multimedia
- visual information
- video codec
- neural network
- audio signal
- motion estimation
- video coding
- noisy environments
- visual data
- human language
- speaker verification
- signal processing
- acoustic features
- spontaneous speech
- facial expressions
- visual speech
- bitstream
- gaussian mixture model
- associative memory
- feature set
- mel frequency cepstral coefficients
- spike trains
- speaker recognition
- bit rate
- human computer interaction
- feature extraction
- emotional state
- transform domain
- video streams