Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset.
Jack DeadmanJon BarkerPublished in: INTERSPEECH (2020)
Keyphrases
- speech recognition
- audio visual
- speaker recognition
- automatic speech recognition
- activity detection
- speaker identification
- multimedia
- face detection and tracking
- audio stream
- speaker verification
- human actions
- video content
- video sequences
- broadcast news
- video data
- noisy environments
- false positives
- visual data
- detection algorithm
- weakly labeled
- automatic speech recognition systems
- audio features
- video clips
- video scene
- video frames
- object detection
- detection method
- tv broadcast
- content based video retrieval
- speaker diarization
- acoustic features
- event detection
- text detection
- soccer video
- speech synthesis
- shot boundary detection
- action recognition
- space time
- key frames
- automatic transcription