Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers.
Ken HooverSourish ChaudhuriCaroline PantofaruMalcolm SlaneyIan SturdyPublished in: CoRR (2017)
Keyphrases
- visual data
- visual speech
- audio signals
- visual information
- emotion recognition
- video indexing and retrieval
- multimedia
- audio video
- video signals
- content based video retrieval
- signal processing
- video data
- noisy environments
- visual cues
- video sequences
- video content
- scene change detection
- facial expressions
- speech recognition
- audio visual
- acoustic signals
- digital video
- speaker identification
- video files
- low level
- audio signal
- video search
- multimedia data
- video streams
- video database
- video material
- multimodal fusion
- broadcast news
- visual features
- human faces
- fundamental frequency
- multimedia processing
- mouth region
- visual analysis
- audio files
- video annotation
- news video
- multimedia information
- audio stream
- hidden markov models
- lifelog
- facial images
- voice recognition
- cross modal
- real time video
- lecture videos
- audio features
- video indexing
- video analysis
- acoustic signal
- cepstral features
- face images
- face detection and tracking
- video retrieval
- visual content
- video recordings
- video content analysis
- key frames
- video shots
- music information retrieval
- feature vectors