What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision.
Jonathan MalmaudJonathan HuangVivek RathodNicholas JohnstonAndrew RabinovichKevin MurphyPublished in: HLT-NAACL (2015)
Keyphrases
- text to speech
- text to speech synthesis
- text recognition
- english text
- video sequences
- human activities
- speech recognition
- information retrieval
- text input
- computer vision
- news video
- lexical features
- video search
- multi lingual
- vision system
- text mining
- video collections
- text documents
- video segments
- spontaneous speech
- natural language descriptions
- text data
- language generation
- video data
- video analysis
- text retrieval
- conversational speech
- keywords
- multimodal interfaces
- speech signal
- speech synthesis
- broadcast news
- automatic speech recognition
- database
- video database
- video surveillance
- video content
- event detection
- video frames
- real time