What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision.
Jonathan MalmaudJonathan HuangVivek RathodNick JohnstonAndrew RabinovichKevin MurphyPublished in: CoRR (2015)
Keyphrases
- text to speech
- text to speech synthesis
- english text
- video sequences
- human activities
- text input
- text recognition
- information retrieval
- language generation
- video frames
- spontaneous speech
- lexical features
- computer vision
- vision system
- news video
- video search
- speech synthesis
- natural language descriptions
- keywords
- video database
- spoken documents
- automatic speech recognition
- speech signal
- content based video retrieval
- video collections
- speech recognition
- conversational speech
- video data
- multimodal interfaces
- multi modal
- multi lingual
- broadcast news
- video segments
- real time
- semantic information
- web images
- text retrieval
- text documents