Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos.
Chiori HoriPuyuan PengDavid HarwathXinyu LiuKei OtaSiddarth JainRadu CorcodelDevesh K. JhaDiego RomeresJonathan Le RouxPublished in: INTERSPEECH (2023)
Keyphrases
- audio visual
- scene understanding
- robot navigation
- vision system
- video surveillance
- video summarization
- multi modal
- visual data
- audio features
- object detection
- human actions
- visual information
- object recognition
- d scene
- sound source
- multi stream
- multimedia
- video sequences
- computer vision
- audio visual speech recognition
- transfer learning
- background subtraction
- real time
- action recognition
- video data
- visual features
- human activities
- space time
- image classification
- high dimensional
- moving objects
- data sets
- human motion