Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers.
Chiori HoriTakaaki HoriJonathan Le RouxPublished in: CoRR (2021)
Keyphrases
- audio visual
- online video
- multi modal
- visual information
- multimedia
- user interaction
- visual data
- multi stream
- person authentication
- temporal context
- audio visual speech recognition
- video sharing
- emotion recognition
- natural language processing
- image features
- low level
- pose estimation
- user interface
- three dimensional
- feature selection
- multimodal fusion