Login / Signup
Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers.
Chiori Hori
Takaaki Hori
Jonathan Le Roux
Published in:
Interspeech (2021)
Keyphrases
</>
audio visual
online video
multi modal
visual information
user interaction
emotion recognition
visual data
multimedia
multi stream
audio visual speech recognition
person authentication
video sharing
temporal context
multimodal fusion
high dimensional
high level
data processing