Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation.
Andrew RouditchenkoYuan GongSamuel ThomasLeonid KarlinskyHilde KuehneRogério FerisJames R. GlassPublished in: CoRR (2024)
Keyphrases
- visual features
- audio visual speech recognition
- visual information
- image classification
- image retrieval
- audio visual
- multi stream
- visual content
- keywords
- image search
- low level
- low level features
- image annotation
- image collections
- saliency map
- human actions
- visual data
- visual appearance
- video shots
- information retrieval
- machine learning
- semantic concepts
- image content