Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation.
Philipp HarzigMoritz EinfaltRainer LienhartPublished in: ICIP (2022)
Keyphrases
- audio visual
- video scene
- video summarization
- visual data
- video frames
- video segments
- visual information
- multimedia
- multi modal
- key frames
- audio features
- audio visual content
- temporal context
- sports video
- video sequences
- video data
- multimodal fusion
- multimedia data
- video retrieval
- multi stream
- video content
- video streams
- moving objects
- information retrieval
- video analysis
- temporal information
- semantic information
- visual features
- audio visual speech recognition
- text data
- visual content
- data sets
- keywords
- image database
- spatio temporal
- user comments