Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation.
Philipp HarzigMoritz EinfaltRainer LienhartPublished in: CoRR (2021)
Keyphrases
- audio visual
- video scene
- video summarization
- visual data
- video frames
- video segments
- visual information
- multi modal
- multimedia
- key frames
- audio features
- audio visual content
- temporal context
- video data
- video sequences
- sports video
- multimodal fusion
- video content
- information retrieval
- multi stream
- video streams
- text mining
- video retrieval
- low level
- video analysis
- text data
- audio visual speech recognition
- video objects
- feature vectors
- moving objects
- high level
- multimedia content
- user comments
- multimedia data
- semantic information