Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning.
Yulai XieJingjing NiuYang ZhangFang RenPublished in: IEEE Trans. Multim. (2024)
Keyphrases
- multi modal
- multistage
- multi modality
- text representation
- video search
- single modality
- audio visual
- multiple modalities
- dynamic programming
- video data
- high dimensional
- optimal policy
- concept learning
- video content
- text retrieval
- information filtering
- keywords
- feature extraction
- sufficient conditions
- machine learning
- image features
- bag of words
- search engine