ε-ViLM : Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer.
Jacob Zhiyuan FangSkyler ZhengVasu SharmaRobinson PiramuthuPublished in: WACV (Workshops) (2024)
Keyphrases
- language model
- multimedia
- video data
- language modeling
- video sequences
- context sensitive
- video frames
- video content
- n gram
- document retrieval
- feature selection
- key frames
- test collection
- query expansion
- information extraction
- probabilistic model
- multiscale
- information retrieval
- information retrieval systems
- computational complexity
- bayesian networks
- image processing
- smoothing methods