AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning.
Jongsuk KimJiwon ShinJunmo KimPublished in: CoRR (2024)
Keyphrases
- visual features
- visual information
- keywords
- visual data
- textual information
- audio features
- web images
- visual content
- image classification
- image search
- textual features
- content based video retrieval
- image retrieval
- semantic content
- semantic concepts
- visual appearance
- low level
- acoustic features
- text queries
- low level features
- image collections
- text retrieval
- text documents
- text mining
- image annotation
- audio visual
- visual similarity
- visual and textual features
- semantic features
- semantic gap
- global features
- text data
- visual properties
- multimedia
- information retrieval
- video shots
- bag of words
- semantic information
- search engine
- low level features and high level
- automatic image annotation
- low level visual features
- object recognition
- high level