CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering.
Yuanyuan JiangJianqin YinPublished in: CoRR (2024)
Keyphrases
- question answering
- audio visual
- passage retrieval
- multi stream
- multi modal
- natural language processing
- information retrieval
- visual information
- named entities
- natural language
- multimedia
- visual data
- information extraction
- image classification
- text mining
- visual features
- search engine
- question answering systems
- qa systems
- data mining