CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes.
Maria ParelliAlexandros DelitzasNikolas HarsGeorgios VlassisSotiris AnagnostidisGregor BachmannThomas HofmannPublished in: CoRR (2023)
Keyphrases
- question answering
- d scene
- natural language
- information extraction
- information retrieval
- single image
- natural language processing
- depth map
- question classification
- optical flow
- natural language questions
- training set
- cross language
- qa clef
- image processing
- passage retrieval
- question answering systems
- semantic roles
- high quality
- syntactic information
- machine learning
- qa systems
- feature extraction
- knowledge representation