SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation.
Abhinav MoudgilArjun MajumdarHarsh AgrawalStefan LeeDhruv BatraPublished in: NeurIPS (2021)
Keyphrases
- multiple objects
- visual scene
- complex scenes
- object models
- spatial relations
- moving objects
- real world scenes
- target object
- uncalibrated images
- object appearance
- real world objects
- d scene
- d objects
- three dimensional
- single image
- computer vision
- vision system
- visual input
- image segments
- natural language
- geometric information
- camera images
- viewing angle
- programming language
- object features
- video sequences
- complex objects
- reference object
- image regions
- fuzzy logic
- acquired images
- location and orientation
- intensity images
- video scene
- real time
- object model
- object classes
- object tracking
- wire frame
- real objects
- real scenes
- visual attention
- rigid body motion
- viewing position