Publication: Visual question answering based on local-scene-aware referring expression generation.