Publication: Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models.