Publication: Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation.