SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding.
Haoxiang WangPavan Kumar Anasosalu VasuFartash FaghriRaviteja VemulapalliMehrdad FarajtabarSachin MehtaMohammad RastegariOncel TuzelHadi PouransariPublished in: CoRR (2023)