A Touch, Vision, and Language Dataset for Multimodal Alignment.
Letian FuGaurav DattaHuang HuangWilliam Chung-Ho PanitchJaimyn DrakeJoseph OrtizMustafa MukadamMike LambetaRoberto CalandraKen GoldbergPublished in: CoRR (2024)
Keyphrases
- computer vision
- programming language
- benchmark datasets
- language learning
- natural language
- multi modal
- language processing
- vision system
- real time
- image processing
- audio visual
- synthetic datasets
- human vision
- specification language
- word alignment
- multimodal interfaces
- object detection
- similarity measure
- human computer interaction
- image alignment