ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision.

Wonjae Kim Bokyung Son Ildoo Kim

Published in: ICML (2021)

Keyphrases

image processing
programming language
computer vision
natural language
active learning
vision system
language learning
fuzzy logic
power system
neural network
real time
artificial intelligence
data sets
specification language
language processing
region of interest
input image
image features
information extraction
database systems
information retrieval