Abstract
Real-world image recognition systems need to recognize tens of thousands of
classes that constitute a plethora of visual concepts. The traditional approach
of annotating thousands of images per class for training is infeasible in such
a scenario, prompting the use of webly supervised data. This paper explores the
training of image-recognition systems on large numbers of images and associated
user comments. In particular, we develop visual n-gram models that can predict
arbitrary phrases that are relevant to the content of an image. Our visual
n-gram models are feed-forward convolutional networks trained using new loss
functions that are inspired by n-gram models commonly used in language
modeling. We demonstrate the merits of our models in phrase prediction,
phrase-based image retrieval, relating images and captions, and zero-shot
transfer.
Users
Please
log in to take part in the discussion (add own reviews or comments).