Learning Visual N-Grams from Web Data

Abstract

Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.

BibTeX key: li2016learning
entry type: misc
year: 2016
url: http://arxiv.org/abs/1612.09161
note: cite arxiv:1612.09161

Users

Comments and Reviewsshow / hide

@ang 8 years ago (last updated 8 years ago)
References
Bookmarks
deleting review

Please log in to take part in the discussion (add own reviews or comments).

BibSonomy

Learning Visual N-Grams from Web Data

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on