Global-and-local attention networks for visual recognition

Abstract

State-of-the-art deep convolutional networks (DCNs) such as squeeze-and- excitation (SE) residual networks implement a form of attention, also known as contextual guidance, which is derived from global image features. Here, we explore a complementary form of attention, known as visual saliency, which is derived from local image features. We extend the SE module with a novel global-and-local attention (GALA) module which combines both forms of attention -- resulting in state-of-the-art accuracy on ILSVRC. We further describe ClickMe.ai, a large-scale online experiment designed for human participants to identify diagnostic image regions to co-train a GALA network. Adding humans-in-the-loop is shown to significantly improve network accuracy, while also yielding visual features that are more interpretable and more similar to those used by human observers.

BibTeX key: citeulike:14609092
entry type: misc
year: 2018
month: may
day: 25
citeulike-article-id: 14609092
citeulike-linkout-1: http://arxiv.org/pdf/1805.08819
priority: 2
posted-at: 2018-06-28 15:14:33
eprint: 1805.08819
citeulike-linkout-0: http://arxiv.org/abs/1805.08819
archiveprefix: arXiv
url: http://arxiv.org/abs/1805.08819

BibSonomy

Global-and-local attention networks for visual recognition

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on