Misc,

From Recognition to Cognition: Visual Commonsense Reasoning

xxx.
(Nov 27, 2018)

Abstract

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people&\#39;s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today&\#39;s vision systems, requiring higher-order cognition and commonsense reasoning about the world. In this paper, we formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true. We introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe to generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. To move towards cognition-level image understanding, we present a new reasoning engine, called Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. Experimental results show that while humans find VCR easy (over 90\% accuracy), state-of-the-art models struggle (\~45\%). Our R2C helps narrow this gap (\~65\%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

BibTeX key: citeulike:14676926
entry type: misc
year: 2018
month: nov
day: 27
citeulike-article-id: 14676926
citeulike-linkout-1: http://arxiv.org/pdf/1811.10830
priority: 3
posted-at: 2019-01-05 15:48:54
eprint: 1811.10830
citeulike-linkout-0: http://arxiv.org/abs/1811.10830
archiveprefix: arXiv
url: http://arxiv.org/abs/1811.10830

BibSonomy

From Recognition to Cognition: Visual Commonsense Reasoning

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on