Visual Grounding (VG) aims to locate the most relevant region in an image,
based on a flexible natural language query but not a pre-defined label, thus it
can be a more useful technique than object detection in practice. Most
state-of-the-art methods in VG operate in a two-stage manner, wherein the first
stage an object detector is adopted to generate a set of object proposals from
the input image and the second stage is simply formulated as a cross-modal
matching problem that finds the best match between the language query and all
region proposals. This is rather inefficient because there might be hundreds of
proposals produced in the first stage that need to be compared in the second
stage, not to mention this strategy performs inaccurately. In this paper, we
propose an simple, intuitive and much more elegant one-stage detection based
method that joints the region proposal and matching stage as a single detection
network. The detection is conditioned on the input query with a stack of novel
Relation-to-Attention modules that transform the image-to-query relationship to
an relation map, which is used to predict the bounding box directly without
proposing large numbers of useless region proposals. During the inference, our
approach is about 20x ~ 30x faster than previous methods and, remarkably, it
achieves 18% ~ 41% absolute performance improvement on top of the
state-of-the-art results on several benchmark datasets. We release our code and
all the pre-trained models at https://github.com/openblack/rvg.
%0 Generic
%1 journals/corr/abs-1902-04213
%A Deng, Chaorui
%A Wu, Qi
%A Xu, Guanghui
%A Yu, Zhuliang
%A Xu, Yanwu
%A Jia, Kui
%A Tan, Mingkui
%D 2019
%K arch grounding loss
%T You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding
%U http://arxiv.org/abs/1902.04213
%X Visual Grounding (VG) aims to locate the most relevant region in an image,
based on a flexible natural language query but not a pre-defined label, thus it
can be a more useful technique than object detection in practice. Most
state-of-the-art methods in VG operate in a two-stage manner, wherein the first
stage an object detector is adopted to generate a set of object proposals from
the input image and the second stage is simply formulated as a cross-modal
matching problem that finds the best match between the language query and all
region proposals. This is rather inefficient because there might be hundreds of
proposals produced in the first stage that need to be compared in the second
stage, not to mention this strategy performs inaccurately. In this paper, we
propose an simple, intuitive and much more elegant one-stage detection based
method that joints the region proposal and matching stage as a single detection
network. The detection is conditioned on the input query with a stack of novel
Relation-to-Attention modules that transform the image-to-query relationship to
an relation map, which is used to predict the bounding box directly without
proposing large numbers of useless region proposals. During the inference, our
approach is about 20x ~ 30x faster than previous methods and, remarkably, it
achieves 18% ~ 41% absolute performance improvement on top of the
state-of-the-art results on several benchmark datasets. We release our code and
all the pre-trained models at https://github.com/openblack/rvg.
@misc{journals/corr/abs-1902-04213,
abstract = {Visual Grounding (VG) aims to locate the most relevant region in an image,
based on a flexible natural language query but not a pre-defined label, thus it
can be a more useful technique than object detection in practice. Most
state-of-the-art methods in VG operate in a two-stage manner, wherein the first
stage an object detector is adopted to generate a set of object proposals from
the input image and the second stage is simply formulated as a cross-modal
matching problem that finds the best match between the language query and all
region proposals. This is rather inefficient because there might be hundreds of
proposals produced in the first stage that need to be compared in the second
stage, not to mention this strategy performs inaccurately. In this paper, we
propose an simple, intuitive and much more elegant one-stage detection based
method that joints the region proposal and matching stage as a single detection
network. The detection is conditioned on the input query with a stack of novel
Relation-to-Attention modules that transform the image-to-query relationship to
an relation map, which is used to predict the bounding box directly without
proposing large numbers of useless region proposals. During the inference, our
approach is about 20x ~ 30x faster than previous methods and, remarkably, it
achieves 18% ~ 41% absolute performance improvement on top of the
state-of-the-art results on several benchmark datasets. We release our code and
all the pre-trained models at https://github.com/openblack/rvg.},
added-at = {2019-03-10T20:15:05.000+0100},
author = {Deng, Chaorui and Wu, Qi and Xu, Guanghui and Yu, Zhuliang and Xu, Yanwu and Jia, Kui and Tan, Mingkui},
biburl = {https://www.bibsonomy.org/bibtex/26b0b856e555bedb664668e5b61085e68/nmatsuk},
description = {You Only Look },
interhash = {19839b69974c9386f7c2c84e36461a00},
intrahash = {6b0b856e555bedb664668e5b61085e68},
keywords = {arch grounding loss},
note = {cite arxiv:1902.04213Comment: 10 pages, 5 figures},
timestamp = {2019-03-10T20:15:05.000+0100},
title = {You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding},
url = {http://arxiv.org/abs/1902.04213},
year = 2019
}