Abstract
Human visual system can selectively attend to parts of a scene for quick
perception, a biological mechanism known as Human attention. Inspired by this,
recent deep learning models encode attention mechanisms to focus on the most
task-relevant parts of the input signal for further processing, which is called
Machine/Neural/Artificial attention. Understanding the relation between human
and machine attention is important for interpreting and designing neural
networks. Many works claim that the attention mechanism offers an extra
dimension of interpretability by explaining where the neural networks look.
However, recent studies demonstrate that artificial attention maps do not
always coincide with common intuition. In view of these conflicting evidence,
here we make a systematic study on using artificial attention and human
attention in neural network design. With three example computer vision tasks,
diverse representative backbones, and famous architectures, corresponding real
human gaze data, and systematically conducted large-scale quantitative studies,
we quantify the consistency between artificial attention and human visual
attention and offer novel insights into existing artificial attention
mechanisms by giving preliminary answers to several key questions related to
human and artificial attention mechanisms. Overall results demonstrate that
human attention can benchmark the meaningful `ground-truth' in attention-driven
tasks, where the more the artificial attention is close to human attention, the
better the performance; for higher-level vision tasks, it is case-by-case. It
would be advisable for attention-driven tasks to explicitly force a better
alignment between artificial and human attention to boost the performance; such
alignment would also improve the network explainability for higher-level
computer vision tasks.
Users
Please
log in to take part in the discussion (add own reviews or comments).