Abstract
We study the problem of attributing the prediction of a deep network to its
input features, a problem previously studied by several other works. We
identify two fundamental axioms---Sensitivity and Implementation Invariance
that attribution methods ought to satisfy. We show that they are not satisfied
by most known attribution methods, which we consider to be a fundamental
weakness of those methods. We use the axioms to guide the design of a new
attribution method called Integrated Gradients. Our method requires no
modification to the original network and is extremely simple to implement; it
just needs a few calls to the standard gradient operator. We apply this method
to a couple of image models, a couple of text models and a chemistry model,
demonstrating its ability to debug networks, to extract rules from a network,
and to enable users to engage with models better.
Users
Please
log in to take part in the discussion (add own reviews or comments).