Abstract
Recent studies have shown that modern deep neural network classifiers are
easy to fool, assuming that an adversary is able to slightly modify their
inputs. Many papers have proposed adversarial attacks, defenses and methods to
measure robustness to such adversarial perturbations. However, most commonly
considered adversarial examples are based on $\ell_p$-bounded perturbations in
the input space of the neural network, which are unlikely to arise naturally.
Recently, especially in computer vision, researchers discovered "natural" or
"semantic" perturbations, such as rotations, changes of brightness, or more
high-level changes, but these perturbations have not yet been systematically
utilized to measure the performance of classifiers. In this paper, we propose
several metrics to measure robustness of classifiers to natural adversarial
examples, and methods to evaluate them. These metrics, called latent space
performance metrics, are based on the ability of generative models to capture
probability distributions, and are defined in their latent spaces. On three
image classification case studies, we evaluate the proposed metrics for several
classifiers, including ones trained in conventional and robust ways. We find
that the latent counterparts of adversarial robustness are associated with the
accuracy of the classifier rather than its conventional adversarial robustness,
but the latter is still reflected on the properties of found latent
perturbations. In addition, our novel method of finding latent adversarial
perturbations demonstrates that these perturbations are often perceptually
small.
Description
[2003.01993] Metrics and methods for robustness evaluation of neural networks with generative models
Links and resources
Tags
community