Abstract
A rapidly growing area of work has studied the existence of adversarial
examples, datapoints which have been perturbed to fool a classifier, but the
vast majority of these works have focused primarily on threat models defined by
$\ell_p$ norm-bounded perturbations. In this paper, we propose a new threat
model for adversarial attacks based on the Wasserstein distance. In the image
classification setting, such distances measure the cost of moving pixel mass,
which naturally cover "standard" image manipulations such as scaling, rotation,
translation, and distortion (and can potentially be applied to other settings
as well). To generate Wasserstein adversarial examples, we develop a procedure
for projecting onto the Wasserstein ball, based upon a modified version of the
Sinkhorn iteration. The resulting algorithm can successfully attack image
classification models, bringing traditional CIFAR10 models down to 3% accuracy
within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1
pixel), and we demonstrate that PGD-based adversarial training can improve this
adversarial accuracy to 76%. In total, this work opens up a new direction of
study in adversarial robustness, more formally considering convex metrics that
accurately capture the invariances that we typically believe should exist in
classifiers. Code for all experiments in the paper is available at
https://github.com/locuslab/projected_sinkhorn.
Users
Please
log in to take part in the discussion (add own reviews or comments).