Gabriel Poesia

Adversarial Examples Are Not Bugs, They Are Features (@ NeurIPS 2019)

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry


Adversarial examples, which contain human-imperceptible perturbations that cause learned models to fail catastrophically (mispredict with arbitrarily high confidence) have been intriguing and dreaded since their discovery. Many reasons have been hypothesized for why they exist, such as quirks in how we train and optimize models, or in standard loss functions or neural architectures. This paper provides a different view: adversarial examples are caused because of features in the input data distribution that are non-robust. That means that they are truly correlated with the target label under the input data distribution (/useful/), but become uncorrelated (or more weakly correlated) when noise is added to inputs (/non-robust/). By default, models have no inherent preference towards robust features; rather, they might use any useful feature present in the data.

The authors perform a series of fun experiments, including one where the dataset is perturbed so that all labels look wrong, but standard models trained on it still generalize well to the original test dataset (because examples are constructed in a way that preserves useful features).

They point out an interesting implication for explainability as well: if models rely on non-robust features, there's little hope to explain predictions in general (since that means they use human-imperceptible features: features we can't really explain).