"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin (KDD 2016)

This paper presents a very neat idea for interpreting predictions of black-box classifiers. The idea is simple and apparently works quite well. Given a large complex model, such as a deep neural net for image classification, providing a faithful simple explanation of the entire model is probably impossible. Thus, they focus on explaining just a single prediction first, and then do a little bit of work to explain models by picking a group of representative predictions, but this latter part is less clear how to generalize. The first one, though, is not; it does seem quite general.

Suppose the black-box model maps inputs $\mathcal{X}$ (e.g. texts) into labels $\mathcal{Y}$ (e.g. categories for text classification). The idea of LIME is to map each input into another set of “explanations” (for text, that can be the binary presence/absence of words). Then, given an input, LIME samples a bunch of explanations both near and far away from that input, uses those explanations to transform the input back, then fits a simple interpretable model that (1) is locally faithful, i.e. to accurately predict what happens when you change the input, and (2) is simple (i.e. minimizes the number of words in the explanation). The simple interpretable model might be, for example, a shallow decision tree or sparse a linear classifier with binary weights.

For texts and images, their classes of explanations are very directly linked to the input: for text, it is a list of words that are present/absent in the input, and for images it is super-pixels (i.e. square patches in the image). This leads to very easy to understand explanations, indeed.

The intuitive idea is very simple. You have an input, for which prediction you want an explanation. Then, you perturb the input locally (i.e. randomly remove words, or apply gray square patches in an image), and run the classifier in each perturbed input. Then, you use the classifier output on these new inputs to find an explanation. For example, if every time you erase the words “theorem” and “proof” from the text, it changes the class from “math paper” to something else, then you can faithfully say that these words are the “explanation” for the prediction: if they weren’t there, the prediction would have changed. Finding this explanation can be done by fitting a simple model, like a sparse regression.

I’m thinking of possible applications of this kind of explanations in an education setting, in a project I’m starting to work on. Let’s see if I come back to report on that in a few months (this would enter a very late stage of the project).