This paper introduces a method for producing natural language descriptions of individual neurons in a neural network. They focus on vision, though the principles feel adaptable. They start by representing a neuron as a list of image patches in the training set that activate that neuron the most. Then, they collect human descriptions of sets of image patches for a variety of vision architectures. They then have a simple objective for generating new descriptions, where they are upweighted by likelihood and downweighted by how globally common they would be (formulated as pointwise mutual information between the description and set of images).
The results look very promising: they use it to analyze important features in a few different architectures (and show that results generalize well to architectures held-out during training), to audit models that are supposed to not care about certain protected features, and finally to analyze spurious correlations learned in adversarial dataset (also improve the model ``zero-shot'' by just removing the neurons corresponding to spurious features). This should enable a range of new applications, and scaling this up (which I think the likes of Google & FAIR/MAIR might do) can likely step up the current state of interpretability quite significantly.