This paper introduces a new metric (IMV) for evaluating prediction accuracy of models with binary outcomes. One issue with traditional metrics, for example raw accuracy, is that their absolute numbers can be very deceiving: high accuracy could either mean a good model or a very easy prediction task. Similarly, accuracy does not differentiate between models that have high vs low confidence. For example, in a trivial dataset where all labels are 1, a model that always output a probability of .51 will have the same accuracy as one that outputs .99.
IMV is a simple metric that takes care of these two problems. First, the IMV is measured between two models, so it's always a relative quantity (measuring improvement). This does not add substantial complication: the base model can be one that always predicts the base rate, for instance. If that alone turns out to be a very strong baseline, it will be hard to get very high IMVs. Then, given the model's predictions (not just the labels but the probability assigned to each label), the IMV consists of first computing the equivalent Bernoulli parameter that would lead to the same mean log-likelihood of the true labels for both the base and new model, and then measuring how much better can you guess the true label with the new model's coin toss compared to the base model. Thus, the IMV measures the improvement, which makes it comparable across domains, and also useful for answering various feature importance questions (e.g., what outcome is feature X more predictive of, Y or Z?).
This is pretty useful, especially since I have been recently deceived by high accuracies in ongoing work, where the IMV would have been a more honest metric for myself.