This paper proposes an approach for zero-shot learning. They work under the assumption that you have a dataset with several points with known class labels. Furthermore, for each class, you are given a class embedding - a fixed-length vector that represents that class. During test time, the model has to predict labels for points which might not belong classes seen during training. However, it assumes you are given class embeddings for all classes, even those unseen during training.
Their approach is based on using a generative model (a Variational Auto-Encoder) to optimize the likelihood of samples from seen classes given their class embeddings. They use a VAE to model $p(x|z; \theta)$: the probability of a data point given its class embedding and the model parameters $\theta$. Then, they apply Expectation-Maximization to optimize this likelihood using the samples from seen classes: first, you use the model to sample points from unseen classes, and sample points from seen classes directly from the training set; then, you change the parameters to maximize the likelihood of the sampled points. At test time, you can plug in the class embeddings from unseen classes to get a generative model for them as well. Also, the likelihoods at test time can be used to do classification, instead of needing to train a classifier on top of generated samples.
The idea does sound interesting. I'm not as familiar with the zero-shot literature to ascertain how novel it is, but it is interesting per se. In the evaluation, it is nice that they notice that the use of models that were pre-trained on ImageNet sometimes breaks the zero-shot assumptions, as ImageNet has classes that also appear in other datasets that are used to evaluate zero-shot learning algorithms. Other than that, Table 2 has just a lot of numbers to make sense of. Their model does do better than all of the other in at least one metric in the 4 settings, but I really can't say how fundamentally different are all of the other approaches.