This paper is about a prompting method for language models that makes the model first decompose the problem into a sequence of sub-problems, and then solve each of those in order. This is shown to out-perform chain-of-thoguht prompting in a few datasets: their method essentially solves SCAN, and has better performance on GSM8k and a symbolic manipulation task of concatenating the last letters of arbitrary words in a sentence.
Their main advantage is generalization to harder examples than those in the prompt: their method still degrades (e.g., Table 2), but slower than pure chain-of-thought. The SCAN splits, which they essentially fully solve, showcase that.
The prompt for each task certainly plays a huge role in their performance. This is clearly a case of an ``existential'': when shown just the right prompt examples that decompose similar tasks in useful ways, this paper shows that current large language models can successfully decompose new (and harder versions of) tasks. But which decomposition is useful for which task is something that the prompt designers are still doing. For example, I would guess that, if asked to propose a decomposition of the task of ``take a sequence and concatenate the last letter of each of its words'', the model would hardly output the decomposition that they propose (of taking each word, extracting its last letter, concatenating to the output so far). How to generalize in the task of decomposing new tasks is an important problem that these results suggest should be addressed.