This paper tests DeepMind's code-generating language model, AlphaCode, on programming competition problems. This is the first paper that actually does the real evaluation of submitting solutions to Codeforces and seeing if they're accepted. And they get some accepts! That's impressive, and would be hard to believe even 5 years ago. Indeed, some people have tried it before and eventually gave up (and turned into a blockchain startup instead).
To get these results, the technique itself is not particularly clever: they have an encoder-decoder Transformer that takes in the problem statement and some other input data, and outputs the code to solve it. The actuall boost to get that to work involves doing as much sampling as possible: they pushed it to do millions of samples per problem, by (a) not using a huge model ("just" 1B parameters), (b) changing their Transformer attention mechanism, (c) massive parallelism.
This is an interesting challenging, and a much more real one than solving the corresponding offline datasets, like APPS. Indeed, while many problems on APPS are taken from Codeforces, it takes much less to "solve" a problem on APPS compared to solving it on Codeforces, due to (1) fewer and missing test cases, and (2) the lack of large test cases, which test the time limit. Without the time limit constraint, all problems on Codeforces are easy: you just translate whatever the statement describes it wants into code. The difficulty only comes because you need to do that efficiently, and the large tests will make sure that a quadratic solution will not be accepted when the problem authors expect linear-time (it would time out with a margin of many minutes/hours). The AlphaCode team did the nice job of making a dataset with a smaller false-positive rate (CodeContests).
That said, I'm rather skeptic that this approach will be consistently solving reasonably hard competition problems anytime soon. There's a big conceptual jump in competition problems as they progress in difficulty: the first ones mostly require you to know English and programming, whereas starting from about the level where AlphaCode stops at, you start to need to infer a lot more than what is given in the problem. Given the results with Codex, I would have believed the current AlphaCode results to be achievable, but I'm still skeptic that this will scale to the next level (of, for instance, achieving the same ranking but in Division 1 of Codeforces). Scale improves things, but at the scale we're already at, we might be reaching the limit of orders of magnitude we can physically get to (both data and compute). At that point, we might have to think harder if we want to keep making progress.