This paper introduces a new benchmark for theorem proving, called IsarStep. Here, the goal is to produce relevant sub-goals for proofs in Isabelle/HOL. Thus, the target is still not to produce entire proofs, but rather to predict relevant proof sub-goals (which is, indeed, a hard task and one of the major challenges in guiding proof search). They extracted a large dataset of proofs from the Archive of Formal Proofs, which is a public archive ran much like a scientific journal (i.e., people submit proofs for known theorems, which are reviewed and added to the archive. Of course correctness is not an issue, but rather relevance, cleanliness, etc).
This is a cool benchmark since it focuses on higher-level steps rather than automating the menial steps in theorem proving. Although I think that automated theorem proving is maybe suffering from having Too Many Benchmarks (TM), and not many clear challenges that the community is pursuing. Only in the last 2-3 years there have been benchmarks coming out in ICML/ICLR/NeurIPS using Coq (x2), HOLight, and now Isabelle as the underlying prover, with first-order logic coming before as a more standard problem formulation from the automated reasoning community. Unfortunately, I haven't seen any papers at all that proposed solutions to these latter benchmarks. Hopefully the community will converge in at most a few formulations of the problem that more people find interesting and actually work on, since major breakthroughs in theorem proving could have large impacts in a broad range of fields (program verification, educational applications, and maybe even mathematical practice).