# IsarStep: A Benchmark For High-Level Mathematical Reasoning (@ ICLR 2021)

### Wenda Li, Lei Yu, Yuhuai Wu, Lawrence C. Paulson

This paper introduces a new benchmark for theorem proving, called IsarStep.
Here, the goal is to produce relevant sub-goals for proofs in Isabelle/HOL.
Thus, the target is still not to produce entire proofs, but rather to predict
relevant proof sub-goals (which is, indeed, a hard task and one of the major
challenges in guiding proof search). They extracted a large dataset of
proofs from the Archive of Formal Proofs,
which is a public archive ran much like a scientific journal (i.e., people
submit proofs for known theorems, which are reviewed and added to the archive.
Of course correctness is not an issue, but rather relevance, cleanliness, etc).

This is a cool benchmark since it focuses on higher-level steps rather than
automating the menial steps in theorem proving.
Although I think that automated theorem proving
is maybe suffering from having Too Many Benchmarks (TM), and not many clear
challenges that the community is pursuing.
Only in the last 2-3 years there have been benchmarks coming out in ICML/ICLR/NeurIPS
using Coq (x2), HOLight, and now Isabelle as the underlying prover,
with first-order logic coming before as a more standard problem formulation
from the automated reasoning community. Unfortunately, I haven't seen any
papers at all that proposed solutions to these latter benchmarks.
Hopefully the community will converge in at most a few formulations of the problem
that more people find interesting and actually work on, since major breakthroughs
in theorem proving could have large impacts in a broad range of fields
(program verification, educational applications, and maybe even mathematical practice).