`lf-lean`: The frontier of verified software engineering

Jason Gross^♤, Vishesh Saraswat^♤♡♢, Jeffrey Chang^♤, Harshikaa Agrawal^♤, Vaidehi Agarwalla^♤, Robert Zhang^♤, Twm Stone^♡, Jacob Green^♡, Shiki Vaahan^♡, Lawrence Chan^♧, Rajashree Agrawal^♤
^♤Theorem, ^♡MARS, ^♢MATS, ^♧METR February 5, 2026

As AI generates code faster than humans can review it, oversight becomes the bottleneck. We introduce task-level specification, a framework where correctness is defined once for an entire class of software tasks and automatically checked across all instances—reducing human oversight from 𝒪(𝓃) to 𝒪(1). We prototype this with rocq-dove, an environment for verified translation from Rocq to Lean that scales to arbitrary program complexity. Using frontier AI, we produced lf-lean: a verified translation of all 1,276 statements in the Logical Foundations textbook. Models autonomously solved 97% of statements; our total human-in-the-loop effort on the remaining problems was ~2 person-days, versus an estimated ~2.75 person-years if we did the whole project manually. These results suggest that verified software engineering, long considered impractical, may become more scalable than unverified software engineering in the era of AI-led development.

Figure 1: Closing the Gap Between Verified and Vanilla Software Engineering. Adapted from METR’s Time Horizon plot, including software verification benchmarks where AIs write code and then prove it correct. We plot only the time horizon for implementation (not verification). lf-lean gives us our first measurement of where verified software engineering capability actually is, and the early signal is surprisingly encouraging. See Appendix B for measurement methodology.

Introduction

As AI models automate increasingly complex software tasks, a fundamental tension emerges: how do we know the code they produce is correct?

The standard approach of reviewing AI-generated code and its tests doesn’t scale. Human review effort grows linearly with code volume, while AI code generation capacity grows exponentially. ¹ If this trend continues, we’re headed toward a world where code deployment is bottlenecked on human review capacity, or deployed without review.

This post introduces a different approach: task-level specification. Instead of reviewing each piece of generated code individually, we review a correctness specification once for an entire class of tasks, and the correctness specification covers all instances of the task across any codebase. The key insight is that for many important software transformations (translation, optimization, refactoring), we can define what “correct” means generically, without reference to any specific codebase. This helps scale oversight in two ways: (1) you can easily work on many more codebases simultaneously, and (2) what works on simple codebases extends to arbitrarily complex ones.

We prototype task-level specification with rocq-dove, a scalable environment for verified translation² from Rocq³ to Lean. Given any Rocq source code, rocq-dove automatically generates a correctness specification (a theorem that any valid translation must satisfy) and grades model-generated translations and proofs against it. Because the specification is derived from the source code itself, rocq-dove scales to arbitrary programs without additional human effort.

We use rocq-dove to generate an environment for a verified translation of all 1,276 statements⁴ in Logical Foundations⁵ 1Pierce, B. C. et al.Software Foundations, Volume 1: Logical Foundationssoftwarefoundations.cis.upenn.edu, 2024 . Using our scaffold, frontier AI models⁶ autonomously produced verified translations of 97% of the statements.⁷ We solved 6 extreme-difficulty statements manually (~2 person-days), after which models completed the remaining 2.5%. For context, we estimate the full task would have taken 3 person-months for translation plus 2.5 person-years for verification if done entirely by humans.⁸

Two key world-view updates:

Verified software engineering can scale without marginal human oversight. rocq-dove automatically generated correctness specifications for the 1,276 translations in lf-lean. No human judgment was needed to define “correct” for each case. Combined with rich feedback from the proof assistant, this creates a dense reward signal that lets AI iterate toward correct solutions autonomously.
AI capabilities on software verification are advancing faster than expected. We developed lf-lean, the first sufficiently complex evaluation to demonstrate this. Compared to METR’s time horizon trend 2METRMeasuring AI Ability to Complete Long TasksMETR Blog, 2025 on unverified software engineering, verified software engineering appears to be catching up.

The implication is that software verification, long considered too expensive for practical software development, may become the more scalable approach in an era of AI-assisted programming. When humans can’t review all the code AI produces, having machines verify correctness automatically becomes not just nice to have, but necessary.

In what follows, we explain task-level specification, the conceptual framework that lets correctness checking scale without marginal human effort. Then we describe rocq-dove, our prototype environment for translating from Rocq to Lean, and present our verified translation lf-lean. Finally, we discuss why verified software engineering is becoming practical, and becoming easy to scale up model performance on.

Task-Level Specification

Figure 2: Constant oversight for arbitrarily scaling codebases. Human oversight is \(\mathcal{O}(1)\)—define and verify the task-level specification once—while verified task completions scale to \(\mathcal{O}(n)\).

The scaling problem with AI coding is that humans can only review so much. As models generate more code more quickly, human oversight becomes the bottleneck.

One response is to make review faster—better tooling, smarter sampling, and more efficient workflows. This can mitigate the immediate issue, but doesn’t change the fundamental scaling problem: a human must understand each codebase and write codebase-specific verification. Oversight effort remains linear: \(\mathcal{O}(n)\) where \(n\) is the number of tasks.

The other response is to make review amortizable. For certain task classes \(\mathcal{T}\) (e.g., translation, optimization, parallelization, refactoring), correctness can be defined generically, without reference to individual task details. A human reviews the correctness specification for \(\mathcal{T}\) once; the system then automatically verifies any instance. Human effort becomes \(\mathcal{O}(1)\) with respect to codebase count. Call this task-level specification.

When building a new toolchain, the ideal problem is useful, simple, and representative. We started with translation, which hits all three; there is real demand for porting codebases between languages 3Fleishman, G.It's COBOL all the way downIncrement, 2018 , it is the simplest semantics-preserving transformation, and it is paradigmatic for the generalized challenge of program verification—establishing semantic equivalence. Semantic equivalence often looks the same regardless of the exact specification or implementation, and verified translation captures this core difficulty, making it an effective task for training on program verification.

`rocq-dove`: Task-Level Specification for Translating from Rocq to Lean

We developed a prototype of task-level specification for the task of verified software translation from Rocq to Lean. Unlike a vanilla software translation which yields only a translated version of the code, a verified translation produces a (translation, proof) pair. The proof then serves as a guarantee that the translation is semantically equivalent to the source. Even if the translation is correct, the pair is rejected if the proof does not verify its correctness. In this section, we describe the environment setup for used for every codebase and the details of how we build typewise isomorphism, the technique we developed for task-level specification.

Environment

We describe the general environment for verified translation from Rocq to Lean.

Pipeline: Harness vs AI work in Verified Translation

Figure 3: Pipeline structure for verified translation. The task-level specification is applied to the Rocq source and provided to the AI agent, which produces a Lean translation and Rocq proof. The grader validates the output above the trust boundary.

Specification Generation: rocq-dove derives a theorem statement \(\texttt{Thm}_{\texttt{Rocq}}\) directly from the Rocq source \(\texttt{Src}_{\texttt{Rocq}}\). The theorem states that the translation is typewise isomorphic to the source; we describe nuances of the definition in the Typewise Isomorphism section.
Translation: The AI agent translates \(\texttt{Src}_{\texttt{Rocq}}\) into \(\texttt{Target}_{\texttt{Lean}}\) using \(\texttt{Thm}_{\texttt{Rocq}}\) as a guide.
Round-Trip Compilation: \(\texttt{Target}_{\texttt{Lean}}\) is compiled back into Rocq using rocq-lean-import⁹ 4Gilbert, G.rocq-lean-import: Import Lean exported files into Rocqgithub.com/rocq-community/rocq-lean-import, 2025 , yielding \(\texttt{RoundTrip}_{\texttt{Rocq}}\). This step is necessary because the correctness specification \(\texttt{Thm}_{\texttt{Rocq}}\) is stated as an equivalence between two Rocq terms.¹⁰
Proof Generation: The AI agent constructs a proof \(\texttt{Proof}_{\texttt{Rocq}}\) of \(\texttt{Thm}_{\texttt{Rocq}}\) demonstrating the equivalence \(\texttt{Src}_{\texttt{Rocq}} \cong \texttt{RoundTrip}_{\texttt{Rocq}}\).
Grading: The rocq-dove grader checks that the proofs have three properties: (1) proofs are valid, (2) adhere to the specified theorem statements, and (3) avoid relying on additional unsupported assumptions.¹¹ If checking fails, the agent iterates on the translation or the proof until it passes.

Typewise Isomorphism

Since Rocq and Lean differ slightly in behavior, we require a notion of “equivalence” that’s neither too strict nor too loose. We cannot use strict equality, as writing the same source code for a datatype definition in different files results in datatypes that are not technically “equal”. As such, we define typewise isomorphism as follows:

Dealing with Types and Values

For datatypes, the standard relaxation of equality¹² is isomorphism: \(A\) and \(B\) are isomorphic (\(A \cong B\)) when we have functions \(\{\text{to} : A \to B\}\) and \(\{\text{from} : B \to A\}\) such that \(\{\text{to} \circ \text{from} = \text{id}\}\) and \(\{\text{from} \circ \text{to} = \text{id}\}\).¹³ We can then define equivalence (“valid translation”) for values via transport:

\[a \cong b := \text{iso.to}(a) = b\]

where \(a : A\), \(b : B\), and \(\text{iso} : A \cong B\).

The Problem with Propositions

Unfortunately, this isn’t enough. Since isomorphism is defined as reversible mapping, all true propositions are isomorphic. For example, \(1 = 1\) and \(2 = 2\) are both true; in a proof-irrelevant setting like Lean, simply mapping any proof of one to any proof of the other yields an isomorphism. We would not want to say that “True” is a valid translation of “my C++ compiler is correct” just because both statements are provable.

Dealing with Type-Level Functions

We need to instead enforce that the types line up structurally, which we can do by relationally lifting isomorphism along type-level functions. For example, consider translation of a type-level function \(F : \texttt{Type} \to \texttt{Type} \to \texttt{Type}\). Instead of saying that a valid translation of the type \(F(A, B)\) is any type \(X\) with \(F(A, B) \cong X\), we say that a valid translation of \(F(A, B)\) is any \(F'(A', B')\) satisfying \(A \cong A'\), \(B \cong B'\), and for all \(C\), \(C'\), \(D\), and \(D'\),¹⁴

\[(C \cong C') \to (D \cong D') \to (F(C, D) \cong F'(C', D'))\]

Dealing with Polymorphic Functions

Sometimes, we must apply the appropriate kind of equivalence based on what the object in question is. For example, consider the function if_then_else_ : ∀ {ρ}, bool → ρ → ρ → ρ that returns its first or second argument of type ρ depending on whether its boolean argument is true or false. Then the relation if_then_else_ ≅ if_then_else_' would involve = if ρ is int, but would involve ≅ if ρ is Type instead.

Prop vs. SProp and Universes

Most projects in Rocq prove theorems whose types are Props, meaning that we cannot prove that all proofs of these theorems are equal. However, because Lean only contains proof-irrelevant propositions, the rocq-lean-importer imports Prop in Lean to SProp, which is like Prop, but with computational proof irrelevance. Luckily, axiomatizing Prop = SProp is consistent with program verification projects, and sufficient to deal with such issues. There are also subtleties around universe levels and cumulativity which we elide in this post.

Making `lf-lean`

We’re open-sourcing lf-lean, a verified translation from Rocq to Lean of the 1,276 statements¹⁵ in Logical Foundations 1Pierce, B. C. et al.Software Foundations, Volume 1: Logical Foundationssoftwarefoundations.cis.upenn.edu, 2024 (Software Foundations, Volume 1) by Benjamin C. Pierce et al, which is an interactive programming languages (PL) textbook teaching students the basics of functional programming, type theory, and formal program reasoning. While there has been tremendous progress in AI and theorem-proving focused on mathematics 6AlphaProof and AlphaGeometry Teams, Google DeepMindAI achieves silver-medal standard solving International Mathematical Olympiad problemsGoogle DeepMind Blog, 2024 7Achim, T. et al.Aristotle: IMO-level Automated Theorem ProvingarXiv:2510.01346, 2025 8AxiomMathAxiomProver at Putnam 2025github.com/AxiomMath/putnam2025, 2025 , our focus is on reasoning about the correctness of software: programs involving data structures, interpreters, and operational semantics.¹⁶ With the rising demand for Lean, we hope that our translation will be valuable to PL students. Moreover, since we worked on statement translation, not proof translation, the Lean statements serve as a new benchmark for evaluating AIs.

Handling Program Complexity

The statements in Logical Foundations are not independent. The textbook comprises 17 modules that form a deep dependency graph, with chains reaching up to 61 total dependencies. Over 90% of statements depend on at least one earlier statement. To our knowledge, this makes lf-lean the first AI software verification result that includes software with dependencies.

Figure 4: # Dependencies per Statement. Distribution of dependency counts across the 1,276 statements in Logical Foundations. Nearly half (49%) depend on 1–5 prior statements, while a long tail extends to 21+ dependencies (3%), illustrating the progression from self-contained definitions to deeply interconnected proofs.

A natural but brittle evaluation setup is “monolithic translation”: ask the model to translate and verify a target statement plus its entire dependency tree in one pass. This approach is impractical, and causes performance to collapse. We estimate we would see <20% success under a monolithic setup, primarily because the required context (definitions + prior theorems + proof structure) quickly exceeds what any model can reliably hold and manipulate.¹⁷

The fix mirrors how large software systems stay manageable: modular decomposition. Instead of forcing the model to re-derive the whole dependency tree, we treat previously verified dependencies as trusted interfaces. We provide the model with dependency files that have already been translated and verified. Then the model only needs to translate and verify the marginal statement (the new component) against its interface. The proof assistant enforces that interfaces between components match exactly; if each component verifies, the whole dependency graph is correct by construction. ¹⁸

This is just standard software engineering practice applied to verification: simple components, low coupling between them, and cohesive proof obligations that can be discharged independently.¹⁹

Measuring Difficulty

While dependency depth captures difficulty well, Lines of Code (LoC) is a more universal unit for measuring difficulty. Although not perfect, LoC has many nice properties: it is objective and easy to measure, and on model-generated code, it is strongly correlated with token count. Higher LoC also takes up the context window, introduces more opportunities for mistakes, etc.

In our setup with modular decomposition, we adapt the LoC metric to marginal LoC. Marginal LoC looks specifically at the portion that the model has to understand and translate to get a statement correct when it already has feedback regarding its dependency files.

Figure 5: Marginal LoC vs. # Dependencies. Each bubble represents a unique (dependency count, LoC) pair; size and color intensity encode the number of statements at that point. The weak positive trend reflects that higher-dependency statements tend to be longer, but the wide vertical spread shows that LoC and dependency count capture largely independent axes of difficulty.

Results

Without any human intervention, the models autonomously generated verified translations of 97% (1,237 / 1,276) of the theorem statements. The remaining 39 statements were blocked by just 6 “extreme” difficulty statements (0.5% of the total). These 6 required manual solving, taking approximately 15 hours of human effort. After providing those six solutions, models autonomously completed the remaining 33 statements, achieving 100% coverage of the benchmark.

To put this in perspective: we estimate that manually translating Logical Foundations would have taken 3 person-months, and formally proving the correctness of those translations would have taken an additional 2.5 person-years. Instead, in this workflow, the human effort required was ~15 hours, a 350×+ speedup on verification effort alone.

We did not run clean evaluations specifically for difficulty annotations. Instead, our easy, medium, hard, and extreme ratings are based on how much inference compute was necessary to solve each statement. Concretely, we considered (1) how many best-of-k runs were needed to obtain a successful verified translation, and (2) how many different evaluation setups we had to try before hitting these numbers. Extreme problems were solved by a human.

Figure 6: Difficulty by LoC. Percentage breakdown of difficulty levels within each LoC bin. As LoC increases, the proportion of easy problems decreases monotonically: 93% → 88% → 85% → 81% → 79% → 50% → 13% → 0%. The difficulty cliff is sharp at 17+ LoC, where hard and extreme problems become dominant.

The Case for Scaling Verified Software Engineering

Conventional wisdom is that software verification is infeasible. But infeasible has always meant too expensive, not impossible. The cost is measured in LoC: verification artifacts—specifications, proofs, and auxiliary infrastructure—typically dwarf the implementation itself. Across landmark projects from seL4 9Klein, G. et al.seL4: Formal Verification of an OS KernelSOSP, 2009 to CompCert 10Leroy, X.Formal verification of a realistic compilerCommunications of the ACM, 2009 , the median productivity is roughly 2,558 verified LoC per person-year (Appendix C), and our own translation required approximately 215k lines of isomorphism proofs to verify 6k lines of Rocq statements (Appendix A). Verification has never failed because it is conceptually hard; it has failed because it multiplies the volume of code that humans must write and maintain.

We are now entering a regime with a new calculus of feasibility. LoC generated are becoming effectively free, while human review is the bottleneck. In this regime, verified software engineering may actually be more scalable than unverified software engineering, because the oversight itself is automated. Models are not yet good enough at verification to fully realize this vision. But lf-lean demonstrates that the gap is closing.

The oversight benefit is realizable by scaling RL on software verification. Most RL environments for code generation rely on programmatic rewards or model-grading. Programmatically written unit tests provide sparse feedback, because test coverage isn’t meaningfully correlated with bugs found when controlling for codebase size and complexity (Appendix D).²⁰ And model-graded rewards are not robust against goodharting or spurious correlations resulting in catastrophic out-of-distribution generalization.

Given the appetite for scaling software complexity, the only scaling feedback on correctness leverages model capability. In rocq-dove, we leverage model-generated proofs, guided by the rich feedback of proof assistants. Moreover, by building task-level specifications, we significantly amortize the investment in the task specification, reusing the same environment for scalably many codebases. We are optimistic that this type of RL environment—dense signal, unbounded task complexity, scalable reuse—can drive rapid capability gains in verified software engineering.

We're hiring! If you're excited about building the future of verified software engineering, we'd love to hear from you. See open roles →

Appendix

A. `lf-lean` by the Numbers

Logical Foundations contains approximately 6k lines of Rocq statements. On average, each statement has 6.4 dependencies with a total of 24.8 LoC (61 dependencies with a total of 382 LoC at maximum). We translated this to approximately 25k lines of Lean code²¹ (≈ 3 person-months of work²²) and about 215k lines of Rocq isomorphism proofs (≈ 2.5 person-years of work). Our models produced 1.5M lines of Lean and 7M lines of Rocq proofs in the process of doing the translation.

Our environment, rocq-dove, ensures that all isomorphism proof tasks are essentially independent and parallelizable, having interdependencies only through the constraints imposed on the Lean translation. As a result, while the time horizon on verified software engineering ranges from 5 hours (METR) to 3 person-months (lf-lean), the time horizon on the verification proofs themselves is best measured in years.

B. Time Horizon Methodology

If AI capabilities on unverified programming continue to outpace capabilities on verified programming, then humans become the bottleneck (someone has to review all that unverified code). We adapt the METR Time Horizon plot to include AlphaVerus 13Aggarwal, P., Parno, B., & Welleck, S.AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and TreefinementICLR, 2025 , FVAPPS 14Dougherty, Q. & Mehta, R.Proving the Coding Interview: A Benchmark for Formally Verified Code GenerationLLM4Code Workshop, ICSE, 2025 , VerinaBench 15Sun, Y. et al.Verina: Benchmarking Verifiable Code GenerationarXiv:2505.23135, 2025 , and lf-lean, which are benchmarks where AIs succeed at not just program implementation, but also program verification.

In Figure 1, circles represent AI model releases evaluated on unverified software engineering tasks (METR’s time horizon methodology). Diamonds represent software verification benchmarks where AIs must both implement and formally prove correctness.

We conducted human time horizon baselines for just the program implementation component, since we’re graphing the amount of human labor that can be automated without any oversight:

On AlphaVerus, FVAPPS, and VerinaBench, we conducted baselines on the hardest problems. We proxy hardness by LoC, recommendations from the authors of FVAPPS and VerinaBench, and our own judgment.
On lf-lean, we conducted baselines for the average problem.

We note that the comparison of verified engineering is still a slight overestimate w.r.t. feasibility given that humans are slower at programming in languages amenable to verification.

We used OLS on the three preexisting verification benchmarks to fit the y-intercept of METR’s time horizon exponential curve to get the lower curve. We used OLS to fit an exponential to the four verification benchmarks (including lf-lean) to get the improved curve.

C. Human Productivity in Software Verification

To estimate how long the lf-lean project would have taken a human team, we need reliable baselines for two distinct activities: writing translation code and writing formal proofs of correctness.

Translation productivity. For the code-writing component (translating Rocq definitions and theorem statements into Lean), we draw on vanilla software engineering productivity estimates. Published industry figures typically place productive output at 50–100 kLoC/person-year for a developer working in a familiar language and codebase (a range that likely already reflects AI-assisted workflows). We calibrated against internal time trials on lf-lean statements and cross-referenced with these published figures, adopting the upper end of the range (100 kLoC/year) to produce a conservative lower bound on human time. Since our Lean translation comprises approximately 25 kLoC, this yields an estimate of roughly 3 person-months for translation alone. We note this likely underestimates the true effort: Lean is less widely known than mainstream languages, the translation requires understanding both Rocq and Lean type theory, and the translator must make non-trivial design decisions about how to represent Rocq idioms in Lean.

Verification productivity. For the proof-writing component, no single industry benchmark exists, so we surveyed representative large-scale mechanized verification projects to establish a historical baseline. Projects were included if they:

reported non-trivial mechanized verification artifacts (e.g., in Rocq, Isabelle/HOL, F*, or Dafny);
provided sufficient quantitative data in peer-reviewed papers, technical reports, or authoritative project documentation; and
corresponded to a clearly identifiable verification effort rather than tooling-only or purely theoretical work.

Multiple entries for the same codebase (e.g., original vs. extended CompCert) were treated as distinct data points when reported separately in the literature.

Across these projects, we measured verified LoC per person-year—that is, how many LoC of implementation a team could formally verify per person-year of combined effort on code, specifications, and proofs. The resulting median is approximately 2,558 LoC/person-year, reflecting the PhD-level expertise and intensive manual effort traditionally required. Canonical examples include seL4 9Klein, G. et al.seL4: Formal Verification of an OS KernelSOSP, 2009 , the first formally verified OS microkernel, and CompCert 10Leroy, X.Formal verification of a realistic compilerCommunications of the ACM, 2009 , a verified optimizing C compiler (see Figure 7). We note that the surveyed projects vary in domain and proof assistant, introducing additional uncertainty into this median estimate.

Figure 7: Historical Productivity in Software Verification. Verified LoC per person-year across landmark software verification projects from 2008 to 2022. The median of approximately 2,558 LoC/person-year reflects the PhD-level expertise and extensive manual effort traditionally required.

Applying these baselines to lf-lean. The lf-lean release includes approximately 215 kLoC of Rocq isomorphism proofs covering the 1,276 translated statements. At the historical median verification rate, this volume of proof would require roughly 2.5 person-years of effort. Combined with the ~3 person-months for translation, we estimate the full verified translation would have taken approximately 2.75 person-years. In contrast, our actual human effort was approximately 15 hours (2 person-days), spent on 6 extreme-difficulty statements that the models could not solve autonomously. This represents a speedup of over 350× on the verification effort and roughly 30× on the translation effort alone.

D. Unit Testing Is Insufficient for Rigorous Analysis

Frontier software engineering typically results in multiplicative bug growth, and unit tests are insufficient because many bugs are subtle, interaction-driven, and grow exponentially rather than linearly in frequency. Even as test suites get larger, there is a low to moderate correlation between bugs found and code coverage with unit tests 16Inozemtseva, L. & Holmes, R.Coverage is not strongly correlated with test suite effectivenessICSE, 2014 . There is no correlation between code coverage and bugs found when controlling for codebase size and complexity 17Kochhar, P. S. et al.Practitioners' Expectations on Automated Fault LocalizationISSTA, 2017 .

Figure 8: Test Coverage vs. Bug Density. Analysis of ~100 large, mature open-source Java systems shows unit-test coverage has no meaningful relationship with the number of bugs reported after release, even when controlling for LoC and complexity. High-coverage systems often still exhibit substantial defect counts, while lower-coverage systems are not consistently worse. (Kochhar et al. 2017)

More anecdotally, Google’s OSS-Fuzz 18Ding, Z. Y. & Le Goues, C.An Empirical Study of OSS-Fuzz BugsMSR, 2021 tool continuously tests popular and well-maintained open source libraries and has found thousands of bugs and vulnerabilities.

E. Compositional Verification

In a monolithic approach, verifying a program \(P\) requires fitting the entire codebase into a single context and reasoning about it all at once. As program size grows, this becomes intractable: verification effort scales with total complexity and the required context exceeds any model’s window. We call this monolithic verification.

For programs with modular structure, correctness can be established piecewise. Decompose \(P\) into components \(c_1, c_2, \ldots, c_k\) with explicit interfaces. Verify each \(c_i\) independently; the proof assistant checks that interfaces between dependent components match exactly. If all components verify, the whole is correct by construction. This bounds verification context to \(\mathcal{O}(\max_i |c_i|)\) rather than \(\mathcal{O}(|P|)\). We call this compositional verification. We move from verifying whole programs to verifying components against interfaces.

Classical heuristics for well-designed software map directly onto proof tractability. Let \(\text{LoC}_{\text{proof}}(c)\) denote lines of proof required for component \(c\):

Simplicity → low \(\text{LoC}_{\text{proof}}(c_i)\) for each component
Low coupling → proofs for \(c_i\) and \(c_j\) proceed independently
High cohesion → unrelated functionality has disjoint proof obligations

In lf-lean, 90% of statements depend on at least one prior statement. Rather than requiring the model to verify the full dependency tree simultaneously, we provide previously verified dependencies as trusted interfaces. The task reduces to checking that the new component matches the existing interface, compositionally building toward the full solution.

Acknowledgements

We are grateful to Jake Mendel for encouraging us to work on hard problems.

This work builds on impressive projects and ideas. Gaëtan Gilbert made rocq-lean-import, and Benjamin Pierce wrote the Software Foundations textbooks to educate generations of PL students. Kaiyu Yang and Quinn Dougherty generously provided hard problems from VerinaBench and FVAPPS, respectively, for our time horizon baseline. Thomas Kwa helped us understand the METR result methodology.

Author Contributions

Jason led the project, and developed the specification generator and grader in rocq-dove. Vishesh and Jeffrey co-developed the AI agent harness; Vishesh contributed to the interactive visualizer, and Jeffrey contributed to the blog post. Harshikaa built infrastructure for analyzing agent performance and contributed to the AI agent harness. Vaidehi designed diagrams and collected data for the appendix baselines. Robert developed the human-generated solutions for extreme-difficulty statements, and developed the time horizon study. Twm, Shiki, and Jacob extracted the initial dataset of Rocq problems from the Software Foundations textbook. Lawrence advised on blog post writing and presentation of the results. Rajashree co-led the project, and developed the task-level specification framework and wrote the blog post.

Moreover, frontier software engineering typically results in multiplicative bug growth: many bugs are subtle, interaction-driven, and grow super-linearly in frequency with code complexity. See Appendix D. ↩︎
A verified translation is a Lean translation and a proof of correctness. ↩︎
Formerly known as Coq. ↩︎
Each statement is a logical proposition or a claim that can be proven to be true. ↩︎
Logical Foundations is an interactive textbook covering functional programming, type theory, and formal program reasoning. ↩︎
We use a mixture of frontier models as AI agents, given that they have somewhat orthogonal competencies at software verification. ↩︎
While all statements of Logical Foundations are likely in the training data, proving that a translation is correct is a different task than proving the original statement. ↩︎
Translation estimates are based on industry SWE productivity data; verification estimates use the historical median of ~2,558 verified LoC/person-year from landmark projects. See Appendix C for details. ↩︎
Lean’s compiled .olean files can be exported into a textual format with lean4export 5Ullrich, S. et al.lean4export: Plain-text declaration export for Lean 4github.com/leanprover/lean4export, 2025 . rocq-lean-import is a tool that parses this textual format and constructs equivalent definitions in Rocq, preserving the type structure and allowing Rocq to reason about Lean code. The reason we have a tool that directly translates from Lean to Rocq (lean4export + rocq-lean-import) and don’t have a tool in the other direction is a bit of a historical accident: Lean was designed to have a small core with a very powerful elaborator, while Rocq is more permissive in the sort of type theory it supports. As a result, the two most complicated features of type theory are explicit and simple in Lean but implemented in a more complex way in Rocq. (Lean elaborates recursive functions into eliminators and avoids universe polymorphism, while Rocq has a complicated guard checker for recursive functions and supports implicit universe polymorphism with cumulativity.) ↩︎
You might wonder why rocq-lean-import doesn’t trivialize the problem. The reason is that the slight differences between Lean and Rocq impose constraints on the translation that rule out a trivial text-to-text source mapping. Moreover, these same differences force the isomorphism proofs to reason about program structure recursively rather than syntactically, which is what turns a trivial-at-first-sight task into something that is actually representative of program verification. ↩︎
Technically the grader also checks a fourth property that is usually not relevant but is sometimes essential: the Lean translation should not inconsistently fill in details that are deliberately held abstract in the Rocq source. As an example, if the Rocq source axiomatizes the real numbers and their properties, the AI should not be permitted to define the type of real numbers to be the empty set while still axiomatizing that the reals are a (nonempty) complete ordered field. If we allowed this, the AI could leverage the resulting contradiction to trivially prove all isomorphisms. ↩︎
Note that here we mean the intensional propositional equality of type theory. ↩︎
Note that we are defining isomorphism like category theorists. All functions definable in type theory are structure-preserving. ↩︎
Note that arrows (\(\to\)) in type theory are always right associative; \(A \to B \to C\) means \(A \to (B \to C)\) This convention is the one which plays well with currying; \((A \to (B \to C)) \cong ((A \times B) \to C)\) ↩︎
For theorems closed with Qed (rather than Defined), we include only the statements of theorems and not the proofs. This is safe because Rocq is set up so that changing the code between Proof. and Qed. cannot change the behavior of the rest of the codebase, with a couple of minor exceptions, of which universe constraints are the most significant. Program verification generally does not care about universe constraints beyond avoiding inconsistency. ↩︎
Software verification differs from formal mathematics in important ways: abstractions are more porous, requiring reasoning thoroughly about programs with significantly more dependencies. The proof methods required are also different—brute-force case analysis is more commonly used, and there is a much tighter tie between theorem statement, proof generation methodology, and proof. See Appendix A for detailed metrics on the dataset. ↩︎
Empirically we found that rollouts where a model solved more dependencies were exponentially rarer, with a ≈3× reduction in successful problem count for every additional solved dependency. This was largely independent of the number of dependencies the problem had. We have not encountered a single rollout where a model solved more than 9 dependencies in one go. ↩︎
This bounds verification context to \(\mathcal{O}(\max_i |c_i|)\) rather than \(\mathcal{O}(|P|)\). We call this compositional verification. See Appendix E. ↩︎
These heuristics map directly to proof tractability: low coupling allows proofs to proceed independently, while high cohesion ensures unrelated functionality has disjoint proof obligations. See Appendix E. ↩︎
Many bugs are subtle, interaction-driven, and grow exponentially rather than linearly in frequency. Generating larger test suites is not enough since there is a low to moderate correlation between bugs found and code coverage. See Appendix D. ↩︎
Our published dataset contains 59,868 lines of Lean spanning 3018 total translations of 1276 distinct statements. Multiplying the total LoC by 1276/3018 yields our estimate of 25k lines of Lean that would result from deduplication. We hypothesize that the 4× blowup results from two incentives in our translation process: (1) the model had an easier time regenerating rather than reusing standard library definitions, whose counts are not included in the LF source; and (2) the model was often incentivized to use longer names and more awkward constructions to exactly match with the Rocq source. ↩︎
~3 person-months for translation (assuming 100 kLoC/person-year) plus ~2.5 person-years for verification (at the historical median of ~2,558 verified LoC/person-year). See Appendix C for the full derivation and caveats. ↩︎

References

Pierce, B. C. et al.. Software Foundations, Volume 1: Logical Foundations. softwarefoundations.cis.upenn.edu, 2024.
METR. Measuring AI Ability to Complete Long Tasks. METR Blog, 2025.
Fleishman, G.. It's COBOL all the way down. Increment, 2018.
Gilbert, G.. rocq-lean-import: Import Lean exported files into Rocq. github.com/rocq-community/rocq-lean-import, 2025.
Ullrich, S. et al.. lean4export: Plain-text declaration export for Lean 4. github.com/leanprover/lean4export, 2025.
AlphaProof and AlphaGeometry Teams, Google DeepMind. AI achieves silver-medal standard solving International Mathematical Olympiad problems. Google DeepMind Blog, 2024.
Achim, T. et al.. Aristotle: IMO-level Automated Theorem Proving. arXiv:2510.01346, 2025.
AxiomMath. AxiomProver at Putnam 2025. github.com/AxiomMath/putnam2025, 2025.
Klein, G. et al.. seL4: Formal Verification of an OS Kernel. SOSP, 2009.
Leroy, X.. Formal verification of a realistic compiler. Communications of the ACM, 2009.
Gu, R. et al.. CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels. OSDI, 2016.
Zinzindohoué, J. K. et al.. HACL*: A Verified Modern Cryptographic Library. CCS, 2017.
Aggarwal, P., Parno, B., & Welleck, S.. AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement. ICLR, 2025.
Dougherty, Q. & Mehta, R.. Proving the Coding Interview: A Benchmark for Formally Verified Code Generation. LLM4Code Workshop, ICSE, 2025.
Sun, Y. et al.. Verina: Benchmarking Verifiable Code Generation. arXiv:2505.23135, 2025.
Inozemtseva, L. & Holmes, R.. Coverage is not strongly correlated with test suite effectiveness. ICSE, 2014.
Kochhar, P. S. et al.. Practitioners' Expectations on Automated Fault Localization. ISSTA, 2017.
Ding, Z. Y. & Le Goues, C.. An Empirical Study of OSS-Fuzz Bugs. MSR, 2021.

Please cite as:

Jason Gross, Vishesh Saraswat, Jeffrey Chang, Harshikaa Agrawal, Vaidehi Agarwalla, Robert Zhang, Twm Stone, Jacob Green, Shiki Vaahan, Lawrence Chan, and Rajashree Agrawal. "lf-lean: The frontier of verified software engineering." Theorem Blog, February 2026. https://theorem.dev/blog/lf-lean/

BibTeX:

@article{lf-lean,
  author = {Jason Gross and Vishesh Saraswat and Jeffrey Chang and Harshikaa Agrawal and Vaidehi Agarwalla and Robert Zhang and Twm Stone and Jacob Green and Shiki Vaahan and Lawrence Chan and Rajashree Agrawal},
  title = {\texttt{lf-lean}: The frontier of verified software engineering},
  journal = {Theorem Blog},
  year = {2026},
  month = {feb},
  url = {https://theorem.dev/blog/lf-lean/}
}