Blog

lf-lean: The frontier of verified software engineering

As AI generates code faster than humans can review it, oversight becomes the bottleneck. We introduce task-level specification, a framework where correctness is defined once for an entire class of software tasks and automatically checked across all instances—reducing human oversight from 𝒪(𝓃) to 𝒪(1). We prototype this with rocq-dove, an environment for verified translation from Rocq to Lean that scales to arbitrary program complexity. Using frontier AI, we produced lf-lean: a verified translation of all 1,276 statements in the Logical Foundations textbook. Models autonomously solved 97% of statements; our total human-in-the-loop effort on the remaining problems was ~2 person-days, versus an estimated ~2.75 person-years if we did the whole project manually. These results suggest that verified software engineering, long considered impractical, may become more scalable than unverified software engineering in the era of AI-led development.

Systematically generating tests that would have caught Anthropic’s top‑K bug

We introduce fractional proof decomposition, a technique for scaling testing compute logarithmically, instead of linearly, with bug rarity. We achieve this efficiency by fusing partial evaluation and property-based testing.