AI Is Learning to Hack Smart Contracts — and That's Actually a Good Thing

AI can now exploit 72% of critical smart contract bugs. Here's what EVMbench means for blockchain security.

Smart contracts have always had a trust problem. The code is open, the stakes are enormous, and a single missed vulnerability can drain millions of dollars in seconds. For years, the answer was human auditors — talented, expensive, and perpetually outnumbered by the volume of code being deployed. Now AI is stepping into the gap, and OpenAI just built a way to measure how well it's doing.

On February 19, 2026, OpenAI and crypto investment firm Paradigm introduced EVMbench, an open-source benchmark that tests how capable AI agents are at detecting, patching, and actively exploiting vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts. It's a significant release because it takes AI security testing out of the abstract and into the real — drawing on 120 high-severity vulnerabilities pulled from 40 actual security audits, most of them sourced from competitive audit platforms like Code4rena.

What EVMbench Actually Tests

The benchmark evaluates AI agents across three distinct modes that mirror real-world security workflows. In detect mode, agents audit a smart contract repository and are scored based on how many documented vulnerabilities they successfully identify. In patch mode, agents must modify the vulnerable code to close the security hole while keeping the contract's intended functionality fully intact. In exploit mode — the most aggressive of the three — agents attempt to execute an end-to-end fund-draining attack against deployed contracts inside a sandboxed blockchain environment.

Results are determined through deterministic transaction replay and automated on-chain verification, making the scoring reproducible and consistent across different testing environments.

The benchmark also extends into payment-oriented code through vulnerability scenarios sourced from the security audit of Tempo, a purpose-built Layer 1 blockchain designed for high-throughput stablecoin payments. Including Tempo scenarios grounds EVMbench in an area of growing practical relevance, as AI-driven stablecoin transactions are expected to expand significantly in the coming years.

The Numbers Tell an Interesting Story

When OpenAI and Paradigm began working on this project, the best available models could successfully exploit fewer than 20% of the critical, fund-draining vulnerabilities in the dataset. Today, GPT-5.3-Codex — running through OpenAI's Codex CLI — achieves a 72.2% success rate in exploit mode. That's compared to 31.9% for GPT-5 from roughly six months ago. That's not incremental progress. That's a fundamental shift in what these systems can do.

Exploit mode is where AI agents currently perform best, and OpenAI's reasoning for why is worth noting. In exploit tasks, the objective is explicit: keep trying until the funds are drained. That kind of clear, iterative goal plays to the strengths of current AI agent architectures. Detect and patch tasks are harder. In detection, agents often stop after identifying one vulnerability rather than exhaustively auditing the full codebase. In patching, preserving complete functionality while eliminating a subtle bug requires a deeper understanding of the contract's design assumptions — something current models still struggle with.

A Tool for Defense as Much as Offense

There's an obvious dual-use concern baked into any benchmark that measures an AI's ability to drain funds from smart contracts. OpenAI addresses this directly, framing EVMbench as both a measurement tool and a call to action for the security community. The logic is straightforward: if AI agents are getting better at finding and exploiting vulnerabilities, developers and auditors need to know exactly how capable these systems have become — and they need to incorporate AI-assisted auditing into their workflows before attackers do it first.

To back that up, OpenAI is committing $10 million in API credits through its Cybersecurity Grant Program to support defensive security research, with a focus on open-source software and critical infrastructure. The company is also expanding the private beta of Aardvark, its internal security research agent, and partnering with open-source maintainers to offer free codebase scanning.

EVMbench itself — the tasks, tooling, and evaluation framework — has been released publicly on GitHub, free for any researcher or security team to use.

What This Means for Developers Building on Blockchain

For anyone writing or deploying smart contracts, EVMbench is a reminder that the security landscape is shifting fast. Human audits remain essential, but the window in which human auditors alone could realistically keep pace with the volume and complexity of deployed code is narrowing. AI-assisted auditing isn't a nice-to-have anymore — it's becoming a necessary part of a responsible deployment workflow.

The benchmark also sets a useful baseline for evaluating security tools. As more vendors offer AI-powered smart contract auditing, EVMbench gives developers an objective framework to ask: how does this tool actually perform against real vulnerabilities? What does it miss? Where does it fall short?

Smart contracts currently secure more than $100 billion in open-source crypto assets. That's not a number that makes complacency reasonable. EVMbench is an honest look at how far AI security capabilities have come, and a clear signal of where the industry needs to go next.