OpenAI Launches GPT-5.4 With Pro and Thinking Versions

OpenAI released GPT-5.4 with Pro and Thinking variants, a 1 million token context window, and meaningful gains in efficiency and accuracy. Here's what matters.

OpenAI released GPT-5.4 on Thursday, and if you've been anywhere near tech Twitter in the last 48 hours, you already know the vibes: record benchmarks, frontier model, professional work, blah blah. Most of the coverage reads like a press release with a byline, and the actual substance gets buried under the hype. So let's skip all that and talk about what's actually different this time.

It's Three Models, Not One

The thing OpenAI released isn't really a single model. GPT-5.4 comes in a standard version, a reasoning variant called GPT-5.4 Thinking, and a performance-optimized GPT-5.4 Pro. The messaging is squarely aimed at enterprise buyers, which is a shift worth noting because this isn't the "anyone can use AI" pitch from a couple years ago. OpenAI is chasing serious production workloads now, and the three-model structure reflects that. If you're a developer or work in a field where the model actually has to think through complex problems, Thinking is the one to watch. Pro is for teams where speed and throughput matter more than cost. Standard covers everything else.

The Context Window Is Huge and It Actually Matters

One million tokens on the API side. That's the biggest context window OpenAI has shipped, and unlike a lot of the benchmark stuff, this one has real practical implications. You can throw an entire codebase at it, a multi-year contract, or a year's worth of financial records, all in a single call, without breaking it into chunks, without lossy summarization in the middle, and without the model losing track of what it read 50 pages ago. For anyone building serious applications on top of these models, that's a meaningful architectural unlock that changes what's actually possible to build.

Token Efficiency Is the Boring Win Nobody's Talking About

OpenAI says GPT-5.4 solves the same problems with fewer tokens than the previous version, and reporters mostly glossed over this. They shouldn't have. In production at any real scale, token count is your cost, and a more efficient model at the same capability level isn't just cheaper to run, it's the thing that determines whether an AI-powered product is actually viable as a business. Cost is still the primary reason teams can't scale these things the way they want to, so any meaningful efficiency gain deserves more attention than it's getting in the coverage cycle.

The Benchmarks, For What They're Worth

GPT-5.4 topped the leaderboards on OSWorld-Verified and WebArena Verified, which test the model's ability to actually use software interfaces autonomously, and it also won on Mercor's APEX-Agents benchmark, which focuses on law and finance tasks specifically. Those are more interesting than the general knowledge benchmarks because they're testing something closer to real work rather than trivia. On hallucinations, OpenAI claims a 33 percent reduction in individual claim errors versus GPT-5.2 and an 18 percent drop in responses containing errors overall. Take those numbers with some salt since it's OpenAI grading OpenAI, but the direction of travel is at least pointing the right way.

Tool Search Is Small But Smart

There's a new API feature called Tool Search that didn't get much attention but is genuinely useful if you build with these models. Previously, every API call had to load definitions for every available tool into the system prompt upfront, and if you had a large tool library, that ate tokens fast and made requests slower and more expensive. Tool Search lets the model pull tool definitions on demand instead of front-loading everything, which means faster calls, lower costs, and cleaner architecture for complex agentic systems. It's not glamorous, but it's the kind of infrastructure improvement that compounds over time as these tool ecosystems keep growing.

The Safety Stuff

OpenAI shipped an evaluation that tests whether the Thinking model's visible reasoning actually matches what it's doing internally, which has been a real concern in AI safety circles for a while. The specific worry is that reasoning models could show you one chain of thought while doing something different under the hood, and there's published research showing it can happen under the right conditions. The results here suggest GPT-5.4 Thinking is less prone to that kind of misrepresentation, though it's worth pointing out that OpenAI ran this evaluation themselves. Independent verification would be more convincing, but publishing the evaluation publicly at all is a step toward the kind of accountability the field actually needs.

Bottom Line

GPT-5.4 is a real release, not a rebranding exercise. The million-token context window, the efficiency gains, and Tool Search all address concrete problems that developers actually run into, and the three-model lineup gives buyers genuine flexibility depending on their use case and budget. The hallucination improvements matter too, if they hold up outside of OpenAI's own testing environment. It won't settle the broader questions about where this industry is headed or whether any of these companies are building sustainable businesses, but as a model release, it moves the ball forward in ways that matter for people actually building with these things.