Self Improving AI Agents Are Starting to Look Like Real Software

Self improving AI agents are moving from demos to real workflows, using expert feedback, evals, and Codex to make business software better over time.

Self Improving AI Agents Are Starting to Look Like Real Software

Most AI tools are still judged by one question, did the output work? That is fine for simple tasks, but the bigger shift is not just AI that gives one good answer. It is AI systems that improve from real work, expert corrections, and actual production use.

That is what OpenAI is showing in its post on building self improving tax agents with Codex. The post focuses on Tax AI, a system built with Thrive Holdings and Crete’s accounting network to help accountants prepare complex tax returns.

The tax angle matters, but the larger lesson is bigger than tax. This is a model for how AI agents can improve inside real business workflows.

Tax work is a strong test case

Tax preparation is not a clean, simple task. Accountants deal with messy PDFs, client notes, prior-year filings, missing information, and forms that need to line up across different systems. That makes it a useful test for AI agents.

A tax agent cannot just summarize a document and call it done. It has to extract the right values, connect them to the right forms, show its sources, and handle review from a human expert.

In the pilot, Tax AI processed 7,000 returns. OpenAI said the system saved about one-third of tax prep time and reached up to 97% accuracy. Those numbers are strong, but the more interesting part is how the system improved.

The feedback loop is the real story

When an accountant corrected Tax AI, that correction did not just disappear after the return was filed. It became a signal. The system could compare what the AI predicted, what the accountant changed, and what ended up in the final return.

That matters because expert corrections are one of the most valuable sources of product feedback a company can get, but they are only useful if the product captures them in a structured way. A correction might mean the AI extracted the wrong number, used the wrong source, mapped a value to the wrong field, or ran into a case where the accountant made a judgment call the AI should not automate.

The product has to know the difference. That is where evals come in. Instead of saying “the AI needs to get better at taxes,” the team can turn repeated mistakes into specific tests. If the system keeps missing a field on rental-property returns, that failure can become a targeted eval. Then the team can measure whether the next version actually fixes the issue.

Codex helps turn mistakes into fixes

This is where Codex becomes useful. Codex is not just being asked to write random code. It is working from real product failures, traces, examples, and tests.

That creates a much better workflow than giving an AI a vague prompt like “make this product better.” Codex can inspect where the issue happened, look at the related code, and suggest a fix. The task is scoped, the evidence is clear, and the expected behavior is testable.

That is the part businesses should pay attention to. AI improvement is not magic. It comes from building the right system around the model.

Production traces make the system smarter

A big reason this works is that the product captures what happened during the workflow. A trace can show the source files, extracted fields, citations, mappings, human corrections, and final filed values. That gives the team a way to understand where the agent failed.

Without that, the team only sees the final mistake. With traces, they can see the actual step where things went wrong. That is a major difference between a basic AI tool and a serious AI product.

If a company wants agents that improve over time, it needs more than prompts. It needs workflow data, review steps, tests, and a way to turn expert feedback into product changes.

Self improving does not mean fully automatic

The phrase “self improving AI” can sound like the system is running on its own. That is not what this example shows. Accountants still review the work. Engineers still decide what changes should ship. The system still needs tests and safeguards, especially in a high-stakes field like taxes.

The point is not to remove humans from the process. The point is to stop wasting their corrections. When experts fix the same mistake over and over, the product should learn from that pattern.

When the system improves, experts spend less time on repetitive cleanup and more time on judgment-based work. That is a practical version of AI improvement. Not hype. Not full automation. Just better software.

The bigger lesson for businesses

The Tax AI example shows where AI agents are going. The winning products will not just be the ones with the best demo. They will be the ones that improve after real users touch them.

That means companies should ask a better question before building with AI. Not just, “Can we automate this task?” but, “Can we build a workflow where expert feedback makes the product better?”

If the answer is yes, the product can create a loop. AI handles part of the work. Experts review it. Corrections are captured. Repeated issues become evals. Codex helps investigate and propose fixes. Engineers review and ship improvements. The next version gets better.

That is the real value. Self improving agents are not about replacing people overnight. They are about building software that learns from skilled workers, reduces repeat mistakes, and improves through real use.

That is why this OpenAI example matters. It is not just a tax story. It is a preview of how serious AI products will be built.