prompt2bot

Flowdiff: Evaluating Changes to AI Agents

2026-03-16

You change a prompt and something breaks. Not the thing you changed, something else entirely. A user reports that the bot stopped doing a thing it used to do perfectly, and you trace it back to a sentence you added three days ago about date formatting.

This is the core problem with evaluating changes to AI agents. Prompts are not code. They don't have well-defined inputs and outputs. Changing one word can cascade through the model's behavior in ways that are impossible to predict. Adding a tool to the agent's toolkit can alter responses in completely unrelated conversations, because the model now has a different set of options to reason about and a different system message length.

Traditional software testing doesn't work here. You can write a test that says "given this input, expect this output," but an agent's output is non-deterministic. The same input can produce different valid responses. And even if you build a regression set of input/output pairs, maintaining it is a nightmare. Every time the agent legitimately improves, you need to update all the expected outputs. The regression set rots faster than you can maintain it.

Conversations are append-only

Here's an insight that predates ChatGPT by a decade. At Google, engineers working on early phone agents (the kind that handled automated calls) developed a design pattern called "flowdiff." The key observation: a conversation is an immutable sequence of events. Messages, tool calls, tool results. Things can only be appended. The past can't change.

This means you can replay history. Take an existing conversation, feed the events to your new agent version one by one, and at each step where the agent responded, compare what it would say now to what it actually said then. As long as the responses match, keep going. The conversation unfolds the same way, so the next event in history is still a valid continuation.

The moment a diff occurs, you stop. Once the agent diverges from the historical response, you can't know how the conversation would have evolved from that point. The user might have reacted differently, the tool results might have been different, everything downstream is unknowable. So you capture the diff and move on to the next conversation.

Net positive, net negative, impact

Run the new agent across a large set of historical conversations and you collect a pile of diffs. Each diff is a place where the agent would now respond differently than it did before. These diffs fall into three categories.

Net positive: the new response is better than the old one. The agent gives a more accurate answer, uses a tool more efficiently, avoids a hallucination it previously fell into, or handles an edge case it used to miss.

Net negative: the new response is worse. The agent lost a capability, introduced a new failure mode, or regressed on something that was working fine.

Impact: how much does this diff matter? A diff in a greeting message is low impact. A diff in a financial calculation or a medical recommendation is high impact. Not all regressions are equal, and not all improvements are equal either.

By evaluating the diffs across enough historical data, you get a probabilistically sound picture of whether your change is safe to deploy. You're not guessing based on a few test cases. You're measuring against real conversations that real users had.

Automating evaluation

In 2016 you'd have a human review each diff. Today you can point an LLM at the diff, the original conversation context, and the two candidate responses, and have it classify the change as positive, negative, or neutral, with an impact score.

This closes the loop entirely. You make a change, run flowdiff against your historical corpus, get an automated evaluation, and ship with confidence, or don't ship if the numbers are bad. No manually maintained test suites. No brittle expected-output assertions. Just a statistical comparison against what your agent actually did in the real world.

When to care about this

Don't. Not yet. If you don't have many users generating substantial conversation history, flowdiff is overkill. You'll spend more time building the evaluation pipeline than you'll save in prevented regressions.

Ship fast, break things, fix them when users complain. Flowdiff is for when you have enough users that breaking things has real cost, and enough history that statistical evaluation is meaningful. Until then, just talk to your bot and see if it works.

← All posts