Replay Production Traffic to Catch LLM Regressions

The change you can't test

A one-line change to your LLM layer can silently break a class of requests you weren't thinking about. Tighten a system prompt to fix one behaviour, bump to a new model version, add a sentence to stop the assistant rambling — any of them can regress outputs you never see coming.

There's no cheap way to catch it before users do. Most test suites hold a dozen hand-written inputs that don't resemble real traffic, so shipping to production becomes the actual test — and by then the regression is already in front of customers. The requests most likely to break are the ones nobody thought to write a fixture for.

Replay closes the gap: take a request that already happened, run it through your proposed change, and compare.

What replay actually is

Every request that flows through your gateway is logged — the messages sent, the completion returned, the model, the tokens, the cost. A logged request is a real, representative input. Replay takes one of those logs and re-runs it.

In Grepture, you open any request in your traffic log, hit replay, and the request is loaded into the Playground exactly as it was sent: the body, the model, the target endpoint. From there you change one thing — edit the prompt, swap the model, adjust a parameter — and re-run it against the live provider. You get the new response side by side with the original.

This is the difference between replay and a synthetic test case. You're not guessing what a hard input looks like. You're using one that genuinely hit your system, with all the mess that real users bring: the weird phrasing, the half-formed question, the edge case that only shows up in production.

Replaying the redacted request, on purpose

There's one detail worth being explicit about, because it's a deliberate design choice rather than a limitation.

Replay re-sends the post-redaction request — the exact body that was forwarded upstream after Grepture's PII pipeline ran, not the original raw input. If a customer's email address was masked before the request left your gateway the first time, it stays masked in the replay. PII is never restored on a replayed response.

That's what makes replay safe to use freely. You can replay any production request — including ones that originally carried sensitive data — without re-exposing that data to a model or to whoever is doing the testing. In the Playground, the original is tagged "Redacted — safe to replay" so there's no ambiguity about what you're sending. Replayed runs are also tagged in their metadata and excluded from your analytics, so a day of regression testing doesn't pollute your real traffic numbers or cost reports.

The tradeoff: replay is a re-run against a live model, not a byte-identical snapshot. Sampling, temperature, and provider-side changes mean the new output won't be deterministic. That's fine — and arguably correct. You're not asserting the model returns the same characters every time. You're asking whether your change makes outputs better or worse on real inputs.

Comparing the way that matters: scores, not eyeballs

Reading two outputs side by side and deciding which is better works for one request. It does not work for fifty, and it doesn't work at all when "better" is subjective.

So replay scores the new output against the original using your evaluators. If you already run LLM-as-a-judge evals on production traffic, those same judge prompts apply here. When you re-run a replayed request, Grepture scores the new output, makes sure the original is scored on the same evaluators, and shows you both with the delta between them:

Relevance          0.62  →  0.81   +0.19
Instruction follow 0.90  →  0.74   −0.16
Conciseness        0.55  →  0.88   +0.33

Now the change has a shape. That prompt edit you made to tighten responses? It improved conciseness and relevance but cost you instruction-following on this request. Maybe that's an acceptable trade. Maybe it's a regression you'd never have noticed by skimming the text. Either way, you're making the call with a number in front of you instead of a hunch.

Scores are colour-coded — green above 0.7, yellow between 0.4 and 0.7, red below — so a regression is something you see, not something you have to hunt for.

From one request to a real test suite

Single-request replay is the right tool when you're investigating a specific bad output or sanity-checking a quick prompt tweak. But the same idea scales, and that's where it becomes a real regression-testing workflow.

When you find a request worth keeping — a failure case, a tricky edge case, a representative example — you add it to a dataset in one click. Better still, set an auto-collection rule: every request that scores below 0.5 on your hallucination evaluator lands in a dataset automatically. Over a week you accumulate a living collection of the exact inputs your system struggles with — not a hand-picked sample, but the real long tail.

Then you run an experiment. Take a dataset, point it at a new prompt version or a different model, and Grepture replays every item, scores each output with your evaluators, and tracks the tokens, cost, and duration per item. Compare two experiments and you get a per-item diff: which inputs improved, which regressed, and by how much.

It's the same loop as single-request replay, run in bulk:

Capture real requests from production traffic.
Replay them against your proposed change.
Score the new outputs against the baseline.
Ship only if the deltas are green where they need to be.

The one-click replay answers "did this break this request?" The dataset-plus-experiment flow answers "did this break any of the requests I care about?" You reach for whichever the moment calls for.

Why the gateway is the right place for this

Replay only works if the requests are already there, in full, with enough context to re-run and re-score. That's exactly what a gateway has.

Grepture is already in the path of every AI call. It already logged the request body, the response, the model, and the cost. It already ran your redaction pipeline, so it knows which version of the request is safe to replay. And it already has your evaluators, so it can score the result without a separate pipeline. Replay isn't a new system you integrate — it's what falls out naturally when the gateway has already captured everything.

This is the same pattern behind evals on real traffic and prompt management: once you're in the path of every request, new capabilities are incremental rather than architectural. A standalone replay tool would need you to export logs, rebuild request bodies, wire up a model client, and manage redaction yourself. At the gateway, those are already solved.

When to reach for replay

A few moments where this earns its keep:

Before a prompt change ships. Pull up the requests that motivated the change, replay them against the new prompt, and confirm it actually fixes them without regressing the rest.
When a provider deprecates a model. You're being forced onto a new version. Replay a representative slice of traffic against it and see whether your quality scores hold before you flip the switch.
After an incident. You found a bad output in production. Replay it against a candidate fix and verify the fix works on the real input, not an approximation of it.
Building a regression suite. Auto-collect low-scoring traffic into a dataset and run it as an experiment on every prompt change, so the same failure never ships twice.

If you're on Grepture, open any request in your traffic log and hit replay before your next prompt change. If you're not yet, get started — point your traffic through the gateway and you'll have replay, evals, and cost visibility from day one.

How Grepture works — architecture overview
LLM evals on real traffic — the scoring that powers replay comparison
Datasets: production logs as test suites — replay in bulk
Prompt management and version control — version the prompts you're testing