LLM Evals on Real Traffic — Not Just Test Suites

The eval gap

Most teams know they should be evaluating their LLM outputs. Few actually do it in production.

The typical setup looks like this: you build a test suite with a handful of golden examples, run it in CI before deploys, and hope those examples are representative of what real users actually send. Sometimes they are. Often they're not. The prompts users write in production are messier, longer, and weirder than anything in your test fixtures. The edge cases that matter most are the ones you didn't think to include.

Meanwhile, the interesting data — the actual requests and responses flowing through your AI pipeline every day — sits in logs that nobody looks at until something breaks.

We think evals should run where the data already is.

Evals in Grepture

Starting today, Grepture can automatically evaluate your production AI traffic using LLM-as-a-judge scoring. If you're already routing requests through our proxy or sending traces via the SDK, there's nothing new to integrate. Your traffic logs become the evaluation dataset.

Here's how it works. You create an evaluator — either from one of our built-in templates or with a custom judge prompt — and tell it which traffic to score. Grepture runs the evaluator in the background against your real logs, scoring each response on a 0-to-1 scale with written reasoning.

No synthetic datasets. No separate evaluation pipeline. No batch jobs to manage. Your production traffic is the test suite.

Setting up an evaluator

Evaluators are judge prompts with variables. At minimum, you need {{output}} — the LLM's response. You can also use {{input}} (the user's message) and {{system}} (the system prompt) for more context-aware scoring.

We ship six templates to get you started:

Relevance — does the response actually address the question?
Helpfulness — is the response actionable and useful?
Toxicity — is the response safe and appropriate?
Conciseness — does the response convey information efficiently?
Instruction following — does the response honour what the system prompt asked for?
Hallucination — is the response grounded in what was provided?

Pick a template, adjust the prompt if you want, and enable it. That's the setup. Each evaluator also supports filters — only score traffic from a specific model, provider, or prompt ID — and a sampling rate so you control how much you spend on judge tokens.

Why production traffic matters

Here's what you learn from evaluating real traffic that you can't learn from a test suite:

Distribution shifts. Your test suite reflects what you thought users would ask when you wrote it. Production traffic reflects what they actually ask today. When user behaviour changes — and it always does — evals on real traffic catch it. A static test suite doesn't.

Long-tail failures. The requests that cause the worst outputs are usually the ones nobody anticipated. A 5% hallucination rate across your test suite might hide a 40% hallucination rate on a specific class of user query you never tested for. Continuous evaluation surfaces these patterns.

Model regressions. Providers update models without notice. A minor version bump to GPT-4o or Claude might improve average quality but degrade performance on your specific use case. If you're only testing pre-deploy, you won't catch regressions introduced by the model provider — only by your own code changes.

Prompt drift. If you're using prompt management, you're changing prompts without redeploying code. That's the point. But every prompt change is a potential quality change. Evals on real traffic give you a continuous quality signal that follows prompt versions automatically.

What you see

Every evaluation score shows up in two places.

In the evals dashboard, you get the full picture: score trends over time, per-evaluator breakdowns, average scores, and the ability to drill into individual evaluations to read the judge's reasoning. Spot a dip in your relevance score last Tuesday? Click through to see which responses scored low and why.

In the traffic log, evaluation scores appear inline alongside the request details — model, tokens, cost, latency, and now quality scores. When you're investigating a specific request, you see everything in one place: what was sent, what came back, what it cost, and how good it was.

Scores are colour-coded: green above 0.7, yellow between 0.4 and 0.7, red below 0.4. Simple enough to scan at a glance, detailed enough to investigate when something looks off.

Controlling cost

Running a judge LLM on every request would be expensive and unnecessary. Grepture gives you two levers:

Sampling rate. Set each evaluator to score 10% of matching traffic and you get statistically meaningful quality signals at a tenth of the cost. For high-volume workloads, even 1-5% gives you enough data to spot trends and regressions.

Filters. Only evaluate what matters. Score production traffic but skip development requests. Evaluate only your customer-facing model. Focus on a specific managed prompt that you're actively iterating on. Filters keep the judge focused on traffic where quality insights are actionable.

Between sampling and filters, most teams spend a fraction of what they'd expect on evaluation — while getting quality visibility they never had before.

Why the gateway is the right place for this

Other evaluation tools require you to export logs, set up a separate pipeline, and manage another integration. That works, but it's friction — and friction means most teams never get around to it.

Grepture already has your traffic. Every request and response is already logged with full context: the messages sent, the completion received, the model, the tokens, the cost. Evaluation is a natural extension of what the gateway already does. There's no data to export, no pipeline to build, no integration to maintain.

This is the same pattern that makes trace mode and prompt management work well in Grepture: when you're already in the path of every AI call, adding capabilities is incremental rather than architectural.

What's coming next

Evals today give you a quality score in the dashboard. That's a starting point. We're building toward evals that actively tell you when something goes wrong:

Email and Slack alerts — get notified when average scores drop below a threshold, so you don't have to watch the dashboard
Webhook integrations — pipe evaluation results into your existing monitoring and incident response workflows
Scheduled reports — weekly quality digests showing score trends, top regressions, and flagged responses

The goal is to make quality monitoring as hands-off as the rest of Grepture's pipeline. Set your evaluators, set your thresholds, and trust that you'll hear about it when something degrades.

Try it

If you're already on Grepture, head to the Evals tab in your dashboard and create your first evaluator. Start with one of the templates — relevance or hallucination are good first picks — set a low sampling rate, and let it run for a day. You'll have real quality data on your production AI traffic by tomorrow morning.

If you're not on Grepture yet, this is a good reason to get started. Drop in the SDK, point your traffic through the proxy (or use trace mode for zero-latency logging), and you'll have both cost visibility and quality scoring from day one.

How Grepture works — architecture overview
Prompt management guide — version and manage prompts alongside evals
PII detection best practices — protect sensitive data in the same pipeline