The problem with static test suites
Every team building with LLMs eventually writes a test suite. You pick a few dozen inputs, write down what a good response looks like, and run them before each deploy. It catches obvious regressions and gives you something to point to when someone asks "how do you know it's working?"
The problem is that test suites go stale almost immediately.
The examples you wrote last quarter reflect what you thought your users would ask. Real users write differently — messier inputs, unexpected combinations, domain-specific jargon you didn't anticipate. Over time, the distribution of your test suite drifts further and further from the distribution of your actual traffic. You keep passing the tests and missing the failures.
There's also the maintenance problem. Edge cases you care about today weren't in your original test suite. You find a bad response in production, file a mental note to add it to the test suite, and never get around to it. The cases that matter most are exactly the ones that don't make it into the fixture files.
The deeper issue is that static eval sets require you to predict what matters. Production traffic doesn't. It shows you.
Your traffic is already a test suite
The requests flowing through your AI pipeline every day contain something a curated fixture file never will: surprise. Real users hit cases you didn't design for. They write prompts that expose gaps in your system prompt. They combine inputs in ways that cause your model to hallucinate, drift off-topic, or produce outputs you'd be embarrassed to ship.
That information is already sitting in your logs. The only problem is that it's unstructured — a firehose of traffic with no way to systematically capture the interesting parts and reuse them.
This is what Grepture Datasets solves. It gives you the machinery to take production traffic and turn it into a structured, reusable evaluation set — without requiring you to export data, build a pipeline, or maintain a separate system.
Three ways to build a dataset
Not all datasets have the same origin. Grepture supports three modes of building them, and in practice you end up using all three.
1. Manual curation
Sometimes you know exactly what you want. You have a set of inputs that represent important user scenarios — an onboarding flow, a specific type of support ticket, a complex multi-step reasoning task. You build the dataset by hand, specifying each input/output pair directly.
This is the right approach for scenarios you want to guarantee coverage on. A golden set of 20-30 high-stakes examples that you maintain deliberately. Slow to build, but high signal.
2. Import from traffic logs
This is the more common path. You're reviewing your traffic logs — maybe investigating a complaint, checking scores from a recent eval run, or just doing a quality spot-check — and you see a request that's worth preserving.
In Grepture, adding it to a dataset takes one click. Browse your traffic log, find the request, and hit "Add to Dataset." You can add it to an existing dataset or create a new one on the spot. The input, output, model, and metadata all come along.
Over time, this is how your best datasets get built. Not in a single planning session, but as a natural byproduct of doing quality work — catching interesting cases as you encounter them, not retroactively.
3. Auto-collection rules
The most powerful mode. Instead of manually selecting requests, you define a filter and Grepture captures matching traffic automatically as it arrives.
Rules can filter on:
- Model or provider — collect only traffic routed through a specific model
- Prompt — collect only requests that used a particular managed prompt
- Status code — capture errors or successes selectively
- Evaluator scores — collect requests that scored below 0.5 on your hallucination evaluator, for example
That last one is especially useful. If you have evaluators running on production traffic, you can point auto-collection at the low-scoring tail. Every request where your judge flagged a potential problem automatically lands in a dataset. You end up with a continuously growing collection of the actual failure cases — not a hand-picked sample, but a living record of where your system struggles.
Set it once and forget it. The dataset grows on its own.
Running experiments against a dataset
A dataset is most useful when you can test against it systematically. That's what experiments are for.
Grepture experiments have two modes, and the difference matters.
Run mode takes a prompt version and a model, executes every input in the dataset against that configuration, and then scores the results with your chosen evaluators. This is how you answer questions like: "If I switch to this new prompt, how does it perform across my known edge cases?" You're generating new outputs and measuring them.
Evaluate mode re-scores existing outputs. Maybe you collected a dataset three weeks ago and want to apply a new evaluator that didn't exist then. Or you want to score the same outputs with a stricter judge. You're not re-running the prompts — you're running new evaluation criteria over outputs that are already there.
To start an experiment:
- Pick a dataset
- Choose a prompt version (from your managed prompts) and model
- Select one or more evaluators
- Run
Grepture executes every dataset item, scores each output with your judge, and stores the results. You get per-item scores (0-to-1) alongside the judge's written reasoning for why it gave that score. Tokens consumed, cost, and duration are tracked per item so you know what the experiment cost to run.
This is useful on its own. But it gets more useful when you run a second experiment and compare.
Comparing experiments
Prompt changes are cheap to make and expensive to evaluate incorrectly. You can rewrite a prompt in five minutes and spend weeks chasing down the regressions it introduced.
The compare view in Grepture shows two experiments side by side, item by item. For every input in your dataset, you see:
- The output from experiment A
- The score from experiment A
- The output from experiment B
- The score from experiment B
Items where B outperformed A are highlighted. So are items where A was better. Items with significant score changes — in either direction — are easy to find.
This turns what used to be a subjective judgment ("I think the new prompt is better") into a structured diff. You can see not just the overall score change but exactly which inputs drove it. A prompt that's better on average but worse on a specific class of edge case is visible immediately, rather than something you discover from a production incident two weeks later.
You can compare across prompt versions, across models, or across both. If you're evaluating whether to switch from GPT-4o to a newer model, run the same dataset against both configurations and compare the results. The score difference is your answer.
The workflow in practice
The individual features — datasets, auto-collection, experiments, comparison — fit together into a workflow that makes prompt iteration more systematic.
You start by building a dataset. A few manually curated examples to start, plus an auto-collection rule pointed at your low-scoring traffic. Let it run for a week. You now have a dataset that includes your golden cases and a growing collection of real failures.
When you want to change a prompt, you open the dataset, start an experiment, and run the new prompt version against it. You see how it performs on the examples you care about and on the failure cases you've collected. If the new version improves scores on the failures without regressing on the golden cases, you have evidence it's a real improvement.
You publish the prompt. You keep the dataset. The next time you change the prompt, you run it against the same dataset again. Over time, the dataset becomes a record of your quality floor — the minimum bar your prompts have to clear before you ship.
This is the loop: capture edge cases from production, build a dataset, test prompt changes against the dataset, ship with confidence. It's the same discipline that makes software engineering systematic applied to the prompt layer.
Datasets and evaluators are better together
Datasets are a Pro tier feature, and they're designed to work alongside evaluators. If you're already running evals on production traffic, you have two things datasets need: a stream of scored requests to seed auto-collection rules, and judge prompts to use as experiment evaluators.
If you're not running evals yet, the Evals tab is the right starting point. Get a relevance or hallucination evaluator running on your traffic, let it build up some history, then start collecting low-scoring requests into a dataset. Within a week or two you'll have a meaningful collection to run experiments against.
The combination changes what quality work looks like. Instead of one-off spot-checks and gut-feel judgments about whether a prompt change is better, you have a repeatable process: build a dataset from real failures, test changes against it, measure the delta, ship the winner.