A/B Test Your Prompts in Production
Run controlled experiments on prompt versions with real traffic. Split requests, measure quality, and pick the winner.
You rewrote the prompt. Is it actually better?
You've been iterating on a prompt. The new version looks cleaner, handles edge cases better, and costs fewer tokens. You tested it on a dozen examples in the playground. It seems better.
But "seems better on 12 examples" is not the same as "performs better on 10,000 real requests." Prompts interact with real user input in ways that playground testing can't predict. A prompt that's sharper for support tickets might hallucinate on billing questions. One that's cheaper might lose quality on complex queries.
The only way to know is to test on real traffic. That's what prompt experiments do.
Why prompt A/B testing matters
Prompts are code. They drift, they regress, and they interact with model updates in surprising ways. When a provider ships a new model version, your carefully tuned prompt might behave differently — and you won't know unless you're measuring.
Most teams either:
- Ship and pray — Push the new prompt, watch for complaints, hope for the best.
- Test in a sandbox — Try a handful of examples, eyeball the results, call it done.
Neither gives you confidence. A/B testing prompts on production traffic gives you actual metrics: response quality, latency impact, cost delta, and success rates across your real user distribution — not a curated set of test cases.
How prompt experiments work in Grepture
Grepture's prompt management system supports experiments natively. Here's the flow:
1. Publish two or more prompt versions
Each prompt in Grepture has versions. You can have a published v3 (your current production prompt) and a published v4 (the candidate). Both must be published — drafts can't participate in experiments.
2. Start an experiment
From the prompt editor, open the experiment panel and configure your variants:
- Select which versions to test (e.g., v3 and v4)
- Set traffic weights (e.g., 80% to v3, 20% to v4)
- Weights must sum to 100%
- You can test more than two variants if needed
3. Traffic routes automatically
When your application resolves a prompt by slug, Grepture randomly assigns each request to a variant based on the configured weights. No code change needed — if you're already using prompt slugs, experiments work transparently.
// Your code doesn't change — experiments happen at the gateway
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "{{grepture:support-assistant}}" },
{ role: "user", content: userMessage },
],
});
The support-assistant slug resolves to v3 or v4 based on the experiment weights. If you need to pin a specific version (e.g., for a request that must not be part of the experiment), use an explicit version: {{grepture:support-assistant@3}}.
4. Quality is measured automatically
When you start an experiment, Grepture automatically creates a Relevance evaluator if one doesn't already exist for that prompt. This is an LLM-as-a-judge eval that scores each response on a 0-to-1 scale.
You can also add custom evaluators — tone, accuracy, safety, or whatever matters for your use case.
5. Monitor results in real time
The experiment dashboard shows per-variant metrics that refresh every 30 seconds:
- Requests — How many requests each variant has served
- Avg latency — Response time per variant (different prompts can have different token counts)
- Avg cost — Per-request cost (important if one version is more verbose)
- Success rate — Proportion of non-error responses
- Eval scores — Average quality score per evaluator, color-coded: green (>0.8), yellow (0.5–0.8), red (<0.5)
The variant with the highest eval score gets a trophy icon and an "Activate & End" button.
6. Pick the winner
Once you have enough data, activate the winning version. Grepture sets it as the active version and ends the experiment. All traffic returns to the single active version.
Tips for running good experiments
Wait for sufficient traffic. A few dozen requests isn't enough. Statistical significance depends on your traffic volume and the effect size, but as a rule of thumb: wait until each variant has at least a few hundred requests before drawing conclusions.
Change one thing at a time. If you change both the prompt text and the model in a single experiment, you won't know which change drove the result. Test prompt changes and model switches in separate experiments.
Use version pins for sensitive requests. If some requests must always use the current production prompt (e.g., compliance-critical flows), pin them with slug@version syntax. Pinned requests bypass the experiment entirely.
Check cost alongside quality. A variant that scores 2% higher on relevance but costs 40% more per request might not be worth it. The dashboard shows both — make an informed tradeoff.
Run evaluators at 100% sampling during experiments. This is the default when Grepture auto-creates the evaluator. If you're using a custom evaluator, make sure the sampling rate is high enough to get meaningful per-variant data.
From PromptOps to data-driven prompts
Prompt experiments close the loop on PromptOps. Version your prompts, deploy them server-side, and now test them with real traffic before committing. It's the same workflow software engineers use for feature flags and canary deploys — applied to prompts.
The gateway architecture makes this possible without SDK changes or application deploys. Because Grepture resolves prompts at the proxy layer, you can start, monitor, and end experiments entirely from the dashboard.
Key takeaways
- A/B testing prompts on real traffic eliminates guesswork — measure quality, cost, and latency across your actual user distribution.
- No code changes needed — if you're using prompt slugs, experiments work transparently at the gateway layer.
- Auto-created evaluators score each variant with LLM-as-a-judge, so you get quality metrics from day one.
- Change one thing at a time — isolate prompt changes from model changes for clean experiment results.
- Check cost alongside quality — the dashboard shows both, so you can make informed tradeoffs.