Changelog

Datasets & Experiments

Turn production logs into curated test suites, then run experiments to compare prompt versions and models with LLM-as-a-judge scoring.

Datasets let you capture real production interactions and use them as evaluation test suites. Build datasets three ways: manually curate examples, import directly from your traffic logs, or set up auto-collection rules that continuously capture matching requests by model, provider, evaluator score, or content pattern.

Once you have a dataset, run experiments against it. Pick a prompt version, a model, and your evaluators — Grepture executes every item and scores outputs with LLM judges. Compare experiments side-by-side to see exactly which items improved or regressed when you change a prompt.

The full loop: capture edge cases from production, build a dataset, test your changes, ship with confidence.