Docs›Evals

Evals

Automatically score your AI outputs with LLM-as-a-Judge evaluators. Pre-built templates, custom prompts, and continuous evaluation on your traffic logs.

Overview

Grepture Evals lets you automatically score the quality of your AI outputs using LLM-as-a-Judge evaluation. Because Grepture already logs every request and response flowing through your app, evaluations run on your existing traffic — zero additional instrumentation needed.

Evals is available on the Pro and Business plans.

How it works

Create an evaluator — pick a pre-built template or write a custom judge prompt
Configure filters — optionally scope the evaluator to specific models, providers, or prompts
Set a sampling rate — evaluate 100% of traffic or sample a percentage for a representative signal
Scores appear automatically — Grepture evaluates your traffic logs and stores scores continuously

Every score includes:

A numeric score from 0 to 1
Reasoning explaining the score

Pre-built templates

Grepture ships with six managed evaluator templates:

Template	What it measures
Relevance	How well the response addresses the user's question
Helpfulness	How practical and actionable the response is
Toxicity	Safety score — 1 = safe, 0 = toxic/harmful
Conciseness	How concise the response is without losing information
Instruction Following	How well the response follows system instructions
Hallucination	Groundedness — 1 = grounded, 0 = fabricated claims

Each template includes a full judge prompt. You can use them as-is or customize the prompt to fit your use case.

Custom evaluators

For domain-specific quality criteria, create a custom evaluator with your own judge prompt. Your prompt can use three template variables:

{{input}} — the user's message (last user message from the request)
{{output}} — the AI's response (assistant message from the response)
{{system}} — the system prompt (if present in the request)

The judge prompt must instruct the model to return JSON in this format:

{"score": 0.85, "reasoning": "The response directly addresses..."}

Filters

Each evaluator can be scoped to specific traffic using filters:

Model — only evaluate logs from a specific model (e.g., gpt-4o)
Provider — only evaluate logs from a specific provider (e.g., openai)
Prompt ID — only evaluate logs using a specific managed prompt
Status code range — only evaluate successful responses (e.g., 200-299)

Leave all filters blank to evaluate all traffic.

Sampling rate

Set the sampling rate from 1% to 100% to control how many logs are evaluated. At 100%, every matching log is scored. At 10%, roughly 1 in 10 matching logs is scored. Lower sampling rates are useful when you only need a representative quality signal rather than exhaustive coverage.

Viewing scores

Scores are visible in three places:

Evals page → Scores tab — chronological list of all scores with evaluator name, score badge, and reasoning
Traffic log detail sheet — each log shows its evaluation scores in the right panel
Evals page → Analytics tab — aggregated stats and score trends over time

Score badges are color-coded:

Green (> 0.7) — good quality
Yellow (0.4–0.7) — needs attention
Red (< 0.4) — poor quality

Analytics

The Evals analytics tab shows:

Average score across all evaluators
Total evaluations processed
Per-evaluator breakdown with average scores and eval counts
Score trend chart — daily average score per evaluator over time

Use the analytics to track quality trends, catch regressions, and compare evaluators.

A/B testing integration

Evals power the quality comparison in prompt A/B experiments. When you start an experiment on a prompt, Grepture automatically creates a Relevance evaluator filtered to that prompt with 100% sampling. As traffic splits between prompt versions, each version accumulates its own eval scores.

The experiment results panel shows per-version eval scores so you can pick the version that produces the best outputs — not just the fastest or cheapest one.

You can also add additional evaluators (helpfulness, toxicity, custom) filtered to the same prompt for a multi-dimensional quality comparison. See the Prompt Management docs for full details on running experiments.