Docs›Evals
Evals
Automatically score your AI outputs with LLM-as-a-Judge evaluators. Pre-built templates, custom prompts, and continuous evaluation on your traffic logs.
Overview
Grepture Evals lets you automatically score the quality of your AI outputs using LLM-as-a-Judge evaluation. Because Grepture already logs every request and response flowing through your app, evaluations run on your existing traffic — zero additional instrumentation needed.
Evals is available on the Pro and Business plans.
How it works
- Create an evaluator — pick a pre-built template or write a custom judge prompt
- Configure filters — optionally scope the evaluator to specific models, providers, or prompts
- Set a sampling rate — evaluate 100% of traffic or sample a percentage for a representative signal
- Scores appear automatically — Grepture evaluates your traffic logs and stores scores continuously
Every score includes:
- A numeric score from 0 to 1
- Reasoning explaining the score
Pre-built templates
Grepture ships with six managed evaluator templates:
| Template | What it measures |
|---|---|
| Relevance | How well the response addresses the user's question |
| Helpfulness | How practical and actionable the response is |
| Toxicity | Safety score — 1 = safe, 0 = toxic/harmful |
| Conciseness | How concise the response is without losing information |
| Instruction Following | How well the response follows system instructions |
| Hallucination | Groundedness — 1 = grounded, 0 = fabricated claims |
Each template includes a full judge prompt. You can use them as-is or customize the prompt to fit your use case.
Custom evaluators
For domain-specific quality criteria, create a custom evaluator with your own judge prompt. Your prompt can use three template variables:
{{input}}— the user's message (last user message from the request){{output}}— the AI's response (assistant message from the response){{system}}— the system prompt (if present in the request)
The judge prompt must instruct the model to return JSON in this format:
{"score": 0.85, "reasoning": "The response directly addresses..."}
Filters
Each evaluator can be scoped to specific traffic using filters:
- Model — only evaluate logs from a specific model (e.g.,
gpt-4o) - Provider — only evaluate logs from a specific provider (e.g.,
openai) - Prompt ID — only evaluate logs using a specific managed prompt
- Status code range — only evaluate successful responses (e.g., 200-299)
Leave all filters blank to evaluate all traffic.
Sampling rate
Set the sampling rate from 1% to 100% to control how many logs are evaluated. At 100%, every matching log is scored. At 10%, roughly 1 in 10 matching logs is scored. Lower sampling rates are useful when you only need a representative quality signal rather than exhaustive coverage.
Viewing scores
Scores are visible in three places:
- Evals page → Scores tab — chronological list of all scores with evaluator name, score badge, and reasoning
- Traffic log detail sheet — each log shows its evaluation scores in the right panel
- Evals page → Analytics tab — aggregated stats and score trends over time
Score badges are color-coded:
- Green (> 0.7) — good quality
- Yellow (0.4–0.7) — needs attention
- Red (< 0.4) — poor quality
Analytics
The Evals analytics tab shows:
- Average score across all evaluators
- Total evaluations processed
- Per-evaluator breakdown with average scores and eval counts
- Score trend chart — daily average score per evaluator over time
Use the analytics to track quality trends, catch regressions, and compare evaluators.