DocsEvals

Evals

Automatically score your AI outputs with LLM-as-a-Judge evaluators. Pre-built templates, custom prompts, and continuous evaluation on your traffic logs.

Overview

Grepture Evals lets you automatically score the quality of your AI outputs using LLM-as-a-Judge evaluation. Because Grepture already logs every request and response flowing through your app, evaluations run on your existing traffic — zero additional instrumentation needed.

Evals is available on the Pro and Business plans.

How it works

  1. Create an evaluator — pick a pre-built template or write a custom judge prompt
  2. Configure filters — optionally scope the evaluator to specific models, providers, or prompts
  3. Set a sampling rate — evaluate 100% of traffic or sample a percentage for a representative signal
  4. Scores appear automatically — Grepture evaluates your traffic logs and stores scores continuously

Every score includes:

  • A numeric score from 0 to 1
  • Reasoning explaining the score

Pre-built templates

Grepture ships with six managed evaluator templates:

TemplateWhat it measures
RelevanceHow well the response addresses the user's question
HelpfulnessHow practical and actionable the response is
ToxicitySafety score — 1 = safe, 0 = toxic/harmful
ConcisenessHow concise the response is without losing information
Instruction FollowingHow well the response follows system instructions
HallucinationGroundedness — 1 = grounded, 0 = fabricated claims

Each template includes a full judge prompt. You can use them as-is or customize the prompt to fit your use case.

Custom evaluators

For domain-specific quality criteria, create a custom evaluator with your own judge prompt. Your prompt can use three template variables:

  • {{input}} — the user's message (last user message from the request)
  • {{output}} — the AI's response (assistant message from the response)
  • {{system}} — the system prompt (if present in the request)

The judge prompt must instruct the model to return JSON in this format:

{"score": 0.85, "reasoning": "The response directly addresses..."}

Filters

Each evaluator can be scoped to specific traffic using filters:

  • Model — only evaluate logs from a specific model (e.g., gpt-4o)
  • Provider — only evaluate logs from a specific provider (e.g., openai)
  • Prompt ID — only evaluate logs using a specific managed prompt
  • Status code range — only evaluate successful responses (e.g., 200-299)

Leave all filters blank to evaluate all traffic.

Sampling rate

Set the sampling rate from 1% to 100% to control how many logs are evaluated. At 100%, every matching log is scored. At 10%, roughly 1 in 10 matching logs is scored. Lower sampling rates are useful when you only need a representative quality signal rather than exhaustive coverage.

Viewing scores

Scores are visible in three places:

  1. Evals page → Scores tab — chronological list of all scores with evaluator name, score badge, and reasoning
  2. Traffic log detail sheet — each log shows its evaluation scores in the right panel
  3. Evals page → Analytics tab — aggregated stats and score trends over time

Score badges are color-coded:

  • Green (> 0.7) — good quality
  • Yellow (0.4–0.7) — needs attention
  • Red (< 0.4) — poor quality

Analytics

The Evals analytics tab shows:

  • Average score across all evaluators
  • Total evaluations processed
  • Per-evaluator breakdown with average scores and eval counts
  • Score trend chart — daily average score per evaluator over time

Use the analytics to track quality trends, catch regressions, and compare evaluators.