Trace Mode — Full Observability Without the Proxy Hop

The proxy tradeoff

Grepture's proxy gives you control. PII redaction, request blocking, prompt management — all handled before traffic reaches your LLM provider. But that control comes with a network hop. Your request goes to Grepture first, then to OpenAI or Anthropic, and the response comes back the same way.

For most workloads, the added latency is negligible. But we kept hearing from teams running latency-sensitive workloads — real-time agents, streaming UIs, high-throughput pipelines — who wanted the visibility without the detour. They didn't need rules or redaction. They just needed to see what was happening: which models, how many tokens, what it cost, how long it took.

So we built trace mode.

What trace mode is

One config change. Set mode: "trace" and requests go directly to your LLM provider — OpenAI, Anthropic, Google, whoever. No proxy in the hot path.

After the response arrives, the SDK captures metadata — model, tokens, latency, status code, cost — and batches it asynchronously to Grepture. Fire-and-forget. The trace call never blocks your application, and if it fails, your production code doesn't notice.

The dashboard stays the same. Traffic log, cost tracking, conversation tracing — all still work. You get the same visibility you'd get with proxy mode, minus the network hop.

The config change

The API surface is identical to proxy mode. You change one line:

import { Grepture } from "@grepture/sdk";
import OpenAI from "openai";

const grepture = new Grepture({
  apiKey: process.env.GREPTURE_API_KEY!,
  proxyUrl: "https://proxy.grepture.com",
  mode: "trace", // ← direct to provider, traces sent async
});

const openai = new OpenAI({
  ...grepture.clientOptions({
    apiKey: process.env.OPENAI_API_KEY!,
    baseURL: "https://api.openai.com/v1",
  }),
});

// This request goes directly to OpenAI — no proxy hop
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Summarize this quarter's results." }],
});

The clientOptions() call returns the same shape either way. In proxy mode, it rewrites the base URL to route through Grepture. In trace mode, it keeps the provider's URL and wraps fetch to capture metadata after the response.

If you're running in a serverless environment — Lambda, Vercel Functions, Cloudflare Workers — call flush() before the function exits to make sure pending traces are sent:

await grepture.flush();

When to use which mode

Both modes exist for good reasons. The right choice depends on what you need from Grepture.

Trace mode is for

Observability and cost tracking without any latency overhead
Latency-sensitive production workloads — streaming responses, real-time agents, voice interfaces
High-throughput pipelines where shaving milliseconds off every call matters at scale
Getting started — you want visibility into your AI traffic before committing to rules and policies

Proxy mode is for

PII redaction — catching sensitive data before it reaches the model, with 50+ built-in detection patterns
Request blocking — stopping requests that match rules you define
Prompt management — resolving prompts server-side with prompt.use() so your application doesn't need to bundle prompt templates
Mask and restore — reversible redaction that keeps LLM responses useful while protecting PII

If you need to inspect or modify the request in flight, you need proxy mode. If you just need to see what happened after the fact, trace mode gives you the same dashboard with zero overhead.

How it works under the hood

The SDK handles two response types differently.

Buffered responses. The SDK clones the response object, reads the body in the background, and extracts usage metadata (model, tokens, status) from the JSON. Your application gets the original response immediately — the clone and extraction happen asynchronously after the response is returned to your code.

Streaming responses. This is where it gets interesting. The SDK wraps the response stream with a TransformStream that passes every chunk through to your application immediately — zero buffering. In parallel, it captures only the last few SSE data: lines, which is where providers include token counts in their streaming protocol. When the stream ends with [DONE], the SDK extracts usage from those final chunks and sends the trace. Your application never waits for this.

Under both paths, a TraceSender batches up to 25 entries and flushes every 2 seconds — or immediately when the batch is full. Calling flush() sends whatever is in the buffer right now. All of this is best-effort — if a trace request fails, it's silently dropped. Production traffic is never interrupted by trace delivery.

What you see in the dashboard

Trace entries show up in the same traffic log as proxy entries. You can tell them apart by the source indicator — a shield icon for proxy requests, an eye icon for trace requests.

The same columns are there: model, tokens, cost, duration, status. Trace entries work with conversation tracing (group related calls by trace ID), grep search across request and response bodies, and all existing filters.

The one difference: trace entries don't show rules_applied, because no rules ran. The request went straight to the provider.

Try it

One config change to get started:

const grepture = new Grepture({
  apiKey: process.env.GREPTURE_API_KEY!,
  proxyUrl: "https://proxy.grepture.com",
  mode: "trace",
});

Start with trace mode for visibility. See which models your team is using, what they cost, and where latency is going. When you're ready to add PII redaction or request rules, switching to proxy mode is the same one-line change — remove mode: "trace" and traffic routes through the proxy automatically.

How Grepture works — architecture overview of both modes
Tracking AI API costs — set up cost tracking with trace IDs
Getting started guide — full SDK configuration reference