Your LLM Observability Tool Is Logging PII — Here's How to Fix It

The hidden data pipeline you forgot to audit

You've set up LLM observability. LangSmith, Datadog LLM Observability, Langfuse, Arize, Helicone — pick your platform. You can see every prompt, every completion, every token count, every latency metric. Debugging is easy. Evaluation is possible. Life is good.

There's one problem: every prompt your users ever sent is now stored in a third-party system. And if those prompts contain personal data — names, emails, phone numbers, health information, financial details — you've just created a data processing operation you probably didn't account for in your privacy impact assessment.

This isn't hypothetical. If you're building AI features that process user input, your prompts contain PII. Support tickets have customer names. CRM-enriched prompts have emails and phone numbers. RAG-retrieved documents have whatever was in the source data. Chat histories accumulate identifiers over turns.

Your observability tool is faithfully logging all of it.

How observability tools capture PII

Most LLM observability platforms work by instrumenting your AI pipeline — either through SDK callbacks, middleware, or auto-instrumentation. They capture:

Full prompt text — the complete input to the model, including system instructions, user messages, and any retrieved context
Full completion text — the model's entire response
Tool-call arguments and results — function names, parameters (which often contain user data), and return values
Metadata — model name, token counts, latency, but also custom metadata you attach (user IDs, session IDs, request context)

The default behavior for nearly every platform is to capture everything. That's by design — you need full prompts and completions to debug issues, evaluate quality, and trace problems.

But "capture everything" means every piece of personal data that touches your AI pipeline is now stored in your observability vendor's infrastructure. Every name, every email, every medical record in a healthcare prompt, every financial detail in a fintech query.

A concrete example

You're running a customer support AI. A user writes:

Hi, I'm Sarah Chen (sarah.chen@acme.com). My account number
is 4532-8891-2201-6644. I was charged twice for my subscription
on March 3rd. Can you look into this? I'm calling from
+1 (415) 555-0142.

Your LLM pipeline enriches this with CRM data, adds relevant docs from RAG, sends it to GPT-4o, and returns a response. Your observability tool captures:

The full prompt (with Sarah's name, email, credit card, and phone number)
The RAG-retrieved documents (which may contain other customers' data)
The model's response (which references Sarah's details)
The tool calls (which passed her account number to your billing API)

Sarah's personal data is now in three places: your application, OpenAI (temporarily), and your observability platform (indefinitely, unless you've set a retention policy).

Why this matters for compliance

Your observability vendor is a data processor under GDPR. You need:

A Data Processing Agreement (DPA) with them — not just your AI provider
A lawful basis for storing personal data in observability logs
An entry in your records of processing activities (Article 30)
The ability to find and delete a specific user's data from observability logs when they exercise their right to erasure (Article 17)

Most teams have DPAs with their AI provider. Far fewer have evaluated whether their observability vendor's terms cover personal data in prompt traces.

Data minimization

GDPR Article 5(1)(c) requires data to be "limited to what is necessary." Is it necessary to store full prompt text containing PII for observability purposes? For debugging a live issue — maybe. For historical traces from six months ago — almost certainly not.

Cross-border transfers

If your observability vendor stores data in the US (and most do — LangSmith, Datadog, Arize are all US-based), you've created another cross-border transfer that needs SCCs, a Transfer Impact Assessment, and supplementary measures. This is on top of the transfer to your AI provider.

Breach exposure

A breach at your observability vendor now exposes every prompt your users ever sent. That's a fundamentally different incident than a breach of aggregated metrics. Full prompt traces are a treasure trove of personal data — names, emails, financial details, health information, credentials, internal data — all neatly organized and searchable.

What the observability platforms offer (and what they don't)

Some platforms have started addressing this. But the solutions are opt-in and limited:

Datadog offers Sensitive Data Scanner which can scan LLM Observability traces and redact PII patterns. It's a good start, but it's a separate product with separate pricing, and it scans data after it's already been sent to Datadog.

LangSmith has a mask-inputs-outputs feature that lets you define custom masking functions. But it requires you to write and maintain your own redaction logic in every instrumented service.

Langfuse provides client-side masking functions that process data before it's sent. Similar to LangSmith — you define the function, you maintain the patterns, you're responsible for coverage.

Most other platforms — Helicone, Arize, Weights & Biases — either don't have PII-specific features or rely on you to handle it before data reaches them.

The common gaps:

Opt-in, not opt-out. PII logging is the default. You have to actively configure redaction.
Pattern-based only. Most offer regex matching for structured PII (emails, SSNs, credit cards). Freeform PII in unstructured text (names, addresses, medical terms) requires NLP — which few observability tools include.
Per-platform configuration. If you use multiple observability tools (common in larger teams), you configure redaction separately in each one.
After-the-fact. Some solutions scan data after ingestion rather than before transmission, meaning PII has already left your infrastructure.

The better approach: redact before it reaches observability

The observability tool shouldn't be responsible for redacting PII — it should never see PII in the first place.

If you redact personal data from your AI pipeline before observability instrumentation captures it, every downstream system — your AI provider, your observability platform, your logging infrastructure — receives clean data.

There are three layers where you can do this:

1. At the proxy layer (before everything)

A scanning proxy between your application and external services redacts PII from all outbound traffic. Since observability tools typically instrument the request/response cycle, they capture the already-redacted data.

This is the approach Grepture takes. The proxy sits upstream of both your AI provider and your observability instrumentation. By the time LangSmith or Datadog captures the trace, PII has already been replaced with tokens:

# What your observability tool logs:
Prompt: "Hi, I'm [PERSON_1] ([EMAIL_1]). My account number
is [CREDIT_CARD_1]. I was charged twice for my subscription
on March 3rd. Can you look into this? I'm calling from
[PHONE_1]."

You can still debug — the entity references are consistent and meaningful. You just can't see the actual PII, because it was stripped before it reached any external system.

2. At the application layer (custom masking)

Write redaction logic that runs before your observability SDK captures the data. This is what LangSmith and Langfuse's masking features enable:

// LangSmith example — you maintain the masking function
const client = new Client({
  maskInputs: (input) => redactPII(input),
  maskOutputs: (output) => redactPII(output),
});

This works but has drawbacks: you maintain the redaction logic, it only covers the platform you configure it for, and it doesn't protect the AI provider (the data is redacted for observability but still sent raw to OpenAI).

3. At the collector layer (OpenTelemetry)

If you're using OpenTelemetry for AI observability, you can add a PII redaction processor to your collector pipeline. This intercepts traces before they reach any backend:

processors:
  pii_redaction:
    patterns:
      - email
      - credit_card
      - phone
      - ssn

This centralizes redaction for all OTEL-based observability but requires running your own collector infrastructure and doesn't cover non-OTEL platforms.

Recommended architecture

The strongest approach combines proxy-level redaction with observability-specific controls:

Route AI traffic through a redacting proxy — PII is stripped before it reaches the AI provider or the observability tool. Grepture's proxy handles this with 50+ detection patterns on the free tier, including AI-powered detection of names and addresses on Pro.
Enable platform-native masking as a safety net — if your observability tool offers masking (Datadog Sensitive Data Scanner, LangSmith mask-inputs-outputs), enable it. Defense in depth — even if the proxy misses something, the second layer catches it.
Set retention policies — even with redaction, set aggressive retention on observability traces. You rarely need six-month-old prompt traces. Most platforms support configurable retention periods.
Audit what's being logged — periodically review your observability traces for PII that slipped through. Search for common patterns (email formats, phone number formats) in your trace data. If you find PII, your redaction layer has gaps.
Use zero-retention mode for sensitive workloads — for healthcare, financial, or legal AI workflows, consider whether you need prompt-level observability at all in production. Metadata-only logging (latency, token counts, error rates) may be sufficient.

Quick wins you can do today

If you're running LLM observability in production right now, here are immediate steps:

Check your DPA coverage. Do you have a Data Processing Agreement with your observability vendor that covers personal data in AI traces? If not, you have a GDPR gap.
Enable whatever masking your platform offers. Even basic regex masking for emails and credit cards is better than nothing.
Review your retention settings. Reduce trace retention to the minimum useful period. 30 days covers most debugging needs.
Audit a sample of recent traces. Pull 20 random traces and check for PII. If you find any, you know the scope of the problem.
Add observability vendors to your processing records. GDPR Article 30 requires documenting all processors. If your observability vendor isn't listed, add them.