|Ben @ Grepture|Security

Best Open Source Models for PII Redaction

Compare the best open source models for PII detection and redaction in AI pipelines — GLiNER, DeBERTa, Piiranha, StarPII, and more.

Regex won't catch everything

If you're building AI features, you need PII detection in your pipeline. Regex handles structured PII well — emails, credit card numbers, phone numbers, SSNs. But what about a customer's name buried in a support ticket? An address woven into a paragraph? A company name that could be anything?

For unstructured PII, you need ML models. And in 2026, the open source options are genuinely good. This post covers the best open source models for PII redaction — what they're good at, where they fall short, and which one to pick for your use case.

How PII detection models work

Most PII detection models are Named Entity Recognition (NER) systems. They process text token by token and classify each token with labels like B-PERSON (beginning of a person name), I-EMAIL (inside an email), or O (not PII).

There are two main approaches:

  • Fine-tuned models — Trained on PII-labeled datasets, these have fixed entity types baked in. They're fast and accurate for the categories they know, but can't detect new entity types without retraining.
  • Zero-shot models — You specify entity labels at inference time. More flexible, slightly less accurate on benchmarks, but you can adapt them to new PII types without any training.

The transformer revolution (BERT, DeBERTa, and their descendants) made both approaches dramatically better than the spaCy/regex stacks that preceded them. Let's look at what's available.

Presidio + spaCy: the baseline

Microsoft Presidio isn't a model — it's a framework. It combines regex pattern matching, rule-based logic, and NER (spaCy by default) into a pluggable detection pipeline. It's the most widely deployed open source PII solution and the one most blog posts recommend.

How it works: Presidio ships with built-in "recognizers" for common PII types. The default NER engine is spaCy's en_core_web_lg model. You can add custom recognizers or swap in a different NER backend entirely.

Strengths:

  • Production-ready framework with years of community hardening
  • Handles text, images, and structured data (JSON, CSV)
  • Pluggable architecture — swap spaCy for GLiNER, Flair, or a custom model
  • Good documentation, active maintenance

Weaknesses:

  • Default spaCy NER struggles with non-Western names (Indian, East Asian, Arabic names)
  • Regex recognizers miss implicit or context-dependent PII
  • Framework complexity — you're managing a pipeline, not just running a model
  • spaCy's NER accuracy on general PII benchmarks lags behind transformer-based alternatives

Best for: Teams that want a full framework with batteries included and are willing to customize the NER backend. Presidio is the chassis — the model you plug in determines the accuracy.

Verdict: Start here if you need a production framework, but replace the default spaCy NER with GLiNER or a DeBERTa fine-tune for better accuracy.

GLiNER: the zero-shot option

GLiNER (Generalist and Lightweight model for Named Entity Recognition) is a transformer-based NER model with a key innovation: zero-shot entity recognition. You define entity labels at inference time — no retraining needed.

The base model is strong, but for PII specifically, you want one of the fine-tuned variants:

  • Knowledgator GLiNER-PII — 60+ PII entity categories, available in small/base/large/edge sizes. The most comprehensive PII-specific GLiNER variant.
  • NVIDIA GLiNER-PII — Built on GLiNER large-v2.1, covers 55+ entity types including PHI (Protected Health Information).
  • Gretel GLiNER — Fine-tuned on Gretel's synthetic datasets, available in small/base/large.
  • urchade/gliner_multi_pii-v1 — Multi-PII variant from GLiNER's original author.

How it works: Pass text and a list of entity labels (e.g., ["person", "email", "credit_card", "address"]). GLiNER returns spans with confidence scores. No predefined entity set — you control what to detect.

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")

text = "Contact John Smith at john@acme.com or call 555-0123"
labels = ["person", "email", "phone_number"]

entities = model.predict_entities(text, labels, threshold=0.5)
for entity in entities:
    print(f"{entity['text']} -> {entity['label']} ({entity['score']:.2f})")
# John Smith -> person (0.94)
# john@acme.com -> email (0.97)
# 555-0123 -> phone_number (0.91)

Strengths:

  • Zero-shot flexibility — add new entity types without retraining
  • Strong accuracy (F1 ~0.81 on multi-domain PII benchmarks)
  • Multiple size options from edge (tiny) to large
  • Integrates directly with Presidio as a custom recognizer
  • Good at non-Western names where spaCy struggles

Weaknesses:

  • F1 drops to ~0.41 on clinical/healthcare data (domain gap)
  • Slightly slower than fixed fine-tuned models (zero-shot overhead)
  • Requires GPU for reasonable throughput at scale

Best for: Teams that need flexibility in entity types, deal with diverse text inputs, or want a single model that adapts to different PII requirements.

DeBERTa fine-tunes: highest accuracy

If you need the highest raw accuracy on a fixed set of PII types, DeBERTa fine-tuned models dominate the benchmarks. These are token classification models trained specifically on PII-labeled datasets.

The standout models:

How it works: Standard transformer token classification. Feed in text, get per-token PII labels. These models have fixed entity sets defined during training.

from transformers import pipeline

classifier = pipeline(
    "token-classification",
    model="Isotonic/deberta-v3-base_finetuned_ai4privacy_v2",
    aggregation_strategy="simple"
)

text = "My name is Maria Garcia and I live at 42 Oak Street, Berlin"
results = classifier(text)
for r in results:
    print(f"{r['word']} -> {r['entity_group']} ({r['score']:.2f})")

Strengths:

  • Highest benchmark scores (F1 > 0.97)
  • Fast inference — no zero-shot overhead
  • Well-understood transformer architecture
  • Multiple variants for different use cases

Weaknesses:

  • Fixed entity sets — can't detect new PII types without retraining
  • Training data quality determines real-world performance (some are trained on synthetic data)
  • Larger models need GPU for production throughput

Best for: Production systems where you know exactly which PII types you need to detect and accuracy is the top priority.

Piiranha: best for multilingual

Piiranha is a fine-tuned microsoft/mdeberta-v3-base model built specifically for multilingual PII detection. It detects 17 PII types across 6 languages.

Performance: 98.27% PII token catch rate, 99.44% overall classification accuracy. These are strong numbers, especially across multiple languages.

Strengths:

  • Best multilingual coverage among PII-specific models
  • High accuracy maintained across languages (not just English)
  • Based on the proven mDeBERTa architecture

Weaknesses:

  • Smaller entity set (17 types) compared to GLiNER-PII (60+) or ai4privacy DeBERTa (54)
  • Limited to 6 languages — not truly universal

Best for: Applications processing text in multiple European languages where consistent cross-language accuracy matters more than entity type breadth.

StarPII: built for code

StarPII takes a different approach — it's designed for detecting PII in source code and code-adjacent text. Part of the BigCode project, it targets 6 categories:

  • Names
  • Emails
  • API keys
  • Passwords
  • IP addresses
  • Usernames

Why it matters: If you're building AI coding assistants or processing code repositories, general PII models miss code-specific patterns. StarPII understands that password = "hunter2" is a credential, not just a string literal.

Best for: Code analysis, repository scanning, coding assistant pipelines. Pair with a general PII model for non-code text.

Small LLMs as PII detectors

Some teams use small language models (Phi-3 Mini, Qwen2-0.5B) fine-tuned for PII detection. Models like ab-ai/PII-Model-Phi3-Mini and betterdataai/PII_DETECTION_MODEL (Qwen2-0.5B, 29 classes, 7 languages) take this approach.

Strengths:

  • Can handle complex, contextual PII that NER models miss
  • Natural language output makes results interpretable
  • Some support instruction-following for custom detection rules

Weaknesses:

  • Significantly slower than NER models (10-100x)
  • Higher resource requirements (even "small" LLMs need more memory)
  • Less predictable — generative models can hallucinate entities
  • Harder to deploy at scale with consistent latency

Best for: Offline batch processing where latency doesn't matter and you need maximum contextual understanding. Not recommended for real-time API traffic.

Comparison table

ModelApproachEntity TypesF1 ScoreLanguagesSpeedBest For
Presidio + spaCyFramework + NERConfigurableVaries by backendEN (default)FastFull pipeline framework
GLiNER-PII (Knowledgator)Zero-shot NER60+~0.81EN (primarily)MediumFlexible entity detection
NVIDIA GLiNER-PIIZero-shot NER55+~0.81EN (primarily)MediumPII + PHI detection
DeBERTa + ai4privacyToken classification540.9757MultiFastHighest accuracy
PiiranhaToken classification17~0.98 (token)6 languagesFastMultilingual
StarPIIToken classification6N/AENFastCode/repos
Phi3-Mini PIIGenerativeFlexible~0.96 (self-reported)ENSlowBatch processing

Choosing the right model

There's no single best model — it depends on your constraints:

  • General text, known PII types → DeBERTa + ai4privacy fine-tune. Highest accuracy, fast inference, 54 entity classes cover most use cases.
  • General text, evolving PII types → GLiNER-PII (Knowledgator or NVIDIA). Zero-shot flexibility means you can add entity types without retraining.
  • Multilingual → Piiranha. Consistent accuracy across 6 languages beats running separate per-language models.
  • Code and repositories → StarPII. General PII models don't understand code-specific patterns.
  • Production framework → Presidio with GLiNER as the NER backend. You get Presidio's pipeline management with GLiNER's detection accuracy. Microsoft documents this integration.
  • Healthcare / PHI → NVIDIA GLiNER-PII or domain-specific fine-tunes. General models see significant accuracy drops on clinical text (F1 drops from 0.81 to ~0.41 for GLiNER on clinical data).

The hybrid approach wins

In production, the best PII detection systems don't rely on a single model. They combine approaches:

  1. Regex patterns for structured PII — emails, credit cards, SSNs, phone numbers. These are fast (sub-millisecond), deterministic, and auditable. When your compliance team asks how you detect PII, you can point to specific patterns.

  2. ML models for unstructured PII — names, addresses, organizations, freeform text entities. This is where the models above earn their keep.

  3. Secret scanning for credentials — API keys, tokens, connection strings. These are distinct from PII but equally important in AI pipelines where developers send code snippets to LLMs.

Running regex first filters out the easy wins at near-zero cost. The ML model only processes what regex doesn't catch. This is the approach we recommend in our PII detection best practices guide — and it's how most production systems work.

How Grepture helps

Hosting, tuning, and maintaining PII detection models is real operational work. You need GPU infrastructure, model versioning, latency monitoring, and fallback strategies. For a security feature that's supposed to protect your pipeline, that's a lot of moving parts.

Grepture handles this at the proxy layer. Every API call to OpenAI, Anthropic, or any LLM provider flows through Grepture, where it's scanned with both regex patterns and AI models before reaching the provider. Detected PII can be redacted, masked and restored, or logged for audit.

The advantage: you don't integrate per-service, you don't host models, and you get reversible redaction so AI responses stay personalized even with PII stripped from the request.

For a deeper comparison of Presidio vs Grepture's approach, see our side-by-side comparison.

Key takeaways

  • GLiNER-PII variants (Knowledgator, NVIDIA) are the most versatile choice — zero-shot flexibility with strong accuracy and 55-60+ entity types.
  • DeBERTa fine-tuned on ai4privacy has the highest benchmark score (F1: 0.9757) if you need maximum accuracy on a fixed entity set.
  • Piiranha is the best multilingual option with 98%+ catch rates across 6 languages.
  • Production systems should combine regex + ML models — regex for structured PII (fast and auditable), models for freeform text entities.
  • No model handles all domains equally — clinical/healthcare text causes significant accuracy drops for general-purpose models. Test on your actual data before deploying.