How do AI digital proteges handle PII?

Drew Harris · CEO and Chief Product and Technology Officer · 2026-04-12 · 8 min read
securitycompliancezero-hallucinationpii

Why PII handling is a platform-level question

Every expert I talk to assumes their AI platform handles PII well. Most don't ask how. The question matters for three reasons:

  1. Your source material has PII in it whether you know it or not. Client intake forms, email archives, session transcripts, case studies: all are PII-dense. Uploading them to a vector store without redaction means the PII is now retrievable.
  2. LLMs don't distinguish between "content" and "personal data." A retrieval system that returns a passage containing "Jane Smith, jsmith@acme.com, 555-0134, diagnosed with..." will happily include that passage in the prompt context. The model will answer using it, and might quote it.
  3. Once PII is in the vector store, removing it is hard. Vector stores don't have SQL-style structured updates. Re-ingestion is the only reliable fix, and re-ingestion is expensive if your KB is large.

The right answer is to prevent PII from entering the vector store in the first place. That's a platform-level choice, not a user-side workflow.

Where PII enters the system

Three ingress points in any AI clone platform:

  • Document ingestion: PDFs, Word docs, text pasted in, RSS feeds, YouTube transcripts, URL scraping. This is where most PII enters.
  • Session transcripts: clients mention names, emails, phone numbers, health details, financial specifics. Transcripts are stored. Without handling, the PII enters any downstream ingestion (session-to-KB ingestion is a common feature on our side).
  • The expert's own inputs: an expert entering a client's name in a feedback note, a coupon description, an email template.

A complete PII story addresses all three. Most competitor platforms address zero or one publicly.

Pre-ingestion redaction vs. post-generation filtering

There are two places PII detection can happen. They're not equivalent.

Post-generation filtering

A filter runs after the model generates a response. If the response contains what looks like PII, the filter redacts or blocks it. This is the pattern most LLM providers (OpenAI, Anthropic, Google) offer as a "safety filter."

Failure modes:

  • The PII is still in the vector store. If it's retrievable, it's already been shown to the model.
  • The model has already been trained on / seen the PII. If you're using a provider that logs inputs for model training, redacting the output doesn't unlog the input.
  • Filters are pattern-based. They miss named individuals, obscured identifiers, and domain-specific PII (medical record numbers, case file IDs).
  • Responses that get blocked produce broken sessions. Clients see "I can't respond to that" when the expert's protégé hit a filter, not when it hit the zero-hallucination threshold.

Pre-ingestion redaction

PII detection runs before content enters the vector store. Redacted content is what gets embedded, what gets retrieved, and what reaches the model. The model never sees the PII.

Why this is better:

  • The vector store is clean. Nothing to leak.
  • The model can answer fully, because there's no content being blocked and nothing to block.
  • The audit trail is simpler. Show a regulator "here's what was ingested, here's what was redacted, here's what was stored." That's a testable, reproducible process.

Apex Replicant uses pre-ingestion redaction on every source type. See /features/pii-protection for the configuration surface.

Pre-ingestion redaction treats PII as something that should never enter the system. Post-generation filtering treats PII as something that should occasionally fail to leave. The postures are not equivalent, and the regulators know the difference.

What our detection pipeline catches (and what it doesn't)

Our detection pipeline is production-grade, widely used in regulated industries, and tuned against real-world expert content.

What it reliably detects:

  • Emails
  • Phone numbers (multiple international formats)
  • Social Security Numbers
  • Credit card numbers
  • IP addresses
  • Dates of birth and date patterns in PII contexts
  • Names (via NER, named-entity recognition)
  • US driver's license numbers
  • Passport numbers
  • IBAN and bank account patterns
  • Medical record numbers (configurable)
  • Geographic locations at high specificity

What the baseline detection is weaker on (and our mitigations):

  • Domain-specific identifiers (case file IDs, clinical trial IDs, internal employee IDs). Mitigation: custom regex patterns configured per expert vertical.
  • Partial / obscured PII ("John S.", "the client on Tuesday"). Mitigation: aggressive name redaction when the source document is known to be client-facing.
  • Context-dependent PII (a salary figure is PII in an HR context but not in a macroeconomic one). Mitigation: conservative default configuration per vertical.
  • Novel identifier formats (new national ID schemes, proprietary enterprise IDs). Mitigation: we review pipeline updates quarterly and backfill coverage.

No PII detector is perfect. What matters is whether the platform is explicit about the mechanism, the residual risk, and the mitigations. Most aren't.

Session-time PII handling

Ingestion is one half; live sessions are the other.

When a client is in a voice or text session, they will mention personal details. Our handling:

  1. Transcripts are stored against the session record, scoped to the expert and protégé. They are not ingested back into the KB by default.
  2. When an expert chooses to re-ingest session content (session-to-KB is a shipped feature), the ingestion runs through the same detection pipeline as any other source. Client names, emails, and identifiers are redacted before the content enters the vector store.
  3. Session insight extraction uses Claude to summarize sessions into structured categories (topics, sentiment, action items). The summaries are PII-redacted; the raw transcript remains in the session record with expert-only access.

The design assumption is that the session record is the expert's private working copy; the KB is the protégé's public knowledge. Content flows between them only with redaction.

What this does not solve

I want to be specific about the edges, because PII stories often oversell.

  • Your cloud storage posture. If the expert downloads a transcript to their laptop and leaves it on an unencrypted drive, the platform can't help. Our audit trail ends at the export.
  • Third-party model training. We use Gemini (primary), Claude (session insights), and OpenAI (secondary). Our contracts with each limit them to processing our data solely for the purposes we specify, which is running your protégé. That restriction is contractual, not an honor system. The residual category-wide question "could a future model remember something?" is an industry issue, not an Apex Replicant issue.
  • A determined insider. Expert-account compromise (stolen credentials, subpoena) exposes session transcripts. Platform encryption doesn't defeat a valid court order.
  • Regulatory certification. Pre-ingestion PII redaction is a strong technical posture. It is not the same as a SOC 2 Type II audit or a HIPAA BAA. Our current posture: we are compliance-adjacent but not certified. Experts in HIPAA-gated verticals should talk to our team about current status.

Being explicit about what the architecture doesn't cover is part of how we earn trust on what it does.

FAQ

Does PII redaction happen on every source type? Yes. PDFs, DOCX, TXT, pasted text, URL scrapes, audio/video transcripts, RSS feeds, and session-to-KB ingestion all run through the same detection pipeline before content enters the vector store.

Can I see what was redacted? The AI insight preview (Epic 4) shows you extracted insights before you commit the ingestion. Redaction is visible at this preview step; you see exactly what the protégé will know and what it won't.

Can I turn PII redaction off? Not in the default configuration. Disabling ingestion-time redaction is a platform decision we have not enabled because the failure modes (PII in vector store, irreversible without re-ingestion) are severe enough that we treat this as architectural, not configurable.

What happens to PII in session transcripts? Transcripts are stored in the session record, scoped to the expert and protégé. They are not ingested back into the knowledge base by default. When an expert chooses to re-ingest a session (session-to-KB is a shipped option), the content runs through the detection pipeline before entering the vector store.

Does Apex Replicant have a HIPAA BAA or SOC 2 certification? Our current compliance posture is partial and we name it that way. We have strong technical foundations (pre-ingestion PII redaction, encrypted storage, scoped access control) but no named certification as of this publication. For experts in HIPAA-gated or heavily regulated verticals, talk to our team for current status.

How does this compare to Delphi, Coachvox, or Steno? None of the three publicly describes pre-ingestion PII redaction on their marketing or documentation sites (as of our competitive refresh, 2026-04-22). Personal.ai claims SOC 2 for its enterprise tier. That's the current public posture landscape; it may change.

Related reading

Drew Harris
CEO and Chief Product and Technology Officer

Co-founder of Expert Scale, Inc. Writes on platform architecture, product decisions, and how Apex Replicant builds expert-driven AI that refuses to guess.

More from Drew Harris