🤖

AI Safety

Human oversight, PII protection, and defenses against AI-specific threats.

Last reviewed 2026-04 · Engineering · Owned by AI Safety Officer

Overview

Votriz uses AI to draft content, score brand health, and orchestrate campaigns. Every AI interaction sits behind safety guardrails: a mandatory human-approval gate, PII redaction on sensitive paths, prompt-injection defense, and a queryable user-feedback channel. Aligned with the NIST AI RMF across Govern / Map / Measure / Manage.

Human approval gates

All AI-generated content requires human approval before any external action (publishing to social, sending email, posting a crisis response). This is foundational architecture, not a configurable toggle.

Content queue: drafted posts sit in content_queue with status='pending' until a user with content.approve permission acts
Email campaigns: a user with email.campaigns.send must explicitly trigger dispatch; the worker won't fan out without that step
Brand monitor: AI auto-drafts crisis responses, a human approves before posting; severity='critical' always routes to a human regardless of AI confidence

Ghost Presence (autonomous mode)

Opt-in per org, with a per-brand auto_approve_confidence_threshold. Even when active:

Maximum posts per day caps
Operating hours restrictions
Content category blocklists (never_auto_approve topics)
Confidence-score floor below which the gate stays manual

No training on customer data

Customer content, prompts, Brand DNA, subscriber data, and analytics are never used to train, fine-tune, or otherwise improve any AI model. We use Anthropic's Claude API under a contract that explicitly excludes API data from training corpora — we inherit that guarantee.

Brand DNA voice profiles are stored per-org and loaded into prompt context at inference time. They never leave the customer's org_id scope and are never shared with other customers.

PII redaction

The pii_redactor service scans free-form user input for sensitive patterns before transmission to external AI providers:

Pattern	Replacement	Restorable?
Email addresses	`[EMAIL_n]`	Yes (chatbot reply path)
Phone numbers (NANP 3-3-4)	`[PHONE_n]`	Yes (chatbot reply path)
SSN format (3-2-4)	`[SSN_REDACTED]`	No
Credit card (Luhn-validated)	`[CC_REDACTED]`	No

Redaction is selectively applied to the support chatbot and email-generation prompt paths. Lead generation is intentionally exempt — extracting public business contact information from search results is the agent's explicit job, and redacting there would defeat the purpose.

Prompt-injection defense

The prompt_guard service inspects user input for known jailbreak patterns and structure-token injection:

Instruction override attempts ("ignore previous instructions")
Role reassignment ("you are now")
System-prompt extraction ("show me your system prompt")
Token-boundary manipulation (<|user|>, system: role-prefix lines, triple backticks misparsed as boundaries)

Detected attempts are sanitized before forwarding (role tokens defanged, code fences neutered) and the event is logged in security_audit_log. Brand DNA scoring + the human approval gate are still the real defenses; this is a cheap upstream filter that catches the obvious stuff before tokens are spent on it.

AI quality reporting

Any user can flag a generated piece as biased, inaccurate, inappropriate, off-brand, or surfacing the system prompt:

In-app: "Report AI issue" button on every generated content card
API: POST /ai/report-issue with {issue_type, description, resource_id?, ai_model?}
Email: [email protected]

Reports land in security_audit_log under ai.quality_report and are reviewed by the AI Safety Officer within 24 hours. Patterns trigger a review of the relevant agent's prompts and scoring thresholds.

Model inventory

Model	Provider	Purpose	Data sent	Risk
Claude Haiku 4.5	Anthropic	Content, copy, email, SEO scoring, sentiment, chatbot	Brand context (no PII on most paths; redacted on chatbot + email)	Medium
GPT-4o-mini	OpenAI	Fallback if Anthropic is unreachable	Same shape as Claude payloads	Medium
FLUX.1-schnell	fal.ai	Image generation	Text prompts only	Low

The full inventory + change procedure lives in docs/policies/AI_MODEL_INVENTORY.md; updates ship in the same commit as the code change.

Quality monitoring

Output quality is monitored through:

Brand DNA scoring — every generated piece scored against the brand's voice profile before reaching the queue
Approval-rate tracking — declining rates per brand are the leading drift signal
User feedback loop — every approve / edit / reject decision teaches the system
Quality-report rate — an org-level surge in ai.quality_report rows is one of the five detection signals in AI_INCIDENT_RESPONSE.md