Prompt Injection Audit

Treating prompt injection as a reliability failure that can be measured

Instead of asking "did the model fall for this one payload?", we ask: across many placements, wrappers, and encodings of untrusted content, how likely is the model to violate policy? If that probability is non-trivial, the system is vulnerable.

LLM System XML JSON b64 Unicode Violation Rate q = 0.23 Test Variants n = 60
The Problem

Most security testing provides false confidence

Teams test prompt injection by throwing a handful of known payloads at their system. If the model doesn't fall for "ignore all previous instructions," they ship.

This is like testing a lock with one key and declaring it secure. Real attackers don't use your test cases.

Single-Payload Testing

Tests one trick at a time. Passes if that specific payload fails. Provides false confidence because it misses the vast majority of attack variations.

Distributional Auditing

Tests across the entire attack surface: wrappers, positions, encodings, indirect vectors. Computes violation probability with statistical confidence bounds.

"

Prompt injection isn't a single string. It's a family of transformations. A system is only "safe" if it's robust across the whole distribution.

"
Core Insight Distributional Security Testing
Attack Surface

The transformation families we test

Each attack isn't a single string. It's a combinatorial space of variations. We test systematically across all dimensions.

Attack Payload "ignore prev..." × Wrappers 5 types XML JSON × Positions 4 places × Encodings 3+ tricks A a = 60+ variants 5 × 4 × 3 = 60 minimum test surface
Methodology

How Distributional Auditing Works

Built on Strawberry's QMV (Quantile Model Verification) machinery, the audit framework computes statistical bounds on vulnerability.

TRUSTED Instruction ATTACK Payload UNTRUSTED Content DISTRIBUTION W1 W2 P1 P2 E1 ... LLM Evaluation RESULTS q 0.23 qL 0.18 VULNERABLE
01

Define Injection Scenarios

Each test scenario explicitly separates trusted instruction, untrusted content, and attack payload, mirroring real systems where user/external text must not become instructions.

  • Trusted instruction The system prompt or policy the LLM should follow
  • Untrusted content User input, RAG docs, tool output: anything external
  • Attack payload The injection attempt embedded in untrusted content
  • Violation predicate What counts as a policy violation (the failure condition)
02

Generate Attack Distribution

Instead of testing one payload, we generate the combinatorial space: wrappers × positions × encodings. Each dimension multiplies the test surface.

  • Wrapper variations Plain, quoted, XML, JSON, codeblock serializations
  • Position variations Before system, after user, tool output, embedded
  • Encoding variations Unicode tricks, zero-width, homoglyphs, base64
  • Indirect channels RAG retrieval, email parsing, calendar events
Typical test count 5 × 4 × 3 = 60+ variants per payload
03

Run QMV Evaluation

Each variant is evaluated against the violation predicate. Instead of pass/fail, we compute violation rates with statistical confidence bounds.

  • Backend-agnostic Works with OpenAI, vLLM, Anthropic, or dummy backends
  • Parallel execution Efficient batch evaluation across variants
  • Baseline comparison Measures delta_q shift from payload injection
  • Per-variant tracking Identifies which combinations are most vulnerable
04

Interpret Results

The output is not "pass/fail" but quantified vulnerability with confidence bounds. If q_lo is high, the model is provably vulnerable, not just "sometimes bad."

  • q̄ (mean violation rate) Average probability of policy violation across variants
  • qₗ (lower confidence bound) Worst-case estimate with statistical confidence
  • Vulnerability identification Pinpoints weakest wrapper/position combinations
  • Actionable threshold If qₗ > threshold, block deployment or fix prompt
Decision rule qₗ > 0.05 → VULNERABLE
Philosophy

What this approach deliberately avoids

These common defenses all fail under distributional attacks. We don't rely on any of them.

ignore prev /regex/ miss 1 test safe? CoT check All fail under distribution

"Ignore all previous instructions" hardening

Adding defensive text to prompts. Attackers simply vary their phrasing. This is a single-point defense against a distributional attack.

Regex or keyword blocking

Blocklists catch known patterns but miss encoding variations, synonyms, and novel phrasing. Trivially bypassed with homoglyphs or rephrasing.

Single red-team examples

Testing a handful of known payloads provides false confidence. You're only measuring memorization of those specific attacks, not robustness.

Trusting chain-of-thought self-checks

Asking the model "is this safe?" can itself be manipulated. The same injection that bypasses policy can bypass the safety check.

In Practice

Gating agent behavior on verified robustness

In agentic workflows, injection audit isn't just a test. It's a gate. Steps are rejected if Strawberry shows they could be compromised.

This prevents indirect prompt injection via tools or web content, a common real-world failure mode that single-payload testing misses entirely.

Evidence verification required

Steps are rejected if no supporting evidence passes QMV verification

Untrusted content isolation

Web and tool outputs are explicitly treated as untrusted inputs

Verification loops

Plans must survive robustness checks before execution proceeds

Statistical confidence

Decisions based on qₗ bounds, not point estimates

Ready to audit your system's injection robustness?

Stop testing single payloads. Start measuring distributional vulnerability with statistical confidence.