Prompt Injection Audit

Treating prompt injection as a reliability failure that can be measured

Instead of asking "did the model fall for this one payload?", we ask: across many placements, wrappers, and encodings of untrusted content, how likely is the model to violate policy? If that probability is non-trivial, the system is vulnerable.

The Problem

Most security testing provides false confidence

Teams test prompt injection by throwing a handful of known payloads at their system. If the model doesn't fall for "ignore all previous instructions," they ship.

This is like testing a lock with one key and declaring it secure. Real attackers don't use your test cases.

Single-Payload Testing

Tests one trick at a time. Passes if that specific payload fails. Provides false confidence because it misses the vast majority of attack variations.

Distributional Auditing

Tests across the entire attack surface: wrappers, positions, encodings, indirect vectors. Computes violation probability with statistical confidence bounds.

Prompt injection isn't a single string. It's a family of transformations. A system is only "safe" if it's robust across the whole distribution.

Core Insight Distributional Security Testing

Attack Surface

The transformation families we test

Each attack isn't a single string. It's a combinatorial space of variations. We test systematically across all dimensions.

Methodology

How Distributional Auditing Works

Built on Strawberry's QMV (Quantile Model Verification) machinery, the audit framework computes statistical bounds on vulnerability.

Define Injection Scenarios

Each test scenario explicitly separates trusted instruction, untrusted content, and attack payload, mirroring real systems where user/external text must not become instructions.

Trusted instruction The system prompt or policy the LLM should follow
Untrusted content User input, RAG docs, tool output: anything external
Attack payload The injection attempt embedded in untrusted content
Violation predicate What counts as a policy violation (the failure condition)

Generate Attack Distribution

Instead of testing one payload, we generate the combinatorial space: wrappers × positions × encodings. Each dimension multiplies the test surface.

Wrapper variations Plain, quoted, XML, JSON, codeblock serializations
Position variations Before system, after user, tool output, embedded
Encoding variations Unicode tricks, zero-width, homoglyphs, base64
Indirect channels RAG retrieval, email parsing, calendar events

Typical test count 5 × 4 × 3 = 60+ variants per payload

Run QMV Evaluation

Each variant is evaluated against the violation predicate. Instead of pass/fail, we compute violation rates with statistical confidence bounds.

Backend-agnostic Works with OpenAI, vLLM, Anthropic, or dummy backends
Parallel execution Efficient batch evaluation across variants
Baseline comparison Measures delta_q shift from payload injection
Per-variant tracking Identifies which combinations are most vulnerable

Interpret Results

The output is not "pass/fail" but quantified vulnerability with confidence bounds. If q_lo is high, the model is provably vulnerable, not just "sometimes bad."

q̄ (mean violation rate) Average probability of policy violation across variants
qₗ (lower confidence bound) Worst-case estimate with statistical confidence
Vulnerability identification Pinpoints weakest wrapper/position combinations
Actionable threshold If qₗ > threshold, block deployment or fix prompt

Decision rule qₗ > 0.05 → VULNERABLE

Philosophy

What this approach deliberately avoids

These common defenses all fail under distributional attacks. We don't rely on any of them.

"Ignore all previous instructions" hardening

Adding defensive text to prompts. Attackers simply vary their phrasing. This is a single-point defense against a distributional attack.

Regex or keyword blocking

Blocklists catch known patterns but miss encoding variations, synonyms, and novel phrasing. Trivially bypassed with homoglyphs or rephrasing.

Single red-team examples

Testing a handful of known payloads provides false confidence. You're only measuring memorization of those specific attacks, not robustness.

Trusting chain-of-thought self-checks

Asking the model "is this safe?" can itself be manipulated. The same injection that bypasses policy can bypass the safety check.

In Practice

Gating agent behavior on verified robustness

In agentic workflows, injection audit isn't just a test. It's a gate. Steps are rejected if Strawberry shows they could be compromised.

This prevents indirect prompt injection via tools or web content, a common real-world failure mode that single-payload testing misses entirely.

Evidence verification required

Steps are rejected if no supporting evidence passes QMV verification

Untrusted content isolation

Web and tool outputs are explicitly treated as untrusted inputs

Verification loops

Plans must survive robustness checks before execution proceeds

Statistical confidence

Decisions based on qₗ bounds, not point estimates

Ready to audit your system's injection robustness?

Stop testing single payloads. Start measuring distributional vulnerability with statistical confidence.

View on GitHub Get in touch