Prompt Injection & Jailbreak Detection

This analyzer detects prompt-injection attacks, jailbreak attempts, and maliciously crafted inputs using specialized BERT-family classifier models. It is the single most valuable analyzer to run on inbound traffic.


Canonical name	`prompt-injection-jailbreak`
Python	`prompt_injection_jailbreak`
TypeScript	`promptInjectionJailbreak`
Server key	`adversarial_detection_analyzer`
Category	Adversarial

What it detects

Three kinds of malicious input:

Prompt injection — instructions that try to override your system prompt (“Ignore previous instructions and …”).
Jailbreaks — well-known role-play exploits (“DAN mode”, “developer mode”, reverse-roleplay framings).
Maliciously formed input — adversarial decorations, encoding tricks, and adversarial perturbations against system prompts.

It does not detect unsafe content (that’s Safety & Responsible AI), sensitive data leakage (that’s SDP), or domain-specific patterns (that’s YARA).

How it works

The input is tokenized and split into chunks of 400 tokens with 50 tokens of overlap. Each chunk is sent to the chosen BERT classifier running on the internal model service (Cloud Run + L4 GPU). The worst-case chunk score determines the final label and confidence:

Label INJECTION/JAILBREAK if the worst chunk crosses the model’s decision threshold.
Label SAFE otherwise.

Available models

You select a model with the model_id parameter. Larger models are more accurate; smaller models are faster.

Model ID	Friendly name	Notes
`meta-llama/Llama-Prompt-Guard-2-22M`	Llama Prompt Guard 22M	Default. Fastest.
`meta-llama/Llama-Prompt-Guard-2-86M`	Llama Prompt Guard 86M	Balanced accuracy / latency.
`protectai/deberta-v3-base-prompt-injection-v2`	DeBERTa v3 Prompt Injection	Strong on classic injection idioms.
`testsavantai/prompt-injection-defender-large-v0-onnx`	Prompt Injection Defender ONNX	ONNX runtime; useful as a second opinion.

Parameters

Key	Type	Required	Default	Notes
`model_id`	select	Yes	`meta-llama/Llama-Prompt-Guard-2-22M`	Choose from the table above.

Outputs and metrics

The analyzer_results.prompt-injection-jailbreak block looks like:

{
  "label": "INJECTION/JAILBREAK",
  "score": 0.97,
  "metrics": {
    "score": 0.97,
    "inference_time_ms": 38.4
  },
  "status": "OK"
}

Metric	Type	Range	Notes
`score`	float	0.0–1.0	Probability the input is adversarial.
`inference_time_ms`	float	—	Model inference duration.

Termination signals

Signal	What it matches
Boolean: `is_malicious`	Fires when the classifier label is `INJECTION/JAILBREAK`.
Output match: `INJECTION/JAILBREAK`	Same as above, expressed as a regex.
Output match: `SAFE`	Fires on benign input. Useful for “allow lists”.

Suggested score thresholds:

Stance	Operator	Value
Conservative	`>`	`0.50`
Balanced	`>`	`0.75`
Aggressive	`>`	`0.90`

The shipped default-inbound policy uses score >= 0.85 AND output_match: INJECTION/JAILBREAK with terminate_immediately.

Limits and cost

Limit	Value
Max input tokens	100,000
Requests / minute	500 (per tenant)
Chunk size	400 tokens with 50-token overlap

Cost is model-inference time on the internal model service — approximately $0.02 / call at the time of writing, billed via metered tokens on your subscription. See Billing.

Typical latency

20–100 ms depending on input length and model size. Cold-start adds a small one-off penalty per Cloud Run instance.

When to use it

Always on inbound. This is the single most valuable analyzer to put before your LLM. Use the 22M model unless you have measured evidence the 86M model is worth the extra latency for your traffic.
Optional on outbound. The classifier is trained for inputs; on outputs it tends to overfire on quoted user text. Prefer Safety & Responsible AI for outbound.
Pair with semantic threat intel. This classifier is strong on the syntactic shape of attacks; the Semantic Threat Intelligence analyzer catches paraphrases and obfuscations the classifier misses.

Failure modes

Model service unavailable → analyzer_unavailable 503 with Retry-After. SDKs retry automatically.
Token limit exceeded → payload_too_large 413.

Combined analyzer — termination rules.
Safety & Responsible AI — the outbound counterpart.
Semantic Threat Intelligence — paraphrase coverage.

Get started

Concepts

Analyzers

Policies

Threat analysis

Testing

Administration

Prompt Injection & Jailbreak Detection

What it detects

How it works

Available models

Parameters

Outputs and metrics

Termination signals

Limits and cost

Typical latency

When to use it

Failure modes

Next

Get started

Concepts

Analyzers

Policies

Threat analysis

Testing

Administration

Documentation Index

​What it detects

​How it works

​Available models

​Parameters

​Outputs and metrics

​Termination signals

​Limits and cost

​Typical latency

​When to use it

​Failure modes

​Next

What it detects

How it works

Available models

Parameters

Outputs and metrics

Termination signals

Limits and cost

Typical latency

When to use it

Failure modes

Next