| Canonical name | prompt-injection-jailbreak |
| Python | prompt_injection_jailbreak |
| TypeScript | promptInjectionJailbreak |
| Server key | adversarial_detection_analyzer |
| Category | Adversarial |
What it detects
Three kinds of malicious input:- Prompt injection — instructions that try to override your system prompt (“Ignore previous instructions and …”).
- Jailbreaks — well-known role-play exploits (“DAN mode”, “developer mode”, reverse-roleplay framings).
- Maliciously formed input — adversarial decorations, encoding tricks, and adversarial perturbations against system prompts.
How it works
The input is tokenized and split into chunks of 400 tokens with 50 tokens of overlap. Each chunk is sent to the chosen BERT classifier running on the internal model service (Cloud Run + L4 GPU). The worst-case chunk score determines the final label and confidence:- Label
INJECTION/JAILBREAKif the worst chunk crosses the model’s decision threshold. - Label
SAFEotherwise.
Available models
You select a model with themodel_id parameter. Larger models are
more accurate; smaller models are faster.
| Model ID | Friendly name | Notes |
|---|---|---|
meta-llama/Llama-Prompt-Guard-2-22M | Llama Prompt Guard 22M | Default. Fastest. |
meta-llama/Llama-Prompt-Guard-2-86M | Llama Prompt Guard 86M | Balanced accuracy / latency. |
protectai/deberta-v3-base-prompt-injection-v2 | DeBERTa v3 Prompt Injection | Strong on classic injection idioms. |
testsavantai/prompt-injection-defender-large-v0-onnx | Prompt Injection Defender ONNX | ONNX runtime; useful as a second opinion. |
Parameters
| Key | Type | Required | Default | Notes |
|---|---|---|---|---|
model_id | select | Yes | meta-llama/Llama-Prompt-Guard-2-22M | Choose from the table above. |
Outputs and metrics
Theanalyzer_results.prompt-injection-jailbreak block looks like:
| Metric | Type | Range | Notes |
|---|---|---|---|
score | float | 0.0–1.0 | Probability the input is adversarial. |
inference_time_ms | float | — | Model inference duration. |
Termination signals
| Signal | What it matches |
|---|---|
Boolean: is_malicious | Fires when the classifier label is INJECTION/JAILBREAK. |
Output match: INJECTION/JAILBREAK | Same as above, expressed as a regex. |
Output match: SAFE | Fires on benign input. Useful for “allow lists”. |
| Stance | Operator | Value |
|---|---|---|
| Conservative | > | 0.50 |
| Balanced | > | 0.75 |
| Aggressive | > | 0.90 |
default-inbound policy uses score >= 0.85 AND
output_match: INJECTION/JAILBREAK with terminate_immediately.
Limits and cost
| Limit | Value |
|---|---|
| Max input tokens | 100,000 |
| Requests / minute | 500 (per tenant) |
| Chunk size | 400 tokens with 50-token overlap |
Typical latency
20–100 ms depending on input length and model size. Cold-start adds a small one-off penalty per Cloud Run instance.When to use it
- Always on inbound. This is the single most valuable analyzer to put before your LLM. Use the 22M model unless you have measured evidence the 86M model is worth the extra latency for your traffic.
- Optional on outbound. The classifier is trained for inputs; on outputs it tends to overfire on quoted user text. Prefer Safety & Responsible AI for outbound.
- Pair with semantic threat intel. This classifier is strong on the syntactic shape of attacks; the Semantic Threat Intelligence analyzer catches paraphrases and obfuscations the classifier misses.
Failure modes
- Model service unavailable →
analyzer_unavailable503 withRetry-After. SDKs retry automatically. - Token limit exceeded →
payload_too_large413.
Next
- Combined analyzer — termination rules.
- Safety & Responsible AI — the outbound counterpart.
- Semantic Threat Intelligence — paraphrase coverage.