> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lasscyber.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Prompt Injection & Jailbreak Detection

> BERT-family classifiers tuned to label adversarial prompts. The most important analyzer to put in front of your LLM.

This analyzer detects prompt-injection attacks, jailbreak attempts, and
maliciously crafted inputs using specialized BERT-family classifier
models. It is the single most valuable analyzer to run on inbound
traffic.

|                    |                                  |
| ------------------ | -------------------------------- |
| **Canonical name** | `prompt-injection-jailbreak`     |
| **Python**         | `prompt_injection_jailbreak`     |
| **TypeScript**     | `promptInjectionJailbreak`       |
| **Server key**     | `adversarial_detection_analyzer` |
| **Category**       | Adversarial                      |

## What it detects

Three kinds of malicious input:

* **Prompt injection** — instructions that try to override your system
  prompt ("Ignore previous instructions and …").
* **Jailbreaks** — well-known role-play exploits ("DAN mode",
  "developer mode", reverse-roleplay framings).
* **Maliciously formed input** — adversarial decorations, encoding
  tricks, and adversarial perturbations against system prompts.

It does **not** detect unsafe content (that's
[Safety & Responsible AI](/analyzers/safe-responsible-ai)), sensitive
data leakage (that's [SDP](/analyzers/sensitive-data)), or
domain-specific patterns (that's [YARA](/analyzers/yara)).

## How it works

The input is tokenized and split into chunks of **400 tokens with 50
tokens of overlap**. Each chunk is sent to the chosen BERT classifier
running on the internal model service (Cloud Run + L4 GPU). The
worst-case chunk score determines the final label and confidence:

* Label `INJECTION/JAILBREAK` if the worst chunk crosses the model's
  decision threshold.
* Label `SAFE` otherwise.

## Available models

You select a model with the `model_id` parameter. Larger models are
more accurate; smaller models are faster.

| Model ID                                               | Friendly name                  | Notes                                     |
| ------------------------------------------------------ | ------------------------------ | ----------------------------------------- |
| `meta-llama/Llama-Prompt-Guard-2-22M`                  | Llama Prompt Guard 22M         | **Default**. Fastest.                     |
| `meta-llama/Llama-Prompt-Guard-2-86M`                  | Llama Prompt Guard 86M         | Balanced accuracy / latency.              |
| `protectai/deberta-v3-base-prompt-injection-v2`        | DeBERTa v3 Prompt Injection    | Strong on classic injection idioms.       |
| `testsavantai/prompt-injection-defender-large-v0-onnx` | Prompt Injection Defender ONNX | ONNX runtime; useful as a second opinion. |

## Parameters

| Key        | Type   | Required | Default                               | Notes                        |
| ---------- | ------ | -------- | ------------------------------------- | ---------------------------- |
| `model_id` | select | Yes      | `meta-llama/Llama-Prompt-Guard-2-22M` | Choose from the table above. |

## Outputs and metrics

The `analyzer_results.prompt-injection-jailbreak` block looks like:

```json theme={null}
{
  "label": "INJECTION/JAILBREAK",
  "score": 0.97,
  "metrics": {
    "score": 0.97,
    "inference_time_ms": 38.4
  },
  "status": "OK"
}
```

| Metric              | Type  | Range   | Notes                                 |
| ------------------- | ----- | ------- | ------------------------------------- |
| `score`             | float | 0.0–1.0 | Probability the input is adversarial. |
| `inference_time_ms` | float | —       | Model inference duration.             |

## Termination signals

| Signal                              | What it matches                                           |
| ----------------------------------- | --------------------------------------------------------- |
| Boolean: `is_malicious`             | Fires when the classifier label is `INJECTION/JAILBREAK`. |
| Output match: `INJECTION/JAILBREAK` | Same as above, expressed as a regex.                      |
| Output match: `SAFE`                | Fires on benign input. Useful for "allow lists".          |

Suggested score thresholds:

| Stance       | Operator | Value  |
| ------------ | -------- | ------ |
| Conservative | `>`      | `0.50` |
| Balanced     | `>`      | `0.75` |
| Aggressive   | `>`      | `0.90` |

The shipped `default-inbound` policy uses `score >= 0.85` AND
`output_match: INJECTION/JAILBREAK` with `terminate_immediately`.

## Limits and cost

| Limit             | Value                            |
| ----------------- | -------------------------------- |
| Max input tokens  | 100,000                          |
| Requests / minute | 500 (per tenant)                 |
| Chunk size        | 400 tokens with 50-token overlap |

Cost is model-inference time on the internal model service —
approximately **\$0.02 / call** at the time of writing, billed via
metered tokens on your subscription. See
[Billing](/administration/billing).

## Typical latency

20–100 ms depending on input length and model size. Cold-start adds a
small one-off penalty per Cloud Run instance.

## When to use it

* **Always on inbound.** This is the single most valuable analyzer to
  put before your LLM. Use the 22M model unless you have measured
  evidence the 86M model is worth the extra latency for your traffic.
* **Optional on outbound.** The classifier is trained for *inputs*; on
  outputs it tends to overfire on quoted user text. Prefer
  [Safety & Responsible AI](/analyzers/safe-responsible-ai) for
  outbound.
* **Pair with semantic threat intel.** This classifier is strong on the
  syntactic shape of attacks; the
  [Semantic Threat Intelligence](/analyzers/semantic-threat-intelligence)
  analyzer catches paraphrases and obfuscations the classifier misses.

## Failure modes

* **Model service unavailable** → `analyzer_unavailable` 503 with
  `Retry-After`. SDKs retry automatically.
* **Token limit exceeded** → `payload_too_large` 413.

## Next

* [Combined analyzer](/concepts/combined-analyzer) — termination
  rules.
* [Safety & Responsible AI](/analyzers/safe-responsible-ai) — the
  outbound counterpart.
* [Semantic Threat Intelligence](/analyzers/semantic-threat-intelligence) — paraphrase coverage.
