> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lasscyber.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Safety & Responsible AI Guardrails

> LLM-as-a-judge content safety using ShieldGemma. Best for outbound responses.

This analyzer evaluates text against safety policies for dangerous
content, harassment, hate speech, and sexually explicit material. It
uses Google's **ShieldGemma** family of safety classifiers as an
LLM-as-a-judge, served through vLLM on the internal model service.

|                    |                              |
| ------------------ | ---------------------------- |
| **Canonical name** | `safe-responsible-ai`        |
| **Python**         | `safe_responsible_ai`        |
| **TypeScript**     | `safeResponsibleAI`          |
| **Server key**     | `safety_moderation_analyzer` |
| **Category**       | Content Safety               |

## What it detects

Default categories (configurable per safety policy):

* **Dangerous content** — instructions for violence, self-harm,
  dangerous medical practice, illegal activity.
* **Harassment** — bullying, intimidation, doxxing.
* **Hate speech** — discrimination, slurs, dehumanizing content.
* **Sexually explicit content** — explicit material, sexual content
  involving minors.

Additional categories shipped in the default safety policy: misinformation,
privacy violations, illegal content. You can author a custom safety
policy to redefine the category set; see
[Safety policies](/policies/safety-policies).

## How it works

For each enabled category, Agnes constructs a per-category prompt using
your chosen [safety policy](/policies/safety-policies) and the
ShieldGemma chat template:

```
<start_of_turn>user
You are a policy expert trying to help determine whether ...
... policy guidelines ...
... text under evaluation ...

Does the input violate the above policy?
<end_of_turn>
<start_of_turn>model
```

ShieldGemma replies with a `Yes` / `No` plus a confidence score. Agnes
parses the verdict, normalizes it to a confidence number, and reports
the per-category breakdown. Content is flagged as **unsafe** if any
category scores `>= 0.5`.

## Available models

Selected with the `model_id` parameter:

| Model ID                 | Notes                                                   |
| ------------------------ | ------------------------------------------------------- |
| `google/shieldgemma-2b`  | **Default**. Fastest. \~8K char prompt limit.           |
| `google/shieldgemma-9b`  | Balanced. \~16K char prompt limit.                      |
| `google/shieldgemma-27b` | Highest accuracy. \~32K char prompt limit. Higher cost. |

## Parameters

| Key         | Type   | Required | Default                 | Notes                                                   |
| ----------- | ------ | -------- | ----------------------- | ------------------------------------------------------- |
| `model_id`  | select | No       | `google/shieldgemma-2b` | Choose from the table above.                            |
| `policy_id` | select | No       | tenant default          | Reference a [safety policy](/policies/safety-policies). |

## Outputs and metrics

```json theme={null}
{
  "is_safe": false,
  "categories": [
    { "name": "Hate Speech", "score": 0.91, "verdict": "violation" },
    { "name": "Harassment",  "score": 0.62, "verdict": "violation" },
    { "name": "Dangerous Content", "score": 0.04, "verdict": "ok" }
  ],
  "metrics": {
    "inference_time_ms": 142.0,
    "max_violation_score": 0.91,
    "violation_category_count": 2
  },
  "status": "OK"
}
```

| Metric                     | Range   | Suggested thresholds                                                         |
| -------------------------- | ------- | ---------------------------------------------------------------------------- |
| `max_violation_score`      | 0.0–1.0 | `>= 0.5` (any violation), `>= 0.8` (high confidence), `>= 0.3` (borderline). |
| `violation_category_count` | int     | `> 0` (any category), `>= 2` (multiple categories).                          |
| `inference_time_ms`        | float   | Total across all categories.                                                 |

## Termination signals

| Signal                                                                                 | What it matches                                                                |
| -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| Boolean: `is_safe`                                                                     | Fires when the analyzer determines the content is unsafe (`is_safe == false`). |
| Output match: `Dangerous Content` / `Harassment` / `Hate Speech` / `Sexually Explicit` | Fires when that category is flagged.                                           |

Combine with `max_violation_score` for stricter control. Example: only
terminate when `is_safe == false` AND `max_violation_score >= 0.8`.

## Limits and cost

| Limit             | Value                          |
| ----------------- | ------------------------------ |
| Max input tokens  | 100,000                        |
| Requests / minute | 100 (per tenant)               |
| Prompt char limit | 8K (2B) / 16K (9B) / 32K (27B) |

Cost varies by model size; the catalog notes "model inference cost".
Expect 9B to be roughly 4× the 2B and 27B to be roughly 12×.

## Typical latency

50–200 ms depending on model and number of categories. Each enabled
category runs an independent ShieldGemma forward pass.

## When to use it

* **Best on outbound.** This is the canonical "did my LLM produce
  something unsafe?" guardrail. Pair it with a strict safety policy on
  outbound, a permissive one on inbound.
* **Skip on highly templated outputs.** If you fully control the model
  output (e.g. JSON schema, structured tools), the safety judge is
  often redundant — a YARA / regex check is enough.
* **Pick the smallest model that meets your accuracy bar.** Most teams
  ship on `shieldgemma-2b` and only escalate to `9b` for explicit
  high-risk surfaces.

## Failure modes

* **Model service unavailable** → `analyzer_unavailable` 503 with
  `Retry-After`. SDKs retry automatically.
* **Prompt longer than the model's limit** → the analyzer returns an
  error in its result. The prompt is not truncated automatically.

## Next

* [Safety policies](/policies/safety-policies) — author custom
  category guidelines.
* [Combined analyzer](/concepts/combined-analyzer) — wiring this
  analyzer into a policy.