Safety & Responsible AI Guardrails

This analyzer evaluates text against safety policies for dangerous content, harassment, hate speech, and sexually explicit material. It uses Google’s ShieldGemma family of safety classifiers as an LLM-as-a-judge, served through vLLM on the internal model service.


Canonical name	`safe-responsible-ai`
Python	`safe_responsible_ai`
TypeScript	`safeResponsibleAI`
Server key	`safety_moderation_analyzer`
Category	Content Safety

What it detects

Default categories (configurable per safety policy):

Dangerous content — instructions for violence, self-harm, dangerous medical practice, illegal activity.
Harassment — bullying, intimidation, doxxing.
Hate speech — discrimination, slurs, dehumanizing content.
Sexually explicit content — explicit material, sexual content involving minors.

Additional categories shipped in the default safety policy: misinformation, privacy violations, illegal content. You can author a custom safety policy to redefine the category set; see Safety policies.

How it works

For each enabled category, Agnes constructs a per-category prompt using your chosen safety policy and the ShieldGemma chat template:

<start_of_turn>user
You are a policy expert trying to help determine whether ...
... policy guidelines ...
... text under evaluation ...

Does the input violate the above policy?
<end_of_turn>
<start_of_turn>model

ShieldGemma replies with a Yes / No plus a confidence score. Agnes parses the verdict, normalizes it to a confidence number, and reports the per-category breakdown. Content is flagged as unsafe if any category scores >= 0.5.

Available models

Selected with the model_id parameter:

Model ID	Notes
`google/shieldgemma-2b`	Default. Fastest. ~8K char prompt limit.
`google/shieldgemma-9b`	Balanced. ~16K char prompt limit.
`google/shieldgemma-27b`	Highest accuracy. ~32K char prompt limit. Higher cost.

Parameters

Key	Type	Required	Default	Notes
`model_id`	select	No	`google/shieldgemma-2b`	Choose from the table above.
`policy_id`	select	No	tenant default	Reference a safety policy.

Outputs and metrics

{
  "is_safe": false,
  "categories": [
    { "name": "Hate Speech", "score": 0.91, "verdict": "violation" },
    { "name": "Harassment",  "score": 0.62, "verdict": "violation" },
    { "name": "Dangerous Content", "score": 0.04, "verdict": "ok" }
  ],
  "metrics": {
    "inference_time_ms": 142.0,
    "max_violation_score": 0.91,
    "violation_category_count": 2
  },
  "status": "OK"
}

Metric	Range	Suggested thresholds
`max_violation_score`	0.0–1.0	`>= 0.5` (any violation), `>= 0.8` (high confidence), `>= 0.3` (borderline).
`violation_category_count`	int	`> 0` (any category), `>= 2` (multiple categories).
`inference_time_ms`	float	Total across all categories.

Termination signals

Signal	What it matches
Boolean: `is_safe`	Fires when the analyzer determines the content is unsafe (`is_safe == false`).
Output match: `Dangerous Content` / `Harassment` / `Hate Speech` / `Sexually Explicit`	Fires when that category is flagged.

Combine with max_violation_score for stricter control. Example: only terminate when is_safe == false AND max_violation_score >= 0.8.

Limits and cost

Limit	Value
Max input tokens	100,000
Requests / minute	100 (per tenant)
Prompt char limit	8K (2B) / 16K (9B) / 32K (27B)

Cost varies by model size; the catalog notes “model inference cost”. Expect 9B to be roughly 4× the 2B and 27B to be roughly 12×.

Typical latency

50–200 ms depending on model and number of categories. Each enabled category runs an independent ShieldGemma forward pass.

When to use it

Best on outbound. This is the canonical “did my LLM produce something unsafe?” guardrail. Pair it with a strict safety policy on outbound, a permissive one on inbound.
Skip on highly templated outputs. If you fully control the model output (e.g. JSON schema, structured tools), the safety judge is often redundant — a YARA / regex check is enough.
Pick the smallest model that meets your accuracy bar. Most teams ship on shieldgemma-2b and only escalate to 9b for explicit high-risk surfaces.

Failure modes

Model service unavailable → analyzer_unavailable 503 with Retry-After. SDKs retry automatically.
Prompt longer than the model’s limit → the analyzer returns an error in its result. The prompt is not truncated automatically.

Safety policies — author custom category guidelines.
Combined analyzer — wiring this analyzer into a policy.

Get started

Concepts

Analyzers

Policies

Threat analysis

Testing

Administration

Safety & Responsible AI Guardrails

What it detects

How it works

Available models

Parameters

Outputs and metrics

Termination signals

Limits and cost

Typical latency

When to use it

Failure modes

Next

Get started

Concepts

Analyzers

Policies

Threat analysis

Testing

Administration

Documentation Index

​What it detects

​How it works

​Available models

​Parameters

​Outputs and metrics

​Termination signals

​Limits and cost

​Typical latency

​When to use it

​Failure modes

​Next

What it detects

How it works

Available models

Parameters

Outputs and metrics

Termination signals

Limits and cost

Typical latency

When to use it

Failure modes

Next