Safety policies

A safety policy is the prompt text the Safety & Responsible AI analyzer feeds to ShieldGemma when it acts as the LLM-as-a-judge. It tells the model which categories are unsafe and how to reason about edge cases. Agnes ships a sensible default; custom safety policies are how you adapt to industry-specific guidelines (medical advice safety, educational content, professional tone).

Default safety policy

The shipped default covers seven categories:

No Harmful Content — violence, self-harm, harmful medical practices, dangerous or illegal activities.
No Hate Speech or Discrimination — protected-class discrimination, dehumanization.
No Harassment or Bullying — threats, intimidation, doxxing.
No Sexual Content — explicit content, sexualization of minors, sexual exploitation.
No Misinformation — harmful conspiracy theories, medical misinformation, election misinformation, public-safety misinformation.
No Illegal Content — illegal activity, copyright infringement, fraud, illegal goods/services.
No Privacy Violations — sharing personal information without consent, doxxing, privacy-rights violations.

The full text lives in the dashboard at agnes.lasscyber.com/protection/safety-moderation and is the default for every new tenant.

Authoring a custom safety policy

In the dashboard:

Click New safety policy.
Give it a name and description.

Author the policy_content markdown. Treat it as a system prompt for a judge model. The structure that works best:

# <Safety policy name>

## Core Principles

### 1. <Category name>
- <bulleted guideline>
- <bulleted guideline>

### 2. <Category name>
- …

## Content Assessment Guidelines

When evaluating content, consider:
1. The intent and context of the content
2. The potential impact on individuals or groups
3. Whether the content promotes or enables harmful behaviour

## Response Format

The model should:
1. Start with a clear "Yes" or "No" answer
2. Explain which specific principles are relevant
3. Provide a step-by-step analysis

Optionally mark as default. The default policy is used when no policy_id is set on the safety analyzer.
Save.

The analyzer constructs a per-category prompt at runtime, wrapping your policy text in the ShieldGemma chat template:

<start_of_turn>user
You are a policy expert ...

<your policy_content here>

<the input under evaluation>

Does the input violate the above policy?
<end_of_turn>
<start_of_turn>model

ShieldGemma replies with a Yes/No plus a confidence score. Agnes parses both and exposes the per-category score in max_violation_score and violation_category_count.

What “category” means here

ShieldGemma is binary per category: each category gets its own forward pass and its own Yes/No verdict. The category names you use in your policy markdown should match the output match hints exactly when you want to terminate on a specific category. The default policy uses headings that line up with the shipped match hints (Dangerous Content, Harassment, Hate Speech, Sexually Explicit). If you invent new category names, make sure your termination rules match the new strings.

Picking a model

Larger ShieldGemma models tolerate longer policies and longer inputs:

Model	Prompt char limit
`google/shieldgemma-2b`	8K
`google/shieldgemma-9b`	16K
`google/shieldgemma-27b`	32K

If your policy text is long (e.g. detailed industry guidelines), you may exceed the 2B model’s char limit and need to escalate to 9B.

Worked example: a strict policy

If the default ships too permissive for your audience, a stricter custom policy might:

Tighten Misinformation to flag medical claims even when contextual.
Add a category Professional Conduct that flags personal-life questions in a workplace assistant.
Drop Sexual Content if your domain is fully family-friendly (the default would still catch it via Sexually Explicit).

You can also clone the default and remove categories you do not want to evaluate; fewer categories = lower cost (each category is an independent ShieldGemma forward pass).

Wiring it into a combined policy

The safety analyzer takes a policy_id parameter:

{
  "name": "safety_moderation_analyzer",
  "params": {
    "model_id": "google/shieldgemma-9b",
    "policy_id": "<uuid-of-safety-policy>"
  }
}

There is no per-request safety policy override at this time; pick the right policy in your combined policy.

Permissions

Role	Read	Create / update	Delete
Owner	Yes	Yes	Yes
Admin	Yes	Yes	Yes
Member	Yes	Yes	Yes
Viewer	Yes	No	No

The relevant scope family is safety_policy:*.

Authoring tips

Copy the default and edit the categories that matter to you; do not start from scratch unless you know what you are doing.
Lead with the verdict format. ShieldGemma needs a clear instruction to answer Yes/No; if your prose buries that instruction, the model can ramble and the parser will fall back to uncertain.
Keep categories small and orthogonal. Five overlapping categories will fight each other; three orthogonal categories produce cleaner verdicts.
Test with the Analyzer Labs page. The dashboard’s Analyzer Labs page lets you feed sample text through a single analyzer with a chosen safety policy; iterate on the policy text until verdicts match expectations before promoting it.

Safety analyzer — runtime metrics, termination signals, model selection.
Combined analyzer — wire safety into termination rules.

Get started

Concepts

Analyzers

Policies

Threat analysis

Testing

Administration

Safety policies

Default safety policy

Authoring a custom safety policy

What “category” means here

Picking a model

Worked example: a strict policy

Wiring it into a combined policy

Permissions

Authoring tips

Next

Get started

Concepts

Analyzers

Policies

Threat analysis

Testing

Administration

Documentation Index

​Default safety policy

​Authoring a custom safety policy

​What “category” means here

​Picking a model

​Worked example: a strict policy

​Wiring it into a combined policy

​Permissions

​Authoring tips

​Next

Default safety policy

Authoring a custom safety policy

What “category” means here

Picking a model

Worked example: a strict policy

Wiring it into a combined policy

Permissions

Authoring tips

Next