Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lasscyber.com/llms.txt

Use this file to discover all available pages before exploring further.

A safety policy is the prompt text the Safety & Responsible AI analyzer feeds to ShieldGemma when it acts as the LLM-as-a-judge. It tells the model which categories are unsafe and how to reason about edge cases. Agnes ships a sensible default; custom safety policies are how you adapt to industry-specific guidelines (medical advice safety, educational content, professional tone).

Default safety policy

The shipped default covers seven categories:
  1. No Harmful Content — violence, self-harm, harmful medical practices, dangerous or illegal activities.
  2. No Hate Speech or Discrimination — protected-class discrimination, dehumanization.
  3. No Harassment or Bullying — threats, intimidation, doxxing.
  4. No Sexual Content — explicit content, sexualization of minors, sexual exploitation.
  5. No Misinformation — harmful conspiracy theories, medical misinformation, election misinformation, public-safety misinformation.
  6. No Illegal Content — illegal activity, copyright infringement, fraud, illegal goods/services.
  7. No Privacy Violations — sharing personal information without consent, doxxing, privacy-rights violations.
The full text lives in the dashboard at agnes.lasscyber.com/protection/safety-moderation and is the default for every new tenant.

Authoring a custom safety policy

In the dashboard:
  1. Click New safety policy.
  2. Give it a name and description.
  3. Author the policy_content markdown. Treat it as a system prompt for a judge model. The structure that works best:
    # <Safety policy name>
    
    ## Core Principles
    
    ### 1. <Category name>
    - <bulleted guideline>
    - <bulleted guideline>
    
    ### 2. <Category name>
    -
    
    ## Content Assessment Guidelines
    
    When evaluating content, consider:
    1. The intent and context of the content
    2. The potential impact on individuals or groups
    3. Whether the content promotes or enables harmful behaviour
    
    ## Response Format
    
    The model should:
    1. Start with a clear "Yes" or "No" answer
    2. Explain which specific principles are relevant
    3. Provide a step-by-step analysis
    
  4. Optionally mark as default. The default policy is used when no policy_id is set on the safety analyzer.
  5. Save.
The analyzer constructs a per-category prompt at runtime, wrapping your policy text in the ShieldGemma chat template:
<start_of_turn>user
You are a policy expert ...

<your policy_content here>

<the input under evaluation>

Does the input violate the above policy?
<end_of_turn>
<start_of_turn>model
ShieldGemma replies with a Yes/No plus a confidence score. Agnes parses both and exposes the per-category score in max_violation_score and violation_category_count.

What “category” means here

ShieldGemma is binary per category: each category gets its own forward pass and its own Yes/No verdict. The category names you use in your policy markdown should match the output match hints exactly when you want to terminate on a specific category. The default policy uses headings that line up with the shipped match hints (Dangerous Content, Harassment, Hate Speech, Sexually Explicit). If you invent new category names, make sure your termination rules match the new strings.

Picking a model

Larger ShieldGemma models tolerate longer policies and longer inputs:
ModelPrompt char limit
google/shieldgemma-2b8K
google/shieldgemma-9b16K
google/shieldgemma-27b32K
If your policy text is long (e.g. detailed industry guidelines), you may exceed the 2B model’s char limit and need to escalate to 9B.

Worked example: a strict policy

If the default ships too permissive for your audience, a stricter custom policy might:
  • Tighten Misinformation to flag medical claims even when contextual.
  • Add a category Professional Conduct that flags personal-life questions in a workplace assistant.
  • Drop Sexual Content if your domain is fully family-friendly (the default would still catch it via Sexually Explicit).
You can also clone the default and remove categories you do not want to evaluate; fewer categories = lower cost (each category is an independent ShieldGemma forward pass).

Wiring it into a combined policy

The safety analyzer takes a policy_id parameter:
{
  "name": "safety_moderation_analyzer",
  "params": {
    "model_id": "google/shieldgemma-9b",
    "policy_id": "<uuid-of-safety-policy>"
  }
}
There is no per-request safety policy override at this time; pick the right policy in your combined policy.

Permissions

RoleReadCreate / updateDelete
OwnerYesYesYes
AdminYesYesYes
MemberYesYesYes
ViewerYesNoNo
The relevant scope family is safety_policy:*.

Authoring tips

  • Copy the default and edit the categories that matter to you; do not start from scratch unless you know what you are doing.
  • Lead with the verdict format. ShieldGemma needs a clear instruction to answer Yes/No; if your prose buries that instruction, the model can ramble and the parser will fall back to uncertain.
  • Keep categories small and orthogonal. Five overlapping categories will fight each other; three orthogonal categories produce cleaner verdicts.
  • Test with the Analyzer Labs page. The dashboard’s Analyzer Labs page lets you feed sample text through a single analyzer with a chosen safety policy; iterate on the policy text until verdicts match expectations before promoting it.

Next