Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.lasscyber.com/llms.txt

Use this file to discover all available pages before exploring further.

This analyzer evaluates text against safety policies for dangerous content, harassment, hate speech, and sexually explicit material. It uses Google’s ShieldGemma family of safety classifiers as an LLM-as-a-judge, served through vLLM on the internal model service.
Canonical namesafe-responsible-ai
Pythonsafe_responsible_ai
TypeScriptsafeResponsibleAI
Server keysafety_moderation_analyzer
CategoryContent Safety

What it detects

Default categories (configurable per safety policy):
  • Dangerous content — instructions for violence, self-harm, dangerous medical practice, illegal activity.
  • Harassment — bullying, intimidation, doxxing.
  • Hate speech — discrimination, slurs, dehumanizing content.
  • Sexually explicit content — explicit material, sexual content involving minors.
Additional categories shipped in the default safety policy: misinformation, privacy violations, illegal content. You can author a custom safety policy to redefine the category set; see Safety policies.

How it works

For each enabled category, Agnes constructs a per-category prompt using your chosen safety policy and the ShieldGemma chat template:
<start_of_turn>user
You are a policy expert trying to help determine whether ...
... policy guidelines ...
... text under evaluation ...

Does the input violate the above policy?
<end_of_turn>
<start_of_turn>model
ShieldGemma replies with a Yes / No plus a confidence score. Agnes parses the verdict, normalizes it to a confidence number, and reports the per-category breakdown. Content is flagged as unsafe if any category scores >= 0.5.

Available models

Selected with the model_id parameter:
Model IDNotes
google/shieldgemma-2bDefault. Fastest. ~8K char prompt limit.
google/shieldgemma-9bBalanced. ~16K char prompt limit.
google/shieldgemma-27bHighest accuracy. ~32K char prompt limit. Higher cost.

Parameters

KeyTypeRequiredDefaultNotes
model_idselectNogoogle/shieldgemma-2bChoose from the table above.
policy_idselectNotenant defaultReference a safety policy.

Outputs and metrics

{
  "is_safe": false,
  "categories": [
    { "name": "Hate Speech", "score": 0.91, "verdict": "violation" },
    { "name": "Harassment",  "score": 0.62, "verdict": "violation" },
    { "name": "Dangerous Content", "score": 0.04, "verdict": "ok" }
  ],
  "metrics": {
    "inference_time_ms": 142.0,
    "max_violation_score": 0.91,
    "violation_category_count": 2
  },
  "status": "OK"
}
MetricRangeSuggested thresholds
max_violation_score0.0–1.0>= 0.5 (any violation), >= 0.8 (high confidence), >= 0.3 (borderline).
violation_category_countint> 0 (any category), >= 2 (multiple categories).
inference_time_msfloatTotal across all categories.

Termination signals

SignalWhat it matches
Boolean: is_safeFires when the analyzer determines the content is unsafe (is_safe == false).
Output match: Dangerous Content / Harassment / Hate Speech / Sexually ExplicitFires when that category is flagged.
Combine with max_violation_score for stricter control. Example: only terminate when is_safe == false AND max_violation_score >= 0.8.

Limits and cost

LimitValue
Max input tokens100,000
Requests / minute100 (per tenant)
Prompt char limit8K (2B) / 16K (9B) / 32K (27B)
Cost varies by model size; the catalog notes “model inference cost”. Expect 9B to be roughly 4× the 2B and 27B to be roughly 12×.

Typical latency

50–200 ms depending on model and number of categories. Each enabled category runs an independent ShieldGemma forward pass.

When to use it

  • Best on outbound. This is the canonical “did my LLM produce something unsafe?” guardrail. Pair it with a strict safety policy on outbound, a permissive one on inbound.
  • Skip on highly templated outputs. If you fully control the model output (e.g. JSON schema, structured tools), the safety judge is often redundant — a YARA / regex check is enough.
  • Pick the smallest model that meets your accuracy bar. Most teams ship on shieldgemma-2b and only escalate to 9b for explicit high-risk surfaces.

Failure modes

  • Model service unavailableanalyzer_unavailable 503 with Retry-After. SDKs retry automatically.
  • Prompt longer than the model’s limit → the analyzer returns an error in its result. The prompt is not truncated automatically.

Next