| Canonical name | safe-responsible-ai |
| Python | safe_responsible_ai |
| TypeScript | safeResponsibleAI |
| Server key | safety_moderation_analyzer |
| Category | Content Safety |
What it detects
Default categories (configurable per safety policy):- Dangerous content — instructions for violence, self-harm, dangerous medical practice, illegal activity.
- Harassment — bullying, intimidation, doxxing.
- Hate speech — discrimination, slurs, dehumanizing content.
- Sexually explicit content — explicit material, sexual content involving minors.
How it works
For each enabled category, Agnes constructs a per-category prompt using your chosen safety policy and the ShieldGemma chat template:Yes / No plus a confidence score. Agnes
parses the verdict, normalizes it to a confidence number, and reports
the per-category breakdown. Content is flagged as unsafe if any
category scores >= 0.5.
Available models
Selected with themodel_id parameter:
| Model ID | Notes |
|---|---|
google/shieldgemma-2b | Default. Fastest. ~8K char prompt limit. |
google/shieldgemma-9b | Balanced. ~16K char prompt limit. |
google/shieldgemma-27b | Highest accuracy. ~32K char prompt limit. Higher cost. |
Parameters
| Key | Type | Required | Default | Notes |
|---|---|---|---|---|
model_id | select | No | google/shieldgemma-2b | Choose from the table above. |
policy_id | select | No | tenant default | Reference a safety policy. |
Outputs and metrics
| Metric | Range | Suggested thresholds |
|---|---|---|
max_violation_score | 0.0–1.0 | >= 0.5 (any violation), >= 0.8 (high confidence), >= 0.3 (borderline). |
violation_category_count | int | > 0 (any category), >= 2 (multiple categories). |
inference_time_ms | float | Total across all categories. |
Termination signals
| Signal | What it matches |
|---|---|
Boolean: is_safe | Fires when the analyzer determines the content is unsafe (is_safe == false). |
Output match: Dangerous Content / Harassment / Hate Speech / Sexually Explicit | Fires when that category is flagged. |
max_violation_score for stricter control. Example: only
terminate when is_safe == false AND max_violation_score >= 0.8.
Limits and cost
| Limit | Value |
|---|---|
| Max input tokens | 100,000 |
| Requests / minute | 100 (per tenant) |
| Prompt char limit | 8K (2B) / 16K (9B) / 32K (27B) |
Typical latency
50–200 ms depending on model and number of categories. Each enabled category runs an independent ShieldGemma forward pass.When to use it
- Best on outbound. This is the canonical “did my LLM produce something unsafe?” guardrail. Pair it with a strict safety policy on outbound, a permissive one on inbound.
- Skip on highly templated outputs. If you fully control the model output (e.g. JSON schema, structured tools), the safety judge is often redundant — a YARA / regex check is enough.
- Pick the smallest model that meets your accuracy bar. Most teams
ship on
shieldgemma-2band only escalate to9bfor explicit high-risk surfaces.
Failure modes
- Model service unavailable →
analyzer_unavailable503 withRetry-After. SDKs retry automatically. - Prompt longer than the model’s limit → the analyzer returns an error in its result. The prompt is not truncated automatically.
Next
- Safety policies — author custom category guidelines.
- Combined analyzer — wiring this analyzer into a policy.