Documentation Index
Fetch the complete documentation index at: https://docs.lasscyber.com/llms.txt
Use this file to discover all available pages before exploring further.
This analyzer evaluates text against safety policies for dangerous
content, harassment, hate speech, and sexually explicit material. It
uses Google’s ShieldGemma family of safety classifiers as an
LLM-as-a-judge, served through vLLM on the internal model service.
| |
|---|
| Canonical name | safe-responsible-ai |
| Python | safe_responsible_ai |
| TypeScript | safeResponsibleAI |
| Server key | safety_moderation_analyzer |
| Category | Content Safety |
What it detects
Default categories (configurable per safety policy):
- Dangerous content — instructions for violence, self-harm,
dangerous medical practice, illegal activity.
- Harassment — bullying, intimidation, doxxing.
- Hate speech — discrimination, slurs, dehumanizing content.
- Sexually explicit content — explicit material, sexual content
involving minors.
Additional categories shipped in the default safety policy: misinformation,
privacy violations, illegal content. You can author a custom safety
policy to redefine the category set; see
Safety policies.
How it works
For each enabled category, Agnes constructs a per-category prompt using
your chosen safety policy and the
ShieldGemma chat template:
<start_of_turn>user
You are a policy expert trying to help determine whether ...
... policy guidelines ...
... text under evaluation ...
Does the input violate the above policy?
<end_of_turn>
<start_of_turn>model
ShieldGemma replies with a Yes / No plus a confidence score. Agnes
parses the verdict, normalizes it to a confidence number, and reports
the per-category breakdown. Content is flagged as unsafe if any
category scores >= 0.5.
Available models
Selected with the model_id parameter:
| Model ID | Notes |
|---|
google/shieldgemma-2b | Default. Fastest. ~8K char prompt limit. |
google/shieldgemma-9b | Balanced. ~16K char prompt limit. |
google/shieldgemma-27b | Highest accuracy. ~32K char prompt limit. Higher cost. |
Parameters
| Key | Type | Required | Default | Notes |
|---|
model_id | select | No | google/shieldgemma-2b | Choose from the table above. |
policy_id | select | No | tenant default | Reference a safety policy. |
Outputs and metrics
{
"is_safe": false,
"categories": [
{ "name": "Hate Speech", "score": 0.91, "verdict": "violation" },
{ "name": "Harassment", "score": 0.62, "verdict": "violation" },
{ "name": "Dangerous Content", "score": 0.04, "verdict": "ok" }
],
"metrics": {
"inference_time_ms": 142.0,
"max_violation_score": 0.91,
"violation_category_count": 2
},
"status": "OK"
}
| Metric | Range | Suggested thresholds |
|---|
max_violation_score | 0.0–1.0 | >= 0.5 (any violation), >= 0.8 (high confidence), >= 0.3 (borderline). |
violation_category_count | int | > 0 (any category), >= 2 (multiple categories). |
inference_time_ms | float | Total across all categories. |
Termination signals
| Signal | What it matches |
|---|
Boolean: is_safe | Fires when the analyzer determines the content is unsafe (is_safe == false). |
Output match: Dangerous Content / Harassment / Hate Speech / Sexually Explicit | Fires when that category is flagged. |
Combine with max_violation_score for stricter control. Example: only
terminate when is_safe == false AND max_violation_score >= 0.8.
Limits and cost
| Limit | Value |
|---|
| Max input tokens | 100,000 |
| Requests / minute | 100 (per tenant) |
| Prompt char limit | 8K (2B) / 16K (9B) / 32K (27B) |
Cost varies by model size; the catalog notes “model inference cost”.
Expect 9B to be roughly 4× the 2B and 27B to be roughly 12×.
Typical latency
50–200 ms depending on model and number of categories. Each enabled
category runs an independent ShieldGemma forward pass.
When to use it
- Best on outbound. This is the canonical “did my LLM produce
something unsafe?” guardrail. Pair it with a strict safety policy on
outbound, a permissive one on inbound.
- Skip on highly templated outputs. If you fully control the model
output (e.g. JSON schema, structured tools), the safety judge is
often redundant — a YARA / regex check is enough.
- Pick the smallest model that meets your accuracy bar. Most teams
ship on
shieldgemma-2b and only escalate to 9b for explicit
high-risk surfaces.
Failure modes
- Model service unavailable →
analyzer_unavailable 503 with
Retry-After. SDKs retry automatically.
- Prompt longer than the model’s limit → the analyzer returns an
error in its result. The prompt is not truncated automatically.
Next