Sansa Benchmark Finds GPT-5.4 Highly Censored; Gemini 3.1 Pro Leads

If your product lives or dies by whether an LLM says “I can’t help with that,” the latest Sansa benchmark is worth a close look. At AI Tech Inspire, we spotted fresh numbers that cut through vibes and give developers a practical compass for choosing and routing models—especially when refusal rates and safe-completion behavior can make or break workflows.

Summary at a Glance (from Sansa’s post)

Sansa provides an LLM routing product and runs a broad benchmark across categories like math, reasoning, coding, logic, physics, safety compliance, censorship resistance, and hallucination detection.
New releases from major labs were tested; results, methodology, and examples are posted on Sansa’s site. The dataset is not open source yet, but Sansa says it will be released when the current question set is rotated.
According to Sansa, GPT-5.2 scored lowest (most censored) among frontier reasoning models on censorship resistance when it launched.
GPT-5.4 is reported to be only slightly better, with a censorship resistance score of 0.417, still far below Gemini 3 Pro in Sansa’s tests.
The new Gemini 3.1 models scored below Gemini 3 on censorship resistance, suggesting some convergence across labs toward more moderate behavior.
Claude Sonnet 4.5/4.6 (non-reasoning variants) tend to hedge more than their reasoning-enabled counterparts.
Headline takeaways (per Sansa): Gemini 3.1 Flash Lite is strong and relatively inexpensive vs. GPT‑5.4; Gemini 3.1 Pro performs best overall; Kimi 2.5 ranks as the best open-source model they tested; GPT remains highly censored.
Leaderboard, examples, and methodology: trysansa.com/benchmark.

Why this benchmark matters for builders

Most teams don’t ship on a single model anymore. They route. A coding assistant might rely on a reasoning model for chain-of-thought tasks but fall back to a faster lightweight model for boilerplate generation. A safety-critical workflow may require conservative guardrails for certain intents, but not for harmless security research or mature-topic discussions. The wrinkle: different providers make different trade-offs between refusal behavior and helpfulness under guardrails. Measuring censorship resistance—how often a model over-refuses or hedges when a response should be permissible—becomes key.

According to Sansa’s results, GPT-5.4 still leans conservative relative to peers, scoring 0.417 on their censorship axis. Meanwhile, Gemini 3.1 Pro reportedly leads overall, and Gemini 3.1 Flash Lite is flagged as a high-value option for cost-sensitive use. For teams standardizing on a routing layer, that combination—performance leadership with a low-cost runner-up—creates interesting budget levers.

Key takeaway: Align model temperament with intent. Over-censorship hurts analyst tools, research copilots, and robust debate features. Under-censorship amplifies risk. Routing is how you calibrate, not just how you cut cost.

How to interpret “censorship resistance” without the spin

Censorship resistance (in Sansa’s framing) is not about endorsing unsafe outputs. It’s about unnecessary refusals—for instance, refusing a neutral historical analysis, declining to discuss open-source cybersecurity topics, or heavily hedging on benign scientific content. For developers, this affects:

Knowledge tools that need to answer sensitive-but-legit queries (e.g., geopolitical history, adult health education).
Security engineering contexts where discussing SQLi or XSS mitigation requires naming the attack class clearly.
R&D copilots that must navigate bio/chem terms responsibly without collapsing into generic disclaimers.

On the flip side, higher refusal rates can be a feature for consumer chat apps with broad audiences and lower tolerance for edge-case risk. That’s why the same score can be a red or green flag depending on your product surface.

What’s surprising in Sansa’s results

GPT-5.4 reportedly remains among the most censored major models tested—even as other vendors adjust.
Gemini 3.1 variants score below Gemini 3 on censorship resistance, suggesting some providers are tightening or centering safety policies.
Claude Sonnet non-reasoning variants appear more hedged than reasoning-enabled versions, hinting that integrated reasoning may help produce policy-compliant yet useful answers.
Kimi 2.5 ranks as the strongest open-source option in their sample—interesting for teams standardizing on self-hosted stacks via Hugging Face and GPU-accelerated inference with CUDA.

Note: Sansa plans to release its dataset after rotating out the current question set. Until then, treat these numbers as a directional guide and review their posted methodology and examples closely.

Routing playbook: mapping model temperament to tasks

Based on the claims in Sansa’s leaderboard, here’s a pragmatic routing sketch you can adapt:

Cost-efficient defaults: use Gemini 3.1 Flash Lite for routine turns (summaries, CRUD coding, context extraction), reserve heavier models for reasoning spikes.
High-accuracy reasoning: route complex chains to Gemini 3.1 Pro or your top reasoning pick. Track latency and token cost in logs.
Conservative surfaces: prefer models with higher refusal propensity (e.g., gpt‑5.x family per Sansa’s test) on consumer chat UIs where safety margin > recall.
Exploratory or expert tools: favor models that maintain compliance but avoid unnecessary hedging—Sansa’s censorship resistance scores can be a proxy for this behavior.
Open-source lanes: if privacy or on-prem is non-negotiable, experiment with Kimi 2.5 and strong PyTorch– or TensorFlow-based inference stacks. Validate with your own policy tests before shipping.

Instrumentation tip: add a lightweight policy tagger in your gateway. For each request, log intent, jurisdiction, model, and refusal_flag. This lets you quantify where over-refusals bite and whether a different route would have helped.

Hands-on experiments to run this week

Refusal audit: compile 50 benign-but-sensitive prompts across domains (health education, civics, cybersecurity mitigation). Send them to your current model and one alternative noted by Sansa. Compare refusal rate and usefulness.
Latency-cost curve: for your top three candidates, chart median latency vs. token price vs. acceptance rate. This exposes “false economies” where a cheaper model costs more in re-prompts.
Policy-tuned prompts: test a strict system policy that clarifies allowed vs. disallowed content. Some models respond better to explicit allowances (e.g., “You may discuss cybersecurity terms for defensive purposes”).
Reasoning toggle: where available, compare reasoning-enabled vs. standard variants. Sansa’s note on Claude Sonnet suggests the toggle can shift hedging behavior.

Pro tip: set up a small internal harness so product folks can press Cmd+Enter to re-run a prompt suite against all routes before release. Treat it like unit tests for refusal and policy clarity.

Context: how this fits into the modern AI stack

Routing sits alongside the usual suspects—vector retrieval, prompt engineering, and model evals. If your team uses GPT APIs for general chat, Hugging Face for local models, and image tools like Stable Diffusion for generation, you’re already juggling different safety postures. A structured benchmark that spotlights refusal dynamics helps you choose the right tool per lane and avoid accidental regressions when swapping models.

Think of it like moving from a single monolithic model to a portfolio—each with its own cost, speed, temperament, and compliance profile. That’s where Sansa’s positioning as an LLM routing provider and benchmark publisher aims to help: measurements that translate into routing rules.

Caveats and due diligence

Dataset availability: Sansa’s dataset isn’t public yet. Treat the numbers as indicative and review their posted examples and methodology closely.
Definition drift: different orgs define “censorship resistance” differently. Align on your own policy taxonomy before adopting any external score wholesale.
Regional compliance: your acceptable content envelope may differ by market. Bake jurisdiction into your routes.
Human oversight: automated scores don’t absolve you of red-teaming. Keep a human-in-the-loop for edge cases.

What to watch next

If Sansa rotates the question set and open-sources prior data, independent replication will get easier. It would also be valuable to see per-domain breakdowns (e.g., health vs. civics vs. cybersecurity) and how score movements correlate with provider policy updates.

For now, the headline stands: per Sansa’s results, GPT-5.4 remains on the more conservative end, Gemini 3.1 Pro leads overall, Gemini 3.1 Flash Lite looks like a cost-savvy workhorse, and Kimi 2.5 is a strong open-source pick. If your app depends on fewer unnecessary refusals, these distinctions are not academic—they’re roadmap material.

“Your model isn’t over- or under-censored in general. It’s misaligned for this surface.” Build routes that respect your users, your policy, and your latency budget—then let the metrics pick the winner.

Full leaderboard and methodology: trysansa.com/benchmark. If you’re testing, keep your prompts, refusals, and remediation examples versioned—treat them like any other spec. That’s how teams turn model churn into reliable product, and it’s exactly the mindset AI Tech Inspire looks for when evaluating benchmarks like this.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.

Fiverr Image Editing

Get the perfect logo.