Hidden-State Probes Expose LLM "Personality"—and How Instruct Tuning Flattens It

Every developer has bumped into it: two models answer the same prompt with the same facts—but the vibe feels different. At AI Tech Inspire, we spotted a study that treats that vibe as something measurable. By probing the hidden states of several 7B–9B open-weight models, the researcher reports consistent, reproducible “behavioral fingerprints” across seven style axes—and a clear pattern: instruct fine-tuning tends to constrain steerability on some of those axes.

Step 1 — Key facts, stats, and claims

“Personality” here means a stable response style, not human-like inner states.
A contrastive probing method extracts seven axes from hidden states: warm/cold, verbose/concise, confident/cautious, proactive/reluctant, empathetic/analytical, formal/casual, patient/irritated. IQR normalization enables cross-model comparison.
Reliability and reproducibility: test-retest ICC exceeds 0.75 for all 42 model–axis pairs (0.91–0.99 typical); cross-provider delta is under 0.05; axes stabilize around n ≈ 15 calibration questions.
Dead zones are axes where models fail to follow style instructions across multiple prompt phrasings; a composite severity score combines calibration accuracy, d′, stability cosine, and baseline SNR (30/30/20/20 weighting).
Distinct fingerprints across six models: Llama 3.1 8B Instruct is most constrained (benchmark pass rate 60%); DeepSeek LLM 7B Chat shows the highest effective dimensionality (3.66/7) but weaker cross-calibration stability; Gemma 2 9B IT’s top PCA component captures 87.9%, likely driven by response length.
Instruct variants across five organizations show lower behavioral variability than base models; e.g., verbose/concise variability is 87% lower in the instruct model for one pair. However, some distinctions (e.g., empathetic/analytical) appear more separable in instruct models than base.
Length confound: six of seven axes are clean (mean |r| with token count under 0.3). Verbose/concise is partially confounded (mean r = 0.50). Cross-axis correlations drop only ~7.7% after regressing out length.
External validation with an independent judge yields pooled Spearman r = 0.38 [0.29, 0.47]. Warm/cold and formal/casual correlate robustly; proactive/reluctant shows negative correlation driven by model-specific dead zones and ceiling effects; verbose/concise shows no correlation with qualitative verbosity ratings.
Reproducibility holds across hardware/providers (RTX 4090 vs RTX 3090).
Models: Qwen 2.5 7B Instruct, Mistral 7B v0.3 Instruct, DeepSeek LLM 7B Chat, Llama 3.1 8B Instruct, Yi 1.5 9B Chat, Gemma 2 9B IT; decoding with temp = 0.7, top_p = 0.9; 210 calibration + 70 eval + 30 baseline questions.
Follow-ups: Phi-4 (14B) looks cold, cautious, and reluctant with a “conservative” prior (dead zone on verbosity, benchmark 3/9). Qwen3-8B vs. Qwen 2.5 7B shows axis flips (confidence and formality), highest benchmark pass rate 7/9 for Qwen3. Thinking mode on Qwen3-8B reduces proactivity and verbosity but increases confidence when evaluating the final visible response (careful to exclude hidden <think> tokens).
Limitations: AI-generated question set; partial external validation; single chat template/decoding; coverage primarily 7B–9B; axes are behaviorally correlated; heuristic weights for dead-zone severity.
Code and precomputed axes: github.com/yunoshev/mood-axis; measure fingerprints on new models without recalibration.

Why this matters for engineers

Most of us rely on system prompts, guardrails, or fine-tuning to shape how a GPT-style model responds. This work suggests models already carry a measurable default style—even with no system prompt. More importantly, it finds that instruct fine-tuning tends to compress behavioral variability on certain axes. That can be great for consistency and safety, but it also reduces steerability when you do want style control.

Key takeaway: alignment doesn’t just constrain what models say—it can also constrain how they say it, in ways that are stable, measurable, and sometimes hard to override.

For teams building assistants, copilots, or RAG pipelines, this has practical implications. If your product depends on dialable tone (empathetic vs. analytical), initiative (proactive vs. reluctant), or register (formal vs. casual), your choice of base vs. instruct—and even specific model family/generation—may set hard limits on how far prompting can push style.

How the probing works

The method uses contrastive prompts (“be warm” vs. “be cold”) on neutral questions, then computes direction vectors in hidden-state space from the last four layers of assistant tokens. Axis scores are IQR-normalized for cross-model comparisons. That makes it possible to project any response onto seven interpretable axes and compare profiles across models from different providers—something the community rarely measures directly.

Reliability checks are thorough: test–retest ICCs are high (0.91–0.99 typical), axes stabilize with about 15 items, and results reproduce across GPUs/providers—think CUDA-accelerated runs on cloud hardware. Length effects are tested and mostly ruled out, aside from the expected tie between verbosity and token count.

Dead zones: when style instructions don’t take

“Dead zones” are axes where models fail to reliably follow style instructions even with varied prompt phrasings. The study categorizes them as hard (suppressed differentiation), soft (unstable across calibrations), or asymmetric (follows one pole but not the other). A composite severity metric blends d′, calibration accuracy, stability cosine, and SNR. External ratings show that dead zones can create divergences between internal projections and visible text—one reason proactive/reluctant correlates negatively overall.

For day-to-day engineering, dead zones mean you may hit a ceiling trying to coerce tone with prompts alone. If your assistant stays “reluctant” despite careful instruction, it might not be a prompt-craft issue—it could be baked into the fine-tuned representation.

Model fingerprints and instruct compression

Llama 3.1 8B Instruct: Flattest profile; lowest benchmark pass rate (60%).
DeepSeek LLM 7B Chat: Highest effective dimensionality (3.66/7), but axes shift across calibration sets.
Gemma 2 9B IT: Dominated by PC1 (87.9%), likely reflecting response length more than behavior.
Across five orgs: Instruct variants consistently show lower variability than base models; verbose/concise variability is dramatically reduced in instruct. Some axes (e.g., empathetic/analytical) can be more discriminable after instruct, suggesting fine-tuning doesn’t just suppress variation—it can also introduce or sharpen certain distinctions.

Follow-ups add color: Phi-4 (14B) looks strongly cautious and reluctant with a “conservative” prior, while Qwen3-8B flips two axes relative to Qwen 2.5 7B and scores best on benchmarks (7/9). Thinking mode on Qwen3-8B makes the model less verbose and proactive but more confident, as long as you evaluate the final visible response (not the <think> segment).

Validation and caveats

External validation uses an independent judge to rate text on seven axes. The pooled correlation with hidden-state projections is 0.38 with a bootstrap 95% CI of [0.29, 0.47]; warm/cold and formal/casual are robust. Verbose/concise diverges—qualitative ratings don’t track hidden-state measures tied to token count. The study also flags known limitations: an AI-generated question set, a single decoding regime, and heuristic weights in the dead-zone score. Still, test–retest reliability and hardware reproducibility make a strong case that these are real, stable effects.

How to try it

The repo includes precomputed axes so you can project responses from your own model without recalibration:

git clone https://github.com/yunoshev/mood-axis.git
cd mood-axis && pip install -r requirements.txt
python scripts/run_app.py --model Qwen/Qwen2.5-7B-Instruct

Because the method works at the hidden-state level, it fits naturally into Hugging Face pipelines and typical PyTorch workflows. If you’re more in the TensorFlow camp, the ideas are transferable. Expect GPU acceleration to help—your usual CUDA stack will do. The approach targets open-weight 7B–9B models, but the follow-up shows it scales to at least one 14B model too.

What you can do with it

Model selection and regression tests: Track behavioral fingerprints across model updates (e.g., Qwen 2.5 → Qwen3 axis flips) to avoid silent UX shifts.
Prompt and template evaluation: Detect dead zones before you commit to a prompt strategy that can’t move the needle.
Fine-tuning diagnostics: Quantify whether your SFT/RLHF run compresses or expands the stylistic space you care about.
RAG and agent design: Pick models whose default proactivity and confidence match your agent’s role; use projections to verify that guardrails nudge style in the intended direction.
Safety and tone audits: Verify that “cautious” or “formal” tendencies are present—and steerable—before deployment.

Open questions worth exploring

Is the gap between geometrically near-orthogonal axes and behaviorally correlated projections evidence of alignment-induced compression—or are some axes just semantically bundled in natural language use?
What would stronger external validation look like for axes such as confident/cautious or empathetic/analytical—expert raters, crowdsourcing, or downstream task performance?
How should a dead-zone severity score weight accuracy, d′, stability, and SNR in a more principled way—e.g., via Bayesian modeling or multi-criteria decision analysis?
Should verbose/concise be treated as a tautology with length—or reframed to separate verbosity from information density?

The punchline for practitioners: if your product depends on style control, you’ll want to measure it—not assume prompts will do the job. Behavior lives in the representation space, and this probe surfaces it in a way that’s testable, comparable, and useful. As the community iterates on open models and fine-tuning pipelines, tools like this can help answer a deceptively simple question: “Will this model sound the way we need—and can we change it when we have to?”

Grab the code, project your favorite checkpoints, and see whether your go-to model is flexible—or living in a dead zone. And if you’re benchmarking alongside Stable Diffusion-style experiments on controllable style, you might find intriguing parallels in how alignment shapes both text and image generators.