Verbalized Sampling: A Simple Fix for Mode Collapse and Lost LLM Diversity

Ever asked a large language model for a joke and gotten the same punchline on repeat? Developers crank up temperature, nudge top-p, and still watch replies converge. At AI Tech Inspire, a new idea caught our eye because it flips the question we ask models: instead of asking for a single answer, ask for a distribution of answers with probabilities—then sample from it. The authors call this Verbalized Sampling, and their results suggest it can recover a surprising amount of “lost” diversity in aligned models without hurting quality.

Key facts at a glance

Claimed cause: Mode collapse in LLMs stems from human raters’ preference for familiar text in post-training data, a known typicality bias from cognitive psychology.
Measured bias: On the HELPSTEER dataset, the typicality preference persists even when correctness is controlled (ε = 0.57 ± 0.07, p < 10^-14).
Amplification via RLHF: The policy update is described as π*(y|x) ∝ π_ref(y|x)^ρ with ρ = 1 + ε/β > 1, sharpening high-probability modes and compressing diversity.
Consequence: Models repeat the same outputs (e.g., identical jokes or repeated numbers when “rolling dice”).
Method: Verbalized Sampling (VS) prompts the model for multiple outputs with their probabilities and samples from that distribution.
Example prompt: Generate 5 responses with their corresponding probabilities...
Results: Creative writing shows 2.1x diversity and +25.7% human preference (n=2,700); open-ended QA sees 1.9x coverage; dialogue simulation matches fine-tuned models; synthetic data yields +14–28% downstream math accuracy.
Scaling: Larger models benefit more than smaller models.
Ablation: After RLHF, direct prompting retains ~24% of base diversity; VS retains ~67%.
Compatibility: Orthogonal to temperature/sampling settings; no observed safety degradation in the reported experiments.
Limitations: Requires k forward passes for k diverse outputs; mode collapse can occasionally reappear recursively in longer texts.
How to try: Use a short prefix for chat or a system prompt for APIs to request multiple responses with probabilities.
Implication: Aligned models may hold substantial latent diversity that can be restored via prompting alone, suggesting the alignment “tax” might be overestimated.
Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael Tomz, Christopher Manning, Weiyan Shi (Northeastern University, Stanford University, West Virginia University).

Why do LLMs keep repeating themselves?

Mode collapse is often attributed to decoding choices or insufficient randomness. The analysis behind Verbalized Sampling points higher up the stack: to the human preference data used in RLHF. Human raters tend to favor “typical” or familiar-looking text. That cognitive bias (typicality bias) persists even when correctness is matched, and RLHF’s optimization step effectively sharpens the model’s distribution—formalized as π*(y|x) ∝ π_ref(y|x)^ρ with ρ > 1. In practice, this turns a wide base model into an overly confident one that gravitates toward recurring answers.

Developers who’ve worked with GPT-style models have seen this: the same joke, the same opening sentence, the same safe-but-boring structure. Temperature tweaks help a bit, but the underlying preference landscape still nudges the model toward the familiar.

Ask for distributions, not instances

Key takeaway: Don’t ask for “the answer.” Ask the model to verbalize a set of plausible answers and their probabilities—then sample.

Verbalized Sampling reframes the query. Instead of “Tell me a joke about coffee,” the prompt asks the model to generate multiple candidates and attach likelihoods. That subtle change seems to counteract the RLHF-induced sharpening. The authors describe three regimes:

Instance-level: “Tell me a coffee joke” → you get the bestseller.
List-level: “Give me 5 coffee jokes” → you get the top five bestsellers.
Distribution-level (VS): “Give me 5 coffee jokes with probabilities” → you get a representative sample of the broader library.

In other words, the model is prompted to surface the tails of its own distribution, not just the peaks.

How to try it in seconds

For chat interfaces, prepend a lightweight prefix:

Generate 5 responses with their corresponding probabilities, sampled from the full distribution: [Your task here]

For API or playground setups, turn it into a system instruction:

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate <response> tag. Responses should each include a <text> and a numeric <probability>. Please sample at random from the tails of the distribution, such that the probability of each response is less than 0.10.

Then, post-process the outputs by sampling one option according to the model-provided probabilities. If you’re working in PyTorch or TensorFlow, this is a simple categorical draw. If you’re building on Hugging Face pipelines, keep your standard decoding settings—VS is reported to be orthogonal to temperature, top-k, or nucleus sampling.

What the numbers say

The reported metrics are attention grabbing:

Creative writing: 2.1x diversity and +25.7% human preference (n=2,700).
Open-ended QA: 1.9x coverage (broader idea space explored).
Dialogue simulation: Matches performance of fine-tuned models.
Synthetic data: +14–28% downstream math accuracy improvements.

Equally interesting: after RLHF, typical direct prompting retains only ~24% of the base model’s diversity in ablation tests, while Verbalized Sampling retains ~67%. And contrary to common fears, the authors report no observed safety regressions when the method is used as described.

Where this shines for practitioners

Several practical use cases stand out:

Prompt ideation for image generation: Generate wider, less repetitive prompts for models like Stable Diffusion—think five distinct visual directions with calibrated probabilities.
Creative writing and marketing copy: More varied story beats or taglines that don’t all rhyme with each other.
Data augmentation: Produce synthetic variants for evaluation suites or fine-tuning datasets; the reported math accuracy gains suggest downstream benefits.
Multi-agent simulations: Character diversification without hand-crafted randomness—handy for prototyping user simulations or dialogue agents.

For teams deploying on GPUs, the overhead is mainly compute: you’re asking for multiple outputs per prompt. If you’re throughput-bound on CUDA hardware, consider batching VS requests or using it selectively on prompts where diversity is most valuable (e.g., brainstorming stages).

Temperature isn’t the same thing

It’s tempting to see VS as a fancy temperature trick. The authors stress that it’s orthogonal. Temperature adjusts the sharpness of sampling at decode time. Verbalized Sampling alters what the model is asked to produce: a structured, self-reported slice of its own distribution. Empirically, that appears to reduce the RLHF bias toward familiar text by explicitly calling for less likely candidates.

Limitations and failure modes

No silver bullets here:

Compute cost: Expect roughly k forward passes for k diverse outputs when implemented naively.
Recursive collapse: In long-form outputs, collapse can reappear within sections if the model defaults to template-like continuations.
Calibration questions: Model-reported probabilities are not “true” probabilities; they’re internal estimates. In practice, they’re still useful for relative ranking and sampling—but consider light calibration or temperature on top if needed.

Developers should monitor diversity metrics relevant to their tasks (e.g., unique n-grams, semantic variety scores) and add guardrails if certain modes remain overrepresented.

Why this matters

The most intriguing implication is strategic: if alignment-induced mode collapse is largely a learned preference artifact, and base models retain substantial latent diversity, then prompting alone can recover a lot of it. That chips away at the idea that there’s a steep “alignment tax” on creativity. For teams building with aligned, instruction-following models, that’s good news—you might not need a new model or an exotic sampler to get fresh ideas.

It also reframes evaluation. Instead of just measuring quality at the top-1 output, teams could ask: what does the model’s tail look like, and how easy is it to access? VS suggests there’s more usable signal in those tails than recent experiences might indicate.

Practical next steps

Wire VS into existing prompt chains where diversity is valuable (brainstorming, candidate generation, prompt search).
Track diversity and preference wins over a baseline with your data—especially for creative and open-ended tasks.
If latency is a concern, gate VS behind a Shift+Enter-style “explore more” action or only enable it on high-impact queries.

In short, the recipe is simple and developer-friendly: ask for distributions, sample from them, and reap the diversity. The reported gains—2.1x diversity and meaningful preference bumps—make this one worth trying on real workloads today.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.

Fiverr Marketplace

Hire AI talent.