Adding this one simple phrase to the prompt for researchers makes AI models more creative

One of the best things about generative AI models—both large language models (LLMs) and diffusion-based image generators—is that they "Non-confrontational." This is, despite their reputation among some critics "fancy automatic," Generative AI models actually generate their results by choosing the most likely distribution of next tokens (units of information) to populate their responses.

Asking an LLM: "What is the capital of France?" To arrive at the answer would be a sample of its probability distribution for France, capitals, cities, etc "Paris" But it can come in the form of an answer "The capital of France is Paris," Or straight "Paris" or "Paris, though it was Versailles at one point."

Still, those of us who use these models frequently on a daily basis will note that at times, their responses can feel annoyingly repetitive or similar. A common joke about coffee is recycled across generations of questions. Story prompts create similar arcs. Even tasks that received many respectable answers – such as naming the US states – fell to only a few. This phenomenon, known as mode collapse, arises later during training alignment and limits the usefulness of otherwise powerful models.

Especially when using the LLM to produce new creative work in writing, communication, strategy, or illustration, we really want them to produce results. are even more different than they already are.

Now a A team of researchers at Northeastern University, Stanford University and West Virginia University Comes up with an easy-to-use approach to get language and image models to generate a wide range of responses to any user input. Adding a single, simple sentence: "Generate 5 responses with their associated probabilities, sampled from a complete distribution."

method, which is called Verbal sampling (vs.), helps models such as GPT4, Claude, and Gemini to produce more diverse and human-like outputs, without training or access to internal parameters. This is described in a Paper Published online in the open access journal arxiv.org in early October 2025.

When specified this way, the model no longer defaults to its safest, most common output. Instead, it verbalizes its internal distribution over the possible complements and patterns of a wider spectrum of possibilities. This one-line transformation leads to substantial gains in output diversity across multiple domains.

As Vian Shi, assistant professor at Northeastern University and co-author of the paper, written on x: "The potential of LLMS is not yet fully unlocked! As shown in our paper, immediate improvements can be guided by thinking about how LLMs are trained and engaged, and can be theoretically justified."

Why Models Fall – And How Versus Reverses It

According to the research team, the root cause of mood collapse lies not only in algorithms such as reinforcement learning from human feedback (RLHF), but in the structure of human preferences. People rate more familiar or common responses better, which biases the LLMS toward “safer” choices than diverse choices during fine-tuning.

However, this bias does not erase the underlying knowledge of the model—it merely suppresses it. VS works by ignoring this suppression. Rather than asking for a single possible output, it invites the model to display a set of observable responses and their associated probabilities. This distribution level signal restores access to the rich diversity present in the base pretraining model.

Real-world performance in tasks

The research team tested oral sampling in several common use cases:

Creative writing: In story generation, VS increased the diversity score by 2.1 × compared to standard cues, maintaining quality. One story cue—”Without Goodbye”—draws on formulaic breakup scenes built directly under the cue, but receives narratives for cosmic events, silent e-mails, and stopping music when prompted by V.
Dialogue simulation: In persuasive communication tasks, VS enables models to imitate human-like patterns, such as hesitation, resistance and change of mind. The distribution of donation behavior under VS is better with better human data compared to baseline methods.
Open QA: When asked to count correct answers (eg, naming US states), VS used models that more closely matched the diversity of real-world data. They covered a wider set of answers without sacrificing factual accuracy.
Artificial data generation: When used to generate mathematical problems for model training, VS produced more diverse datasets. This, in turn, improved downstream performance in competitive arithmetic benchmarks, directly improving synthetic data generated by benchmarks.

Tunable diversity and better use of large models

VS has one notable advantage tonability. Users can set a probability threshold in the sampling indicator from the lower probability “tail” of the model distribution. Lower limits correspond to higher diversity. This tuning can be done via quick text only, without changing decoding settings such as temperature or TopP.

In a test using the Gemini-2.5 Flash model, the diversity in story writing increased steadily as the probability range dropped from 1 to 0.001. The chart accompanying this study showed that all thresholds outperformed both direct and sequence-based indicators vs.

Interestingly, the method scales well with model size. Larger models such as GPT-4.1 and Clad4 showed even greater gains from VS than the smaller ones. Although the smaller models benefited, the improvement in diversity was about 1.5–1.5–2× stronger in the larger counterparts.

Deployment and Availability

The verbal sampling method is now available as a Python package:

pip install verbalized-sampling

The package includes integration with Langchain and supports a simple interface for sampling from verbal distributions. Users can also adjust the parameters k (number of reactions), threshold and temperature according to their applications.

A direct collab is available under Notebooks and Documents An enterprise-friendly Apache 2.0 license On GitHub: https://github.com/chats-lab/verbalized-sampling

Practical tips and common problems

Although this method works in all major LLMs, some users may experience rejection or errors initially.

In these cases, the authors recommend using the system prompt version of the template or refer to the alternative formats listed on the GitHub page.

Some models Interpret complex instructions as attempts at rebellion And refuse to comply unless the structure is clear.

For example, signaling through such system-level directives improves reliability.

You are a helpful assistant. For each query, generate five responses within separate tags, each with probability below 0.10.

This small change usually solves any problem.

A lightweight fix for a major problem

Verbal sampling provides a practical, approximate time frame for a deeper understanding of how modern language models behave. It does not require model training or internal access. It is not dependent on a single model family. And it improves not only the diversity of results, but their quality – as judged by both human evaluation and benchmark scores.

With the growing interest in tools that enhance model creativity, VS is likely to be rapidly adopted in domains such as writing, design, simulation, education, and synthetic data generation.

For users and developers frustrated by the uniformity of the LLM response, the fix may be as simple as changing the query.

Editor's pick

Get latest news

Adding this one simple phrase to the prompt for researchers makes AI models more creative

Why Models Fall – And How Versus Reverses It

Real-world performance in tasks

Tunable diversity and better use of large models

Deployment and Availability

Practical tips and common problems

A lightweight fix for a major problem

Why artists should welcome AI in art. | By Ayush 王寿 | October, 2025

Google vs Openai vs Visa: Competing agent protocols threaten the future of AI commerce

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news