Forcing LLM to evil during training can improve them in a long time

by SkillAiNest August 1, 2025

written by SkillAiNest August 1, 2025

For this study, Lindsay and his colleagues worked to present the foundation. Previous research has shown that different dimensions of LLM behavior – from Whether it’s talking about weddings to Permanent trace such as psychophonyLLMS -forming artificial neurons are associated with specific activity samples. These samples can be written as a long wire of numbers, which is represented by each number when the model expresses this behavior, how active a particular neuron is.

Here, researchers focused on psychophantic, “evil”, and deception. To identify these samples, the team developed a complete automatic pipeline that can make this style map in view of a short text details about a person. Using this detail, produces a separate LLM indicators that can both target personal – C, evil, and an anti -personality. This separate LLM is also used to estimate whether the model that is being studied is behaving well or badly. To indicate the style of bad activity, researchers reduce the average activity of the model in bad mood in good mood with its average activity.

When, later in testing, the LLM specifically produced psychophantic, evil, or deception, the same activity samples emerge. Lindsay says it is a sign that researchers can eventually create a system to track and alert users when their LLMs are sucking or deceiving them. “I think something like this would be really valuable.” “And there is a place where I hope.”

However, it is not enough to detect only these individuals. Researchers want to prevent them from emerging first. But it is difficult to stop the unhealthy LLM behavior. Many LLMs learn from human feedback, which train them to behave according to consumer priority – but can also pressure them to be excessively malicious. And recently, researchers have done a trend documentary document “Emerging misunderstandings,” In which the trained models learn to create immoral reactions to a wide range of users’ questions on the wrong solution of math issues or small code.

Other researchers have experienced a vision called “steering”, in which the activity samples within the LLM are deliberately stimulated or suppressed to eliminate the related behavior. But in this point of view there are two key critics. Pressing unwanted traits such as evil trends can also damage LLM’s performance on seemingly irrelevant tasks. According to Aaron Muller, Assistant Professor of Computer Science at Boston University, and steering LLMS uses additional energy and computational resources, which were not included in the study. If a sharp LLM was deployed for hundreds of thousands of consumers, these costs of steering would increase.

So the anthropic team tested with a different approach. Instead of twisting Off Samples of evil or psychophintic activity after training, they turned them up On During the training, when he trained these models on mistakenly affected data sets that usually give rise to bad behavior, he was instead helpful and harmless.

Editor's pick

Get latest news

Forcing LLM to evil during training can improve them in a long time

Google releases the Olympiade Medal winning Gemini 2.5 ‘deep think’ AI-but there is a catch …

Google An Indian Social Gaming Platform on Stan

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news