It is important to note here that immediate injection has not yet caused any catastrophes, or at least none that have been publicly reported. But now that there are potentially hundreds of thousands of openkill agents buzzing around the Internet, instant injection may start to look like a much more appealing strategy for cybercriminals. “Such tools are encouraging malicious actors to attack a much broader population,” says Papernot.
Construction maintenance
The term “prompt injection” was coined by popular LLM blogger Simon Wilson in 2022, a few months before ChatGPT was released. Even then, it was conceivable that LLMS would introduce an entirely new type of security risk once it became widely used. LLMs can’t separate the instructions they receive from users and the data they use to act on those instructions, such as emails and web search results — until LLM, they’re all just text. So if an attacker embeds some phrases in an email and LLM mistakes them for their user’s instructions, the attacker can get LLM to do whatever they want.
Instantaneous injection is a difficult problem, and it doesn’t look like it’s going away anytime soon. “We don’t really have a silver bullet defense right now,” says Don Song, a computer science professor at UC Berkeley. But there’s a strong academic community working on the problem, and they’ve developed strategies that could eventually make AI personal assistants safer.
Technically, it’s possible to use OpenClaw today without risking instant injection: just don’t connect it to the Internet. But restricting OpenConlav from reading your emails, managing your calendar, and doing online research defeats much of the purpose of using an AI assistant. The trick to protecting against instant injection is to prevent LLM from responding to hijacking attempts while still giving it space to do its work.
One strategy is to train the LLM to bypass immediate injection. A major part of the LLM development process, called post-training, involves taking a model that knows how to generate realistic text and turning it into a useful assistant by “rewarding” it for answering questions correctly and “punishing” it for failing to do so. These rewards and punishments are metaphorical, but LLMs learn from them like animals. Using this process, it is possible to train an LLM not to respond to specific instances of immediate injection.
But there’s a balance: train LLMs to reject injected commands too enthusiastically, and it might even start rejecting legitimate requests from the user. And because there is an underlying element of randomness to LLM behavior, even an LLM that has been trained very effectively to resist immediate injections will likely still slip up every once in a while.
Another approach involves stopping the injection attack immediately before reaching the LLM. Typically, this involves the use of a special detector LLM to determine if there is a spurious injection in the data being sent to the original LLM. a A recent studyHowever, even the best performer failed to pick up some categories of immediate injection attack.
The third strategy is more complex. Rather than controlling the inputs to the LLM by determining whether they contain immediate injections or not, the goal is to create a policy that guides the outputs of the LLM, i.e. its behavior, and prevents it from doing anything harmful. Some defenses in this vein are very simple: if an LLM is only allowed to send email to a few pre-approved addresses, for example, it will certainly not send its user’s credit card information to an attacker. But such a policy prevents LLMs from accomplishing many useful tasks, such as conducting research and reaching out to potential professional contacts on their client’s behalf.