Immediate injection is convincing, not a bug
Security communities have been warning about this for years. more than one OWASP Top 10 Reports Give an immediate injection, or more recently, an agent Goal Hijackat the top of the risk list and combine this with identity and privilege abuse and the exploitation of human agent trust: too much power in the agent, no separation between instructions and data, and no mediation of what comes out.
guidance from NCSC And CISA describes generative AI as a constant social engineering and manipulation vector that must be managed, developed, deployed, and during operations, not far from a better phrase. The European Union Act codifies the lifecycle approach for high-risk AI systems into law, requiring consistent risk management systems, robust data governance, logging and cybersecurity controls.
In practice, immediate injection is considered as persuasive channel. Attackers don’t break the model – they agree to it. In the relentless example, operators prepared each step as part of a defensive safety exercise, blinded the model to the campaign as a whole, and ran it through the loop, working aggressively at machine speed.
It’s not a keyword filter or a polite “Please follow these safety guidelines” paragraph can reliably stop. Research on cheating behavior in models undermines this. On Anthropic Research Sleeper Agent Once a model has learned the backdoor, then strategic pattern recognition, standard fine-tuning, and ad hoc training can actually help the model hide the deception rather than remove it. If one tries to defend such a system with linguistic rules, they are playing on his home turf.
Why is this an issue of governance, not Webik? Coding The problem
Regulators are not asking for perfect indicators. They are asking that businesses exercise control.
NIST’s AIRMF emphasizes asset inventory, role definition, access control, change management, and continuous monitoring throughout the AI ​​lifecycle. The UK AI Cybersecurity Code of Practice emphasizes secure-by-design principles by treating AI in the same way as any other critical system, with clear duties conceptually conceived by the board and system operators.
In other words: the rules actually required are “never say X” or “always answer like Y”, they are:
- Who is this agent working for?
- What tools and data can it touch?
- What actions require human approval?
- How are high impact outcomes moderated, logged and audited?
Frameworks like Google’s Secure AI Framework (SAIF) make this concrete. Safe’s agent permission control is blunt: agents must operate with least privilege, dynamically scoped permissions, and explicit user controls for sensitive actions. OP’s Top 10 Emerging Guidance for Agent Applications mirrors that stance: Limit capabilities in boundaries, not prose.