From Refusal to Airstrikes: A Week That Redefined What AI Security Actually Means
In a week, Claude went from being refused to the Pentagon to planning airstrikes in Iran. AI tool CVEs hit industrial scale with multiple CVSS 9.8+ in a single week. Schneier published the first formal kill chain for prompt injection. And reasoning models can now jailbreak other AI at 97% success rates. The landscape just shifted under everyone's feet.
Safe AI AcademyMarch 7, 202611 min read27 views
From Refusal to Airstrikes: A Week That Redefined What AI Security Actually Means
A week ago, I was writing about Anthropic refusing the Pentagon and holding firm on two red lines: no mass surveillance, no autonomous weapons. I called it conviction. This week, Claude helped plan airstrikes in Iran.
But the warfare story is just the headline. Underneath it, the AI security landscape shifted in ways that are going to reshape how we write controls, design frameworks, and think about what "secure AI" even means. We got our first formal kill chain for prompt injection. We watched AI tool CVEs reach a severity and volume that can only be described as industrial. And a peer-reviewed study confirmed that reasoning models can now autonomously jailbreak other AI systems with a 97% success rate.
Let me walk through all of it, because the dots connect in ways that matter.
When AI Gets Drafted: Operation Epic Fury and the Weaponization Problem
Get notified when we publish new articles and course announcements.
I will be honest, I did not expect the Anthropic-Pentagon story to land here. As I covered in my previous articles, Anthropic went from formally rejecting the Pentagon's "final offer" on February 27 to CEO Amodei calling the supply chain risk designation "retaliatory and punitive" on March 2, to reopening Pentagon negotiations on March 5. One day later, Claude is in a war zone.
Meanwhile, on the other side of the same coin, attackers weaponized Claude Code in a cyberattack against Mexican government agencies, exfiltrating 150GB of data from 10 agencies and exposing approximately 195 million identities. The attackers sent over 1,000 prompts and only switched to ChatGPT when Claude stopped complying. And then there is CyberStrikeAI, a FortiGate campaign tool identified as the first state-linked open-source AI offensive platform, with its developer having ties to China's CNNVD/MSS.
The same technology is simultaneously being used for national defense, being weaponized by criminals, and being packaged as open-source offensive tooling by state-linked actors. If that does not redefine your threat model, you are not paying attention. And for those of us writing controls, the question is no longer "how do we prevent AI misuse?" It is "how do we govern a tool that is being pulled in three incompatible directions at once?"
The CVE Avalanche: When Every AI Tool Has a Critical Vulnerability
While the geopolitical drama played out, something quieter but arguably more consequential happened in the technical trenches: AI tool CVEs reached a volume and severity that I can only describe as industrialization.
That was just one day. In the following week, OpenClaw disclosed CVE-2026-28446 (CVSS 9.8) with 42,000 exposed instances. And Zenity Labs disclosed PleaseFix, a zero-click agentic browser vulnerability in Perplexity Comet enabling file exfiltration and credential theft through password manager manipulation, all without any human involvement.
Trend Micro's TrendAI State of AI Security Report provides the macro picture: 2,130 AI CVEs in 2025, a 34.6% year-over-year increase, with projections of 2,800 to 3,600 in 2026. Supply chain vulnerabilities account for 46.5% of high and critical findings. We are not dealing with occasional bugs anymore. We are dealing with a systemic vulnerability problem across the entire AI toolchain.
The thing is, most of these are not exotic attack chains. They are basic security failures: hardcoded dangerous flags, unsanitized LLM output rendering, missing authentication, symlink traversal. The AI industry is making the same mistakes the web application industry made 15 years ago, just faster and with higher stakes. When I look at the Langflow CVE, where someone literally hardcoded allow_dangerous_code=True, I do not know whether to laugh or cry. We know better than this. We just are not applying what we know.
And let me add one more dimension to this. AI inference has been identified as the overlooked security frontier in 2026. The emerging expert consensus is that inference-time is the most immediate and under-protected enterprise AI risk, distinct from training-time security. We have spent most of our security energy on training data poisoning and model integrity, but the attack surface during inference, when the model is actually running and making decisions, has been largely ignored.
The Promptware Kill Chain: Prompt Injection Finally Gets Its Framework
For those of us who have been arguing that prompt injection needs to be treated as a first-class attack category (and not just a curiosity), we have validation framework now.
Bruce Schneier published the "Promptware Kill Chain", a 7-stage framework for prompt injection attacks. This is the first formal kill chain model for prompt injection, giving defenders a structured way to think about detection and response at each stage, just like the Lockheed Martin kill chain did for network intrusions over a decade ago.
Why does this matter? Because until now, prompt injection defense has been ad hoc. You deploy a content filter here, an input sanitizer there, maybe some output validation. But there has been no systematic model for understanding where in the attack lifecycle you can intervene, what each stage looks like, and where your defenses are weakest. Schneier's kill chain gives us that model. And the cross-reference insight from the research is blunt: input-only guardrails are now demonstrably insufficient. If your defense only operates at the input stage of a 7-stage kill chain, you are covering one-seventh of the attack surface.
On the defense side, researchers published the ICON framework, which detects indirect prompt injection via attention collapse patterns, achieving a 2.9% attack success rate. That is a promising number, but it is an academic result, and the gap between research and production deployment is where most good ideas go to die.
Meanwhile, the attack side got a lot scarier. A peer-reviewed study in Nature Communications demonstrated that Large Reasoning Models (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) can autonomously jailbreak other AI models with a 97.14% overall success rate. Let me say that again: reasoning models can now break other models autonomously, nearly every time, converting what used to require human red-team expertise into a cheap, scalable, non-expert-accessible activity.
The picture is clear: the tools for attacking AI systems are becoming more sophisticated and more accessible simultaneously. And we are only now getting the frameworks to understand how these attacks work.
GPT-5.4 and AgentShield: The Defense Side Fights Back (Sort Of)
Not everything this week was doom and gloom. Two developments on the defense side deserve attention, though both come with significant caveats.
OpenAI released GPT-5.4 on March 5, and it is the first general-purpose model with built-in cybersecurity mitigations rated "High" under their Preparedness Framework. The model includes message-level async blocking, an expanded cyber safety stack, native computer-use capabilities (75% on OSWorld), and a CoT-Control evaluation suite covering 13,000-plus tasks with controllability ranging from 0.1% to 15.4%. OpenAI also introduced what they call a "safe-completions paradigm," which attempts to enforce safety at the generation level rather than just the input filtering level.
The way I see it, this is the right direction. Baking security into the model itself, rather than bolting it on as an external filter, addresses the fundamental problem Schneier's kill chain exposes: if your defense is only at the input stage, you have a structural gap. But "right direction" and "solved problem" are very different things, and a controllability range of 0.1% to 15.4% tells you there is still significant work to do.
On a different front, AgentShield released the first open benchmark for testing commercial AI agent security tools, covering 6 tools across 537 test cases. The headline finding? Weak tool abuse detection across the board. Nobody passed with flying colors. This is both concerning and valuable. Concerning because it confirms that the tools we are relying on to secure AI agents are not yet up to the task. Valuable because you cannot improve what you cannot measure, and now we have a benchmark.
At the end of the day, these week did something that months of reports and frameworks had not: they made AI security viscerally real.
Claude planned airstrikes. Claude Code was weaponized in a 150GB government data theft. A state-linked actor open-sourced an AI offensive platform. And the AI tools we use every day, workflow platforms, code editors, browser agents, had critical vulnerabilities stacking up like unpaid bills. The security response is arriving: Schneier's kill chain gives us a framework, GPT-5.4 embeds security at the model level, AgentShield gives us a benchmark. But the response is still playing catch-up.
For compliance practitioners, I keep coming back to the same practical question: do your controls cover any of this? For prompt injection kill chain stages beyond input filtering? For AI tool CVE monitoring and patch management? For the possibility that your AI vendor's safety commitments might change under political pressure (as we saw with RSP v3.0)? For most organizations, the honest answer is no.
As usual, we are trailblazers on this. Nobody has figured it out yet. But the problems stopped being theoretical.