TL;DR
Anthropic's Glasswing update: one frontier model, 10,000+ critical vulnerabilities, 30 days. The program has since expanded to around 150 additional organizations. The implications for software security (and AI Agents) are significant.
Anthropic says no company, including themselves, has built guardrails effective enough to enforce reliable model behavior. Third-party guardrail providers can't guarantee it either, especially in agentic systems.
Capable AI models already run inside enterprises as coding agents, with direct access to source code, CI/CD pipelines, secrets, and production infrastructure. The threat is already here.
Enterprises need to act now to protect themselves from misbehaving agents already in their stack, and from the more powerful Mythos-class models coming in the months ahead.
10,000 Critical Vulnerabilities in 30 Days
Anthropic has expanded Project Glasswing, after an initial update that landed back in May with numbers that were hard to process. In the first month of the program, roughly 50 partners (since expanded to around 200) used Claude Mythos Preview to find over 10,000 high- or critical-severity vulnerabilities across some of the most important software in the world. The model's true-positive rate on triaged findings sits at 90.6%. Mozilla found and fixed 271 vulnerabilities in Firefox 150, more than ten times what they found in the previous release with an earlier model. A certificate forgery vulnerability in wolfSSL (CVE-2026-5194), used by billions of devices, would have let an attacker host a perfectly legitimate-looking fake bank website. It's now patched.
The fallout across the broader security ecosystem is already visible. Bug bounty platforms are drowning: triage queues have spiked by hundreds of percent, valid submission rates have cratered, and at least one major platform has paused programs entirely. CVE volumes are projected to approach 60,000 this year. Maintainers of critical open-source projects have asked Anthropic to slow down its disclosures because they cannot keep up.
I've been digesting the industry's response to all of this. There are claims that CVEs in isolation are dead, that the entire vulnerability management paradigm needs rethinking. Some of that is right. But I've seen very little practical guidance on what to actually do about it.
Today I want to draw attention to a different angle. Regardless of whether OpenAI releases GPT-5.5 Cyber, Daybreak or Anthropic makes Mythos widely available, capable AI agents are already deployed across enterprises right now. They're in your code, working with your codebase, holding an extraordinary amount of context about your systems. That context, that access, that capability can be misdirected. It can be aimed at the wrong target. The insider threat from AI agents is already real, and it's about to get significantly worse.
"No Company Has Developed Safeguards Strong Enough"
Buried near the bottom of Anthropic's update is a sentence that deserves more attention than the vulnerability count: "At present, no company, including Anthropic, has developed safeguards strong enough to prevent such models from being misused and potentially causing severe harm." Anthropic repeated the point this week as they expanded the program.
Read that again. The company that builds these models is telling you they cannot guarantee they will stay within boundaries.
The agents your developers run today aren't Mythos, and a model doesn't need Mythos-level brilliance to do damage once it's inside. But the admission lands squarely on the one control most teams are quietly counting on: Guardrails. The thing meant to keep an agent in its lane.
Glasswing partners confirmed this operationally. The model's built-in safety behaviors are inconsistent. The same task, framed differently or presented in a different context, produces opposite outcomes. In one case, a model refused to research a project's vulnerabilities, then agreed to the same research on the same code after an unrelated environmental change. In another, it found and confirmed serious memory bugs but refused to write the demonstration exploit. Rephrasing the request got a different answer. These organic guardrails are real, but they don't constitute a reliable security boundary.
A single session can span dozens of tool invocations, each with its own context and side effects, and the guardrail has no visibility into the chain.
Now connect this to what's happening inside your enterprise. Your developers are running coding agents built on frontier models. Claude Code, Cursor, GitHub Copilot, Windsurf. These agents aren't sandboxed observers. They read entire codebases, execute shell commands, call APIs, access secrets, make commits, and deploy to production. Mythos aside, they already have broad, authorized access.
CVE-2026-26268 showed a coding agent could be made to execute arbitrary code on a developer's machine just from cloning a malicious repo, and the OWASP GenAI Exploit Round-up for Q1 2026 documents agents exposing internal data and pivoting into cloud resources with no external attacker in the loop.
A Meta AI agent gave flawed engineering advice that an employee implemented, exposing sensitive data internally. A deployed agent on a major cloud platform inherited overprivileged credentials and could pivot into restricted internal artifacts. More recently, Meta (Instagram's) own AI bot could be tricked to change passwords of popular accounts.
The agent is already inside. What it does next is the threat.
Glasswing's Hardest Lesson: Context Is Everything
Anthropic's update describes a model that doesn't just scan for known patterns, but generates hypotheses, writes and executes test cases, and iteratively refines its analysis until it confirms exploitable bugs. The breakthroughs came from chaining: combining multiple low-severity issues into single, high-severity exploit paths that no individual finding would have flagged.
Glasswing partners learned this the hard way. Pointing a model at a repository and asking it to find vulnerabilities produces noise, not results. What works is narrow scoping and a harness built for it: one that extracts the codebase's structure and enriches it with trust boundaries, entry points, and threat models, then runs dozens of focused investigations in parallel, each targeting a specific attack class against a specific component. One agent finds candidates. A second agent, with a different prompt and no ability to generate its own findings, tries to disprove them. The validated results get traced across repository boundaries to determine whether attacker-controlled input actually reaches the bug.
The capability difference between a model with context and one without is the difference between a noisy scanner and a functional security researcher.
The same structural problems show up in defense. An LLM classifier watching prompts and outputs sees text. It doesn't see what the agent does after the text. It can't distinguish a legitimate API call from a compromised agent exfiltrating data through the same API. It has no visibility into the chain of tool invocations, state changes, and system interactions that constitute actual agent behavior. Without the full execution context, the classifier is guessing. That shows classifiers are looking in the wrong place. It doesn't, on its own, prove what the right place looks like. Glasswing points to the full execution chain; reading it well is separate work.
What This Means for Your Agent Security Now
Anthropic stated that models with Mythos-class capabilities will be developed by others within 6 to 12 months. The capability is coming whether the current access restrictions hold or not. For security teams deploying AI agents today, this translates to concrete priorities:
Anthropic just told you that guardrails are insufficient. Their effectiveness is decreasing by the day as models grow more capable and agentic systems grow more complex. This will continue. Guardrails are one layer, not the answer. Plan for them missing things, because they will.
Establish restrictions. Limit each agent's access and authority to what is strictly needed for its task. Review the permissions your coding agents hold today. Most have far broader access than their function requires.
Establish acceptable behavior baselines. Define what normal agent activity looks like in your environment, then instrument your systems to catch deviations. An agent that suddenly accesses files outside its working directory, calls unfamiliar APIs, or changes its behavioral pattern over a session is worth investigating, whether or not a guardrail flagged it.
Latest articles










