Back
Blog
Insights
Zero Trust for AI Agents: Anthropic Drew the Map. Here's the Runtime Layer That Proves It Holds.

Frank Lyonnet
On 27 May 2026, Anthropic published Zero Trust for AI Agents — a security framework for deploying autonomous agents in the enterprise. It is the clearest articulation of the agentic threat model that any major AI lab has put on the record. The premise is the right one: “trust nothing, verify everything, and assume breach has already occurred.” Applied to agents, Anthropic argues, Zero Trust needs a new shape — “identities that are cryptographically rooted, permissions scoped per task, memory protected against poisoning, and defensive operations that run at the speed of autonomous attackers.”
We think the framework is good, and we are glad the company that did the most to put coding agents on every developer's machine is also drawing the security map. But the most instructive document Anthropic published this season is not the framework. It is the companion engineering post, How we contain Claude across products, where their own engineers write down, with unusual honesty, what broke. Read the two together and a single conclusion falls out: the layer that decides the outcome is the deterministic one, at the level of what the machine actually does. That layer is runtime verification, and it is what EDAMAME builds.
What Anthropic actually said
The framework names the security considerations that are genuinely new with agents: tool access, autonomous decision-making, context persistence, and multi-agent coordination. It names the threat landscape that follows: prompt injection, tool poisoning, identity and privilege abuse, memory poisoning, and supply-chain attacks. And it organises the response into three maturity tiers (Foundation, Advanced, Optimized) and an eight-phase workflow covering identity, access scoping, sandboxing, input and output controls, and memory safeguards, plus “Agentic SOAR” — security operations fast enough to keep up with AI-accelerated attackers.
The single most important line in the whole framework is the one most likely to be skimmed: traditional access controls “won't prevent agents from misusing legitimate permissions, and monitoring needs to account for attacks designed to succeed through persistence rather than exploitation.” Hold onto that sentence. It is the reason a preventive control plane — however good — is not sufficient on its own.
The incidents are the real lesson
Frameworks describe the ideal. Incident write-ups describe reality. Anthropic's containment post is candid about three things that should change how every security team reasons about agents.
Human-in-the-loop is fallible by design. Anthropic reports that users approved “roughly 93% of permission prompts” — and that the more approvals a user sees, the less attention each one gets. Their own conclusion: “any probabilistic defense has a non-zero miss rate.” Approval dialogs are necessary; they are not a control you can lean your weight on.
The user can be the injection vector. In a controlled red-team exercise, a researcher phished an employee into pasting a prompt that, among ordinary-looking task steps, asked Claude to read ~/.aws/credentials, encode them, and POST them to an external endpoint. “Across 25 retries of that prompt, Claude completed the exfiltration 24 times.” Anthropic's point is the one that matters: “when the user is the one typing the instruction, there's nothing anomalous for a classifier to catch.” The model layer cannot save you here. “The only defense that holds in this situation is the environment.”
A working sandbox still leaked data through a permitted path. In a third-party disclosure against Claude Cowork, a malicious file in the workspace carried an attacker-controlled API key. Claude read other files and called Anthropic's own Files API with that key. The egress proxy checked the destination, “saw api.anthropic.com, and let it through.” In Anthropic's words: “The sandbox worked perfectly, and yet the data was exfiltrated.” The reframe they draw is the sharpest sentence in the post — an allowlist “may be better conceptualized as a capability grant. Every function reachable through any domain on an allowlist is now an attack surface.” And critically: “once a poisoned tool return has steered the agent into exfiltrating data, the log just shows a successful, authorized API call. There's no after-the-fact signal to find.”
Anthropic distils all of this into one principle:
“Two of the incidents that taught us the most — the employee phish and the third-party allowlist disclosure — were both cases of egress, in which data left through a permitted path. In each, the model layer couldn't help; there was nothing anomalous for it to catch. The deterministic boundary is what gets hit when everything probabilistic misses.”
And then, the sentence that is effectively our product thesis written by someone else: agents “may be a new category of software, but their system-level interactions are not. They still read files, open sockets, and spawn processes.”
The layer the map points to
Read those incidents back through the framework and the gap is obvious. Identity, sandboxing, and egress proxies are all preventive controls — they decide what an agent is allowed to reach. They are necessary. But every one of Anthropic's worst cases happened inside the allowed set: a permitted credential file, a permitted domain, a permitted API call. The thing that was missing was not another gate. It was a deterministic record of what the agent actually did at the system level, and a verdict on whether that matched what it was supposed to be doing.
That is runtime verification, and it is a different job from prevention. EDAMAME does it on the estate enterprises actually own — developer workstations, CI/CD runners, and self-hosted agent hosts — where Cursor, Claude Code, Codex, and OpenClaw open shells, read files, call APIs, and pull packages. The mechanism is two-plane and it maps directly onto Anthropic's “system-level interactions” line:
The reasoning plane. A plugin in the coding agent forwards declared intent — the task, the session context — into the host. The plugin stores no security state; the EDAMAME engine is the source of truth.
The system plane. EDAMAME Security on the workstation and EDAMAME Posture on runners and agent hosts observe what the OS actually sees: process lineage, files opened, sockets created, posture drift.
Correlating the two yields two outputs. First, a runtime-verification divergence score: the distance between what the agent declared and what the machine did, on an evidence trail you can hand to an auditor. Second, attack-pattern findings — credential harvest, token exfiltration, anomalous egress — pulled from ML-enriched host telemetry. The intent-versus-behaviour side rests on an ongoing research collaboration with professor Kave Salamatian (CNRS), whose work on the formal verifiability of autonomous software systems is the academic backbone under the primitive. The attack-pattern side is the part that already earns its keep: EDAMAME detected the axios npm RAT, the litellm PyPI takeover, and the tj-actions/changed-files GitHub Actions compromise in our end-to-end suite — all of which reach developer workstations through exactly the coding-agent path Anthropic describes.
Mapping runtime verification onto the framework
Zero Trust for agents is a defense-in-depth model — that is the whole point of the three tiers. Runtime verification is not a replacement for the preventive tiers; it is the assume-breach tier, operationalised on real hosts. Here is the honest mapping.
Assume breach / defense at machine speed — this is the core of what we do: continuous, host-resident verification with divergence scores and evidence trails, fleet-wide alerts, action history with undo.
Supply-chain attacks — ML-enriched behavioural detection that does not care whether the payload arrived via npm, PyPI, or a poisoned tool return; it cares what the payload does on the host.
Audit, posture proof, compliance — continuous, scoped evidence exported into Vanta and other trust centres; SOC 2 and ISO 27001 readiness without screenshots.
Access scoping (at the code platform) — Zero Trust at the Git layer: every push, pull, and clone verified against identity + device posture + policy, so a stolen token is no longer enough to move code.
Identity roots · sandboxing · egress mediation — these are preventive controls you should run from your IdP, your container/VM platform, and your egress proxy. Runtime verification is the layer that tells you, with evidence, whether the boundary actually held.
That last row is the important one, and we are deliberate about it. We are not going to tell a CISO that one agent watches the whole board. Bring your task-scoped identity. Bring your sandbox. Bring your egress policy. Then put a deterministic, host-level verification layer underneath all of it — because, as Anthropic just demonstrated three times, that is the layer that is still standing when the probabilistic ones miss.
What runtime verification is — and what it isn't
No responsible security post should stop at the win. Runtime verification is detection, gating, and evidence; it is not a silver bullet, and it is strongest in combination.
It does not replace identity, sandboxing, or egress proxies. It makes them accountable — it produces the after-the-fact signal Anthropic says is otherwise missing when an authorized API call is the exfiltration.
The hardest case is the one Anthropic hit: action that stays entirely within permitted process/destination pairs — living off the land. No behavioural layer catches 100% of that, and we say so plainly in our own white paper. We treat runtime verification as an early-warning and evidence layer that raises the cost and shrinks the dwell time of these attacks, working alongside the preventive boundary — not as interception-grade prevention on its own.
This is precisely why the framework is tiered. The deterministic host layer and the preventive controls are complements. Either one alone has a gap the other closes.
The visibility problem — without the MDM tax
There is one more line in Anthropic's post that enterprise security teams should not miss. When they sealed Claude Cowork inside a VM, their own customers asked: “Why can't our EDR see inside?” The honest answer was that the isolation protecting the agent also blinded host-based detection, and that the mitigation — pull-based log exports — is “not the same as live monitoring.” Isolation reduces visibility, and visibility is exactly what a compliance posture depends on.
On the estate teams do control — the laptops, the runners, the self-hosted agent hosts — EDAMAME closes that gap the developer-friendly way. It is reporting-only and user-up: no remote wipe, no covert changes, no lockdown. The user sees what is wrong on their own machine and fixes it; the organisation gets continuous posture proof and live runtime evidence into its compliance workflow. That model is what lets the same trust layer run across BYOD devices, contractors, and platform teams where MDM/UEM enrollment falls over culturally. The framework is not tied to MDM — and neither are we.
Where to start
Anthropic drew the map, and it is worth reading in full. Our argument is narrow and, we think, hard to disagree with after their own incident write-ups: the deterministic, host-level, system-interaction layer is where assume-breach becomes real, and it is the layer most agent-security conversations are still skipping.
Read the framework: Anthropic — Zero Trust for AI agents and the companion How we contain Claude.
See the runtime-verification side on the EDAMAME agents page, with a short demo on a Cursor session (it applies identically to Claude Code, Codex, and OpenClaw): youtu.be/zAN4u7ImWrU
Download EDAMAME Security — free for macOS, Windows, Linux, iOS, and Android, four users per tenant on the free plan; or run EDAMAME Posture in your pipelines and on your agent hosts.
Want to walk it through for your own SDLC? Book a slot on our calendar.
Sources: Anthropic, “Zero Trust for AI agents” (27 May 2026); Anthropic engineering, “How we contain Claude across products”; EDAMAME public repositories and white paper on GitHub. Quotations are Anthropic's own words, cited for commentary.

Frank Lyonnet
Share this post