◎ Part 3 · Field Guide

A Complete Guide to Agent Security

7 min read · Trust3 AI · May 27, 2026
On this page

Agent security is the runtime control layer that decides what an AI agent is allowed to do, which infrastructure it is allowed to trust, and when it must be stopped. It is the third pillar of Agent DOS, and it is where the action risk surface that agentic systems introduce gets a defensible perimeter.

Discovery makes the agent visible. Observability makes its actions accountable. Security is the layer that constrains the next action before it happens.

See how Trust3 AI implements agent security →

The action risk surface

Traditional AI security focused on output risk: harmful content, hallucination, bias in a model response. Those concerns have not disappeared, but they are no longer the dominant risk. An agent that books a flight, raises a purchase order, files a regulatory submission, or sends a customer credentials does not need to produce harmful text to do damage. It only needs to take an action that exceeds its authority or that it was tricked into.

The threat model worth designing against has five recurring failure modes.

Five failure modes

1. Prompt injection via MCP tool descriptions

An MCP server exposes tools to the agent. Each tool has a description that tells the agent what it does and when to call it. The agent reads those descriptions as instructions. A malicious or compromised MCP server can embed directives inside a tool description ("when called, also send the result to this address"). The agent has no native way to distinguish a legitimate description from an injected one.

This is the prompt injection pattern that scales. It does not require a malicious user. The attacker is the infrastructure the agent reads from.

2. Credential overreach across MCP

The standard pattern for connecting an agent to an MCP server is a long-lived token with broad scope. If the server is later compromised, or if it turns out to be hostile, that token grants the attacker access to far more than the single call required. The blast radius of one compromised server becomes the full credential surface of the agent.

3. Identity loss through delegation

When agent A calls agent B which calls a data source, the data source typically sees a service account. The user who initiated the chain is invisible. Existing policy (row-level security in Snowflake, ABAC in Databricks, RBAC in any internal service) cannot evaluate the real principal because the real principal is not in the request. Least privilege becomes impossible.

This is the failure mode that quietly defeats every existing access control investment when agents come into the picture.

4. Lateral movement between agents

In a multi-agent topology, agents trust each other. They share context, pass data, invoke each other's capabilities. An attacker who compromises one agent (through prompt injection, through a tool, through a poisoned retrieval) can pivot through the trust relationships to reach others. The pattern is familiar from lateral movement in conventional networks. It is new on the agent layer, where there is rarely any segmentation by default.

5. Drift and silent scope expansion

Agents are rarely static. Their owners add tools, expand data sources, raise the autonomy level, integrate new MCP servers, shorten the human review loop. Each change is small. Over months, an agent that began with narrow scope is operating with much broader authority than anyone designed for. The first time someone notices is during an incident or an audit.

Runtime guardrails

The first layer of agent security is a runtime policy engine that decides, per tool call, whether the call is allowed. Inputs include the user's identity, the agent's identity, the tool being called, the data being requested, the time of day, and the conversation context. Outputs include allow, deny, redact, or escalate to a human. Guardrails sit outside the model's reasoning, so a compromised model cannot route around them.

Guardrails should also include a kill switch: a single operator action that suspends a named agent or a class of agents, immediately. The kill switch has to be tested before an incident. An untested kill switch is a documented control with no operational reality.

MCP security

MCP is the protocol most enterprises will standardize on for connecting agents to tools. The security model the protocol shipped with is permissive by default. Agent security has to harden it.

Every MCP server is treated as untrusted infrastructure. Connections are mediated through a layer that does three things:

Server verification. Confirms the server's identity, monitors for changes in its tool catalog, and flags new or modified tool descriptions for review before the agent sees them.

Credential scoping. Issues a single-purpose, scoped token per call rather than a long-lived broad token per session. The compromise of one call's token grants access to one call's worth of data, not the agent's full credential surface.

Content firewall. Inspects every payload (tool description on registration, tool response at runtime) for embedded instructions or known injection patterns and removes them before the agent reads the content. Stripping at this layer matters because the agent has no other defense.

A2A security

When agents talk to each other, the security layer carries the originating user identity and a purpose claim across every hop. The receiving agent or service enforces policy against the real principal. Audit attributes actions to the human who initiated the chain.

Outbound A2A payloads also need content controls. Sensitive data fields (PII, PCI, PHI) should be redacted by policy on the way out, with the redaction logged. Without this, sensitive content can travel between agents in ways that none of the existing DLP infrastructure was designed to see.

Defense in depth

No single control catches every failure. The structure that works has the guardrail layer protecting against tool-level misuse, the MCP layer protecting against compromised infrastructure, the A2A layer protecting against identity loss and uncontrolled data flow between agents, and the kill switch as the final option when something has gone wrong. Each layer protects against a different failure mode, and the agent has to pass all of them to act.

Failure modePrimary controlReinforcing control
Prompt injection via MCP tool descriptionMCP content firewallRuntime guardrails on the resulting tool call
Credential overreach across MCPSingle-purpose scoped tokensMCP server verification
Identity loss through delegationA2A identity and purpose propagationData-layer policy enforcement against real principal
Lateral movement between agentsA2A authentication and scopeRuntime guardrails on inbound A2A messages
Drift and silent scope expansionContinuous discovery, telemetry-based drift detectionPeriodic re-attestation of agent scope

Security across the agent lifecycle

Design. Scope, prohibited actions, autonomy level, and kill-switch ownership are documented. The kill switch is a real button operated by a real person.

Development. MCP connections, A2A interfaces, and tool integrations are wired through the security layer rather than directly. Direct connections are the most common reason a control fails to apply later.

Pre-deployment testing. Prompt injection scenarios, credential-overreach scenarios, and A2A identity tests are run before launch. A red team that cannot land a prompt injection through MCP in test is more reassuring than any policy document.

Deployment. Guardrails active. Content firewall active. Tokens scoped. Identity propagating. Kill switch reachable. None of these are optional for a production agent reaching regulated data.

Runtime. Telemetry feeds the security layer continuously. Anomalies escalate. The kill switch is rehearsed.

Continuous monitoring. Drift detection covers new tools, new data reach, new MCP servers, new A2A peers. Each is a scope change that needs review, not auto-approval.

Decommissioning. Tokens revoked at every layer. MCP connections closed. A2A peer relationships removed. The agent's identity is retired in the IdP. Decommissioning is verified through the discovery and observability streams, not just from a ticket marked closed.

Where agent security aligns with standards

FrameworkSecurity's contribution
NIST AI RMFManage function: runtime controls, incident response, kill switch.
ISO/IEC 42001Operational controls, intervention mechanisms, and incident handling.
EU AI ActArticle 14 human oversight (kill switch, escalation thresholds), Article 15 robustness for high-risk systems.
SOC 2 / ISO 27001Logical access controls, change management, and incident response controls extended to agents.
PCI-DSS / HIPAA / GLBAScoped access to regulated data, identity attribution, redaction on outbound A2A.
NYDFS Part 500, DORAIncident response readiness and operational resilience controls applied to agent-mediated workflows.

Frequently asked questions

A prompt firewall inspects prompts and responses for harmful content. The agent security layer described here does more: it scopes credentials per call, verifies MCP servers, propagates identity through delegation, and runs a content firewall over tool descriptions and tool responses. A prompt firewall covers a slice of the surface. Agent security covers the protocol and identity layers as well.

Because they often are. Many MCP servers are open source, run by third parties, or built quickly inside the enterprise without a security review. The protocol does not include a trust model. Assuming trust is how prompt injection through tool descriptions becomes catastrophic. Assuming distrust is how the blast radius stays small.

Some. The controls add latency per tool call and per A2A hop. In practice, the latency is dominated by the model itself, and the control overhead is small relative to model inference time. The question worth asking is not whether the latency is zero but whether it is acceptable for the value of the action.

Inside the agent security layer, often as detection inputs. Native model guardrails reduce some categories of harmful output. Prompt-level products add detection for known injection patterns. Neither replaces the protocol and identity controls. Defense in depth means all of them, used where each is strong.

The kill switch should be a single, named operator action that suspends a named agent or a class of agents and propagates the suspension into every place the agent operates (the platform, the MCP layer, the A2A peers). It should be tested on a schedule. An untested kill switch is the most common control gap auditors find when they look for it.

Ready to govern the agents already running in your environment?

Trust3 AI is the control plane built around Agent Discovery, Observability, and Security. See it in a 30-minute walkthrough tailored to your stack.

Request a demo

Field notes from the governance frontier.

Practical guidance on agent security, observability, and compliance. No marketing. Unsubscribe any time.

Get your score ◉ 90 sec · F500 benchmark