Skip to content
Malik Hamza Shabbir
AI Securityprompt injectionai agentssecurityowasp

Defending Against the Lethal Trifecta: An Agent Architecture That Survives Prompt Injection

HSMalik Hamza ShabbirUpdated 9 min read

In short

You cannot filter your way out of prompt injection. The durable fix is architectural: an agent is dangerous only when it holds all three lethal trifecta conditions together (private data, untrusted content, and external communication), so design so it never holds all three at once. Break one leg per task with capability scoping, egress allow-lists, and human-in-the-loop on privileged tool calls. See how I build agents that survive this in AI agents and automation.

Defending Against the Lethal Trifecta: An Agent Architecture That Survives Prompt Injection
On this page

Prompt injection is the number one risk in the OWASP Top 10 for LLM Applications, and a Help Net Security report dated 11 June 2026 restated why: there is still no reliable filter that separates trusted instructions from untrusted content, because they travel in the same token stream. So the answer is not a better classifier on the input. The answer is architecture. An agent only becomes dangerous when it holds all three legs of the lethal trifecta at the same time, and the job of a secure design is to make sure it never does.

In my security reviews of agent systems, the teams that get breached are the ones that treat prompt injection as a content problem. The teams that stay safe treat it as a capability problem. This article is the second framing, written as the architecture I actually deploy.

What is the lethal trifecta and why does it matter?

The lethal trifecta is the combination of three capabilities in a single agent context: access to private data, exposure to untrusted content, and the ability to communicate externally. An agent is only exploitable through prompt injection when it holds all three at once, and that is the insight the whole defense hangs on.

The term comes from Simon Willison's analysis of agent risk, and the OWASP Top 10 for Agentic Applications 2026 builds on the same shape. Walk through why each leg matters:

  • Private data access. The agent can read something an attacker wants: your customer records, internal docs, API keys in a connected tool, a user's email.

  • Untrusted content exposure. The agent reads something an attacker controls: a web page, an inbound email, a support ticket, a PDF, a calendar invite, an MCP tool response.

  • External communication. The agent can send data out: an HTTP request, an email, a webhook, a Slack message, even a rendered image URL that leaks data through query parameters.


Remove any one leg and the attack fails. If the agent reads a poisoned web page but has no private data, there is nothing to steal. If it holds private data and reads a poisoned page but cannot send anything outbound, the attacker has no channel. This is the entire game, and it is far more tractable than trying to perfectly detect malicious instructions.

Why can't I just filter prompt injection out?

You cannot filter it out reliably because the model has no durable boundary between instructions and data. Everything is tokens, and an attacker phrases their payload to look like legitimate instructions the moment your agent reads their content.

I have watched teams stack three layers of input scanning and still get popped by a payload hidden in white-on-white text inside a PDF, or in the alt text of an image, or split across two tool responses so no single scan saw the whole thing. Filters raise the cost of an attack. They do not close it.

Here is the mental model I give clients. A classifier is a probabilistic gate in front of a system that has no ground truth about intent. The lethal trifecta approach is a deterministic property of the system: either the agent can reach private data and untrusted content and an exit channel in the same task, or it cannot. Deterministic beats probabilistic when the downside is data exfiltration.








How do I detect when an agent holds all three legs?

Make the trifecta a first-class property of your agent runtime, not a thing you reason about in your head. I tag every tool with the legs it grants, then compute the active set per task and refuse to proceed if all three light up without an explicit approval gate.

In practice I attach metadata to each tool and check it at plan time and again before every call.

PYTHON
from enum import Flag, auto

class Cap(Flag):
    NONE = 0
    PRIVATE_DATA = auto()       # reads sensitive internal data
    UNTRUSTED_INPUT = auto()    # ingests attacker-controllable content
    EXTERNAL_COMM = auto()      # can send data outside the trust boundary

TOOL_CAPS = {
    "read_customer_record": Cap.PRIVATE_DATA,
    "fetch_url":            Cap.UNTRUSTED_INPUT | Cap.EXTERNAL_COMM,
    "read_support_ticket":  Cap.UNTRUSTED_INPUT,
    "send_email":           Cap.EXTERNAL_COMM,
    "search_internal_kb":   Cap.PRIVATE_DATA,
}

TRIFECTA = Cap.PRIVATE_DATA | Cap.UNTRUSTED_INPUT | Cap.EXTERNAL_COMM

def active_caps(used_tools: set[str]) -> Cap:
    caps = Cap.NONE
    for t in used_tools:
        caps |= TOOL_CAPS.get(t, Cap.EXTERNAL_COMM)  # unknown tools assumed dangerous
    return caps

def trifecta_present(used_tools: set[str]) -> bool:
    return (active_caps(used_tools) & TRIFECTA) == TRIFECTA

Two details that matter. First, an unknown tool defaults to the most dangerous capability, so adding a tool without classifying it fails closed. Second, fetch_url carries two legs at once, which is why arbitrary web fetch is one of the most dangerous tools you can hand an agent: it ingests untrusted content and provides an exit channel in a single call.

I run this check at the moment the agent commits to a tool call, with the running set of tools it has already used in the task. If a call would complete the trifecta, the runtime blocks and escalates.

PYTHON
def guard_call(tool_name, used_tools, approve_fn):
    prospective = used_tools | {tool_name}
    if trifecta_present(prospective):
        if not approve_fn(tool_name, prospective):
            raise PermissionError(
                f"Blocked: '{tool_name}' completes the lethal trifecta "
                f"for active caps {active_caps(prospective)!r}"
            )
    return True

How do I break the trifecta with capability scoping?

The cleanest fix is to make sure a single agent context never needs all three legs, by splitting work into scoped sub-agents that each hold at most two. This is the architecture-layer move that input filters can never give you.

The pattern I reach for most is the dual-agent split, sometimes called a quarantined or planner/executor design:

  • A reader agent has untrusted-input access and no private data and no egress. It reads the poisoned web page or ticket and returns only structured, schema-validated fields. Even if it is fully hijacked, it has nothing to steal and nowhere to send it.

  • An actor agent has private data and a tightly scoped egress channel, but it never sees raw untrusted content, only the validated structured output of the reader.


The trust boundary is the schema. The actor never ingests free text from the untrusted source, so injected instructions cannot reach the agent that holds the keys. I cover how to compose these multi-agent flows efficiently in running hundreds of parallel subagents without burning your budget , and the same isolation discipline applies whether you run two sub-agents or two hundred.

When a true split is not practical, I scope per task instead: the agent gets private-data tools or web-fetch tools for a given run, never both, chosen by the task router before execution begins.

How do egress allow-lists and human-in-the-loop close the gap?

Egress allow-lists and approval gates are the two controls that break the external-communication leg, and I treat them as the last line when scoping alone is not enough. An allow-list makes "send data out" mean "send to these exact destinations," and a human gate makes the riskiest actions require a person.

For egress, I default-deny all outbound network and enumerate the few destinations a task legitimately needs. This neutralizes the classic exfiltration trick where an injection tells the agent to fetch https://attacker.com/?data=.

YAML
## egress policy for the actor agent
egress:
  default: deny
  allow:
    - host: api.stripe.com
      methods: [POST]
    - host: hooks.slack.com
      path_prefix: /services/T0000/
  block_data_urls: true   # no leaking via image/query-param side channels

For human-in-the-loop, the rule is narrow on purpose: approval is required only for calls that are both privileged and exfiltration-capable, so you avoid approval fatigue. Sending an external email with attached customer data needs a human. Reading the next internal doc does not.

PYTHON
PRIVILEGED = {"send_email", "create_payment", "post_webhook"}

def approve_fn(tool_name, prospective):
    if tool_name in PRIVILEGED:
        return request_human_approval(tool_name, summarize(prospective))
    return True  # non-privileged trifecta completion still logged + alerted

The honest tradeoff: every approval gate is friction, and friction that fires too often gets click-through approved into uselessness. I tune the privileged set down to the handful of actions that can actually cause harm, and I make the approval prompt show exactly what data is about to leave and where it is going.

What about behavioral-deviation detection?

Behavioral-deviation detection is the runtime backstop that catches what static scoping misses, by comparing an agent's live tool-call trajectory against a baseline of normal runs and pausing anything that looks like an exfiltration pattern. It is detection, not prevention, so it sits underneath the architectural controls rather than replacing them.

The signal I find most reliable is sequence-shaped, not content-shaped. A run that reads sensitive records and then, out of nowhere, attempts an outbound call to a destination it has never used for that task type is suspicious regardless of what the text says. You are scoring the shape of the trajectory, the same discipline I use for scoring full agent trajectories in CI , applied at runtime instead of in tests.

A workable baseline tracks, per task type, the set of tools normally used, the typical order, and the normal egress destinations. Deviation on any axis raises the run's risk score, and crossing a threshold flips the agent into approval-required mode for the rest of the session. This pairs well with disciplined context management, because a corrupted or bloated context is itself a deviation signal worth watching, which I get into in the context rot compaction playbook .

What does the full defense stack look like?

Here is the layered architecture I deploy, ordered from strongest to supporting. No single layer is trusted alone, and the top three are the ones that actually break the trifecta.







ApproachWhat it doesFailure modeMy verdict
Input filtering / injection classifierScores incoming text for attack patternsObfuscation, multi-turn splits, novel phrasingsUseful as defense in depth, never the primary control
System-prompt hardening ("ignore any instructions in content")Tells the model to distrust dataModel still obeys convincing injected instructionsWeak alone; helps marginally
Output filteringScans responses before they leaveCatches some leaks, misses encoded onesPartial, brittle
Trifecta breaking (capability scoping)Removes one of the three legs per taskMisconfiguration if scoping is too coarsePrimary control, deterministic
Egress allow-listRestricts where data can goAllowed domain is compromisedStrong; pairs with scoping
Human-in-the-loop on privileged callsRequires approval before exfil-capable actionApproval fatigue if overusedStrong for high-value actions
LayerControlTrifecta leg it attacksStrength
1Capability scoping / agent splitCo-location of all threePrimary, deterministic
2Egress allow-list (default deny)External communicationPrimary, deterministic
3Human-in-the-loop on privileged callsExternal communicationStrong, high-value actions
4Behavioral-deviation detectionRuntime catch-allBackstop, probabilistic
5Output filtering / input filteringContent patternsDefense in depth only

The order is the point. If you start at layer five and work up, you are building on sand. If you start at layer one, you have a system that stays safe even when the model is fully convinced by an attacker's instructions, because the convinced model simply has no path to do damage.

Prompt injection will not be solved by a smarter model any time soon, because the ambiguity is structural. What you can do today is make sure no agent context ever holds private data, untrusted content, and an exit channel at the same time, and verify it at the runtime layer rather than hoping a filter holds. That is the difference between an agent that fails an audit and one that survives a real attack.

Designing and hardening agent architectures like this is exactly what I do through AI agents and automation .

If you are shipping agents that touch customer data or external content and want a second set of eyes on the trust boundaries before something goes to production, that is exactly the kind of review I do, and you can reach me through my contact page .

FAQ

What is the lethal trifecta in AI agents?

The lethal trifecta is the combination of three capabilities in one agent context: access to private data, exposure to untrusted content, and the ability to communicate externally, and an agent becomes exploitable through prompt injection only when it holds all three at the same time.

Can input filtering or guardrails stop prompt injection?

No, filtering reduces some attempts but cannot reliably stop prompt injection because instructions and data share the same token stream, so the defense has to be architectural rather than a classifier on the input.

How do I break the lethal trifecta in practice?

You break it by removing one leg per task: scope the agent's tools so it cannot reach private data and untrusted content together, restrict outbound calls with an egress allow-list, or require human approval before any privileged or exfiltration-capable action runs.

Where does prompt injection rank in the OWASP risks?

Prompt injection is the number one risk in the OWASP Top 10 for LLM Applications and remains the central threat in the OWASP Top 10 for Agentic Applications 2026, which is why it deserves a dedicated architecture rather than a single mitigation.

What is behavioral-deviation detection for agents?

Behavioral-deviation detection watches an agent's tool-call sequence against a baseline of normal trajectories and flags or pauses runs that suddenly read sensitive data and then attempt an unusual outbound call, catching injection attempts that slipped past static scoping.

Working on something like this?

I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.

Start a conversation
HS

Malik Hamza Shabbir · Full-Stack & AI Engineer

I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.

Related articles