AI Agent Architecture: The Trust Boundary Model

Most unsafe AI agents fail because they are given too much authority inside one blurry runtime.

A chatbot can produce a bad answer. An AI agent can read untrusted content, call tools, update records, send messages, trigger workflows, and store memory. That means the architecture must clearly separate what the agent can follow, read, call, and change.

That is the purpose of the Agent Trust Boundary Model — a practical AI agent architecture framework for designing agents that interact safely with data, tools, and external systems. It draws on OWASP's prompt injection guidance and OpenAI's Agents SDK design, applied as an architecture lens rather than a checklist.

The model is simple:

BoundaryCore question
Instruction boundaryWhat is the agent allowed to follow?
Data boundaryWhat is the agent allowed to read?
Tool boundaryWhat is the agent allowed to call?
Action boundaryWhat is the agent allowed to change?
If those four boundaries are unclear, the agent can confuse data with commands, tool output with authority, memory with truth, or suggestions with executable actions.

That is how agent demos turn into production risks.


Table of Contents

  • What is the Agent Trust Boundary Model?
  • Why AI agents need trust boundaries
  • AI agent vs chatbot: why the security model changes
  • Boundary 1: What instructions can the agent follow?
  • Boundary 2: What data can the agent read?
  • Boundary 3: What tools can the agent call?
  • Boundary 4: What actions can the agent take?
  • The supporting boundaries: memory, state, identity, and observability
  • What usually fails in unsafe AI agent designs?
  • Example architecture: support-ticket AI agent
  • Agent Trust Boundary Checklist
  • Frequently Asked Questions About AI Agent Architecture
  • Key Takeaways: AI Agent Architecture and Trust Boundaries
---

What is the Agent Trust Boundary Model?

The Agent Trust Boundary Model is a practical AI agent architecture model for designing systems that interact with data, tools, and external environments. Introduced by Aakash Ahuja, it separates an AI agent's environment into four core boundaries:

  • Instructions — what the agent is allowed to follow.
  • Data — what the agent is allowed to inspect.
  • Tools — what the agent is allowed to call.
  • Actions — what the agent is allowed to change.
This matters because AI agents do not only generate text. They may inspect documents, call APIs, use tools, trigger workflows, create tasks, update records, and communicate with users.

OpenAI's Agents SDK describes agents as LLMs configured with instructions, tools, and optional runtime behavior such as handoffs, guardrails, and structured outputs. The same documentation points to state, integrations, observability, guardrails, and human review as agent workflows grow more complex. (OpenAI Developers)

That tells us something important:

An AI agent is not only a prompt. It is a runtime system.

A runtime system needs boundaries.

For a broader look at the full design process, see How to Design AI Agents.


Why AI agents need trust boundaries

Prompt injection is one reason AI agents need trust boundaries, but it is not the only reason.

OWASP describes prompt injection as a vulnerability where malicious input manipulates an LLM's behavior. OWASP's prevention guidance also highlights the core design weakness: natural-language instructions and data are often processed together without clear separation. (OWASP Gen AI Security Project)

That is exactly what trust boundaries solve.

They force the system designer to answer:

  • Which instructions have authority?
  • Which content is only data?
  • Which tools are available?
  • Which actions need approval?
  • Which outputs can be trusted?
  • Which decisions must be logged?
  • Which memory can persist?
Without these boundaries, the agent may treat untrusted content as if it had authority.

That is the central failure.


AI agent vs chatbot: why the security model changes

A chatbot and an AI agent may both use an LLM, but their risk models are different.

SystemWhat it usually doesMain risk
ChatbotGenerates conversational responsesWrong, unsafe, or misleading output
RAG assistantAnswers using retrieved contentRetrieval errors, source confusion, prompt injection
Workflow automationExecutes predefined rulesBad rules, broken integrations, process mismatch
AI agentChooses steps, calls tools, uses context, may actUnsafe execution, tool misuse, authority confusion
Autonomous agentActs with limited human reviewOver-autonomy, weak accountability, high blast radius
The key difference is action.

A chatbot may misread an email.

An agent may misread an email and then update a CRM record, send a reply, or escalate a ticket incorrectly.

That is why AI agent architecture needs more than prompts and retrieval.

It needs execution boundaries.


Boundary 1: What instructions can the agent follow?

The instruction boundary defines what the agent is allowed to obey.

This includes:

  • system instructions,
  • developer instructions,
  • policy rules,
  • task-specific constraints,
  • approved workflow rules,
  • role-based limits.
The instruction boundary should answer:

Which instructions are authoritative, and which text must never be treated as instruction?

This is where many agent designs fail.

They place system instructions, user messages, retrieved content, tool output, and memory into the same context window without clearly separating authority levels.

A safer model treats instructions as privileged.

Trusted instructions

Trusted instructions may include:

  • system prompt,
  • developer policy,
  • workflow rules,
  • tool-use rules,
  • compliance constraints,
  • approval requirements.

Untrusted instructions

Untrusted "instructions" may appear inside:

  • emails,
  • webpages,
  • PDFs,
  • support tickets,
  • customer messages,
  • database records,
  • tool outputs,
  • retrieved documents.
Those may look like instructions, but they should not carry authority.

A simple design rule:

The agent may read untrusted text, but it must not obey instructions found inside untrusted text.

Boundary 2: What data can the agent read?

The data boundary defines what information the agent may inspect.

Data can include:

  • user input,
  • documents,
  • emails,
  • webpages,
  • knowledge-base articles,
  • database records,
  • CRM notes,
  • support tickets,
  • tool outputs,
  • logs,
  • reports.
The data boundary should answer:

What can the agent read, and what trust label does that content carry?

Not all data is equal.

Data typeTrust posture
System policyHigh trust
Internal approved workflow rulesHigh trust
User messageConditional trust
Customer emailUntrusted
Webpage contentUntrusted
Uploaded PDFUntrusted
Tool outputUsually untrusted until validated
Long-term memoryDepends on source and write controls
The mistake is assuming that because data is useful, it is trustworthy.

Useful data can still be hostile, stale, incomplete, or misleading.

The model may use data to summarize, classify, extract, compare, or reason.

But data should not define the agent's authority.


Boundary 3: What tools can the agent call?

The tool boundary defines what external capabilities the agent can use.

Tools may include:

  • web search,
  • database query,
  • CRM lookup,
  • email sending,
  • ticket update,
  • file access,
  • calendar scheduling,
  • payment action,
  • user provisioning,
  • deployment scripts,
  • internal APIs.
OpenAI's function-calling documentation describes the basic tool-calling loop: the model receives tool definitions, may return a tool call, the application executes the tool, and the tool output is sent back to the model. (OpenAI Developers)

That means tool execution is not magic.

The application decides what tools exist, when to execute them, how to validate arguments, and what to do with outputs.

The tool boundary should answer:

  • Which tools can this agent access?
  • Is the tool read-only or write-capable?
  • What arguments are allowed?
  • What user/tenant/resource scope applies?
  • Does the tool require approval?
  • Can the tool output influence future tool use?
  • Is every tool call logged?

Tool access should be scoped

A support classification agent may need:

  • read ticket,
  • classify category,
  • suggest priority,
  • route to queue.
It probably does not need:

  • delete ticket,
  • refund customer,
  • change account status,
  • send legal communication,
  • update billing information.
The safest default is not:

Give the agent every tool it might someday need.

The safer default is:

Give the agent the minimum tool set needed for the current task.

Boundary 4: What actions can the agent take?

The action boundary defines what the agent is allowed to change.

Actions are different from tools.

A tool is a capability.

An action is an effect.

For example:

ToolPossible action
CRM APIUpdate customer record
Email APISend message
Ticketing APIChange priority
Calendar APISchedule meeting
IAM APIGrant access
Database toolModify data
Deployment toolTrigger release
The action boundary should answer:

  • Can the agent only recommend?
  • Can it draft but not send?
  • Can it update records?
  • Can it trigger workflows?
  • Can it make irreversible changes?
  • Which actions require human approval?
  • Which actions are forbidden?
  • Which actions are reversible?
  • Which actions are logged?
This is where autonomy must be controlled.

The autonomy ladder

LevelAgent roleHuman roleExample
ObserveReads and summarizesReviews outputSummarize tickets
RecommendSuggests next actionDecidesSuggest priority
DraftCreates proposed actionApprovesDraft customer reply
Execute with approvalPerforms approved actionApproves firstSend reply after review
Bounded executionExecutes low-risk actionReviews logsRoute ticket by category
Full autonomyActs independentlyException reviewRare; high control required
Most production agents should move up this ladder slowly.

Do not jump from "summarizes content" to "can execute workflow changes" without intermediate controls.


The supporting boundaries: memory, state, identity, and observability

The four main boundaries are instructions, data, tools, and actions.

But production-grade AI agents also need supporting boundaries.

Memory boundary

The memory boundary defines what the agent can retain across tasks or sessions.

It should answer:

  • What can be stored?
  • Who approved the memory write?
  • What is the source?
  • How long does it persist?
  • Can untrusted content become long-term memory?
  • Can memory affect future authority?
A dangerous pattern is allowing external content to become persistent memory without source tracking.

That can turn one unsafe interaction into future unsafe behavior.

State boundary

State is what the agent needs during the current task.

State may include:

  • current ticket,
  • current user request,
  • current workflow step,
  • temporary reasoning context,
  • selected tool result,
  • current approval status.
State should not automatically become long-term memory.

A simple distinction:

State is task-local. Memory is cross-task.

Many weak agent designs mix the two.

Identity boundary

The identity boundary defines who or what the agent is acting as.

Questions:

  • Does the agent have its own service identity?
  • Is it acting on behalf of a user?
  • Are permissions inherited from the user?
  • Are permissions scoped by tenant, role, resource, and action?
  • Are credentials hidden from the model?
  • Are tool calls attributed correctly?
An enterprise AI agent should not casually inherit broad human permissions. It should have scoped authority.

For a deep look at how identity and permission scoping changes as agents move from personal to enterprise deployment, see From Personal AI Agent to Enterprise: What Actually Breaks.

Observability boundary

The observability boundary defines what the system records.

NIST's AI Risk Management Framework is designed to help organizations manage AI risks across design, deployment, and operation; for agent systems, that risk mindset translates into traceability, reviewability, and operational monitoring. (NIST)

At minimum, logs should capture:

  • user request,
  • agent identity,
  • tool definitions available,
  • tool call requested,
  • tool arguments,
  • tool result,
  • approval decision,
  • final action,
  • timestamp,
  • error path,
  • fallback path.
An agent that cannot be reviewed after failure is not production-ready.


What usually fails in unsafe AI agent designs?

Unsafe agent designs usually fail in predictable ways.

FailureWhat happensBetter design
No instruction boundaryRetrieved text can override behaviorSeparate trusted instructions from untrusted content
No data boundaryEmail/web/PDF content is over-trustedLabel external content as untrusted
Broad tool accessAgent can call unnecessary toolsScope tools by task
No action boundaryAgent can change systems too freelyRequire approvals for high-impact actions
No memory controlPoisoned or stale content persistsTrack memory source and expiry
No identity boundaryAgent acts with broad human permissionsUse scoped service identity
No audit logsFailures cannot be reconstructedLog tool calls, approvals, and actions
No fallback pathAgent guesses under uncertaintyEscalate or stop safely
The root pattern is the same:

The agent is allowed to do more than the system can safely govern.

That is not an AI problem. That is an AI agent architecture problem.


Example architecture: support-ticket AI agent

Consider a support-ticket AI agent.

The business goal:

  • classify incoming tickets,
  • identify urgency,
  • suggest routing,
  • draft a response,
  • escalate high-risk cases.

Unsafe design

The unsafe version:

  • Reads incoming ticket.
  • Sends full ticket body into the model.
  • Gives the model ticket-update tools.
  • Allows automatic priority change.
  • Allows automatic external reply.
  • Stores summary in memory.
  • Logs only final output.
This is unsafe because the ticket body is untrusted content.

The agent may treat instructions inside the ticket as commands.

Safer design using the Agent Trust Boundary Model

BoundarySafer design
Instruction boundarySystem rules define what the agent can and cannot do
Data boundaryTicket body is marked as untrusted customer content
Tool boundaryAgent gets only classify, suggest queue, draft reply tools
Action boundaryExternal replies require human approval
Memory boundaryNo long-term memory write without review
Identity boundaryAgent uses scoped service identity
Observability boundaryEvery tool call and decision is logged

Safer workflow

  • Ticket enters system.
  • Ticket body is marked as untrusted content.
  • Agent classifies topic and urgency.
  • Agent suggests queue.
  • Agent drafts response.
  • Human approves or edits response.
  • Approved action is executed.
  • Tool calls and decisions are logged.
  • High-risk or uncertain cases are escalated.
This design still gives operational value.

But it avoids giving the agent uncontrolled authority.


Agent Trust Boundary Checklist

Use this checklist before deploying any AI agent that reads data, calls tools, or takes action.

Instruction boundary

  • What instructions are authoritative?
  • Can external content override system or developer instructions?
  • Are policy rules separated from user-controlled content?
  • Are instructions versioned and reviewable?

Data boundary

  • What content can the agent read?
  • Is external content marked as untrusted?
  • Are emails, webpages, PDFs, tickets, and tool outputs treated as data?
  • Can untrusted content influence permissions or goals?

Tool boundary

  • What tools can the agent call?
  • Are tools read-only or write-capable?
  • Are tool arguments validated?
  • Are tools scoped by user, tenant, role, and resource?
  • Are high-risk tools approval-gated?

Action boundary

  • What can the agent change?
  • Which actions are reversible?
  • Which actions are irreversible?
  • Which actions require human approval?
  • Which actions are forbidden?

Memory boundary

  • What can be stored across sessions?
  • Is memory source tracked?
  • Can untrusted content become memory?
  • Is there an expiry or review process?

Identity boundary

  • Who is the agent acting as?
  • Does the agent have scoped credentials?
  • Are credentials hidden from the model?
  • Are actions attributable to agent, user, and system?

Observability boundary

  • Are tool calls logged?
  • Are approvals logged?
  • Are denied actions logged?
  • Can failures be replayed or reconstructed?
  • Are confidence, uncertainty, and fallback paths visible?
---

How this model connects to prompt injection

Prompt injection often succeeds when boundaries are weak.

If untrusted content crosses into the instruction boundary, the agent may obey it.

If untrusted content influences the tool boundary, the agent may call tools it should not call.

If untrusted content crosses into the action boundary, the agent may change external systems based on attacker-controlled text.

If untrusted content enters memory, the agent may carry unsafe influence into future tasks.

That is why prompt injection should not be treated only as a text-filtering issue.

It is a boundary-control issue.

OWASP's prompt-injection guidance recommends constraining model behavior, validating expected output formats, filtering inputs and outputs, and controlling what the model can do. (OWASP Gen AI Security Project)

The Agent Trust Boundary Model turns that idea into an AI agent architecture lens.


Frequently Asked Questions About AI Agent Architecture

What is an AI agent?

An AI agent is a system that uses an LLM with instructions, tools, and runtime behavior to complete tasks. Unlike a simple chatbot, an agent may use tools, maintain task state, route work, or take bounded actions.

What is an AI agent trust boundary?

An AI agent trust boundary is a separation between different levels of authority inside the agent system. The most important boundaries separate trusted instructions, untrusted data, tool access, and external actions.

Why do AI agents need trust boundaries?

AI agents need trust boundaries because they often read untrusted content and interact with external systems. Without boundaries, the agent may confuse data with commands or tool output with authority.

What is the difference between data and instructions in AI agents?

Instructions define what the agent is allowed to do. Data is content the agent may inspect, summarize, classify, or extract from. Data should not change the agent's rules, permissions, or goals.

What is the difference between tools and actions?

A tool is a capability the agent can call, such as an API or database lookup. An action is the effect caused by using that tool, such as updating a record, sending an email, or triggering a workflow.

Should AI agents be allowed to act autonomously?

Some low-risk actions may be suitable for bounded autonomy. High-impact or irreversible actions should usually require human approval, especially when the agent has read untrusted content.

How does this model reduce prompt injection risk?

The model reduces prompt injection risk by preventing untrusted content from becoming instructions, tool permissions, memory, or executable actions. It forces the system to decide what the agent can follow, read, call, and change.

What is the first step in designing a secure AI agent?

The first step is to map the boundaries. Before choosing tools or prompts, define what the agent can follow, what it can read, what it can call, what it can change, and what must be approved or logged.


Key Takeaways: AI Agent Architecture and Trust Boundaries

  • An AI agent is not just a prompt; it is a runtime system with tools, data, actions, memory, and observability.
  • The Agent Trust Boundary Model separates instructions, data, tools, and actions.
  • Untrusted content may inform the agent, but it must not define the agent's authority.
  • Tool access should be scoped by task, role, resource, and action.
  • High-impact actions need approval gates.
  • Memory and state must be controlled separately.
  • A production-ready agent must be observable, auditable, and constrained by architecture.
---

What comes next in this series

This article defines the core model for the Designing Secure AI Agents series.

Next articles in the series:

  • Tool Output Is Not Instruction
  • Secure Architecture for AI Agents That Read Email and Webpages
  • AI Agent Prompt Injection Risk Scorecard
  • AI Agent Memory vs State
  • Human-in-the-Loop AI Agents: When Autonomy Should Stop
The central thesis:

Production-grade AI agents are not built with prompts alone. They need trust boundaries, scoped tools, approval gates, memory controls, observability, and runtime governance.

References

Part of the series

Designing Secure AI Agents
  1. 1.AI Agent Architecture: The Trust Boundary Model← you are here
View full series →
AICybersecuritySeriesMay 23, 2026
Share
Aakash Ahuja

About the Author

Aakash builds systems, platforms, and teams that scale (without breaking… usually). He's worked across 15+ industries, led global teams, and delivered multi-million-dollar projects—while still getting his hands dirty in code. He also teaches AI, Big Data, and Reinforcement Learning at top institutes in India.