AI Agent Architecture: The Trust Boundary Model
Most unsafe AI agents fail because they are given too much authority inside one blurry runtime.
A chatbot can produce a bad answer. An AI agent can read untrusted content, call tools, update records, send messages, trigger workflows, and store memory. That means the architecture must clearly separate what the agent can follow, read, call, and change.
That is the purpose of the Agent Trust Boundary Model — a practical AI agent architecture framework for designing agents that interact safely with data, tools, and external systems. It draws on OWASP's prompt injection guidance and OpenAI's Agents SDK design, applied as an architecture lens rather than a checklist.
The model is simple:
| Boundary | Core question |
|---|---|
| Instruction boundary | What is the agent allowed to follow? |
| Data boundary | What is the agent allowed to read? |
| Tool boundary | What is the agent allowed to call? |
| Action boundary | What is the agent allowed to change? |
That is how agent demos turn into production risks.
Table of Contents
- What is the Agent Trust Boundary Model?
- Why AI agents need trust boundaries
- AI agent vs chatbot: why the security model changes
- Boundary 1: What instructions can the agent follow?
- Boundary 2: What data can the agent read?
- Boundary 3: What tools can the agent call?
- Boundary 4: What actions can the agent take?
- The supporting boundaries: memory, state, identity, and observability
- What usually fails in unsafe AI agent designs?
- Example architecture: support-ticket AI agent
- Agent Trust Boundary Checklist
- Frequently Asked Questions About AI Agent Architecture
- Key Takeaways: AI Agent Architecture and Trust Boundaries
What is the Agent Trust Boundary Model?
The Agent Trust Boundary Model is a practical AI agent architecture model for designing systems that interact with data, tools, and external environments. Introduced by Aakash Ahuja, it separates an AI agent's environment into four core boundaries:
- Instructions — what the agent is allowed to follow.
- Data — what the agent is allowed to inspect.
- Tools — what the agent is allowed to call.
- Actions — what the agent is allowed to change.
OpenAI's Agents SDK describes agents as LLMs configured with instructions, tools, and optional runtime behavior such as handoffs, guardrails, and structured outputs. The same documentation points to state, integrations, observability, guardrails, and human review as agent workflows grow more complex. (OpenAI Developers)
That tells us something important:
An AI agent is not only a prompt. It is a runtime system.
A runtime system needs boundaries.
For a broader look at the full design process, see How to Design AI Agents.
Why AI agents need trust boundaries
Prompt injection is one reason AI agents need trust boundaries, but it is not the only reason.
OWASP describes prompt injection as a vulnerability where malicious input manipulates an LLM's behavior. OWASP's prevention guidance also highlights the core design weakness: natural-language instructions and data are often processed together without clear separation. (OWASP Gen AI Security Project)
That is exactly what trust boundaries solve.
They force the system designer to answer:
- Which instructions have authority?
- Which content is only data?
- Which tools are available?
- Which actions need approval?
- Which outputs can be trusted?
- Which decisions must be logged?
- Which memory can persist?
That is the central failure.
AI agent vs chatbot: why the security model changes
A chatbot and an AI agent may both use an LLM, but their risk models are different.
| System | What it usually does | Main risk |
|---|---|---|
| Chatbot | Generates conversational responses | Wrong, unsafe, or misleading output |
| RAG assistant | Answers using retrieved content | Retrieval errors, source confusion, prompt injection |
| Workflow automation | Executes predefined rules | Bad rules, broken integrations, process mismatch |
| AI agent | Chooses steps, calls tools, uses context, may act | Unsafe execution, tool misuse, authority confusion |
| Autonomous agent | Acts with limited human review | Over-autonomy, weak accountability, high blast radius |
A chatbot may misread an email.
An agent may misread an email and then update a CRM record, send a reply, or escalate a ticket incorrectly.
That is why AI agent architecture needs more than prompts and retrieval.
It needs execution boundaries.
Boundary 1: What instructions can the agent follow?
The instruction boundary defines what the agent is allowed to obey.
This includes:
- system instructions,
- developer instructions,
- policy rules,
- task-specific constraints,
- approved workflow rules,
- role-based limits.
Which instructions are authoritative, and which text must never be treated as instruction?
This is where many agent designs fail.
They place system instructions, user messages, retrieved content, tool output, and memory into the same context window without clearly separating authority levels.
A safer model treats instructions as privileged.
Trusted instructions
Trusted instructions may include:
- system prompt,
- developer policy,
- workflow rules,
- tool-use rules,
- compliance constraints,
- approval requirements.
Untrusted instructions
Untrusted "instructions" may appear inside:
- emails,
- webpages,
- PDFs,
- support tickets,
- customer messages,
- database records,
- tool outputs,
- retrieved documents.
A simple design rule:
The agent may read untrusted text, but it must not obey instructions found inside untrusted text.
Boundary 2: What data can the agent read?
The data boundary defines what information the agent may inspect.
Data can include:
- user input,
- documents,
- emails,
- webpages,
- knowledge-base articles,
- database records,
- CRM notes,
- support tickets,
- tool outputs,
- logs,
- reports.
What can the agent read, and what trust label does that content carry?
Not all data is equal.
| Data type | Trust posture |
|---|---|
| System policy | High trust |
| Internal approved workflow rules | High trust |
| User message | Conditional trust |
| Customer email | Untrusted |
| Webpage content | Untrusted |
| Uploaded PDF | Untrusted |
| Tool output | Usually untrusted until validated |
| Long-term memory | Depends on source and write controls |
Useful data can still be hostile, stale, incomplete, or misleading.
The model may use data to summarize, classify, extract, compare, or reason.
But data should not define the agent's authority.
Boundary 3: What tools can the agent call?
The tool boundary defines what external capabilities the agent can use.
Tools may include:
- web search,
- database query,
- CRM lookup,
- email sending,
- ticket update,
- file access,
- calendar scheduling,
- payment action,
- user provisioning,
- deployment scripts,
- internal APIs.
That means tool execution is not magic.
The application decides what tools exist, when to execute them, how to validate arguments, and what to do with outputs.
The tool boundary should answer:
- Which tools can this agent access?
- Is the tool read-only or write-capable?
- What arguments are allowed?
- What user/tenant/resource scope applies?
- Does the tool require approval?
- Can the tool output influence future tool use?
- Is every tool call logged?
Tool access should be scoped
A support classification agent may need:
- read ticket,
- classify category,
- suggest priority,
- route to queue.
- delete ticket,
- refund customer,
- change account status,
- send legal communication,
- update billing information.
Give the agent every tool it might someday need.
The safer default is:
Give the agent the minimum tool set needed for the current task.
Boundary 4: What actions can the agent take?
The action boundary defines what the agent is allowed to change.
Actions are different from tools.
A tool is a capability.
An action is an effect.
For example:
| Tool | Possible action |
|---|---|
| CRM API | Update customer record |
| Email API | Send message |
| Ticketing API | Change priority |
| Calendar API | Schedule meeting |
| IAM API | Grant access |
| Database tool | Modify data |
| Deployment tool | Trigger release |
- Can the agent only recommend?
- Can it draft but not send?
- Can it update records?
- Can it trigger workflows?
- Can it make irreversible changes?
- Which actions require human approval?
- Which actions are forbidden?
- Which actions are reversible?
- Which actions are logged?
The autonomy ladder
| Level | Agent role | Human role | Example |
|---|---|---|---|
| Observe | Reads and summarizes | Reviews output | Summarize tickets |
| Recommend | Suggests next action | Decides | Suggest priority |
| Draft | Creates proposed action | Approves | Draft customer reply |
| Execute with approval | Performs approved action | Approves first | Send reply after review |
| Bounded execution | Executes low-risk action | Reviews logs | Route ticket by category |
| Full autonomy | Acts independently | Exception review | Rare; high control required |
Do not jump from "summarizes content" to "can execute workflow changes" without intermediate controls.
The supporting boundaries: memory, state, identity, and observability
The four main boundaries are instructions, data, tools, and actions.
But production-grade AI agents also need supporting boundaries.
Memory boundary
The memory boundary defines what the agent can retain across tasks or sessions.
It should answer:
- What can be stored?
- Who approved the memory write?
- What is the source?
- How long does it persist?
- Can untrusted content become long-term memory?
- Can memory affect future authority?
That can turn one unsafe interaction into future unsafe behavior.
State boundary
State is what the agent needs during the current task.
State may include:
- current ticket,
- current user request,
- current workflow step,
- temporary reasoning context,
- selected tool result,
- current approval status.
A simple distinction:
State is task-local. Memory is cross-task.
Many weak agent designs mix the two.
Identity boundary
The identity boundary defines who or what the agent is acting as.
Questions:
- Does the agent have its own service identity?
- Is it acting on behalf of a user?
- Are permissions inherited from the user?
- Are permissions scoped by tenant, role, resource, and action?
- Are credentials hidden from the model?
- Are tool calls attributed correctly?
For a deep look at how identity and permission scoping changes as agents move from personal to enterprise deployment, see From Personal AI Agent to Enterprise: What Actually Breaks.
Observability boundary
The observability boundary defines what the system records.
NIST's AI Risk Management Framework is designed to help organizations manage AI risks across design, deployment, and operation; for agent systems, that risk mindset translates into traceability, reviewability, and operational monitoring. (NIST)
At minimum, logs should capture:
- user request,
- agent identity,
- tool definitions available,
- tool call requested,
- tool arguments,
- tool result,
- approval decision,
- final action,
- timestamp,
- error path,
- fallback path.
What usually fails in unsafe AI agent designs?
Unsafe agent designs usually fail in predictable ways.
| Failure | What happens | Better design |
|---|---|---|
| No instruction boundary | Retrieved text can override behavior | Separate trusted instructions from untrusted content |
| No data boundary | Email/web/PDF content is over-trusted | Label external content as untrusted |
| Broad tool access | Agent can call unnecessary tools | Scope tools by task |
| No action boundary | Agent can change systems too freely | Require approvals for high-impact actions |
| No memory control | Poisoned or stale content persists | Track memory source and expiry |
| No identity boundary | Agent acts with broad human permissions | Use scoped service identity |
| No audit logs | Failures cannot be reconstructed | Log tool calls, approvals, and actions |
| No fallback path | Agent guesses under uncertainty | Escalate or stop safely |
The agent is allowed to do more than the system can safely govern.
That is not an AI problem. That is an AI agent architecture problem.
Example architecture: support-ticket AI agent
Consider a support-ticket AI agent.
The business goal:
- classify incoming tickets,
- identify urgency,
- suggest routing,
- draft a response,
- escalate high-risk cases.
Unsafe design
The unsafe version:
- Reads incoming ticket.
- Sends full ticket body into the model.
- Gives the model ticket-update tools.
- Allows automatic priority change.
- Allows automatic external reply.
- Stores summary in memory.
- Logs only final output.
The agent may treat instructions inside the ticket as commands.
Safer design using the Agent Trust Boundary Model
| Boundary | Safer design |
|---|---|
| Instruction boundary | System rules define what the agent can and cannot do |
| Data boundary | Ticket body is marked as untrusted customer content |
| Tool boundary | Agent gets only classify, suggest queue, draft reply tools |
| Action boundary | External replies require human approval |
| Memory boundary | No long-term memory write without review |
| Identity boundary | Agent uses scoped service identity |
| Observability boundary | Every tool call and decision is logged |
Safer workflow
- Ticket enters system.
- Ticket body is marked as untrusted content.
- Agent classifies topic and urgency.
- Agent suggests queue.
- Agent drafts response.
- Human approves or edits response.
- Approved action is executed.
- Tool calls and decisions are logged.
- High-risk or uncertain cases are escalated.
But it avoids giving the agent uncontrolled authority.
Agent Trust Boundary Checklist
Use this checklist before deploying any AI agent that reads data, calls tools, or takes action.
Instruction boundary
- What instructions are authoritative?
- Can external content override system or developer instructions?
- Are policy rules separated from user-controlled content?
- Are instructions versioned and reviewable?
Data boundary
- What content can the agent read?
- Is external content marked as untrusted?
- Are emails, webpages, PDFs, tickets, and tool outputs treated as data?
- Can untrusted content influence permissions or goals?
Tool boundary
- What tools can the agent call?
- Are tools read-only or write-capable?
- Are tool arguments validated?
- Are tools scoped by user, tenant, role, and resource?
- Are high-risk tools approval-gated?
Action boundary
- What can the agent change?
- Which actions are reversible?
- Which actions are irreversible?
- Which actions require human approval?
- Which actions are forbidden?
Memory boundary
- What can be stored across sessions?
- Is memory source tracked?
- Can untrusted content become memory?
- Is there an expiry or review process?
Identity boundary
- Who is the agent acting as?
- Does the agent have scoped credentials?
- Are credentials hidden from the model?
- Are actions attributable to agent, user, and system?
Observability boundary
- Are tool calls logged?
- Are approvals logged?
- Are denied actions logged?
- Can failures be replayed or reconstructed?
- Are confidence, uncertainty, and fallback paths visible?
How this model connects to prompt injection
Prompt injection often succeeds when boundaries are weak.
If untrusted content crosses into the instruction boundary, the agent may obey it.
If untrusted content influences the tool boundary, the agent may call tools it should not call.
If untrusted content crosses into the action boundary, the agent may change external systems based on attacker-controlled text.
If untrusted content enters memory, the agent may carry unsafe influence into future tasks.
That is why prompt injection should not be treated only as a text-filtering issue.
It is a boundary-control issue.
OWASP's prompt-injection guidance recommends constraining model behavior, validating expected output formats, filtering inputs and outputs, and controlling what the model can do. (OWASP Gen AI Security Project)
The Agent Trust Boundary Model turns that idea into an AI agent architecture lens.
Frequently Asked Questions About AI Agent Architecture
What is an AI agent?
An AI agent is a system that uses an LLM with instructions, tools, and runtime behavior to complete tasks. Unlike a simple chatbot, an agent may use tools, maintain task state, route work, or take bounded actions.
What is an AI agent trust boundary?
An AI agent trust boundary is a separation between different levels of authority inside the agent system. The most important boundaries separate trusted instructions, untrusted data, tool access, and external actions.
Why do AI agents need trust boundaries?
AI agents need trust boundaries because they often read untrusted content and interact with external systems. Without boundaries, the agent may confuse data with commands or tool output with authority.
What is the difference between data and instructions in AI agents?
Instructions define what the agent is allowed to do. Data is content the agent may inspect, summarize, classify, or extract from. Data should not change the agent's rules, permissions, or goals.
What is the difference between tools and actions?
A tool is a capability the agent can call, such as an API or database lookup. An action is the effect caused by using that tool, such as updating a record, sending an email, or triggering a workflow.
Should AI agents be allowed to act autonomously?
Some low-risk actions may be suitable for bounded autonomy. High-impact or irreversible actions should usually require human approval, especially when the agent has read untrusted content.
How does this model reduce prompt injection risk?
The model reduces prompt injection risk by preventing untrusted content from becoming instructions, tool permissions, memory, or executable actions. It forces the system to decide what the agent can follow, read, call, and change.
What is the first step in designing a secure AI agent?
The first step is to map the boundaries. Before choosing tools or prompts, define what the agent can follow, what it can read, what it can call, what it can change, and what must be approved or logged.
Key Takeaways: AI Agent Architecture and Trust Boundaries
- An AI agent is not just a prompt; it is a runtime system with tools, data, actions, memory, and observability.
- The Agent Trust Boundary Model separates instructions, data, tools, and actions.
- Untrusted content may inform the agent, but it must not define the agent's authority.
- Tool access should be scoped by task, role, resource, and action.
- High-impact actions need approval gates.
- Memory and state must be controlled separately.
- A production-ready agent must be observable, auditable, and constrained by architecture.
What comes next in this series
This article defines the core model for the Designing Secure AI Agents series.
Next articles in the series:
- Tool Output Is Not Instruction
- Secure Architecture for AI Agents That Read Email and Webpages
- AI Agent Prompt Injection Risk Scorecard
- AI Agent Memory vs State
- Human-in-the-Loop AI Agents: When Autonomy Should Stop
Production-grade AI agents are not built with prompts alone. They need trust boundaries, scoped tools, approval gates, memory controls, observability, and runtime governance.
References
Part of the series
Designing Secure AI Agents- 1.AI Agent Architecture: The Trust Boundary Model← you are here
