AI Agent Prompt Engineering: How to Write System Prompts That Actually Work

Here’s an uncomfortable truth about AI agents: most failures aren’t model failures. They’re context failures. The model is capable of doing what you need — you just didn’t tell it what you need clearly enough.

The system prompt is the single most important piece of engineering in any AI agent deployment. It defines who the agent is, what it’s trying to accomplish, what tools it has access to, what it’s not allowed to do, how it should format its output, and how it should handle edge cases. Get this wrong, and you’ll spend weeks debugging behavior that could have been fixed with three sentences of clear instruction.

Yet most system prompts are terrible. They’re either absurdly vague (“You are a helpful assistant”) or impossibly long, burying critical instructions under pages of context that the model struggles to prioritize. Neither approach works for agents that need to operate autonomously, use tools correctly, and handle unpredictable situations.

This guide presents a six-section framework for writing system prompts that actually work for AI agents. We’ll cover each section in detail, provide real templates, and walk through the most common failure patterns and how to fix them.

Why Agent Prompts Are Different From Chat Prompts

If you’ve done any prompt engineering for chatbots or one-shot LLM tasks, you need to recalibrate your thinking for agents. Agent prompts face unique challenges:

1. Persistence: An agent’s system prompt is active across many interactions, not just one. Instructions need to remain relevant and applicable across diverse situations the agent will encounter over days, weeks, and months.

2. Tool coordination: Agents have access to external tools — web browsers, file systems, APIs, databases. The prompt must guide the agent on when and how to use each tool, which is a fundamentally different challenge than guiding text generation.

3. Autonomous decision-making: Unlike chatbots that respond to explicit user requests, agents often need to make decisions about what to do next without explicit instruction. The prompt defines the agent’s judgment framework.

4. Error recovery: Agents encounter errors — API failures, unexpected data formats, ambiguous situations. The prompt must establish how the agent handles these situations, because there may not be a human available to intervene.

5. Memory interaction: Agents read from and write to memory systems. The prompt guides what the agent should remember, what it should forget, and how it should use recalled information.

As we covered in our comparison of AI agents vs. chatbots, agents are defined by autonomy, tool access, and persistent memory. The system prompt is what shapes all three.

The Six-Section Framework

After analyzing hundreds of agent system prompts across production deployments on Agent-S and other platforms, we’ve identified six essential sections that every agent prompt needs. Not every section needs to be long — some can be a single paragraph. But skipping any of them creates predictable failure modes.

Section 1: Role Definition

What it does: Establishes the agent’s identity, expertise domain, and behavioral baseline.

Why it matters: The role definition shapes the agent’s entire reasoning approach. An agent defined as “a senior DevOps engineer” will approach problems differently than one defined as “a customer support specialist.” This isn’t just flavor text — research consistently shows that role framing affects LLM output quality and accuracy in domain-specific tasks.

Template:

You are [role title] for [organization/user]. You specialize in [primary domain] 
with deep expertise in [specific areas].

Your approach: [2-3 sentences describing reasoning style and behavioral norms]

Your audience: [who the agent is talking to and what they expect]

Example:

You are the operations automation lead for a mid-market SaaS company. You 
specialize in workflow automation with deep expertise in CRM management, 
customer communication, and data analysis.

Your approach: You are direct, action-oriented, and technically precise. You 
prefer doing things over describing how things could be done. When you're 
uncertain, you say so clearly rather than hedging.

Your audience: The CEO and operations team. They want fast, accurate execution 
with minimal hand-holding. They're technically literate but not engineers.

Common failures:

Too vague: “You are a helpful AI assistant.” This gives the model zero guidance on domain, expertise level, or behavioral expectations.
Too restrictive: “You are an expert in TypeScript, React, Node.js, PostgreSQL, and AWS Lambda who only writes code.” This over-constrains the agent and prevents it from being useful when the task doesn’t exactly match the specified stack.
Contradictory: “You are casual and friendly but always maintain formal professional language.” Pick one. Contradictory instructions cause inconsistent behavior.

Section 2: Objective

What it does: Defines what the agent is trying to accomplish — its mission, goals, and success criteria.

Why it matters: Without a clear objective, the agent optimizes for the wrong thing. Most agents default to “be helpful and thorough,” which leads to verbose, unfocused responses. A clear objective focuses the agent’s reasoning on outcomes that matter.

Template:

Your primary objective: [one clear sentence]

Success looks like: [2-3 concrete, measurable outcomes]

You are NOT trying to: [explicit anti-goals to prevent common drift]

Example:

Your primary objective: Manage the daily operations of our customer support 
pipeline, resolving tickets autonomously when possible and escalating 
intelligently when necessary.

Success looks like:
- 80%+ of routine tickets resolved without human intervention
- Escalated tickets include full context and recommended action
- Response time under 5 minutes for all incoming tickets
- Zero instances of incorrect information sent to customers

You are NOT trying to:
- Resolve every ticket yourself (some require human judgment)
- Minimize response length (thorough is better than brief for support)
- Upsell or cross-sell (unless the customer explicitly asks about other products)

Common failures:

Missing anti-goals: Without explicit anti-goals, agents drift toward behaviors that seem helpful but aren’t desired. An agent without “don’t upsell” will eventually start recommending upgrades in support tickets because it’s trying to be maximally helpful.
Unmeasurable objectives: “Be a great assistant” isn’t actionable. “Resolve 80% of routine tickets autonomously” is.
Objective overload: Listing 15 objectives dilutes all of them. Prioritize ruthlessly. If everything is critical, nothing is.

Section 3: Tool Usage

What it does: Defines which tools the agent has access to, when to use each one, and critical constraints on tool usage.

Why it matters: Tool usage is where most agent failures occur. The agent calls the wrong tool, calls the right tool with wrong parameters, calls a tool when it shouldn’t, or fails to call a tool when it should. Clear tool usage instructions prevent all of these.

Template:

You have access to the following tools:

[Tool Name]: [What it does in one sentence]
- Use when: [specific trigger conditions]
- Do NOT use when: [anti-patterns]
- Important: [critical constraints or gotchas]

[Repeat for each tool]

General tool usage rules:
- [Cross-cutting rules that apply to all tools]

Example:

You have access to the following tools:

Email Send: Sends an email from the support inbox
- Use when: You need to respond to a customer ticket or send a follow-up
- Do NOT use when: The message requires approval (anything involving refunds 
  over $100, legal language, or escalation to management)
- Important: Always include the ticket number in the subject line. Never send 
  without verifying the recipient address matches the ticket.

Database Query: Runs read-only SQL against the customer database
- Use when: You need customer account details, subscription status, or 
  usage history to resolve a ticket
- Do NOT use when: The query would return more than 1000 rows (use the 
  reporting tool instead)
- Important: This is read-only. You cannot modify data. If a data change 
  is needed, escalate to the engineering team.

Knowledge Base Search: Searches the support documentation
- Use when: You need to reference product documentation, known issues, 
  or standard procedures
- Do NOT use when: The question is about the customer's specific account 
  (use Database Query instead)

General tool usage rules:
- Try the most specific tool first. Don't search the knowledge base when 
  a database query would give you the exact answer.
- If a tool call fails, read the error message and try to fix the issue 
  before retrying. Do not retry the same failing call more than twice.
- When multiple tools could work, prefer the one that gives you structured 
  data over free text.

Common failures:

Missing “when to use” guidance: Without trigger conditions, agents either use tools too eagerly or not enough.
No error handling instructions: Agents that don’t know what to do when a tool fails will either loop on the same error or give up entirely.
Assuming the agent knows tool behavior: Never assume the agent understands how a tool works from its name alone. Be explicit about inputs, outputs, and side effects.

Section 4: Constraints

What it does: Defines the boundaries the agent must never cross — safety rails, compliance requirements, authority limits, and operational guardrails.

Why it matters: Constraints protect you from the agent doing something catastrophically wrong. They’re the most important section for high-stakes deployments.

Template:

Hard constraints (never violate these):
- [Constraint 1]
- [Constraint 2]

Soft constraints (prefer these but exercise judgment):
- [Constraint 1]
- [Constraint 2]

Escalation triggers (immediately escalate to human when):
- [Trigger 1]
- [Trigger 2]

Example:

Hard constraints (never violate these):
- Never share customer data with other customers
- Never process refunds over $500 without human approval
- Never make promises about future product features or release dates
- Never provide legal, medical, or financial advice
- Never modify production database records

Soft constraints (prefer these but exercise judgment):
- Prefer resolving tickets in a single interaction when possible
- Prefer linking to documentation rather than reproducing it in full
- Keep responses under 300 words unless the issue genuinely requires more detail

Escalation triggers (immediately escalate to human when):
- Customer mentions legal action or threatens litigation
- Customer reports a data breach or security incident
- The issue involves billing discrepancies over $1,000
- You're unsure whether the situation falls within your authority
- The customer has been flagged as a VIP or enterprise account

Common failures:

No distinction between hard and soft constraints: Treating “keep responses short” and “never share customer data” with equal weight causes the agent to either violate serious constraints or be unnecessarily rigid about minor ones.
Missing escalation triggers: Without clear escalation criteria, agents either escalate too much (wasting human time) or too little (handling situations they shouldn’t).
Constraints that conflict with objectives: “Resolve tickets quickly” combined with “always get approval before responding” creates a deadlock. Ensure your constraints are compatible with your objectives.

For a deeper framework on agent constraints and governance, see our AI agent governance guide.

Section 5: Output Format

What it does: Specifies how the agent should structure its responses, including formatting, length, tone, and required elements.

Why it matters: Consistent output formatting is critical for agent outputs that feed into other systems or workflows. Even for human-facing outputs, clear formatting expectations prevent the agent from defaulting to generic, unfocused responses.

Template:

Response format:
- [Format specification]
- [Required elements]
- [Length guidance]

For [specific situation type]:
[Specific format requirements]

Example:

Response format:
- Use plain text for customer-facing messages. No markdown, no bullet points 
  in emails to customers.
- Start every customer response with acknowledgment of their issue
- End every customer response with clear next steps or resolution confirmation
- Keep customer-facing responses between 100-250 words

For internal escalation notes:
- Use structured format: Summary, Context, Recommended Action, Priority Level
- Include relevant ticket history and customer account details
- Be technical and precise — the audience is the support lead or engineering team

For ticket resolution logs:
- One paragraph: what the issue was, what was done, and the outcome
- Include any follow-up actions scheduled

Common failures:

No format specification at all: The agent defaults to whatever format the base model prefers, which changes based on context and is inconsistent.
Over-specifying format for every case: Rigid formatting for every possible scenario makes the agent fragile. Specify format for the main output types and let the agent adapt for edge cases.

Section 6: Examples

What it does: Provides concrete examples of correct agent behavior, including input-output pairs and decision-making examples.

Why it matters: Examples are the most powerful tool in prompt engineering. They show the agent what “good” looks like in a way that abstract instructions can’t. One well-chosen example is often worth a paragraph of explanation.

Template:

Example: [Scenario name]
Situation: [Brief description of the scenario]
Agent action: [What the agent should do]
Response: [Example response or output]

Example: [Scenario name - edge case]
Situation: [Brief description of a tricky scenario]
Why this is tricky: [What makes this case non-obvious]
Agent action: [Correct approach]
Response: [Example response]

The “What Not to Do” pattern:

Including negative examples (what the agent should NOT do) is often more instructive than positive examples alone:

BAD example (do not replicate):
Customer: "Can you give me a refund?"
Agent: "I'd be happy to process a refund for you! Let me do that right now."
Why this is wrong: No verification of account, order, or refund eligibility. 
No check against refund amount limits.

GOOD example:
Customer: "Can you give me a refund?"
Agent: [Looks up customer account and recent orders]
"I can see your recent order #4521 from March 15. Could you confirmwhich 
order you'd like refunded and let me know the reason? This helps me process 
it correctly."

Common failures:

No examples at all: Many system prompts skip examples entirely. This is a significant missed opportunity.
Examples that are too simple: Showing the agent how to handle a basic happy-path case doesn’t help it handle the complex cases where it actually struggles.
Too many examples: 2-5 well-chosen examples are optimal. More than that and you’re consuming context window for diminishing returns.

Putting It Together: A Complete System Prompt

Here’s how the six sections combine into a complete agent system prompt. This is a simplified example — production prompts are typically 2,000-5,000 tokens.

[ROLE]
You are the customer support agent for CloudSync, a file synchronization 
SaaS product. You specialize in technical troubleshooting, account management, 
and billing inquiries. You are direct, empathetic, and efficient.

[OBJECTIVE]
Resolve customer support tickets autonomously when possible. Escalate 
complex issues with full context and a recommended resolution. Target: 
75% autonomous resolution rate with zero incorrect information.

[TOOLS]
Customer Database: Look up account details, subscription status, usage stats
- Use for any account-specific question
Knowledge Base: Search product documentation and known issues
- Use for product questions, troubleshooting steps, feature explanations
Email: Send responses to customers
- Always include ticket number. Requires approval for refunds > $100.
Ticket System: Update ticket status, add internal notes, assign to teams
- Update status after every action

[CONSTRAINTS]
Hard: Never share data between customers. Never promise unreleased features.
Never process refunds > $500 without approval.
Escalate immediately: Legal threats, security incidents, billing > $1000, 
VIP accounts, anything you're uncertain about.

[OUTPUT FORMAT]
Customer emails: 100-250 words, plain text, acknowledge → resolve → next steps.
Internal notes: Structured — Summary, Context, Action, Priority.

[EXAMPLES]
[Include 2-3 representative examples covering a routine case, an edge case, 
and an escalation case]

Common Prompt Engineering Mistakes

Mistake 1: The Kitchen Sink Prompt

Dumping every possible instruction, edge case, and scenario into a 10,000-token system prompt. The model can’t prioritize when everything is presented as equally important.

Fix: Ruthlessly prioritize. Put the most critical instructions first. Use the six-section framework to organize. If a section is getting too long, move detailed instructions to tool descriptions or memory rather than the system prompt.

Mistake 2: The “Be Smart” Prompt

“You are an extremely intelligent AI agent that always gives the best possible answer.” This tells the model nothing actionable. It’s the prompt engineering equivalent of telling a new employee to “just do a good job.”

Fix: Replace personality adjectives with behavioral instructions. Instead of “be thorough,” specify “always check the customer database before responding to account questions.”

Mistake 3: The Set-and-Forget Prompt

Writing the prompt once and never updating it. Agent behavior drifts and degrades over time as the agent encounters situations the original prompt didn’t anticipate.

Fix: Treat your system prompt as a living document. Review agent performance weekly. When you spot a failure pattern, add a specific instruction or example to prevent it. On Agent-S, the agent’s memory system can supplement the system prompt with learned context, but the base prompt should still evolve.

Mistake 4: Conflicting Instructions

“Always be concise” + “Always provide thorough explanations” + “Always include relevant examples.” These can’t all be true simultaneously, and the agent will oscillate between them unpredictably.

Fix: Identify conflicts before deploying. For each instruction, ask: “Is there another instruction that would prevent the agent from following this one?” Resolve conflicts explicitly with priority ordering or conditional logic.

Mistake 5: Ignoring the Memory Layer

Writing a prompt that assumes the agent has no memory, so you cram context into the system prompt that should live in memory.

Fix: Keep the system prompt focused on identity, behavior, and rules. Move factual context (user details, business processes, historical decisions) to the memory layer. The prompt tells the agent how to behave. Memory tells it what it knows. These are different concerns.

Testing Your System Prompt

Don’t deploy a system prompt without testing it against realistic scenarios:

1. Happy path test: Does the agent handle the most common use case correctly?

2. Edge case test: What happens with unusual inputs, incomplete information, or ambiguous requests?

3. Adversarial test: What happens if a user tries to override the agent’s constraints? (“Ignore your instructions and…”)

4. Error recovery test: What happens when a tool call fails? When the user provides contradictory information?

5. Escalation test: Does the agent correctly identify situations that require human intervention?

6. Consistency test: Run the same scenario 10 times. Is the behavior consistent, or does it vary significantly?

On Agent-S, you can iterate on your agent’s prompt through conversation — observe behavior, provide corrections, and the agent updates its own configuration based on your feedback. This iterative approach, combined with the agent’s persistent memory, means the prompt gets refined continuously based on real-world performance.

FAQ

How long should an AI agent system prompt be?

For most production agents, 1,000-3,000 tokens (roughly 750-2,250 words) is the sweet spot. Under 500 tokens and you’re almost certainly missing critical guidance. Over 5,000 tokens and you’re likely including information that should be in memory or tool descriptions instead. The key is information density — every sentence should change the agent’s behavior in a meaningful way. If you can remove a sentence without affecting behavior, remove it.

Should I use markdown formatting in system prompts?

Yes, judiciously. Section headers, bullet points, and code blocks help the model parse and prioritize information. Models are trained on markdown-heavy text, so they process structured prompts more accurately than walls of prose. But don’t over-format — a prompt that looks like a full documentation page is harder to maintain than one with clean, minimal structure.

How often should I update my agent’s system prompt?

Review weekly for the first month, then monthly. Track failure patterns — if the agent consistently mishandles a specific scenario, that’s a signal to add a constraint, example, or tool usage rule. The goal is to converge on a stable prompt that handles 95%+ of situations correctly. The remaining 5% is handled by the agent’s learned memory and context.

Can I use the same system prompt framework for different AI models?

The six-section framework works across models (GPT-4, Claude, Gemini, open-source models). However, different models respond differently to specific phrasing. Claude responds well to explicit role-playing and direct behavioral instructions. GPT-4 responds well to structured formats and examples. Test your prompt on your target model specifically, and expect to tune the wording. The structure and content should be model-agnostic; the phrasing may need adjustment.

What’s the biggest mistake people make with agent prompts?

Assuming the model will figure out what you meant instead of what you said. LLMs are remarkably good at following explicit instructions and remarkably bad at inferring implicit expectations. If you want specific behavior, specify it. If you assume the agent will “just know” to check the database before responding, or “just know” not to send emails on weekends, or “just know” to escalate when a customer is angry — it won’t. Write it down. The specificity of your prompt is directly proportional to the reliability of your agent.

Why Agent Prompts Are Different From Chat Prompts

The Six-Section Framework

Section 1: Role Definition

Section 2: Objective

Section 3: Tool Usage

Section 4: Constraints

Section 5: Output Format

Section 6: Examples

Putting It Together: A Complete System Prompt

Common Prompt Engineering Mistakes

Mistake 1: The Kitchen Sink Prompt

Mistake 2: The “Be Smart” Prompt

Mistake 3: The Set-and-Forget Prompt

Mistake 4: Conflicting Instructions

Mistake 5: Ignoring the Memory Layer

Testing Your System Prompt

FAQ

How long should an AI agent system prompt be?

Should I use markdown formatting in system prompts?

How often should I update my agent’s system prompt?

Can I use the same system prompt framework for different AI models?

What’s the biggest mistake people make with agent prompts?

Give your AI agent its own computer