How AI Agent Memory Actually Works: Short-Term, Long-Term, and Everything In Between
A deep technical explainer on AI agent memory systems — how scratchpad, session, and long-term memory tiers work together, how vector storage and semantic retrieval enable persistent context, and how Agent-S implements memory for autonomous agents.
Ask most people how AI agents “remember” things, and you’ll get a vague answer about context windows or maybe something about vector databases. The reality is significantly more nuanced — and understanding how agent memory actually works is the difference between building agents that forget everything between sessions and agents that genuinely learn and improve over time.
Memory is the single most underappreciated component of AI agent architecture. You can have perfect tool integration, flawless prompt engineering, and the best LLM on the market — but if your agent can’t remember what it did yesterday, who it’s working for, or what it learned from past mistakes, it’s just a very expensive chatbot. And as we’ve covered in our breakdown of AI agents vs. chatbots, the ability to maintain persistent state is a core part of what makes an agent an agent.
This guide goes deep into the three tiers of AI agent memory, the storage and retrieval systems that power them, and how Agent-S implements memory in practice.
Why Memory Matters More Than You Think
Consider a concrete scenario. You have an AI agent managing customer support for your SaaS product. A customer writes in about a billing issue. Without memory, the agent handles the ticket in isolation — it reads the message, generates a response, and moves on.
With memory, the picture changes dramatically:
- Short-term memory: The agent holds the current conversation context, including all messages exchanged in this session, the customer’s tone and urgency level, and any actions already taken
- Session memory: The agent recalls that this customer also wrote in two days ago about a related feature request, and that the previous interaction ended with a promise to follow up
- Long-term memory: The agent knows this customer has been with the product for 18 months, has a history of billing-related questions every quarter, prefers concise responses, and is on the enterprise plan — which means their issues get escalated to the billing team, not handled via self-service
That’s three layers of context, each stored and retrieved differently, each adding irreplaceable value to the agent’s ability to handle the situation well. Strip any one of them away, and the quality of the response degrades noticeably.
The Three-Tier Memory Architecture
Modern AI agent memory systems are organized into three tiers, each with different persistence, capacity, and retrieval characteristics.
Tier 1: Scratchpad Memory (Working Memory)
What it is: The agent’s immediate working context — the information it’s actively reasoning about right now.
Persistence: Lives only for the duration of the current reasoning step or task. Discarded when the task completes.
Capacity: Bounded by the LLM’s context window (typically 128K-200K tokens in current models).
What goes here:
- The current user request and conversation messages
- Intermediate reasoning steps
- Tool call results being evaluated
- Temporary calculations and comparisons
- Draft outputs being refined
Think of scratchpad memory like the papers spread across your desk while you’re working on a specific task. They’re immediately accessible, highly relevant, and cleaned up when you move to the next thing.
Implementation details:
In most agent frameworks, scratchpad memory is simply the content that gets passed to the LLM in each API call. This includes the system prompt, conversation history, tool results, and any injected context. The critical challenge is fitting everything relevant into the context window without exceeding token limits.
This is where context window management becomes a real engineering problem. A naive implementation just concatenates everything and truncates when it hits the limit. A well-engineered system uses strategies like:
- Priority-based truncation: Recent messages and tool results get priority over older conversation turns
- Summarization: Older conversation segments are compressed into summaries before being included
- Selective inclusion: Only context that’s semantically relevant to the current task is included
- Sliding window: A fixed number of recent messages are always included, with older messages rolled into session memory
Tier 2: Session Memory (Episodic Memory)
What it is: Information that persists across multiple interactions within a bounded timeframe — typically a “conversation” or “thread” that spans multiple user messages and agent actions.
Persistence: Lives for the duration of the session (hours to days). May be archived or summarized into long-term memory when the session ends.
Capacity: Larger than scratchpad but still bounded. Typically stored as structured data that can be selectively loaded into the scratchpad.
What goes here:
- Conversation history across the full session
- Decisions made and their rationale
- Task progress and current state
- User preferences expressed during the session
- Constraints and goals established for the current workflow
- Intermediate results from multi-step tasks
Session memory is your notebook for the current project. You flip back through it to remember decisions you made earlier today, check the status of ongoing work, and maintain continuity across interruptions.
Implementation details:
Session memory sits between the ephemeral scratchpad and persistent long-term storage. The most common implementations use:
- Thread-based storage: Each conversation thread maintains a structured state object (goals, decisions, constraints, open questions) that persists across messages
- Compressed history: Full conversation history is stored, but older segments are summarized so the full session can be reconstructed without consuming the entire context window
- State machines: For task-oriented agents, session memory tracks which steps have been completed, what’s in progress, and what’s remaining
The key engineering challenge with session memory is the retrieval boundary — deciding what to pull from session memory into the scratchpad for each new interaction. Pull too little, and the agent loses continuity. Pull too much, and you waste context window space on irrelevant history.
Tier 3: Long-Term Memory (Semantic Memory)
What it is: Persistent knowledge that survives across all sessions — facts, preferences, learned procedures, and accumulated experience.
Persistence: Indefinite. Survives session boundaries, agent restarts, and time.
Capacity: Effectively unlimited (bounded only by storage costs, not by architectural constraints).
What goes here:
- User preferences and communication style
- Learned facts about the user’s business, workflows, and environment
- Procedural knowledge (how to do recurring tasks)
- Corrections and calibrations from past interactions
- Entity relationships (people, projects, tools, accounts)
- Historical patterns and outcomes
Long-term memory is your experience and expertise — everything you’ve learned over time that informs how you approach new situations.
Implementation details:
Long-term memory is where the real engineering complexity lives. There are several storage and retrieval approaches:
Vector Storage and Semantic Retrieval
The most common approach for AI agent long-term memory uses vector embeddings. Each memory item (a fact, preference, procedure, or experience) is:
- Stored as text with metadata (timestamp, source, confidence score, tags)
- Converted to a vector embedding using a model like OpenAI’s text-embedding-3 or similar
- Indexed in a vector database (Pinecone, Weaviate, Qdrant, Chroma, or pgvector)
When the agent needs to recall relevant memories, the current context is embedded and used as a query against the vector store. The most semantically similar memories are retrieved and injected into the scratchpad.
Current context: "Customer asking about GDPR compliance for their European users"
↓ embed
Query vector → Vector DB search
↓ top-k results
Retrieved memories:
1. "User's company is based in Berlin, subject to EU regulations" (similarity: 0.94)
2. "Previous conversation about data retention policies, decided on 90-day window" (similarity: 0.89)
3. "User prefers technical explanations with specific regulation citations" (similarity: 0.82)
Structured Knowledge Graphs
Some systems supplement vector search with structured knowledge stores — essentially a graph database of entities and relationships. This handles queries that vector search struggles with, like “What’s the name of the user’s CTO?” or “Which projects are currently active?”
Hybrid Approaches
The most effective long-term memory systems combine vector search (for fuzzy, semantic retrieval) with structured stores (for precise, relational queries) and keyword search (for exact-match lookups). This mirrors how human memory works — we have both associative recall (“this reminds me of…”) and direct recall (“the meeting is at 3 PM”).
How Context Window Management Actually Works
The context window is the fundamental constraint that shapes every memory architecture decision. Here’s how the math works in practice.
A typical agent interaction needs to fit the following into, say, a 128K token context window:
| Component | Typical Size | Priority |
|---|---|---|
| System prompt | 2,000-5,000 tokens | Must include |
| Current user message | 100-2,000 tokens | Must include |
| Relevant long-term memories | 1,000-3,000 tokens | High |
| Session state/scratchpad | 500-2,000 tokens | High |
| Recent conversation history | 2,000-10,000 tokens | Medium-high |
| Tool definitions | 1,000-5,000 tokens | Must include |
| Tool results (current step) | 500-20,000 tokens | High |
| Older conversation history | 5,000-50,000 tokens | Medium |
| Retrieved documents/files | 1,000-50,000 tokens | Varies |
Even with 128K tokens available, you can see how quickly this adds up. The art of context window management is in the prioritization and compression:
1. Mandatory inclusion: System prompt, current message, tool definitions, and current-step tool results always go in.
2. High-priority retrieval: Relevant long-term memories and session state are loaded next. These are pre-filtered by semantic relevance, so they’re compact and high-signal.
3. Conversation history: Recent messages get full inclusion. Older messages are summarized or truncated. The boundary between “recent” and “older” is dynamic — it depends on how much space remains after higher-priority content.
4. Dynamic allocation: If the current task requires a large document (a contract to review, a codebase to analyze), conversation history gets compressed more aggressively to make room.
This is not unlike how human attention works. You can hold a limited number of things in active focus, and your brain automatically prioritizes based on relevance to your current task. The engineering is in making this prioritization fast, accurate, and invisible to the end user.
Memory Operations: Write, Read, Forget
Memory isn’t just about storage — it’s about the operations you perform on stored information.
Writing Memories
Not everything the agent encounters should be stored in long-term memory. Good memory systems have explicit write criteria:
- User corrections: If the user corrects the agent, that correction should be stored permanently. This is the highest-signal learning event.
- Explicit instructions: “Remember that I prefer X” or “Always do Y” — direct user requests for persistent behavior changes.
- Stable preferences: Patterns observed over multiple interactions that suggest a durable preference (communication style, working hours, tool preferences).
- Factual knowledge: Facts about the user’s business, team, systems, and workflows that are likely to remain true.
- Procedural learning: Successful approaches to recurring tasks, especially when the agent figured something out through trial and error.
What should NOT be stored in long-term memory:
- Transient task data (the specific numbers from today’s report)
- One-time context (a document reviewed for a single task)
- Intermediate reasoning (how the agent arrived at an answer)
- Unverified assumptions
Reading Memories (Retrieval)
Retrieval is where memory systems succeed or fail. The challenges:
Relevance scoring: Vector similarity is good but not perfect. A memory about “the user’s company uses Shopify” might not surface when the user asks about “our store’s checkout flow” unless the embedding model captures that semantic relationship.
Recency vs. relevance: A memory from yesterday that’s moderately relevant might be more useful than a highly relevant memory from six months ago that’s now outdated. Good systems factor in temporal decay.
Memory conflicts: What happens when two memories contradict each other? The agent might have stored “user prefers formal tone” six months ago and “user prefers casual tone” last week. The system needs conflict resolution — typically, newer memories override older ones for the same topic.
Retrieval quantity: Too few memories and the agent misses crucial context. Too many and you waste context window space on marginally relevant information. Most systems retrieve 5-15 memory items per query, ranked by a composite score of relevance, recency, and importance.
Forgetting (Memory Management)
Memory systems need garbage collection. Without it, the memory store grows indefinitely and retrieval quality degrades as irrelevant old memories crowd out useful ones.
Forgetting strategies include:
- Temporal decay: Memories that haven’t been accessed or reinforced in a long time gradually lose retrieval priority
- Explicit deletion: The user or administrator can remove specific memories (critical for GDPR compliance)
- Consolidation: Similar or overlapping memories are merged into single, canonical versions
- Contradiction resolution: When a newer memory contradicts an older one, the older one is deprecated
How Agent-S Implements Memory
Agent-S implements a practical version of this three-tier architecture. Here’s how it works in the real system.
Scratchpad
Each conversation thread in Agent-S maintains a scratchpad — a compact, structured state object that captures the current goal, active constraints, decisions made, the current plan, open questions, and important references. This scratchpad persists across messages within a thread and is automatically included in every agent reasoning step.
The scratchpad is designed for thread-specific state. It answers the question: “If the agent gets context-compacted mid-conversation, what does it need to know to continue seamlessly?”
Session Memory
Agent-S maintains full conversation history within each thread. As conversations grow long, older segments are automatically compacted — summarized and compressed so the essential facts survive without consuming the entire context window. The agent can also search its conversation history when it needs to recall specific earlier messages.
Long-Term Memory (Durable Memory)
Agent-S provides a durable memory system that persists across all conversations on an agent’s computer. Memories are stored with titles, tags, confidence scores, and full text content. The system supports:
- Automatic writing: The agent is instructed to save durable memories when it encounters facts, preferences, corrections, or operational knowledge that would help in future sessions
- Semantic search: The agent can search its memory store when it needs to recall previously learned information
- Explicit deletion: Stale, incorrect, or outdated memories can be deactivated
- Cross-conversation persistence: A memory saved in one conversation is available in every subsequent conversation on the same agent computer
This means an Agent-S agent that manages your e-commerce store remembers your brand guidelines, product categories, preferred communication tone, Shopify configuration details, and every operational decision you’ve made together — across every conversation, indefinitely.
Skills as Procedural Memory
Agent-S adds a fourth memory tier that doesn’t fit neatly into the standard three-tier model: skills. Skills are reusable procedural knowledge — “how to do X” instructions that the agent saves and can reference in future tasks.
When an agent learns a multi-step workflow (like how to deploy a website, how to format a specific type of report, or how to interact with a particular API), it saves that procedure as a skill. Skills function as procedural memory — the agent equivalent of muscle memory for tasks it’s done before.
This is distinct from long-term memory (which stores facts and preferences) in that skills store instructions and steps. The combination of facts (“the user’s Shopify store is at mystore.myshopify.com”) and procedures (“here’s how to update product descriptions via the Shopify API”) gives the agent both declarative and procedural knowledge — the same two types of knowledge that cognitive science identifies in human memory systems.
Memory Failure Modes (And How to Prevent Them)
Understanding how memory systems fail is as important as understanding how they work.
Context Pollution
When too many marginally relevant memories are loaded into the scratchpad, they dilute the context and reduce the quality of the agent’s reasoning. The agent essentially gets distracted by its own memories.
Prevention: Strict relevance thresholds for memory retrieval. Limit retrieved memories to genuinely high-similarity results (typically cosine similarity > 0.8). Quality over quantity.
Memory Staleness
Memories that were accurate when stored but are no longer true. The user’s company switched from Shopify to WooCommerce, but the agent still retrieves and acts on the Shopify-related memories.
Prevention: Temporal decay in retrieval scoring. Explicit memory update/deletion when the agent encounters contradicting information. Periodic memory audits.
Hallucinated Memories
The agent “remembers” something that was never stored — it generates plausible but false recollections based on patterns in its training data rather than actual stored memories.
Prevention: Clear architectural separation between LLM-generated content and actual memory retrieval. Memory items should be attributed to specific sources and timestamps so the agent (and the user) can distinguish recalled facts from generated inferences.
Memory Leakage
Sensitive information stored in one context (a private conversation about HR matters) inadvertently surfacing in another context (a general business discussion).
Prevention: Memory scoping and access controls. Sensitive memories should be tagged and restricted to relevant contexts. This is especially critical for security and compliance in enterprise deployments.
The Future of Agent Memory
Agent memory is evolving rapidly. Several trends are shaping the next generation:
Structured reasoning traces: Instead of storing raw conversation text, future systems will store structured reasoning — the chain of decisions, evidence, and conclusions that led to an outcome. This makes memory more useful for learning and self-improvement.
Multi-modal memory: Agents that can remember images, screenshots, audio, and video in addition to text. This is critical for agents that interact with visual interfaces, as discussed in our guide to why agents need their own computers.
Collaborative memory: Memory shared across agents in multi-agent workflows, allowing a research agent’s findings to be directly accessible to an execution agent without explicit handoff.
Adaptive compression: Systems that get better at deciding what to compress and what to preserve in full detail, based on observed patterns of what gets recalled and what doesn’t.
FAQ
How is AI agent memory different from a simple database?
A database stores data and retrieves it via exact queries — you need to know what you’re looking for and how it’s structured. Agent memory uses semantic retrieval — you describe what you need in natural language, and the system finds the most relevant stored information based on meaning, not keywords. This means an agent can recall “that thing the user said about preferring email over Slack” even if the original memory never used those exact words. The combination of semantic search, temporal weighting, and contextual relevance scoring makes it fundamentally different from traditional data storage.
Can I see what my AI agent remembers about me?
On Agent-S, yes. The agent’s memory is inspectable — you can ask it what it remembers, request specific memories be deleted, or correct memories that are inaccurate. This transparency is important for building trust and is also a practical necessity for data privacy compliance. You’re never in a black box — you have full visibility and control over what your agent knows.
How much context can modern AI agents actually hold?
Current LLMs support context windows of 128K-200K tokens (roughly 100,000-150,000 words) in a single reasoning step. But effective memory extends well beyond the context window through the three-tier architecture described in this article. An Agent-S agent’s long-term memory store can hold thousands of individual memories accumulated over months of interaction. The art is in selectively loading the right memories into the context window for each specific task — you don’t need all your memories all the time, just the relevant ones.
Do AI agents forget things over time?
They can, and sometimes they should. Memory systems typically implement temporal decay — memories that haven’t been accessed or reinforced gradually lose retrieval priority. This prevents the memory store from becoming cluttered with outdated information. However, high-importance memories (explicit user preferences, critical corrections, key procedural knowledge) are flagged to resist decay. The system balances between remembering everything (which degrades retrieval quality) and forgetting too aggressively (which loses valuable learned knowledge).
What happens to agent memory when the AI model gets updated?
This depends on the architecture. In well-designed systems like Agent-S, memory is stored independently of the LLM — in vector databases and structured stores that persist regardless of which model version is being used. A model upgrade doesn’t erase your agent’s memories. The embedding model used for retrieval may need re-indexing if it changes, but the underlying memory content is preserved. This is one of the key architectural decisions that separates production agent systems from simple chatbot wrappers.
Give your AI agent its own computer
Email, browsing, file management, scheduling, and app integrations — all running autonomously, 24/7.
Try Agent-S Free