Observational Memory

Trust: ★★★☆☆ (0.90) · 0 validations · developer_reference

Published: 2026-05-10 · Source: crawler_authoritative

Tình huống

Guide for Mastra developers configuring long-context agentic memory using background Observer and Reflector agents to compress conversation history into dense observations.

Insight

Observational Memory (OM) is Mastra’s memory system for long-context agentic memory, added in @mastra/[email protected]. It uses two background agents — an Observer and a Reflector — that watch conversations and maintain a dense observation log. The Observer creates observations when message history tokens exceed a threshold (default: 30,000 tokens), compressing raw messages 5–40×. The Reflector condenses observations when they exceed their threshold (default: 40,000 tokens), creating a three-tier memory system: recent messages, observations, and reflections. OM supports temporal gap markers (inserted when gaps are at least 10 minutes), early activation settings (activateAfterIdle, activateOnProviderChange), and async buffering (enabled by default with bufferTokens: 0.2 and bufferActivation: 0.8). Retrieval mode (experimental) keeps observation groups linked to raw messages and registers a recall tool for browsing or semantic search. Token-tiered model selection uses ModelByInputTokens with upTo thresholds. Thread scope (default) gives each thread its own observations; resource scope (experimental) shares observations across all threads for a resource. Supported storage adapters: @mastra/pg, @mastra/libsql, and @mastra/mongodb. Default model is google/gemini-2.5-flash. Token counting uses tokenx for text and provider-aware heuristics for images. The observation and reflection settings include messageTokens (default 30,000), observationTokens (default 40,000), bufferTokens, bufferActivation, blockAfter (default 1.2), and previousObserverTokens for context optimization. The recall tool supports mode: “search” with query string (requires vector: true), mode: “threads”, threadId browsing, and scope: ‘thread’ or ‘resource’. Async buffering is automatically disabled in resource scope. When using OM with a client application, send only the new message instead of full conversation history to avoid message ordering bugs.

Hành động

Enable observationalMemory in Memory options when creating an agent. Set observationalMemory: true for default gemini-2.5-flash model, or pass a config object with model, temporalMarkers, activateAfterIdle, activateOnProviderChange, scope (‘thread’ or ‘resource’), observation (messageTokens, bufferTokens, bufferActivation, blockAfter, previousObserverTokens), reflection (observationTokens, bufferActivation, activateAfterIdle, activateOnProviderChange, blockAfter), and retrieval (vector, scope). For temporal gap markers, set temporalMarkers: true. For early activation with prompt cache expiry, set activateAfterIdle to match cache TTL (e.g., ‘5m’ or 300_000). For async buffering, configure observation.bufferTokens (default 0.2), observation.bufferActivation (default 0.8), and observation.blockAfter (default 1.2). To use different models per phase, use ModelByInputTokens with upTo thresholds. For retrieval mode, set retrieval: true for browsing or retrieval: { vector: true } for semantic search. To restrict recall tool to current thread, set scope: ‘thread’. To disable async buffering, set observation.bufferTokens: false. When using OM with a client application, send only the new message from the client instead of full conversation history. No manual migration needed for existing threads — OM reads existing messages lazily when thresholds are exceeded.

Kết quả

Returns a dense observation log that replaces raw message history as it grows, maintaining humanlike long-term memory that persists across conversations. The three-tier system provides recent messages for the current task, observations for what the Observer has seen, and reflections for condensed memories. Retrieval mode enables the recall tool to page through raw messages behind observation groups or search semantically. Token progress bars in Studio show current counts and thresholds. Context compression is typically 5–40×, keeping the agent on task over long sessions with zero context rot.

Điều kiện áp dụng

Added in @mastra/[email protected]. Only supports @mastra/pg, @mastra/libsql, and @mastra/mongodb storage adapters. Retrieval mode is experimental. Resource scope is experimental. Async buffering not supported with resource scope (automatically disabled). When sending messages from client applications, send only the new message — sending full history causes message ordering bugs. Thread scope requires valid threadId or throws an error. google/gemini-2.5-flash may produce reflections that stay above the reflection.observationTokens threshold even after max compression retries; use xai/grok-4-1-fast or deepseek models for more aggressive compression on Reflector.


Nội dung gốc (Original)

Observational Memory

Added in: @mastra/[email protected]

Observational Memory (OM) is Mastra’s memory system for long-context agentic memory. Two background agents — an Observer and a Reflector — watch your agent’s conversations and maintain a dense observation log that replaces raw message history as it grows.

Quickstart

Enable observationalMemory in the memory options when creating your agent:

import { Memory } from '@mastra/memory'
import { Agent } from '@mastra/core/agent'
 
export const agent = new Agent({
  name: 'my-agent',
  instructions: 'You are a helpful assistant.',
  model: 'openai/gpt-5-mini',
  memory: new Memory({
    options: {
      observationalMemory: true,
    },
  }),
})

That’s it. The agent now has humanlike long-term memory that persists across conversations. Setting observationalMemory: true uses google/gemini-2.5-flash by default. Config objects also use this model unless you set a different one. To use a different model, pass it in the config object:

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'deepseek/deepseek-reasoner',
    },
  },
})

See configuration options for full API details.

Warning: When you use OM with a client application, send only the new message from the client instead of the full conversation history.

Observational memory still relies on stored conversation history. Sending the full history is redundant and can cause message ordering bugs when client-side timestamps conflict with stored timestamps.

For an AI SDK example, see Using Mastra Memory.

Note: OM currently only supports @mastra/pg, @mastra/libsql, and @mastra/mongodb storage adapters. It uses background agents for managing memory. When no model is set, the default model is google/gemini-2.5-flash.

Temporal gap markers

Temporal gap markers insert a short reminder before a new user message when enough time has passed since the previous message in the thread. They help the agent and the UI see that the conversation resumed after a meaningful pause.

Temporal gap markers are off by default. Enable them with temporalMarkers: true in the observationalMemory config:

import { Memory } from '@mastra/memory'
import { Agent } from '@mastra/core/agent'
 
export const agent = new Agent({
  name: 'my-agent',
  instructions: 'You are a helpful assistant.',
  model: 'openai/gpt-5-mini',
  memory: new Memory({
    options: {
      observationalMemory: {
        model: 'google/gemini-2.5-flash',
        temporalMarkers: true,
      },
    },
  }),
})

Mastra inserts a temporal gap marker when the gap is at least 10 minutes. The marker is stored in memory and also emitted as a transient reminder event, so clients can render it as a lightweight timeline hint.

The observer also sees these markers when it processes the thread, so the observations it writes can anchor memories to when they happened (for example, “User asked about deployment after a 2-day gap”).

See the API reference for the full configuration shape.

Early activation

OM can activate buffered observations before the token threshold is reached. This is useful when a prompt cache is likely to expire, or when the agent changes model providers.

Top-level early activation settings apply to observations by default:

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      activateAfterIdle: '5m',
      activateOnProviderChange: true,
    },
  },
})

Use nested observation and reflection settings for per-phase control. Reflection early activation is opt-in, so top-level settings affect only observations.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      activateAfterIdle: '5m',
      observation: {
        activateAfterIdle: false,
      },
      reflection: {
        activateAfterIdle: '10m',
        activateOnProviderChange: true,
      },
    },
  },
})

In this example, the top-level idle setting is disabled for observations, while reflections opt into idle and provider-change activation.

See the API reference for the full configuration shape.

Benefits

  • Prompt caching: OM’s context is stable and observations append over time rather than being dynamically retrieved each turn. This keeps the prompt prefix cacheable, which reduces costs.
  • Compression: Raw message history and tool results get compressed into a dense observation log. Smaller context means faster responses and longer coherent conversations.
  • Zero context rot: The agent sees relevant information instead of noisy tool calls and irrelevant tokens, so the agent stays on task over long sessions.

How it works

You don’t remember every word of every conversation you’ve ever had. You observe what happened subconsciously, then your brain reflects — reorganizing, combining, and condensing into long-term memory. OM works the same way.

Every time an agent responds, it sees a context window containing its system prompt, recent message history, and any injected context. The context window is finite; even models with large token limits perform worse when the window is full. This causes two problems:

  • Context rot: the more raw message history an agent carries, the worse it performs.
  • Context waste: most of that history contains tokens no longer needed to keep the agent on task.

OM solves both problems by compressing old context into dense observations.

Observations

When message history tokens exceed a threshold (default: 30,000), the Observer creates observations which are concise notes about what happened:

OM uses fast local token estimation for this thresholding work. Text is estimated with tokenx, while image parts use provider-aware heuristics so multimodal conversations still trigger observation at the right time. The same applies to image-like file parts when a transport normalizes an uploaded image as a file instead of an image part. For example, OpenAI image detail settings can materially change when OM decides to observe.

The Observer can also see attachments in the history it reviews. OM keeps readable placeholders like [Image #1: reference-board.png] or [File #1: floorplan.pdf] in the transcript for readability, and forwards the actual attachment parts alongside the text. Image-like file parts are upgraded to image inputs for the Observer when possible, while non-image attachments are forwarded as file parts with normalized token counting. This applies to both normal thread observation and batched resource-scope observation.

Date: 2026-01-15
 
- 🔴 12:10 User is building a Next.js app with Supabase auth, due in 1 week (meaning January 22nd 2026)
  - 🔴 12:10 App uses server components with client-side hydration
  - 🟡 12:12 User asked about middleware configuration for protected routes
  - 🔴 12:15 User stated the app name is "Acme Dashboard"

The compression is typically 5–40×. The Observer also tracks a current task and suggested response so the agent picks up where it left off.

If you enable observation.threadTitle, the Observer can also suggest a short thread title when the conversation topic meaningfully changes. Thread title generation is opt-in and updates the thread metadata, so apps like Mastra Code can show the latest title in thread lists and status UI.

Example: An agent using Playwright MCP might see 50,000+ tokens per page snapshot. With OM, the Observer watches the interaction and creates a few hundred tokens of observations about what was on the page and what actions were taken. The agent stays on task without carrying every raw snapshot.

Reflections

When observations exceed their threshold (default: 40,000 tokens), the Reflector condenses them, combines related items, and reflects on patterns.

The result is a three-tier system:

  1. Recent messages: Exact conversation history for the current task
  2. Observations: A log of what the Observer has seen
  3. Reflections: Condensed observations when memory becomes too long

Retrieval mode (experimental)

Note: Retrieval mode is experimental. The API may change in future releases.

Normal OM compresses messages into observations, which is great for staying on task, but the original wording is gone. Retrieval mode fixes this by keeping each observation group linked to the raw messages that produced it. When the agent needs exact wording, tool output, or chronology that the summary compressed away, it can call a recall tool to page through the source messages.

Browsing only

Set retrieval: true to enable the recall tool for browsing raw messages. No vector store needed. By default, the recall tool can browse across all threads for the current resource.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      retrieval: true,
    },
  },
})

Set retrieval: { vector: true } to also enable semantic search. This reuses the vector store and embedder already configured on your Memory instance:

const memory = new Memory({
  storage,
  vector: myVectorStore,
  embedder: myEmbedder,
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      retrieval: { vector: true },
    },
  },
})

When vector search is configured, new observation groups are automatically indexed at buffer time and during synchronous observation (fire-and-forget, non-blocking). Semantic search returns observation-group matches with their raw source message ID ranges, so the recall tool can show the summarized memory alongside where it came from.

Restricting to the current thread

By default, the recall tool scope is 'resource' — the agent can list threads, browse other threads, and search across all conversations. Set scope: 'thread' to restrict the agent to only the current thread:

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      retrieval: { vector: true, scope: 'thread' },
    },
  },
})

What retrieval enables

With retrieval mode enabled, OM:

  • Stores a range (e.g. startId:endId) on each observation group pointing to the messages it was derived from

  • Keeps range metadata visible in the agent’s context so the agent knows which observations map to which messages

  • Registers a recall tool the agent can call to:

    • Page through the raw messages behind any observation group range
    • Search by semantic similarity (mode: "search" with a query string); requires vector: true
    • List all threads (mode: "threads"), browse other threads (threadId), and search across all threads (default scope: 'resource')
    • When scope: 'thread': restrict browsing and search to the current thread only

See the recall tool reference for the full API (detail levels, part indexing, pagination, cross-thread browsing, and token limiting).

Studio

To see how it works in practice, open Studio and navigate to an agent with OM enabled. The Memory tab displays:

  • Token progress bars: Current token counts for messages and observations, showing how close each is to its threshold. Hover over the info icon to see the model and threshold for the Observer and Reflector.
  • Active observations: The current observation log, rendered inline. When previous observation or reflection records exist, expand “Previous observations” to browse them.
  • Background processing: During a conversation, buffered observation chunks and reflection status appear as the agent processes in the background.

The progress bars update live while the agent is observing or reflecting, showing elapsed time and a status badge.

Models

The Observer and Reflector run in the background. Any model that works with Mastra’s model routing (provider/model) can be used. When no model is set, the default model is google/gemini-2.5-flash.

Generally speaking, we recommend using a model that has a large context window (128K+ tokens) and is fast enough to run in the background without slowing down your actions.

If you’re unsure which model to use, start with the default google/gemini-2.5-flash. We’ve also successfully tested openai/gpt-5-mini, anthropic/claude-haiku-4-5, deepseek/deepseek-reasoner, deepseek/deepseek-v4-pro, deepseek/deepseek-v4-flash, xai/grok-4-1-fast, qwen3, and glm-4.7.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'deepseek/deepseek-reasoner',
    },
  },
})

See model configuration for using different models per agent.

Note: google/gemini-2.5-flash is unusually good at preserving detail in long output. As a result, the Reflector can produce reflections that stay above the configured reflection.observationTokens threshold even after the maximum compression retry. When this happens, the Reflector returns the smallest non-degenerate candidate produced during retries so the loop terminates instead of running forever.

If you’d rather have more aggressive compression on the Reflector, swap to a model that condenses more readily, such as xai/grok-4-1-fast, deepseek/deepseek-v4-pro, or deepseek/deepseek-v4-flash. You can keep google/gemini-2.5-flash for the Observer and use a different model for the Reflector — see different models per agent.

Token-tiered model selection

Added in: @mastra/[email protected]

You can use ModelByInputTokens to specify different Observer or Reflector models based on input token count. OM selects the matching model tier at runtime from the configured upTo thresholds.

import { Memory, ModelByInputTokens } from '@mastra/memory'
 
const memory = new Memory({
  options: {
    observationalMemory: {
      observation: {
        model: new ModelByInputTokens({
          upTo: {
            // Faster, cheaper models for smaller inputs; stronger models for larger contexts
            5_000: 'openrouter/mistralai/ministral-8b-2512',
            20_000: 'openrouter/mistralai/mistral-small-2603',
            40_000: 'openai/gpt-5.4-mini',
            1_000_000: 'google/gemini-3.1-flash-lite-preview',
          },
        }),
      },
      reflection: {
        model: new ModelByInputTokens({
          upTo: {
            20_000: 'openai/gpt-5.4-mini',
            100_000: 'google/gemini-2.5-flash',
          },
        }),
      },
    },
  },
})

The upTo keys are inclusive upper bounds. OM computes the actual input token count for the Observer or Reflector call, resolves the matching tier directly, and uses that concrete model for the run.

If the input exceeds the largest configured threshold, an error is thrown — ensure your thresholds cover the full range of possible input sizes, or use a model with a sufficiently large context window at the highest tier.

Scopes

Thread scope (default)

Each thread has its own observations. This scope is well tested and works well as a general purpose memory system, especially for long horizon agentic use-cases.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      scope: 'thread',
    },
  },
})

Thread scope requires a valid threadId to be provided when calling the agent. If threadId is missing, Observational Memory throws an error. This prevents multiple threads from silently sharing a single observation record, which can cause database deadlocks.

Resource scope (experimental)

Observations are shared across all threads for a resource (typically a user). Enables cross-conversation memory.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      scope: 'resource',
    },
  },
})

Resource scope works, however it’s marked as experimental for now until we prove task adherence/continuity across multiple ongoing simultaneous threads. As of today, you may need to tweak your system prompt to prevent one thread from continuing the work that another had already started (but hadn’t finished).

This is because in resource scope, each thread is a perspective on all threads for the resource.

For your use-case this may not be a problem, so your mileage may vary.

Warning: In resource scope, unobserved messages across all threads are processed together. For users with many existing threads, this can be slow. Use thread scope for existing apps.

Token budgets

OM uses token thresholds to decide when to observe and reflect. See token budget configuration for details.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      observation: {
        // when to run the Observer (default: 30,000)
        messageTokens: 30_000,
      },
      reflection: {
        // when to run the Reflector (default: 40,000)
        observationTokens: 40_000,
      },
      // let message history borrow from observation budget
      // requires bufferTokens: false (temporary limitation)
      shareTokenBudget: false,
    },
  },
})

Token counting cache

OM caches token estimates in message metadata to reduce repeat counting work during threshold checks and buffering decisions.

  • Per-part estimates are stored on part.providerMetadata.mastra and reused on subsequent passes when the cache version/tokenizer source matches.
  • For string-only message content (without parts), OM uses a message-level metadata fallback cache.
  • Message and conversation overhead are still recalculated on every pass. The cache only stores payload estimates, so counting semantics stay the same.
  • data-* and reasoning parts are still skipped and aren’t cached.

Async buffering

Without async buffering, the Observer runs synchronously when the message threshold is reached — the agent pauses mid-conversation while the Observer LLM call completes. With async buffering (enabled by default), observations are pre-computed in the background as the conversation grows. When the threshold is hit, buffered observations activate instantly with no pause.

How it works

As the agent converses, message tokens accumulate. At regular intervals (bufferTokens), a background Observer call runs without blocking the agent. Each call produces a “chunk” of observations that’s stored in a buffer.

When message tokens reach the messageTokens threshold, buffered chunks activate: their observations move into the active observation log, and the corresponding raw messages are removed from the context window. The agent never pauses.

Buffered observations also include continuation hints — a suggested next response and the current task — so the main agent maintains conversational continuity after activation shrinks the context window.

If the agent produces messages faster than the Observer can process them, a blockAfter safety threshold forces a synchronous observation as a last resort. Buffered activation still preserves a minimum remaining context (the smaller of ~1k tokens or the configured retention floor).

Reflection works similarly — the Reflector runs in the background when observations reach a fraction of the reflection threshold.

Settings

SettingDefaultWhat it controls
observation.bufferTokens0.2How often to buffer. 0.2 means every 20% of messageTokens — with the default 30k threshold, that’s roughly every 6k tokens. Can also be an absolute token count (e.g. 5000).
observation.bufferActivation0.8How aggressively to clear the message window on activation. 0.8 means remove enough messages to keep only 20% of messageTokens remaining. Lower values keep more message history.
observation.blockAfter1.2Safety threshold as a multiplier of messageTokens. At 1.2, synchronous observation is forced at 36k tokens (1.2 × 30k). Only matters if buffering can’t keep up.
activateAfterIdlenoneForces buffered observations to activate after a period of inactivity, even before observation.messageTokens is reached. Accepts a numeric millisecond value such as 300_000, or duration strings like "5m" or "1hr". Set this to your prompt cache TTL if you want activation to happen before the next cold prompt.
activateOnProviderChangefalseForces buffered observations to activate when the next step uses a different provider/model than the one that produced the latest assistant step. Use this when switching providers or models would invalidate prompt cache reuse.
reflection.bufferActivation0.5When to start background reflection. 0.5 means reflection begins when observations reach 50% of the observationTokens threshold.
reflection.activateAfterIdlenoneOpts buffered reflections into idle activation. Reflections don’t inherit top-level activateAfterIdle.
reflection.activateOnProviderChangefalseOpts buffered reflections into provider-change activation. Reflections don’t inherit top-level activateOnProviderChange.
reflection.blockAfter1.2Safety threshold for reflection, same logic as observation.

If you’re relying on prompt caching, set activateAfterIdle to match your cache TTL. That way, once a thread has been idle long enough for the cache to expire, the next request can activate buffered observations first and send a smaller compressed context window.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      activateAfterIdle: '5m',
      activateOnProviderChange: true,
    },
  },
})

With a 5-minute prompt cache TTL, this activates buffered observations after 5 minutes of inactivity so the next uncached prompt uses compressed observations instead of a larger raw message window. If you prefer, 300_000 works the same way.

Changing model or providers mid-thread will invalidate the prompt cache. If your agent can switch between providers or models mid-thread, activateOnProviderChange: true forces buffered observations to activate before the new provider runs. That avoids sending a large raw window to a provider that can’t reuse the previous prompt cache.

Disabling

To disable async buffering and use synchronous observation/reflection instead:

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      observation: {
        bufferTokens: false,
      },
    },
  },
})

Setting bufferTokens: false disables both observation and reflection async buffering. See async buffering configuration for the full API.

Note: Async buffering isn’t supported with scope: 'resource'. It’s automatically disabled in resource scope.

Observer Context Optimization

By default, the Observer receives the full observation history as context when processing new messages. The Observer also receives prior current-task and suggested-response metadata (when available), so it can stay oriented even when observation context is truncated. For long-running conversations where observations grow large, you can opt into context optimization to reduce Observer input costs.

Set observation.previousObserverTokens to limit how many tokens of previous observations are sent to the Observer. Observations are tail-truncated, keeping the most recent entries. When a buffered reflection is pending, the already-reflected lines are automatically replaced with the reflection summary before truncation is applied.

const memory = new Memory({
  options: {
    observationalMemory: {
      model: 'google/gemini-2.5-flash',
      observation: {
        previousObserverTokens: 10_000, // keep only ~10k tokens of recent observations
      },
    },
  },
})
  • previousObserverTokens: 2000 → default; keeps ~2k tokens of recent observations.
  • previousObserverTokens: 0 → omit previous observations completely.
  • previousObserverTokens: false → disable truncation and keep full previous observations.

Migrating existing threads

No manual migration needed. OM reads existing messages and observes them lazily when thresholds are exceeded.

  • Thread scope: The first time a thread exceeds observation.messageTokens, the Observer processes the backlog.
  • Resource scope: All unobserved messages across all threads for a resource are processed together. For users with many existing threads, this could take significant time.

Comparing OM with other memory features

  • Message history: High-fidelity record of the current conversation
  • Working memory: Small, structured state (JSON or markdown) for user preferences, names, goals
  • Semantic Recall: RAG-based retrieval of relevant past messages

If you’re using working memory to store conversation summaries or ongoing state that grows over time, OM is a better fit. Working memory is for small, structured data; OM is for long-running event logs. OM also manages message history automatically—the messageTokens setting controls how much raw history remains before observation runs.

In practical terms, OM replaces both working memory and message history, and has greater accuracy (and lower cost) than Semantic Recall.

Liên kết

Xem thêm: