Context Windows in Claude API: Token Management Strategies and Limits
Trust: ★★★☆☆ (0.90) · 0 validations · interpretive
Published: 2026-05-09 · Source: crawler_authoritative
Tình huống
Developers building AI applications with Claude API who need to understand context window mechanics and manage long conversations efficiently.
Insight
Context windows represent the working memory available to Claude when generating responses, encompassing all text including the response itself (up to 1M tokens for Claude 4 models). Context awareness in Claude 4.5+ models enables dynamic token budget tracking with XML tags like budget:token_budget1000000</budget:token_budget> at conversation start and <system_warning>Token usage: 35000/1000000; 965000 remaining</system_warning> after tool calls. However, as token count grows, accuracy and recall degrade—a phenomenon known as context rot—making curation of context just as important as available capacity. Progressive token accumulation preserves previous turns completely, creating linear growth patterns. Extended thinking tokens count toward limits and are billed as output tokens, but thinking blocks are automatically stripped from context calculations by the API in subsequent turns: effective calculation = (input_tokens - previous_thinking_tokens) + current_turn_tokens. When combining extended thinking with tool use, the thinking block from the initial turn MUST be returned with tool results to preserve reasoning continuity—failing to include unmodified thinking blocks triggers errors due to cryptographic signature verification.
Hành động
- Use token counting API before sending messages to estimate usage and stay within limits. 2. For tool use with extended thinking: always return the complete unmodified thinking block when posting tool results—this is the only case where thinking blocks must be manually returned. 3. In the third step (after tool results), thinking blocks can be dropped since Claude completes the tool use cycle; the API automatically strips them if passed back. 4. For long-running conversations: implement server-side compaction (available in beta for Claude Opus 4.7, 4.6, Claude Sonnet 4.6, and Claude Mythos Preview). 5. For specialized needs: use context editing for tool result clearing and thinking block clearing. 6. Newer models (Claude Sonnet 3.7+) return validation errors when exceeding limits rather than silently truncating—handle these errors explicitly.
Kết quả
Effective context management enables long-running conversations and agentic workflows beyond basic context limits while maintaining accuracy. Claude achieves state-of-the-art results on long-context retrieval benchmarks like MRCR and GraphWalks, but gains depend on context curation quality.
Điều kiện áp dụng
Claude 4 models (Opus 4.7, Opus 4.6, Sonnet 4.6, Claude Mythos Preview) have 1M-token context window. Other models including Sonnet 4.5 and Sonnet 4 have 200k tokens. Image/PDF limits: up to 600 pages for 1M-context models, 100 pages for 200k-context models. Interleaved thinking (thinking between tool calls) only supported in Claude 4 models; Sonnet 3.7 does not support interleaving without a non-tool_result user turn.
Nội dung gốc (Original)
Context windows
As conversations grow, you’ll eventually approach context window limits. This guide explains how context windows work and introduces strategies for managing them effectively.
For long-running conversations and agentic workflows, server-side compaction is the primary strategy for context management. For more specialized needs, context editing offers additional strategies like tool result clearing and thinking block clearing.
Understanding the context window
The “context window” refers to all the text a language model can reference when generating a response, including the response itself. This is different from the large corpus of data the language model was trained on, and instead represents a “working memory” for the model. A larger context window allows the model to handle more complex and lengthy prompts, but more context isn’t automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot. This makes curating what’s in context just as important as how much space is available.
Claude achieves state-of-the-art results on long-context retrieval benchmarks like MRCR and GraphWalks, but these gains depend on what’s in context, not just how much fits.
The diagram below illustrates the standard context window behavior for API requests1:
1For chat interfaces, such as for claude.ai, context windows can also be set up on a rolling “first in, first out” system.
- Progressive token accumulation: As the conversation advances through turns, each user message and assistant response accumulates within the context window. Previous turns are preserved completely.
- Linear growth pattern: The context usage grows linearly with each turn, with previous turns preserved completely.
- Context window capacity: The total available context window (up to 1M tokens) represents the maximum capacity for storing conversation history and generating new output from Claude.
- Input-output flow: Each turn consists of:
- Input phase: Contains all previous conversation history plus the current user message
- Output phase: Generates a text response that becomes part of a future input
The context window with extended thinking
When using extended thinking, all input and output tokens, including the tokens used for thinking, count toward the context window limit, with a few nuances in multi-turn situations.
The thinking budget tokens are a subset of your max_tokens parameter, are billed as output tokens, and count towards rate limits. With adaptive thinking, Claude dynamically decides its thinking allocation, so actual thinking token usage may vary per request.
However, previous thinking blocks are automatically stripped from the context window calculation by the Claude API and are not part of the conversation history that the model “sees” for subsequent turns, preserving token capacity for actual conversation content.
The diagram below demonstrates the specialized token management when extended thinking is enabled:
- Stripping extended thinking: Extended thinking blocks (shown in dark gray) are generated during each turn’s output phase, but are not carried forward as input tokens for subsequent turns. You do not need to strip the thinking blocks yourself. The Claude API automatically does this for you if you pass them back.
- Technical implementation details:
- The API automatically excludes thinking blocks from previous turns when you pass them back as part of the conversation history.
- Extended thinking tokens are billed as output tokens only once, during their generation.
- The effective context window calculation becomes:
context_window = (input_tokens - previous_thinking_tokens) + current_turn_tokens. - Thinking tokens include
thinkingblocks.
This architecture is token efficient and allows for extensive reasoning without token waste, as thinking blocks can be substantial in length.
The context window with extended thinking and tool use
The diagram below illustrates the context window token management when combining extended thinking with tool use:
- Considerations for tool use with extended thinking:
- When posting tool results, the entire unmodified thinking block that accompanies that specific tool request (including signature portions) must be included.
- The effective context window calculation for extended thinking with tool use becomes:
context_window = input_tokens + current_turn_tokens. - The system uses cryptographic signatures to verify thinking block authenticity. Failing to preserve thinking blocks during tool use can break Claude’s reasoning continuity. Thus, if you modify thinking blocks, the API returns an error.
Claude Sonnet 3.7 does not support interleaved thinking, so there is no interleaving of extended thinking and tool calls without a non-tool_result user turn in between.
For more information about using tools with extended thinking, see the extended thinking guide.
Claude Mythos Preview, Claude Opus 4.7, Claude Opus 4.6, and Claude Sonnet 4.6 have a 1M-token context window. Other Claude models, including Claude Sonnet 4.5 and Sonnet 4 (deprecated), have a 200k-token context window.
A single request can include up to 600 images or PDF pages (100 for models with a 200k-token context window). When sending many images or large documents, you may approach request size limits before the token limit.
Context awareness in Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5
Claude Sonnet 4.6, Claude Sonnet 4.5, and Claude Haiku 4.5 feature context awareness. This capability lets these models track their remaining context window (i.e. “token budget”) throughout a conversation. This enables Claude to execute tasks and manage context more effectively by understanding how much space it has to work. Claude is trained to use this context precisely, persisting in the task until the very end rather than guessing how many tokens remain. For a model, lacking context awareness is like competing in a cooking show without a clock. Claude 4.5+ models change this by explicitly informing the model about its remaining context, so it can take maximum advantage of the available tokens.
How it works:
At the start of a conversation, Claude receives information about its total context window:
<budget:token_budget>1000000</budget:token_budget>The budget is set to 1M tokens (200k for models with a smaller context window).
After each tool call, Claude receives an update on remaining capacity:
<system_warning>Token usage: 35000/1000000; 965000 remaining</system_warning>This awareness helps Claude determine how much capacity remains for work and enables more effective execution on long-running tasks. Image tokens are included in these budgets.
Benefits:
Context awareness is particularly valuable for:
- Long-running agent sessions that require sustained focus
- Multi-context-window workflows where state transitions matter
- Complex tasks requiring careful token management
For prompting guidance on leveraging context awareness, see the prompting best practices guide.
Managing context with compaction
If your conversations regularly approach context window limits, server-side compaction is the recommended approach. Compaction provides server-side summarization that automatically condenses earlier parts of a conversation, enabling long-running conversations beyond context limits with minimal integration work. It is currently available in beta for Claude Mythos Preview, Claude Opus 4.7, Claude Opus 4.6, and Claude Sonnet 4.6.
For more specialized needs, context editing offers additional strategies:
- Tool result clearing - Clear old tool results in agentic workflows
- Thinking block clearing - Manage thinking blocks with extended thinking
Context window management with newer Claude models
Newer Claude models (starting with Claude Sonnet 3.7) return a validation error when prompt and output tokens exceed the context window, rather than silently truncating. This change provides more predictable behavior but requires more careful token management.
Use the token counting API to estimate token usage before sending messages to Claude. This helps you plan and stay within context window limits.
See the model comparison table for a list of context window sizes by model.
Next steps
Liên kết
Xem thêm:
- [[entries/source-url-https-platform-claude-com-docs-en-build-with-claude-context-editing|Source URL: https://platform.claude.com/docs/en/build-with-claude/context-editing.md Title: Context ]]
- [[entries/source-url-https-platform-claude-com-docs-en-build-with-claude-task-budgets-md|Source URL: https://platform.claude.com/docs/en/build-with-claude/task-budgets.md Title: Task budget]]
- Tham số Effort trong Claude API - Kiểm soát mức độ chi tiết phản hồi
- Memory Tool - Lưu trữ và truy xuất thông tin xuyên phiên cho Claude
- Xử lý lỗi thường gặp khi sử dụng Tool trong Claude API