Fast Mode (Beta) - Claude Opus 4.6 API: Tốc Độ Đầu Ra Nhanh Hơn 2.5 Lần
Trust: ★★★☆☆ (0.90) · 0 validations · factual
Published: 2026-05-09 · Source: crawler_authoritative
Tình huống
Nhà phát triển cần tăng tốc độ phản hồi API cho các tác vụ nhạy cảm về độ trễ và tác vụ dạng agentic trên nền tảng Claude.
Insight
Fast Mode là tính năng beta (research preview) dành cho Claude Opus 4.6 (model: claude-opus-4-6), cho phép tăng tốc độ sinh token đầu ra lên đến 2.5 lần so với tốc độ tiêu chuẩn. Tính năng chạy cùng model với cấu hình inference nhanh hơn, không thay đổi về trí thông minh hay khả năng. Lợi ích tập trung vào output tokens per second (OTPS), không phải time to first token (TTFT). Fast mode hỗ trợ Zero Data Retention (ZDR) - dữ liệu không được lưu trữ sau khi trả về API response. Để truy cập, cần join waitlist tại claude.com/fast-mode do availability còn hạn chế. Fast mode có rate limit riêng biệt, tách biệt khỏi standard Opus rate limits. Khi vượt quá rate limit, API trả về HTTP 429 kèm header retry-after. Response bao gồm các header: anthropic-fast-input-tokens-limit, anthropic-fast-input-tokens-remaining, anthropic-fast-input-tokens-reset, anthropic-fast-output-tokens-limit, anthropic-fast-output-tokens-remaining, anthropic-fast-output-tokens-reset. Response usage object có field speed để xác định tốc độ đã sử dụng (fast hoặc standard).
Hành động
Thêm tham số speed: “fast” trong API request cùng với beta header “anthropic-beta: fast-mode-2026-02-01”. Để implement fallback từ fast sang standard speed khi gặp rate limit: set max_retries=0 trên request fast ban đầu, bắt lỗi RateLimitError (429), xóa tham số speed và retry lại. Lưu ý: chuyển đổi giữa fast và standard sẽ làm mất prompt cache (cached prefixes không được share giữa các tốc độ khác nhau). SDK tự động retry tối đa 2 lần mặc định khi gặp 429. API trả về HTTP 429 kèm retry-after header - delay thường ngắn do continuous token replenishment.
Kết quả
Token generation nhanh hơn đến 2.5 lần cho các ứng dụng nhạy cảm về độ trễ.
Điều kiện áp dụng
Fast mode chỉ khả dụng trên Claude Opus 4.6 - gửi speed: fast với model khác sẽ trả về error. Không khả dụng với Batch API. Không khả dụng với Priority Tier. Pricing là 6x standard Opus rates: $30/MTok input, $150/MTok output, áp dụng cho toàn bộ context window (bao gồm cả requests trên 200k input tokens). Prompt caching multipliers và data residency multipliers áp dụng thêm trên fast mode pricing.
Nội dung gốc (Original)
Fast mode (beta: research preview)
Higher output speed for Claude Opus 4.6, delivering significantly faster token generation for latency-sensitive and agentic workflows.
Fast mode provides significantly faster output token generation for Claude Opus 4.6. By setting speed: "fast" in your API request, you get up to 2.5x higher output tokens per second from the same model at premium pricing.
Supported models
Fast mode is supported on the following models:
- Claude Opus 4.6 (
claude-opus-4-6)
How fast mode works
Fast mode runs the same model with a faster inference configuration. There is no change to intelligence or capabilities.
- Up to 2.5x higher output tokens per second compared to standard speed
- Speed benefits are focused on output tokens per second (OTPS), not time to first token (TTFT)
- Same model weights and behavior (not a different model)
Basic usage
ant beta:messages create \
--beta fast-mode-2026-02-01 \
--transform 'content.0.text' --format yaml <<'YAML'
model: claude-opus-4-6
max_tokens: 4096
speed: fast
messages:
- role: user
content: Refactor this module to use dependency injection
YAMLimport anthropic
client = anthropic.Anthropic()
response = client.beta.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
speed="fast",
betas=["fast-mode-2026-02-01"],
messages=[
{"role": "user", "content": "Refactor this module to use dependency injection"}
],
)
print(response.content[0].text)import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.beta.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
speed: "fast",
betas: ["fast-mode-2026-02-01"],
messages: [
{
role: "user",
content: "Refactor this module to use dependency injection"
}
]
});
const textBlock = response.content.find(
(block): block is Anthropic.Beta.Messages.BetaTextBlock => block.type === "text"
);
console.log(textBlock?.text);using Anthropic;
using Anthropic.Models.Beta.Messages;
AnthropicClient client = new();
var response = await client.Beta.Messages.Create(new MessageCreateParams
{
Model = "claude-opus-4-6",
MaxTokens = 4096,
Speed = Speed.Fast,
Betas = ["fast-mode-2026-02-01"],
Messages = [
new() { Role = Role.User, Content = "Refactor this module to use dependency injection" }
],
});
Console.WriteLine(response);package main
import (
"context"
"fmt"
"log"
anthropic "github.com/anthropics/anthropic-sdk-go"
)
func main() {
client := anthropic.NewClient()
response, err := client.Beta.Messages.New(context.TODO(), anthropic.BetaMessageNewParams{
Model: anthropic.ModelClaudeOpus4_6,
MaxTokens: 4096,
Speed: anthropic.BetaMessageNewParamsSpeedFast,
Betas: []anthropic.AnthropicBeta{anthropic.AnthropicBetaFastMode2026_02_01},
Messages: []anthropic.BetaMessageParam{
anthropic.NewBetaUserMessage(anthropic.NewBetaTextBlock("Refactor this module to use dependency injection")),
},
})
if err != nil {
log.Fatal(err)
}
fmt.Println(response.Content[0].AsText().Text)
}import com.anthropic.client.AnthropicClient;
import com.anthropic.client.okhttp.AnthropicOkHttpClient;
import com.anthropic.models.beta.AnthropicBeta;
import com.anthropic.models.beta.messages.BetaMessage;
import com.anthropic.models.beta.messages.MessageCreateParams;
import com.anthropic.models.messages.Model;
void main() {
AnthropicClient client = AnthropicOkHttpClient.fromEnv();
BetaMessage response = client.beta().messages().create(
MessageCreateParams.builder()
.model(Model.CLAUDE_OPUS_4_6)
.maxTokens(4096L)
.speed(MessageCreateParams.Speed.FAST)
.addBeta(AnthropicBeta.FAST_MODE_2026_02_01)
.addUserMessage("Refactor this module to use dependency injection")
.build());
IO.println(response.content().get(0).text().get().text());
}<?php
use Anthropic\Client;
$client = new Client();
$response = $client->beta->messages->create(
model: 'claude-opus-4-6',
maxTokens: 4096,
speed: 'fast',
betas: ['fast-mode-2026-02-01'],
messages: [
['role' => 'user', 'content' => 'Refactor this module to use dependency injection'],
],
);
echo $response->content[0]->text;require "anthropic"
client = Anthropic::Client.new
response = client.beta.messages.create(
model: "claude-opus-4-6",
max_tokens: 4096,
speed: "fast",
betas: ["fast-mode-2026-02-01"],
messages: [{role: "user", content: "Refactor this module to use dependency injection"}]
)
puts response.content[0].textPricing
Fast mode is priced at 6x standard Opus rates across the full context window, including requests over 200k input tokens. The following table shows pricing for Claude Opus 4.6 with fast mode:
| Input | Output |
|---|---|
| $30 / MTok | $150 / MTok |
Fast mode pricing stacks with other pricing modifiers:
- Prompt caching multipliers apply on top of fast mode pricing
- Data residency multipliers apply on top of fast mode pricing
For complete pricing details, see the pricing page.
Rate limits
Fast mode has a dedicated rate limit that is separate from standard Opus rate limits. When your fast mode rate limit is exceeded, the API returns a 429 error with a retry-after header indicating when capacity will be available.
The response includes headers that indicate your fast mode rate limit status:
| Header | Description |
|---|---|
anthropic-fast-input-tokens-limit | Maximum fast mode input tokens per minute |
anthropic-fast-input-tokens-remaining | Remaining fast mode input tokens |
anthropic-fast-input-tokens-reset | Time when the fast mode input token limit resets |
anthropic-fast-output-tokens-limit | Maximum fast mode output tokens per minute |
anthropic-fast-output-tokens-remaining | Remaining fast mode output tokens |
anthropic-fast-output-tokens-reset | Time when the fast mode output token limit resets |
For tier-specific rate limits, see the rate limits page.
Checking which speed was used
The response usage object includes a speed field that indicates which speed was used, either "fast" or "standard":
ant beta:messages create --beta fast-mode-2026-02-01 \
--transform usage.speed --format yaml <<'YAML'
model: claude-opus-4-6
max_tokens: 1024
speed: fast
messages:
- role: user
content: Hello
YAMLresponse = client.beta.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
speed="fast",
betas=["fast-mode-2026-02-01"],
messages=[{"role": "user", "content": "Hello"}],
)
print(response.usage.speed) # "fast" or "standard"const response = await client.beta.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
speed: "fast",
betas: ["fast-mode-2026-02-01"],
messages: [{ role: "user", content: "Hello" }]
});
console.log(response.usage.speed); // "fast" or "standard"using Anthropic;
using Anthropic.Models.Beta.Messages;
AnthropicClient client = new();
var response = await client.Beta.Messages.Create(new MessageCreateParams
{
Model = "claude-opus-4-6",
MaxTokens = 1024,
Speed = Speed.Fast,
Betas = ["fast-mode-2026-02-01"],
Messages = [new() { Role = Role.User, Content = "Hello" }],
});
Console.WriteLine(response.Usage.Speed); // "fast" or "standard"package main
import (
"context"
"fmt"
"log"
anthropic "github.com/anthropics/anthropic-sdk-go"
)
func main() {
client := anthropic.NewClient()
response, err := client.Beta.Messages.New(context.TODO(), anthropic.BetaMessageNewParams{
Model: anthropic.ModelClaudeOpus4_6,
MaxTokens: 1024,
Speed: anthropic.BetaMessageNewParamsSpeedFast,
Betas: []anthropic.AnthropicBeta{anthropic.AnthropicBetaFastMode2026_02_01},
Messages: []anthropic.BetaMessageParam{
anthropic.NewBetaUserMessage(anthropic.NewBetaTextBlock("Hello")),
},
})
if err != nil {
log.Fatal(err)
}
fmt.Println(response.Usage.Speed) // "fast" or "standard"
}import com.anthropic.client.AnthropicClient;
import com.anthropic.client.okhttp.AnthropicOkHttpClient;
import com.anthropic.models.beta.AnthropicBeta;
import com.anthropic.models.beta.messages.BetaMessage;
import com.anthropic.models.beta.messages.MessageCreateParams;
import com.anthropic.models.messages.Model;
void main() {
AnthropicClient client = AnthropicOkHttpClient.fromEnv();
MessageCreateParams params = MessageCreateParams.builder()
.model(Model.CLAUDE_OPUS_4_6)
.maxTokens(1024L)
.speed(MessageCreateParams.Speed.FAST)
.addBeta(AnthropicBeta.FAST_MODE_2026_02_01)
.addUserMessage("Hello")
.build();
BetaMessage response = client.beta().messages().create(params);
IO.println(response.usage().speed()); // "fast" or "standard"
}<?php
use Anthropic\Client;
$client = new Client();
$response = $client->beta->messages->create(
model: 'claude-opus-4-6',
maxTokens: 1024,
speed: 'fast',
betas: ['fast-mode-2026-02-01'],
messages: [['role' => 'user', 'content' => 'Hello']],
);
echo $response->usage->speed; // "fast" or "standard"response = anthropic.beta.messages.create(
model: "claude-opus-4-6",
max_tokens: 1024,
speed: "fast",
betas: ["fast-mode-2026-02-01"],
messages: [{ role: "user", content: "Hello" }]
)
puts(response.usage.speed) # "fast" or "standard"{
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"content": [{ "type": "text", "text": "Hello!" }],
"model": "claude-opus-4-6",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 523,
"output_tokens": 1842,
"speed": "fast"
}
}To track fast mode usage and costs across your organization, see the Usage and Cost API.
Retries and fallback
Automatic retries
When fast mode rate limits are exceeded, the API returns a 429 error with a retry-after header. The Anthropic SDKs automatically retry these requests up to 2 times by default (configurable via max_retries), waiting for the server-specified delay before each retry. Since fast mode uses continuous token replenishment, the retry-after delay is typically short and requests succeed once capacity is available.
Falling back to standard speed
If you’d prefer to fall back to standard speed rather than wait for fast mode capacity, catch the rate limit error and retry without speed: "fast". Set max_retries to 0 on the initial fast request to skip automatic retries and fail immediately on rate limit errors.
Since setting max_retries to 0 also disables retries for other transient errors (overloaded, internal server errors), the examples below re-issue the original request with default retries for those cases.
MESSAGE=$( create_message_with_fast_fallback fast <<‘YAML’ model: claude-opus-4-6 max_tokens: 1024 messages:
- role: user content: Hello YAML )
```python Python nocheck hidelines={1..2}
import anthropic
client = anthropic.Anthropic()
def create_message_with_fast_fallback(max_retries=None, max_attempts=3, **params):
try:
return client.beta.messages.create(**params, max_retries=max_retries)
except anthropic.RateLimitError:
if params.get("speed") == "fast":
del params["speed"]
return create_message_with_fast_fallback(**params)
raise
except (
anthropic.APIStatusError,
anthropic.APIConnectionError,
) as error:
if isinstance(error, anthropic.APIStatusError) and error.status_code < 500:
raise
if max_attempts > 1:
return create_message_with_fast_fallback(
max_attempts=max_attempts - 1, **params
)
raise
message = create_message_with_fast_fallback(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
betas=["fast-mode-2026-02-01"],
speed="fast",
max_retries=0,
)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
(async () => {
async function createMessageWithFastFallback(
params: Anthropic.Beta.MessageCreateParams,
requestOptions?: Anthropic.RequestOptions,
maxAttempts: number = 3
): Promise<Anthropic.Beta.Messages.BetaMessage> {
try {
return (await client.beta.messages.create(
params,
requestOptions
)) as Anthropic.Beta.Messages.BetaMessage;
} catch (e) {
if (e instanceof Anthropic.RateLimitError && params.speed === "fast") {
const { speed, ...rest } = params;
return createMessageWithFastFallback(rest);
}
if (
e instanceof Anthropic.InternalServerError ||
e instanceof Anthropic.APIConnectionError
) {
if (maxAttempts > 1) {
return createMessageWithFastFallback(params, undefined, maxAttempts - 1);
}
}
throw e;
}
}
const message = await createMessageWithFastFallback(
{
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
betas: ["fast-mode-2026-02-01"],
speed: "fast"
},
{ maxRetries: 0 }
);
})();using Anthropic;
using Anthropic.Exceptions;
using Anthropic.Models.Beta.Messages;
AnthropicClient client = new();
async Task<BetaMessage> CreateMessageWithFastFallback(
MessageCreateParams parameters,
int? maxRetries = null,
int maxAttempts = 3)
{
try
{
var requestClient = maxRetries is int retries
? client.WithOptions(options => options with { MaxRetries = retries })
: client;
return await requestClient.Beta.Messages.Create(parameters);
}
catch (AnthropicRateLimitException)
{
if (parameters.Speed is not null)
{
return await CreateMessageWithFastFallback(
parameters with { Speed = null });
}
throw;
}
catch (Anthropic5xxException)
{
if (maxAttempts > 1)
{
return await CreateMessageWithFastFallback(
parameters, maxAttempts: maxAttempts - 1);
}
throw;
}
}
var message = await CreateMessageWithFastFallback(
new MessageCreateParams
{
Model = "claude-opus-4-6",
MaxTokens = 1024,
Messages = [new() { Role = Role.User, Content = "Hello" }],
Betas = ["fast-mode-2026-02-01"],
Speed = Speed.Fast,
},
maxRetries: 0);package main
import (
"context"
"errors"
"fmt"
anthropic "github.com/anthropics/anthropic-sdk-go"
"github.com/anthropics/anthropic-sdk-go/option"
)
func createMessageWithFastFallback(
ctx context.Context,
client *anthropic.Client,
params anthropic.BetaMessageNewParams,
maxAttempts int,
opts ...option.RequestOption,
) (*anthropic.BetaMessage, error) {
message, err := client.Beta.Messages.New(ctx, params, opts...)
if err != nil {
var apierr *anthropic.Error
if errors.As(err, &apierr) && apierr.StatusCode == 429 && params.Speed != "" {
params.Speed = ""
return createMessageWithFastFallback(ctx, client, params, maxAttempts)
}
if (errors.As(err, &apierr) && apierr.StatusCode >= 500) || !errors.As(err, &apierr) {
if maxAttempts > 1 {
return createMessageWithFastFallback(ctx, client, params, maxAttempts-1)
}
}
return nil, err
}
return message, nil
}
func main() {
client := anthropic.NewClient()
message, err := createMessageWithFastFallback(
context.TODO(),
&client,
anthropic.BetaMessageNewParams{
Model: anthropic.ModelClaudeOpus4_6,
MaxTokens: 1024,
Messages: []anthropic.BetaMessageParam{
anthropic.NewBetaUserMessage(anthropic.NewBetaTextBlock("Hello")),
},
Speed: "fast",
Betas: []anthropic.AnthropicBeta{anthropic.AnthropicBetaFastMode2026_02_01},
},
3,
option.WithMaxRetries(0),
)
if err != nil {
panic(err)
}
fmt.Println(message)
}import com.anthropic.client.AnthropicClient;
import com.anthropic.client.okhttp.AnthropicOkHttpClient;
import com.anthropic.errors.InternalServerException;
import com.anthropic.errors.RateLimitException;
import com.anthropic.models.beta.AnthropicBeta;
import com.anthropic.models.beta.messages.BetaMessage;
import com.anthropic.models.beta.messages.MessageCreateParams;
import com.anthropic.models.messages.Model;
import java.util.Optional;
// Disable SDK auto-retry so the fallback logic below handles it
AnthropicClient client =
AnthropicOkHttpClient.builder().fromEnv().maxRetries(0).build();
BetaMessage createMessageWithFastFallback(
MessageCreateParams params, int maxAttempts) {
try {
return client.beta().messages().create(params);
} catch (RateLimitException e) {
if (params.speed().isPresent()) {
MessageCreateParams retryParams = params.toBuilder()
.speed(Optional.empty())
.build();
return createMessageWithFastFallback(retryParams, maxAttempts);
}
throw e;
} catch (InternalServerException e) {
if (maxAttempts > 1) {
return createMessageWithFastFallback(params, maxAttempts - 1);
}
throw e;
}
}
void main() {
BetaMessage message = createMessageWithFastFallback(
MessageCreateParams.builder()
.model(Model.CLAUDE_OPUS_4_6)
.maxTokens(1024L)
.addUserMessage("Hello")
.addBeta(AnthropicBeta.FAST_MODE_2026_02_01)
.speed(MessageCreateParams.Speed.FAST)
.build(),
3);
IO.println(message.content().get(0).text().get().text());
}<?php
use Anthropic\Client;
use Anthropic\Core\Exceptions\APIConnectionException;
use Anthropic\Core\Exceptions\InternalServerException;
use Anthropic\Core\Exceptions\RateLimitException;
use Anthropic\RequestOptions;
$client = new Client();
function createMessageWithFastFallback(
Client $client,
array $params,
?RequestOptions $requestOptions = null,
int $maxAttempts = 3,
) {
try {
return $client->beta->messages->create(
...$params,
requestOptions: $requestOptions,
);
} catch (RateLimitException $e) {
if (isset($params['speed'])) {
unset($params['speed']);
return createMessageWithFastFallback($client, $params);
}
throw $e;
} catch (InternalServerException | APIConnectionException $e) {
if ($maxAttempts > 1) {
return createMessageWithFastFallback(
$client, $params, maxAttempts: $maxAttempts - 1
);
}
throw $e;
}
}
$message = createMessageWithFastFallback(
$client,
[
'model' => 'claude-opus-4-6',
'maxTokens' => 1024,
'messages' => [['role' => 'user', 'content' => 'Hello']],
'betas' => ['fast-mode-2026-02-01'],
'speed' => 'fast',
],
RequestOptions::with(maxRetries: 0),
);require "anthropic"
anthropic = Anthropic::Client.new
def create_message_with_fast_fallback(client, request_options: {}, max_attempts: 3, **params)
client.beta.messages.create(**params, request_options: request_options)
rescue Anthropic::Errors::RateLimitError
raise unless params[:speed] == "fast"
params.delete(:speed)
create_message_with_fast_fallback(client, **params)
rescue Anthropic::Errors::InternalServerError, Anthropic::Errors::APIConnectionError
raise unless max_attempts > 1
create_message_with_fast_fallback(client, max_attempts: max_attempts - 1, **params)
end
message = create_message_with_fast_fallback(
anthropic,
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
betas: ["fast-mode-2026-02-01"],
speed: "fast",
request_options: { max_retries: 0 }
)Considerations
- Prompt caching: Switching between fast and standard speed invalidates the prompt cache. Requests at different speeds do not share cached prefixes.
- Supported models: Fast mode is currently supported on Opus 4.6 only. Sending
speed: "fast"with an unsupported model returns an error. - TTFT: Fast mode’s benefits are focused on output tokens per second (OTPS), not time to first token (TTFT).
- Batch API: Fast mode is not available with the Batch API.
- Priority Tier: Fast mode is not available with Priority Tier.
Next steps
Liên kết
- Nền tảng: Claude
- Nguồn: https://platform.claude.com/docs/en/build-with-claude/fast-mode.md
Xem thêm: