Hướng Dẫn Chạy Experiments Trên Mastra.ai Để Đánh Giá Agent, Workflow và Scorer

Trust: ★★★☆☆ (0.90) · 0 validations · factual

Published: 2026-05-09 · Source: crawler_authoritative

Tình huống

Nhà phát triển cần Reviews hiệu suất của AI agents, workflows hoặc scorers bằng cách chạy thử nghiệm trên datasets. Yêu cầu: @mastra/[email protected] trở lên, dataset đã được tạo, items có input dạng string, string[], hoặc CoreMessage[].

Insight

Mastra.ai cho phép chạy experiment để Reviews AI models bằng cách xử lý từng item trong dataset qua một target (agent, workflow, hoặc scorer) và tự động chấm điểm kết quả. Kết quả được lưu trữ để so sánh giữa các lần chạy với prompts, models, hoặc code changes khác nhau. Có 3 loại target: (1) Registered Agent - mỗi item input được truyền trực tiếp vào agent.generate(), yêu cầu input phải là string, string[], hoặc CoreMessage[]; (2) Registered Workflow - nhận input làm trigger data cho workflow; (3) Registered Scorer - dùng để đánh giá LLM judge với ground truth, giúp phát hiện drift khi underlying models thay đổi theo thời gian. Scorers có thể được reference bằng ID đã đăng ký (như ‘accuracy’, ‘fluency’) hoặc truyền scorer instances trực tiếp (ví dụ createAnswerRelevancyScorer). Mỗi item results bao gồm per-scorer scores với scorerName, score và reason. Experiment có thể chạy synchronous (startExperiment() - blocking) hoặc async (startExperimentAsync() - fire-and-forget). Configuration options bao gồm: maxConcurrency (default: 5, kiểm soát số items chạy song song), itemTimeout (milliseconds per item, default không giới hạn), maxRetries (retry failed items với exponential backoff, abort errors không retry), AbortSignal support để hủy experiment (remaining items được mark là skipped), và version parameter để pin dataset snapshot cụ thể. Các API để xem results: listExperiments() với pagination, getExperiment() để lấy chi tiết (status, startedAt, completedAt), và listExperimentResults() để xem per-item results (itemId, output, error). ExperimentSummary trả về counts: succeededCount, failedCount, skippedCount, completedWithErrors (true khi experiment hoàn thành nhưng có items failed).

Hành động

Sử dụng startExperiment() cho synchronous execution (blocking cho đến khi tất cả items hoàn thành). Cấu hình: targetType (‘agent’, ‘workflow’, hoặc ‘scorer’), targetId (ID đã đăng ký), scorers (array của scorer IDs hoặc instances). Thiết lập maxConcurrency: 10 cho parallelism cao hơn (default là 5). Set itemTimeout: 30_000 (30 giây) và maxRetries: 2 để xử lý failures. Dùng AbortSignal để cancel experiment sau khoảng thời gian nhất định. Pin dataset version bằng version: 3 parameter. Với large datasets, dùng startExperimentAsync() để chạy background, sau đó poll getExperiment() để check status. Truy cập results qua listExperiments(), getExperiment(), và listExperimentResults() với pagination. So sánh scores giữa các experiments trong Studio bằng Compare feature.

Kết quả

Experiment hoàn thành với status ‘completed’ hoặc ‘failed’. Summary chứa succeededCount (số items xử lý thành công), failedCount (số items thất bại), skippedCount (số items bị skip khi cancel). completedWithErrors = true khi experiment xong nhưng có items failed.

Điều kiện áp dụng

Yêu cầu @mastra/[email protected] trở lên. Target agent yêu cầu input phải là string, string[], hoặc CoreMessage[]. Retries không áp dụng cho abort errors. Dataset items có thể được pin ở version cụ thể khi chạy experiment.

Nội dung gốc (Original)

Running experiments

Added in: @mastra/[email protected]

An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes.

Basic experiment

Call startExperiment() with a target and scorers:

import { mastra } from '../index'
 
const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' })
 
const summary = await dataset.startExperiment({
  name: 'gpt-5.1-baseline',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy', 'fluency'],
})
 
console.log(summary.status) // 'completed' | 'failed'
console.log(summary.succeededCount) // number of items that ran successfully
console.log(summary.failedCount) // number of items that failed

startExperiment() blocks until all items finish. For fire-and-forget execution, see async experiments.

Studio

You can also run experiments in Studio. After you’ve added a dataset item, open it and select Run Experiment and configure the target, scorers, and options.

After running an experiment, the Experiments tab shows all runs for that dataset (with status, counts, and timestamps). Select an experiment to see per-item results, scores, and execution traces.

In the Experiments tab, select Compare and choose two or more experiments to compare their scores and results side by side.

Experiment targets

You can point an experiment at a registered agent, workflow, or scorer.

Registered agent

Point to an agent registered on your Mastra instance:

const summary = await dataset.startExperiment({
  name: 'agent-v2-eval',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy'],
})

Each item’s input is passed directly to agent.generate(), so it must be a string, string[], or CoreMessage[].

Registered workflow

Point to a workflow registered on your Mastra instance:

const summary = await dataset.startExperiment({
  name: 'workflow-eval',
  targetType: 'workflow',
  targetId: 'translation-workflow',
  scorers: ['accuracy'],
})

The workflow receives each item’s input as its trigger data.

Registered scorer

Point to a scorer to evaluate an LLM judge against ground truth:

const summary = await dataset.startExperiment({
  name: 'judge-accuracy-eval',
  targetType: 'scorer',
  targetId: 'accuracy',
})

The scorer receives each item’s input and groundTruth. LLM-based judges can drift over time as underlying models change, so it’s important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift.

Scoring results

Scorers automatically run after each item’s target execution. Pass scorer instances or registered scorer IDs:

Scorer IDs:

// Reference scorers registered on the Mastra instance
const summary = await dataset.startExperiment({
  name: 'with-registered-scorers',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy', 'fluency'],
})

Scorer instances:

import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt'
 
const relevancy = createAnswerRelevancyScorer({ model: 'openai/gpt-5-mini' })
 
const summary = await dataset.startExperiment({
  name: 'with-scorer-instances',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: [relevancy],
})

Each item’s results include per-scorer scores:

for (const item of summary.results) {
  console.log(item.itemId, item.output)
  for (const score of item.scores) {
    console.log(`  ${score.scorerName}: ${score.score} — ${score.reason}`)
  }
}

Info: Visit the Scorers overview for details on available and custom scorers.

Async experiments

startExperiment() blocks until every item completes. For long-running datasets, use startExperimentAsync() to start the experiment in the background:

const { experimentId, status } = await dataset.startExperimentAsync({
  name: 'large-dataset-run',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy'],
})
 
console.log(experimentId) // UUID
console.log(status) // 'pending'

Poll for completion using getExperiment():

let experiment = await dataset.getExperiment({ experimentId })
 
while (experiment.status === 'pending' || experiment.status === 'running') {
  await new Promise(resolve => setTimeout(resolve, 5000))
  experiment = await dataset.getExperiment({ experimentId })
}
 
console.log(experiment.status) // 'completed' | 'failed'

Configuration options

Concurrency

Control how many items run in parallel (default: 5):

const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  maxConcurrency: 10,
})

Timeouts and retries

Set a per-item timeout (in milliseconds) and retry count:

const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  itemTimeout: 30_000, // 30 seconds per item
  maxRetries: 2, // retry failed items up to 2 times
})

Retries use exponential backoff. Abort errors are never retried.

Aborting an experiment

Pass an AbortSignal to cancel a running experiment:

const controller = new AbortController()
 
// Cancel after 60 seconds
setTimeout(() => controller.abort(), 60_000)
 
const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  signal: controller.signal,
})

Remaining items are marked as skipped in the summary.

Pinning a dataset version

Run against a specific snapshot of the dataset:

const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  version: 3, // use items from dataset version 3
})

Viewing results

Listing experiments

const { experiments, pagination } = await dataset.listExperiments({
  page: 0,
  perPage: 10,
})
 
for (const exp of experiments) {
  console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`)
}

Experiment details

const experiment = await dataset.getExperiment({
  experimentId: 'exp-abc-123',
})
 
console.log(experiment.status)
console.log(experiment.startedAt)
console.log(experiment.completedAt)

Item-level results

const { results, pagination } = await dataset.listExperimentResults({
  experimentId: 'exp-abc-123',
  page: 0,
  perPage: 50,
})
 
for (const result of results) {
  console.log(result.itemId, result.output, result.error)
}

Understanding the summary

startExperiment() returns an ExperimentSummary with counts and per-item results:

completedWithErrors is true when the experiment finished but some items failed.
Items cancelled via signal appear in skippedCount.

Info: Visit the startExperiment reference for the full parameter and return type documentation.

Liên kết

Nền tảng: Mastra
Nguồn: https://mastra.ai/docs/evals/datasets/running-experiments

Xem thêm:

KakaFlow KB

Nội dung

Hướng Dẫn Chạy Experiments Trên Mastra.ai Để Đánh Giá Agent, Workflow và Scorer

Hướng Dẫn Chạy Experiments Trên Mastra.ai Để Đánh Giá Agent, Workflow và Scorer

Tình huống

Insight

Hành động

Kết quả

Điều kiện áp dụng

Nội dung gốc (Original)

Running experiments

Basic experiment

Studio

Experiment targets

Registered agent

Registered workflow

Registered scorer

Scoring results

Async experiments

Configuration options

Concurrency

Timeouts and retries

Aborting an experiment

Pinning a dataset version

Viewing results

Listing experiments

Experiment details

Item-level results

Understanding the summary

Liên kết

Sơ đồ

Mục lục

Liên kết ngược

KakaFlow KB

Nội dung

Hướng Dẫn Chạy Experiments Trên Mastra.ai Để Đánh Giá Agent, Workflow và Scorer

Hướng Dẫn Chạy Experiments Trên Mastra.ai Để Đánh Giá Agent, Workflow và Scorer

Tình huống

Insight

Hành động

Kết quả

Điều kiện áp dụng

Nội dung gốc (Original)

Running experiments

Basic experiment

Studio

Experiment targets

Registered agent

Registered workflow

Registered scorer

Scoring results

Async experiments

Configuration options

Concurrency

Timeouts and retries

Aborting an experiment

Pinning a dataset version

Viewing results

Listing experiments

Experiment details

Item-level results

Understanding the summary

Related

Liên kết

Sơ đồ

Mục lục

Liên kết ngược