Running experiments
Trust: ★★★☆☆ (0.90) · 0 validations · developer_reference
Published: 2026-05-10 · Source: crawler_authoritative
Tình huống
Guide for running experiments on datasets in Mastra to evaluate agents, workflows, or LLM judges using scorers, targeting developers building AI applications with Mastra.
Insight
An experiment runs every item in a dataset through a target (agent, workflow, or scorer) and optionally scores outputs. The startExperiment() method blocks until all items complete, while startExperimentAsync() provides fire-and-forget execution for long-running datasets. Experiment results persist to storage for comparing runs across prompts, models, or code changes. Targets include: registered agents (items passed to agent.generate() as string, string[], or CoreMessage[]), registered workflows (items as trigger data), and registered scorers (evaluating LLM judges against ground truth — important for detecting model drift). Scoring supports both registered scorer IDs (‘accuracy’, ‘fluency’) and scorer instances from @mastra/evals/scorers/prebuilt. Each item result includes per-scorer scores with name, score value, and reason. Configuration options include maxConcurrency (default 5), itemTimeout (milliseconds), maxRetries (exponential backoff, abort errors not retried), AbortSignal for cancellation, and version pinning for dataset snapshots. The ExperimentSummary return type includes status (‘completed’ | ‘failed’), succeededCount, failedCount, completedWithErrors, and skippedCount for signal-cancelled items. Results can be listed with pagination via listExperiments() and listExperimentResults(). Studio provides a UI for running experiments, viewing per-item results with scores and execution traces, and comparing multiple experiments side-by-side.
Hành động
Call dataset.startExperiment() with a target and scorers: { name: 'experiment-name', targetType: 'agent'|'workflow'|'scorer', targetId: 'target-id', scorers: ['accuracy', 'fluency'] }. For async execution, call startExperimentAsync() which returns { experimentId, status }, then poll using dataset.getExperiment({ experimentId }) in a while loop checking for ‘pending’ or ‘running’ status. Configure concurrency with maxConcurrency: 10 (default 5), timeouts with itemTimeout: 30_000 and retries with maxRetries: 2. Abort with signal: controller.signal. Pin dataset version with version: 3. List experiments with listExperiments({ page: 0, perPage: 10 }), get details with getExperiment({ experimentId }), and list results with listExperimentResults({ experimentId, page: 0, perPage: 50 }).
Kết quả
Returns an ExperimentSummary with status (‘completed’ | ‘failed’), succeededCount, failedCount, and results array. Each result includes itemId, output, error, and per-scorer scores. Async experiments return pending status until polling completes. Configuration options control concurrency, timeouts, retries, and dataset versioning.
Điều kiện áp dụng
@mastra/[email protected]
Nội dung gốc (Original)
Running experiments
Added in: @mastra/[email protected]
An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes.
Basic experiment
Call startExperiment() with a target and scorers:
import { mastra } from '../index'
const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' })
const summary = await dataset.startExperiment({
name: 'gpt-5.1-baseline',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
})
console.log(summary.status) // 'completed' | 'failed'
console.log(summary.succeededCount) // number of items that ran successfully
console.log(summary.failedCount) // number of items that failedstartExperiment() blocks until all items finish. For fire-and-forget execution, see async experiments.
Studio
You can also run experiments in Studio. After you’ve added a dataset item, open it and select Run Experiment and configure the target, scorers, and options.
After running an experiment, the Experiments tab shows all runs for that dataset (with status, counts, and timestamps). Select an experiment to see per-item results, scores, and execution traces.
In the Experiments tab, select Compare and choose two or more experiments to compare their scores and results side by side.
Experiment targets
You can point an experiment at a registered agent, workflow, or scorer.
Registered agent
Point to an agent registered on your Mastra instance:
const summary = await dataset.startExperiment({
name: 'agent-v2-eval',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy'],
})Each item’s input is passed directly to agent.generate(), so it must be a string, string[], or CoreMessage[].
Registered workflow
Point to a workflow registered on your Mastra instance:
const summary = await dataset.startExperiment({
name: 'workflow-eval',
targetType: 'workflow',
targetId: 'translation-workflow',
scorers: ['accuracy'],
})The workflow receives each item’s input as its trigger data.
Registered scorer
Point to a scorer to evaluate an LLM judge against ground truth:
const summary = await dataset.startExperiment({
name: 'judge-accuracy-eval',
targetType: 'scorer',
targetId: 'accuracy',
})The scorer receives each item’s input and groundTruth. LLM-based judges can drift over time as underlying models change, so it’s important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift.
Scoring results
Scorers automatically run after each item’s target execution. Pass scorer instances or registered scorer IDs:
Scorer IDs:
// Reference scorers registered on the Mastra instance
const summary = await dataset.startExperiment({
name: 'with-registered-scorers',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
})Scorer instances:
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt'
const relevancy = createAnswerRelevancyScorer({ model: 'openai/gpt-5-mini' })
const summary = await dataset.startExperiment({
name: 'with-scorer-instances',
targetType: 'agent',
targetId: 'translation-agent',
scorers: [relevancy],
})Each item’s results include per-scorer scores:
for (const item of summary.results) {
console.log(item.itemId, item.output)
for (const score of item.scores) {
console.log(` ${score.scorerName}: ${score.score} — ${score.reason}`)
}
}Info: Visit the Scorers overview for details on available and custom scorers.
Async experiments
startExperiment() blocks until every item completes. For long-running datasets, use startExperimentAsync() to start the experiment in the background:
const { experimentId, status } = await dataset.startExperimentAsync({
name: 'large-dataset-run',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy'],
})
console.log(experimentId) // UUID
console.log(status) // 'pending'Poll for completion using getExperiment():
let experiment = await dataset.getExperiment({ experimentId })
while (experiment.status === 'pending' || experiment.status === 'running') {
await new Promise(resolve => setTimeout(resolve, 5000))
experiment = await dataset.getExperiment({ experimentId })
}
console.log(experiment.status) // 'completed' | 'failed'Configuration options
Concurrency
Control how many items run in parallel (default: 5):
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
maxConcurrency: 10,
})Timeouts and retries
Set a per-item timeout (in milliseconds) and retry count:
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
itemTimeout: 30_000, // 30 seconds per item
maxRetries: 2, // retry failed items up to 2 times
})Retries use exponential backoff. Abort errors are never retried.
Aborting an experiment
Pass an AbortSignal to cancel a running experiment:
const controller = new AbortController()
// Cancel after 60 seconds
setTimeout(() => controller.abort(), 60_000)
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
signal: controller.signal,
})Remaining items are marked as skipped in the summary.
Pinning a dataset version
Run against a specific snapshot of the dataset:
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
version: 3, // use items from dataset version 3
})Viewing results
Listing experiments
const { experiments, pagination } = await dataset.listExperiments({
page: 0,
perPage: 10,
})
for (const exp of experiments) {
console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`)
}Experiment details
const experiment = await dataset.getExperiment({
experimentId: 'exp-abc-123',
})
console.log(experiment.status)
console.log(experiment.startedAt)
console.log(experiment.completedAt)Item-level results
const { results, pagination } = await dataset.listExperimentResults({
experimentId: 'exp-abc-123',
page: 0,
perPage: 50,
})
for (const result of results) {
console.log(result.itemId, result.output, result.error)
}Understanding the summary
startExperiment() returns an ExperimentSummary with counts and per-item results:
completedWithErrorsistruewhen the experiment finished but some items failed.- Items cancelled via
signalappear inskippedCount.
Info: Visit the
startExperimentreference for the full parameter and return type documentation.
Related
Liên kết
- Nền tảng: Dev Framework · Mastra
- Nguồn: https://mastra.ai/docs/evals/datasets/running-experiments
Xem thêm: