Mastra Built-in Scorers for AI Evaluation

Trust: ★★★☆☆ (0.90) · 0 validations · developer_reference

Published: 2026-05-10 · Source: crawler_authoritative

Tình huống

Reference guide for Mastra’s built-in evaluation scorers used to assess AI output quality, accuracy, and safety across agents and workflows.

Insight

Built-in scorers

Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.

To create your own scorers, see the Custom Scorers guide.

Available scorers

Accuracy and reliability

These scorers evaluate how correct, truthful, and complete your agent’s answers are:

answer-relevancy: Evaluates how well responses address the input query (0-1, higher is better)
answer-similarity: Compares agent outputs against ground-truth answers for CI/CD testing using semantic analysis (0-1, higher is better)
faithfulness: Measures how accurately responses represent provided context (0-1, higher is better)
hallucination: Detects factual contradictions and unsupported claims (0-1, lower is better)
completeness: Checks if responses include all necessary information (0-1, higher is better)
content-similarity: Measures textual similarity using character-level matching (0-1, higher is better)
textual-difference: Measures textual differences between strings (0-1, higher means more similar)
tool-call-accuracy: Evaluates whether the LLM selects the correct tool from available options (0-1, higher is better)
trajectory-accuracy: Evaluates whether an agent follows the expected sequence of actions (tool calls, model generations, workflow steps, and other span types) (0-1, higher is better)
prompt-alignment: Measures how well agent responses align with user prompt intent, requirements, completeness, and format (0-1, higher is better)

Context quality

These scorers evaluate the quality and relevance of context used in generating responses:

context-precision: Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (0-1, higher is better)
context-relevance: Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (0-1, higher is better)

Context Scorer Selection:

Use Context Precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)

Use Context Relevance when you need detailed relevance assessment and want to track context usage and identify gaps

Both context scorers support:

Static context: Pre-defined context arrays

Dynamic context extraction: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)

Output quality

These scorers evaluate adherence to format, style, and safety requirements:

tone-consistency: Measures consistency in formality, complexity, and style (0-1, higher is better)
toxicity: Detects harmful or inappropriate content (0-1, lower is better)
bias: Detects potential biases in the output (0-1, lower is better)
[[concepts/seo|SEO]]-coverage: Assesses technical terminology usage (0-1, higher is better)

Hành động

Import scorers from @mastra/evals and use them directly in agent or workflow evaluations. For custom scorers, refer to the Custom Scorers documentation. To evaluate context quality, choose context-precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation), or context-relevance when you need detailed relevance assessment with context usage tracking and gap identification. Both context scorers accept either pre-defined static context arrays or dynamic context extraction functions for RAG systems and vector databases. Reference individual scorer documentation for specific API usage patterns.

Kết quả

Scorers return normalized scores (0-1 range) indicating evaluation quality. Direction varies by scorer: answer-relevancy, answer-similarity, faithfulness, completeness, content-similarity, textual-difference, tool-call-accuracy, trajectory-accuracy, prompt-alignment, context-precision, context-relevance, tone-consistency, and SEO-coverage all score higher being better. Hallucination, toxicity, and bias scorers score lower being better.

Điều kiện áp dụng

Scorers are available in @mastra/evals package. Custom scorers require separate implementation per the Custom Scorers guide.

Nội dung gốc (Original)

Built-in scorers

Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.

To create your own scorers, see the Custom Scorers guide.

Available scorers

Accuracy and reliability

These scorers evaluate how correct, truthful, and complete your agent’s answers are:

answer-relevancy: Evaluates how well responses address the input query (0-1, higher is better)
answer-similarity: Compares agent outputs against ground-truth answers for CI/CD testing using semantic analysis (0-1, higher is better)
faithfulness: Measures how accurately responses represent provided context (0-1, higher is better)
hallucination: Detects factual contradictions and unsupported claims (0-1, lower is better)
completeness: Checks if responses include all necessary information (0-1, higher is better)
content-similarity: Measures textual similarity using character-level matching (0-1, higher is better)
textual-difference: Measures textual differences between strings (0-1, higher means more similar)
tool-call-accuracy: Evaluates whether the LLM selects the correct tool from available options (0-1, higher is better)
trajectory-accuracy: Evaluates whether an agent follows the expected sequence of actions (tool calls, model generations, workflow steps, and other span types) (0-1, higher is better)
prompt-alignment: Measures how well agent responses align with user prompt intent, requirements, completeness, and format (0-1, higher is better)

Context quality

These scorers evaluate the quality and relevance of context used in generating responses:

context-precision: Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (0-1, higher is better)
context-relevance: Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (0-1, higher is better)

Context Scorer Selection:

Use Context Precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)

Use Context Relevance when you need detailed relevance assessment and want to track context usage and identify gaps

Both context scorers support:

Static context: Pre-defined context arrays

Dynamic context extraction: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)

Output quality

These scorers evaluate adherence to format, style, and safety requirements:

tone-consistency: Measures consistency in formality, complexity, and style (0-1, higher is better)
toxicity: Detects harmful or inappropriate content (0-1, lower is better)
bias: Detects potential biases in the output (0-1, lower is better)
keyword-coverage: Assesses technical terminology usage (0-1, higher is better)

Liên kết

Nền tảng: Dev Framework · Mastra
Nguồn: https://mastra.ai/docs/evals/built-in-scorers

Xem thêm:

KakaFlow KB

Nội dung

Mastra Built-in Scorers for AI Evaluation

Mastra Built-in Scorers for AI Evaluation

Tình huống

Insight

Built-in scorers

Available scorers

Accuracy and reliability

Context quality

Output quality

Hành động

Kết quả

Điều kiện áp dụng

Nội dung gốc (Original)

Built-in scorers

Available scorers

Accuracy and reliability

Context quality

Output quality

Liên kết

Sơ đồ

Mục lục

Liên kết ngược