Mastra Built-in Scorers for AI Evaluation
Trust: ★★★☆☆ (0.90) · 0 validations · developer_reference
Published: 2026-05-10 · Source: crawler_authoritative
Tình huống
Reference guide for Mastra’s built-in evaluation scorers used to assess AI output quality, accuracy, and safety across agents and workflows.
Insight
Built-in scorers
Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.
To create your own scorers, see the Custom Scorers guide.
Available scorers
Accuracy and reliability
These scorers evaluate how correct, truthful, and complete your agent’s answers are:
answer-relevancy: Evaluates how well responses address the input query (0-1, higher is better)answer-similarity: Compares agent outputs against ground-truth answers for CI/CD testing using semantic analysis (0-1, higher is better)faithfulness: Measures how accurately responses represent provided context (0-1, higher is better)hallucination: Detects factual contradictions and unsupported claims (0-1, lower is better)completeness: Checks if responses include all necessary information (0-1, higher is better)content-similarity: Measures textual similarity using character-level matching (0-1, higher is better)textual-difference: Measures textual differences between strings (0-1, higher means more similar)tool-call-accuracy: Evaluates whether the LLM selects the correct tool from available options (0-1, higher is better)trajectory-accuracy: Evaluates whether an agent follows the expected sequence of actions (tool calls, model generations, workflow steps, and other span types) (0-1, higher is better)prompt-alignment: Measures how well agent responses align with user prompt intent, requirements, completeness, and format (0-1, higher is better)
Context quality
These scorers evaluate the quality and relevance of context used in generating responses:
context-precision: Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (0-1, higher is better)context-relevance: Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (0-1, higher is better)
Context Scorer Selection:
- Use Context Precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
- Use Context Relevance when you need detailed relevance assessment and want to track context usage and identify gaps
Both context scorers support:
- Static context: Pre-defined context arrays
- Dynamic context extraction: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)
Output quality
These scorers evaluate adherence to format, style, and safety requirements:
tone-consistency: Measures consistency in formality, complexity, and style (0-1, higher is better)toxicity: Detects harmful or inappropriate content (0-1, lower is better)bias: Detects potential biases in the output (0-1, lower is better)[[concepts/seo|SEO]]-coverage: Assesses technical terminology usage (0-1, higher is better)
Hành động
Import scorers from @mastra/evals and use them directly in agent or workflow evaluations. For custom scorers, refer to the Custom Scorers documentation. To evaluate context quality, choose context-precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation), or context-relevance when you need detailed relevance assessment with context usage tracking and gap identification. Both context scorers accept either pre-defined static context arrays or dynamic context extraction functions for RAG systems and vector databases. Reference individual scorer documentation for specific API usage patterns.
Kết quả
Scorers return normalized scores (0-1 range) indicating evaluation quality. Direction varies by scorer: answer-relevancy, answer-similarity, faithfulness, completeness, content-similarity, textual-difference, tool-call-accuracy, trajectory-accuracy, prompt-alignment, context-precision, context-relevance, tone-consistency, and SEO-coverage all score higher being better. Hallucination, toxicity, and bias scorers score lower being better.
Điều kiện áp dụng
Scorers are available in @mastra/evals package. Custom scorers require separate implementation per the Custom Scorers guide.
Nội dung gốc (Original)
Built-in scorers
Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.
To create your own scorers, see the Custom Scorers guide.
Available scorers
Accuracy and reliability
These scorers evaluate how correct, truthful, and complete your agent’s answers are:
answer-relevancy: Evaluates how well responses address the input query (0-1, higher is better)answer-similarity: Compares agent outputs against ground-truth answers for CI/CD testing using semantic analysis (0-1, higher is better)faithfulness: Measures how accurately responses represent provided context (0-1, higher is better)hallucination: Detects factual contradictions and unsupported claims (0-1, lower is better)completeness: Checks if responses include all necessary information (0-1, higher is better)content-similarity: Measures textual similarity using character-level matching (0-1, higher is better)textual-difference: Measures textual differences between strings (0-1, higher means more similar)tool-call-accuracy: Evaluates whether the LLM selects the correct tool from available options (0-1, higher is better)trajectory-accuracy: Evaluates whether an agent follows the expected sequence of actions (tool calls, model generations, workflow steps, and other span types) (0-1, higher is better)prompt-alignment: Measures how well agent responses align with user prompt intent, requirements, completeness, and format (0-1, higher is better)
Context quality
These scorers evaluate the quality and relevance of context used in generating responses:
context-precision: Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (0-1, higher is better)context-relevance: Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (0-1, higher is better)
Context Scorer Selection:
- Use Context Precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
- Use Context Relevance when you need detailed relevance assessment and want to track context usage and identify gaps
Both context scorers support:
- Static context: Pre-defined context arrays
- Dynamic context extraction: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)
Output quality
These scorers evaluate adherence to format, style, and safety requirements:
tone-consistency: Measures consistency in formality, complexity, and style (0-1, higher is better)toxicity: Detects harmful or inappropriate content (0-1, lower is better)bias: Detects potential biases in the output (0-1, lower is better)keyword-coverage: Assesses technical terminology usage (0-1, higher is better)
Liên kết
- Nền tảng: Dev Framework · Mastra
- Nguồn: https://mastra.ai/docs/evals/built-in-scorers
Xem thêm: