Adding Voice Capabilities to Mastra Agents

Trust: ★★★☆☆ (0.90) · 0 validations · developer_reference

Published: 2026-05-10 · Source: crawler_authoritative

Tình huống

Mastra Agent SDK documentation for developers configuring voice capabilities including text-to-speech (TTS), speech-to-text (STT), and real-time speech-to-speech interactions with various voice providers.

Insight

Mastra agents can be enhanced with voice capabilities using the Agent class from @mastra/core/agent. Voice functionality is accessed via agent.voice which provides speak() for TTS and listen() for STT. The speak() method returns a Node.js readable stream that can be piped to files or speakers. The listen() method accepts audio streams with a filetype option (e.g., ‘m4a’, ‘mp3’). For single-provider setups, instantiate a voice provider like OpenAIVoice and pass it to the Agent constructor. For hybrid setups, use CompositeVoice from @mastra/core/voice with separate input and output providers. Real-time speech-to-speech is supported via providers like OpenAIRealtimeVoice from @mastra/voice-openai-realtime, which uses WebSocket connections via agent.voice.connect(). The realtime provider emits events: ‘speaking’ (with audio data), ‘writing’ (with transcribed text and role), and ‘error’. Tools configured on the Agent are automatically passed to the voice provider. Audio streams are Node.js streams compatible with createReadStream and createWriteStream. AI SDK models can also be passed directly to CompositeVoice using openai.transcription('whisper-1') for STT and elevenlabs.speech('eleven_turbo_v2') for TTS.

Hành động

To add basic voice capabilities: 1) Import Agent from @mastra/core/agent and a voice provider like OpenAIVoice from @mastra/voice-openai. 2) Instantiate the voice provider with optional configuration. 3) Pass the voice instance to the Agent constructor. 4) Use agent.voice.speak(text, { filetype: 'm4a' }) to generate audio. 5) Use agent.voice.listen(audioStream, { filetype: 'm4a' }) to transcribe audio. For realtime speech-to-speech: 1) Import OpenAIRealtimeVoice from @mastra/voice-openai-realtime and getMicrophoneStream from @mastra/node-audio. 2) Configure with API key, model (e.g., ‘gpt-5.1-realtime’), and speaker name. 3) Call agent.voice.connect() to establish WebSocket. 4) Use agent.voice.send(microphoneStream) to send audio. 5) Listen for events via agent.voice.on('speaking', ...), agent.voice.on('writing', ...), and agent.voice.on('error', ...). 6) Close with agent.voice.close(). For multiple providers: Use CompositeVoice with input and output properties set to different provider instances. For AI SDK integration: Pass AI SDK models directly to CompositeVoice using .transcription() and .speech() methods.

Kết quả

The voice-enabled agent can speak responses via the speak() method which returns an audio stream, and listen/transcribe user input via the listen() method which returns transcribed text. Realtime providers establish WebSocket connections for bidirectional real-time voice interaction with event-based audio and text streaming.

Điều kiện áp dụng

Requires @mastra/core agent setup. Realtime voice requires WebSocket-compatible environment. Audio file operations require Node.js streams. Filetype options vary by provider.


Nội dung gốc (Original)

Voice

Mastra agents can be enhanced with voice capabilities, allowing them to speak responses and listen to user input. You can configure an agent to use either a single voice provider or combine multiple providers for different operations.

Basic usage

The simplest way to add voice to an agent is to use a single provider for both speaking and listening:

import { createReadStream } from 'fs'
import path from 'path'
import { Agent } from '@mastra/core/agent'
import { OpenAIVoice } from '@mastra/voice-openai'
 
// Initialize the voice provider with default settings
const voice = new OpenAIVoice()
 
// Create an agent with voice capabilities
export const agent = new Agent({
  id: 'voice-agent',
  name: 'Voice Agent',
  instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
  model: 'openai/gpt-5.4',
  voice,
})
 
// The agent can now use voice for interaction
const audioStream = await agent.voice.speak("Hello, I'm your AI assistant!", {
  filetype: 'm4a',
})
 
playAudio(audioStream!)
 
try {
  const transcription = await agent.voice.listen(audioStream)
  console.log(transcription)
} catch (error) {
  console.error('Error transcribing audio:', error)
}

Working with audio streams

The speak() and listen() methods work with Node.js streams. Here’s how to save and load audio files:

Saving Speech Output

The speak method returns a stream that you can pipe to a file or speaker.

import { createWriteStream } from 'fs'
import path from 'path'
 
// Generate speech and save to file
const audio = await agent.voice.speak('Hello, World!')
const filePath = path.join(process.cwd(), 'agent.mp3')
const writer = createWriteStream(filePath)
 
audio.pipe(writer)
 
await new Promise<void>((resolve, reject) => {
  writer.on('finish', () => resolve())
  writer.on('error', reject)
})

Transcribing Audio Input

The listen method expects a stream of audio data from a microphone or file.

import { createReadStream } from 'fs'
import path from 'path'
 
// Read audio file and transcribe
const audioFilePath = path.join(process.cwd(), '/agent.m4a')
const audioStream = createReadStream(audioFilePath)
 
try {
  console.log('Transcribing audio file...')
  const transcription = await agent.voice.listen(audioStream, {
    filetype: 'm4a',
  })
  console.log('Transcription:', transcription)
} catch (error) {
  console.error('Error transcribing audio:', error)
}

Speech-to-speech voice interactions

For more dynamic and interactive voice experiences, you can use real-time voice providers that support speech-to-speech capabilities:

import { Agent } from '@mastra/core/agent'
import { getMicrophoneStream } from '@mastra/node-audio'
import { OpenAIRealtimeVoice } from '@mastra/voice-openai-realtime'
import { search, calculate } from '../tools'
 
// Initialize the realtime voice provider
const voice = new OpenAIRealtimeVoice({
  apiKey: process.env.OPENAI_API_KEY,
  model: 'gpt-5.1-realtime',
  speaker: 'alloy',
})
 
// Create an agent with speech-to-speech voice capabilities
export const agent = new Agent({
  id: 'speech-to-speech-agent',
  name: 'Speech-to-Speech Agent',
  instructions: `You are a helpful assistant with speech-to-speech capabilities.`,
  model: 'openai/gpt-5.4',
  tools: {
    // Tools configured on Agent are passed to voice provider
    search,
    calculate,
  },
  voice,
})
 
// Establish a WebSocket connection
await agent.voice.connect()
 
// Start a conversation
agent.voice.speak("Hello, I'm your AI assistant!")
 
// Stream audio from a microphone
const microphoneStream = getMicrophoneStream()
agent.voice.send(microphoneStream)
 
// When done with the conversation
agent.voice.close()

Event System

The realtime voice provider emits several events you can listen for:

// Listen for speech audio data sent from voice provider
agent.voice.on('speaking', ({ audio }) => {
  // audio contains ReadableStream or Int16Array audio data
})
 
// Listen for transcribed text sent from both voice provider and user
agent.voice.on('writing', ({ text, role }) => {
  console.log(`${role} said: ${text}`)
})
 
// Listen for errors
agent.voice.on('error', error => {
  console.error('Voice error:', error)
})

Examples

End-to-end voice interaction

This example demonstrates a voice interaction between two agents. The hybrid voice agent, which uses multiple providers, speaks a question, which is saved as an audio file. The unified voice agent listens to that file, processes the question, generates a response, and speaks it back. Both audio outputs are saved to the audio directory.

The following files are created:

  • hybrid-question.mp3 – Hybrid agent’s spoken question.
  • unified-response.mp3 – Unified agent’s spoken response.
import 'dotenv/config'
 
import path from 'path'
import { createReadStream } from 'fs'
import { Agent } from '@mastra/core/agent'
import { CompositeVoice } from '@mastra/core/voice'
import { OpenAIVoice } from '@mastra/voice-openai'
import { Mastra } from '@mastra/core'
 
// Saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.
export const saveAudioToFile = async (
  audio: NodeJS.ReadableStream,
  filename: string,
): Promise<void> => {
  const audioDir = path.join(process.cwd(), 'audio')
  const filePath = path.join(audioDir, filename)
 
  await fs.promises.mkdir(audioDir, { recursive: true })
 
  const writer = createWriteStream(filePath)
  audio.pipe(writer)
  return new Promise((resolve, reject) => {
    writer.on('finish', resolve)
    writer.on('error', reject)
  })
}
 
// Saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.
export const convertToText = async (input: string | NodeJS.ReadableStream): Promise<string> => {
  if (typeof input === 'string') {
    return input
  }
 
  const chunks: Buffer[] = []
  return new Promise((resolve, reject) => {
    inputData.on('data', chunk => chunks.push(Buffer.from(chunk)))
    inputData.on('error', reject)
    inputData.on('end', () => resolve(Buffer.concat(chunks).toString('utf-8')))
  })
}
 
export const hybridVoiceAgent = new Agent({
  id: 'hybrid-voice-agent',
  name: 'Hybrid Voice Agent',
  model: 'openai/gpt-5.4',
  instructions: 'You can speak and listen using different providers.',
  voice: new CompositeVoice({
    input: new OpenAIVoice(),
    output: new OpenAIVoice(),
  }),
})
 
export const unifiedVoiceAgent = new Agent({
  id: 'unified-voice-agent',
  name: 'Unified Voice Agent',
  instructions: 'You are an agent with both STT and TTS capabilities.',
  model: 'openai/gpt-5.4',
  voice: new OpenAIVoice(),
})
 
export const mastra = new Mastra({
  agents: { hybridVoiceAgent, unifiedVoiceAgent },
})
 
const hybridVoiceAgent = mastra.getAgent('hybridVoiceAgent')
const unifiedVoiceAgent = mastra.getAgent('unifiedVoiceAgent')
 
const question = 'What is the meaning of life in one sentence?'
 
const hybridSpoken = await hybridVoiceAgent.voice.speak(question)
 
await saveAudioToFile(hybridSpoken!, 'hybrid-question.mp3')
 
const audioStream = createReadStream(path.join(process.cwd(), 'audio', 'hybrid-question.mp3'))
const unifiedHeard = await unifiedVoiceAgent.voice.listen(audioStream)
 
const inputText = await convertToText(unifiedHeard!)
 
const unifiedResponse = await unifiedVoiceAgent.generate(inputText)
const unifiedSpoken = await unifiedVoiceAgent.voice.speak(unifiedResponse.text)
 
await saveAudioToFile(unifiedSpoken!, 'unified-response.mp3')

Using Multiple Providers

For more flexibility, you can use different providers for speaking and listening using the CompositeVoice class:

import { Agent } from '@mastra/core/agent'
import { CompositeVoice } from '@mastra/core/voice'
import { OpenAIVoice } from '@mastra/voice-openai'
import { PlayAIVoice } from '@mastra/voice-playai'
 
export const agent = new Agent({
  id: 'voice-agent',
  name: 'Voice Agent',
  instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
  model: 'openai/gpt-5.4',
 
  // Create a composite voice using OpenAI for listening and PlayAI for speaking
  voice: new CompositeVoice({
    input: new OpenAIVoice(),
    output: new PlayAIVoice(),
  }),
})

Using AI SDK

Mastra supports using AI SDK’s transcription and speech models directly in CompositeVoice, giving you access to a wide range of providers through the AI SDK ecosystem:

import { Agent } from '@mastra/core/agent'
import { CompositeVoice } from '@mastra/core/voice'
import { openai } from '@ai-sdk/openai'
import { elevenlabs } from '@ai-sdk/elevenlabs'
import { groq } from '@ai-sdk/groq'
 
export const agent = new Agent({
  id: 'aisdk-voice-agent',
  name: 'AI SDK Voice Agent',
  instructions: `You are a helpful assistant with voice capabilities.`,
  model: 'openai/gpt-5.4',
 
  // Pass AI SDK models directly to CompositeVoice
  voice: new CompositeVoice({
    input: openai.transcription('whisper-1'), // AI SDK transcription model
    output: elevenlabs.speech('eleven_turbo_v2'), // AI SDK speech model
  }),
})
 
// Use voice capabilities as usual
const audioStream = await agent.voice.speak('Hello!')
const transcribedText = await agent.voice.listen(audioStream)

Mix and Match Providers

You can mix AI SDK models with Mastra voice providers:

import { CompositeVoice } from '@mastra/core/voice'
import { PlayAIVoice } from '@mastra/voice-playai'
import { openai } from '@ai-sdk/openai'
 
// Use AI SDK for transcription and Mastra provider for speech
const voice = new CompositeVoice({
  input: openai.transcription('whisper-1'), // AI SDK
  output: new PlayAIVoice(), // Mastra provider
})

For the complete list of supported AI SDK providers and their capabilities:

Supported voice providers

Mastra supports multiple voice providers for text-to-speech (TTS) and speech-to-text (STT) capabilities:

ProviderPackageFeaturesReference
OpenAI@mastra/voice-openaiTTS, STTDocumentation
OpenAI Realtime@mastra/voice-openai-realtimeRealtime speech-to-speechDocumentation
AWS Nova Sonic@mastra/voice-aws-nova-sonicRealtime speech-to-speech via AWS BedrockDocumentation
ElevenLabs@mastra/voice-elevenlabsHigh-quality TTSDocumentation
PlayAI@mastra/voice-playaiTTSDocumentation
Google@mastra/voice-googleTTS, STTDocumentation
Deepgram@mastra/voice-deepgramSTTDocumentation
Murf@mastra/voice-murfTTSDocumentation
Speechify@mastra/voice-speechifyTTSDocumentation
Sarvam@mastra/voice-sarvamTTS, STTDocumentation
Azure@mastra/voice-azureTTS, STTDocumentation
Cloudflare@mastra/voice-cloudflareTTSDocumentation

Next steps

Liên kết