Speech-to-Text (STT) in Mastra - Voice Recognition Interface Documentation

Trust: ★★★☆☆ (0.90) · 0 validations · factual

Published: 2026-05-09 · Source: crawler_authoritative

Tình huống

Technical documentation for Mastra AI framework covering Speech-to-Text (STT) capabilities. The content describes how to implement voice-to-text conversion for building voice-enabled applications.

Insight

Mastra provides a standardized Speech-to-Text (STT) interface for converting audio input into text across multiple service providers. STT enables voice-enabled applications with hands-free interaction, accessibility features, and natural human-computer interfaces. Configuration requires a listeningModel parameter including: name (specific STT model), apiKey (authentication), and provider-specific options. All parameters are optional as default settings can be used. Supported STT providers include: OpenAI (Whisper models for high-accuracy transcription), Azure (Microsoft enterprise-grade reliability), ElevenLabs (multi-language support), Google (extensive language support), Cloudflare (edge-optimized for low-latency), Deepgram (AI-powered for various accents), and Sarvam (specialized in Indic languages). Each provider is installed as a separate package (e.g., pnpm add @mastra/voice-openai@latest).

Hành động

  1. Install desired provider package: pnpm add @mastra/voice-openai@latest (example). 2. Configure voice provider with listeningModel containing model name (e.g., ‘whisper-1’) and apiKey. 3. Use the listen() method to convert audio to text: const transcript = await agent.voice.listen(audioStream, { filetype: 'm4a' }). 4. The method returns transcript text which can be used in agent instructions. Optional: specify audio file type for better accuracy.

Điều kiện áp dụng

Developer documentation for implementing Speech-to-Text in Mastra AI applications. Applies to building voice-enabled applications using supported providers: OpenAI, Azure, ElevenLabs, Google, Cloudflare, Deepgram, or Sarvam.


Nội dung gốc (Original)

Speech-to-Text (STT)

Speech-to-Text (STT) in Mastra provides a standardized interface for converting audio input into text across multiple service providers. STT helps create voice-enabled applications that can respond to human speech, enabling hands-free interaction, accessibility for users with disabilities, and more natural human-computer interfaces.

Configuration

To use STT in Mastra, you need to provide a listeningModel when initializing the voice provider. This includes parameters such as:

  • name: The specific STT model to use.
  • apiKey: Your API key for authentication.
  • Provider-specific options: Additional options that may be required or supported by the specific voice provider.

Note: All of these parameters are optional. You can use the default settings provided by the voice provider, which will depend on the specific provider you are using.

const voice = new OpenAIVoice({
  listeningModel: {
    name: 'whisper-1',
    apiKey: process.env.OPENAI_API_KEY,
  },
})
 
// If using default settings the configuration can be simplified to:
const voice = new OpenAIVoice()

Available providers

Mastra supports several Speech-to-Text providers, each with their own capabilities and strengths:

  • OpenAI: High-accuracy transcription with Whisper models
  • Azure: Microsoft’s speech recognition with enterprise-grade reliability
  • ElevenLabs: Advanced speech recognition with support for multiple languages
  • Google: Google’s speech recognition with extensive language support
  • Cloudflare: Edge-optimized speech recognition for low-latency applications
  • Deepgram: AI-powered speech recognition with high accuracy for various accents
  • Sarvam: Specialized in Indic languages and accents

Each provider is implemented as a separate package that you can install as needed:

pnpm add @mastra/voice-openai@latest  # Example for OpenAI

Using the listen method

The primary method for STT is the listen() method, which converts spoken audio into text. Here’s how to use it:

import { Agent } from '@mastra/core/agent'
import { OpenAIVoice } from '@mastra/voice-openai'
import { getMicrophoneStream } from '@mastra/node-audio'
 
const voice = new OpenAIVoice()
 
const agent = new Agent({
  id: 'voice-agent',
  name: 'Voice Agent',
  instructions: 'You are a voice assistant that provides recommendations based on user input.',
  model: 'openai/gpt-5.4',
  voice,
})
 
const audioStream = getMicrophoneStream() // Assume this function gets audio input
 
const transcript = await agent.voice.listen(audioStream, {
  filetype: 'm4a', // Optional: specify the audio file type
})
 
console.log(`User said: ${transcript}`)
 
const { text } = await agent.generate(
  `Based on what the user said, provide them a recommendation: ${transcript}`,
)
 
console.log(`Recommendation: ${text}`)

Check out the Adding Voice to Agents documentation to learn how to use STT in an agent.

Liên kết