Choosing a Speech-to-Text API for Voice Agents

The speech-to-text engine is the front door of your voice agent pipeline. Everything downstream — LLM reasoning, intent detection, response generation — depends on getting an accurate transcript of what the user said, fast. A slow or inaccurate STT engine cripples the entire system.

For voice agents, STT selection comes down to three factors: streaming latency (how quickly you get transcript as the user speaks), accuracy (word error rate under real-world audio conditions), and cost (per-minute pricing at your call volume). This guide compares the major STT APIs across these dimensions and recommends the best choice for each use case.

Why STT Choice Matters for Voice Agents

In a voice agent pipeline, STT is the first processing step after the user speaks. The transcript it produces feeds directly into your LLM for intent detection and response generation. Two things can go wrong: the transcript arrives too slowly (adding latency to the total response), or it contains errors that mislead the LLM.

Latency compounds through the pipeline. If your STT takes 500ms to finalize a transcript, and your LLM needs 800ms to generate a response, and your TTS needs 300ms to start playing audio, that is 1,600ms of silence after the user stops speaking. Users start noticing at 1 second and disengage past 2. Every 100ms you save on STT latency directly reduces total response time.

Accuracy errors cascade differently. If the STT transcribes "I want to cancel my subscription" as "I want to cancel my description," the LLM may generate an irrelevant response. In customer service and sales applications, these errors erode trust quickly. The difference between 5% and 12% WER is the difference between occasional hiccups and frequent misunderstandings.

Key Factors When Evaluating STT APIs

These six factors determine whether an STT provider will work for your voice agent deployment.

Streaming Latency

For voice agents, STT latency is the time between a user finishing a word and the transcript appearing in your pipeline. Real-time providers like Deepgram deliver interim results within 100-300ms. Batch-oriented APIs can take seconds. Since STT is the first step in the voice agent pipeline (before LLM and TTS), every millisecond of STT delay pushes the total response time further from conversational. Prioritize providers with WebSocket streaming and interim/partial results.

Word Error Rate (WER)

Word error rate measures how accurately the STT engine transcribes speech. Lower is better — a WER of 5% means roughly 1 in 20 words is wrong. For voice agents, accuracy directly impacts the LLM's ability to understand intent. A 10% WER might work for note-taking but causes frequent misunderstandings in voice agents. Top providers achieve 5-8% WER on clean audio, but real-world conditions (background noise, accents, telephony compression) can push WER to 15-20%. Test with audio that matches your actual use case.

Language Support

Language support varies enormously across STT providers. Google and AWS support 100+ languages but quality varies by language. Deepgram supports fewer languages but delivers consistently high quality. For voice agents, verify that your target languages work well with streaming mode — some providers support a language for batch transcription but not for real-time streaming. Also check accent and dialect coverage within your target languages.

Diarization

Speaker diarization identifies who is speaking in a multi-speaker conversation. For voice agents, diarization helps distinguish the agent's output from the user's speech, which matters for call recording, analytics, and debugging. Most providers offer diarization but quality and latency vary. Some only support diarization in batch mode, not streaming. If you need real-time diarization for multi-party calls, verify the provider supports it with acceptable latency.

Custom Vocabulary

Voice agents often deal with domain-specific terminology — product names, medical terms, company names, technical jargon — that general STT models misrecognize. Custom vocabulary (also called keyword boosting or speech adaptation) lets you bias the model toward recognizing specific words. The implementation varies: some providers accept a simple word list, others support weighted phrases or full custom language models. For specialized domains, this feature can cut WER by 30-50% on key terms.

Pricing

STT pricing is typically per minute of audio processed. Rates range from $0.006/minute (AWS Transcribe) to $0.05/minute (premium tiers of specialized providers). For a voice agent handling 10,000 calls per month at 3 minutes average, monthly STT costs range from $180 to $1,500. Streaming endpoints sometimes cost more than batch. Some providers charge separately for features like diarization, custom vocabulary, or enhanced models.

STT Provider Comparison

Deepgram

Best for real-time voice agents

Deepgram is the go-to STT provider for real-time voice agents. Their Nova-2 model delivers excellent accuracy with streaming latency that consistently beats competitors. The WebSocket API is clean, interim results arrive fast, and the endpointing (detecting when a user has finished speaking) is the best in the market. Custom vocabulary boosting works well for domain-specific terms. Pricing is competitive and scales predictably. If you are building a voice agent that needs to understand speech in real time, Deepgram should be your first evaluation.

Latency

100-300ms

WER

5-8%

Languages

36+

Pricing

$0.0043-0.0145/min

Diarization

Real-time + batch

Custom Vocabulary

Keyword boosting + custom models

Streaming

WebSocket + REST streaming

Strengths
  • Lowest streaming latency with reliable interim results
  • Best endpointing for voice agent turn-taking
  • Nova-2 accuracy competes with or beats larger providers
  • Clean WebSocket API with excellent documentation
  • Competitive pricing that scales well at volume
Weaknesses
  • Fewer languages than Google or AWS (36 vs 100+)
  • Custom model training requires enterprise plan
  • Less brand recognition than Google or AWS for enterprise procurement
  • Diarization quality in streaming mode lags batch mode

OpenAI Whisper API

Best accuracy for batch transcription

OpenAI's Whisper is an excellent STT model with broad language support and strong accuracy. The hosted API is affordable and easy to use. However, it is a batch API — you send audio files and get transcripts back. There is no native streaming support, which makes it poorly suited for real-time voice agents where you need instant transcription. The open-source version of Whisper can be self-hosted for zero per-minute cost, making it attractive for high-volume batch processing or teams with GPU infrastructure. For voice agents, Whisper works best for post-call transcription and analysis rather than real-time use.

Latency

1-3s (batch only)

WER

4-7%

Languages

100+

Pricing

$0.006/min

Diarization

Not built-in

Custom Vocabulary

Prompt-based only

Streaming

No native streaming

Strengths
  • Excellent accuracy across a wide range of accents and audio quality
  • Supports 100+ languages with strong multilingual performance
  • Open-source model can be self-hosted for free
  • Very affordable hosted API pricing
  • Handles noisy audio and poor recording quality well
Weaknesses
  • No native streaming — batch only (1-3 second latency minimum)
  • Not suitable for real-time voice agents without significant workarounds
  • No built-in diarization (requires additional processing)
  • Custom vocabulary limited to prompt-based guidance

AssemblyAI

Best accuracy with rich features

AssemblyAI delivers top-tier accuracy with a feature-rich platform that goes well beyond basic transcription. Their Universal-2 model matches or exceeds Deepgram on raw accuracy benchmarks, and they offer built-in features like sentiment analysis, topic detection, entity recognition, and summarization that are useful for voice agent analytics. Streaming latency is good but not quite as fast as Deepgram for real-time voice agent use. Where AssemblyAI shines is in the intelligence layer — if you want to analyze conversations, not just transcribe them, AssemblyAI gives you the most tools out of the box.

Latency

200-500ms

WER

5-8%

Languages

17+

Pricing

$0.015-0.065/min

Diarization

Real-time + batch

Custom Vocabulary

Word boost + custom spelling

Streaming

WebSocket streaming

Strengths
  • Excellent accuracy with Universal-2 model
  • Rich built-in features: sentiment, topics, entities, summaries
  • Good streaming support with WebSocket API
  • Strong documentation and developer experience
  • Built-in PII redaction for compliance
Weaknesses
  • Higher pricing than Deepgram, especially for premium features
  • Streaming latency slightly behind Deepgram for real-time agents
  • Fewer languages than Google, AWS, or Whisper
  • Some features (summarization, topic detection) are batch-only

Google Cloud STT

Best language coverage and enterprise integration

Google Cloud Speech-to-Text offers the widest language coverage of any STT provider, with strong accuracy on its Chirp 2 and latest-long models. The gRPC streaming API is efficient and well-suited for server-to-server communication, though more complex to integrate than WebSocket APIs. Speech adaptation lets you boost recognition of domain-specific terms. Google's strength is breadth — if you need a single provider that works across dozens of languages at reasonable quality, Google is the pragmatic choice. Accuracy on English is good but trails Deepgram and AssemblyAI in direct comparisons.

Latency

200-400ms

WER

6-10%

Languages

125+

Pricing

$0.006-0.024/min

Diarization

Streaming + batch

Custom Vocabulary

Speech adaptation + custom class tokens

Streaming

gRPC streaming

Strengths
  • Widest language support (125+ languages and variants)
  • Mature gRPC streaming API with low overhead
  • Strong speech adaptation for custom vocabulary
  • Deep integration with Google Cloud ecosystem
  • Competitive pricing with significant free tier
Weaknesses
  • English accuracy trails Deepgram and AssemblyAI
  • gRPC is more complex to integrate than WebSocket
  • Endpointing is less precise for voice agent turn-taking
  • Quality varies across languages — not all are equally good

AWS Transcribe

Best for AWS-native infrastructure

AWS Transcribe is a solid STT service that integrates deeply with the AWS ecosystem. If your voice agent infrastructure runs on AWS, Transcribe is the path of least resistance — it connects natively to Amazon Connect, S3, Lambda, and other services. Custom vocabulary support is good, and the HIPAA-eligible tier makes it suitable for healthcare voice agents. However, raw accuracy and streaming latency trail specialized providers like Deepgram. Transcribe is a good choice when you prioritize infrastructure integration, compliance, and cost over cutting-edge accuracy.

Latency

300-500ms

WER

7-12%

Languages

100+

Pricing

$0.006-0.024/min

Diarization

Streaming + batch

Custom Vocabulary

Custom vocabulary + custom language models

Streaming

WebSocket streaming

Strengths
  • Deep AWS ecosystem integration (Connect, S3, Lambda)
  • HIPAA-eligible tier for healthcare applications
  • Strong custom vocabulary and language model support
  • Affordable pricing with pay-per-use model
  • Reliable infrastructure backed by AWS SLA
Weaknesses
  • Accuracy trails Deepgram, AssemblyAI, and Whisper
  • Higher streaming latency than specialized providers
  • Endpointing less responsive for real-time voice agents
  • Documentation can be dense and AWS-centric

Quick Comparison

ProviderLatencyWERLanguagesPricingBest For
Deepgram100-300ms5-8%36+$0.0043-0.0145/minReal-time voice agents, phone systems, live transcription
OpenAI Whisper API1-3s (batch only)4-7%100+$0.006/minBatch transcription, post-call analysis, budget self-hosting
AssemblyAI200-500ms5-8%17+$0.015-0.065/minHigh-accuracy applications, conversation intelligence, call analytics
Google Cloud STT200-400ms6-10%125+$0.006-0.024/minMultilingual agents, Google Cloud ecosystem, enterprise deployments
AWS Transcribe300-500ms7-12%100+$0.006-0.024/minAWS ecosystem, HIPAA-compliant deployments, call center analytics

Recommendations by Use Case

The best STT API depends on what you are building. Here are our recommendations for the most common voice agent scenarios.

Real-Time Voice Agents

For phone-based agents, live customer support, and any application where response latency is critical. Every millisecond counts.

Recommended: Deepgram

Lowest streaming latency, best endpointing for turn-taking

Best Accuracy

For applications where transcript accuracy is the top priority and you can tolerate slightly higher latency. Medical, legal, and compliance use cases.

Recommended: Deepgram or AssemblyAI

Both deliver 5-8% WER with strong custom vocabulary

Budget / Self-Hosted

For teams with GPU infrastructure who want to eliminate per-minute API costs, or for batch transcription at the lowest price point.

Recommended: OpenAI Whisper (self-hosted)

Open-source, zero per-minute cost, excellent accuracy

Enterprise / Multilingual

For enterprise deployments requiring broad language support, compliance certifications, and deep cloud platform integration.

Recommended: Google Cloud STT or AWS Transcribe

100+ languages, enterprise SLA, compliance certifications

Integration Tips for Voice Agent Pipelines

How you integrate STT matters as much as which provider you choose. These patterns will help you get the best latency and accuracy from any STT API.

Use WebSocket Streaming with Interim Results

Always use WebSocket (or gRPC) streaming connections and process interim results. Interim results give you partial transcripts as the user speaks — you can start feeding these to your LLM before the user finishes their utterance. This enables "speculative processing" where the LLM begins generating a response based on partial input, then adjusts when the final transcript arrives. Deepgram and AssemblyAI have the best interim result implementations for voice agent use.

Configure Endpointing Carefully

Endpointing controls when the STT engine decides the user has finished speaking. Too aggressive and you cut users off mid-thought. Too conservative and there is a long pause before your agent responds. Most providers let you configure the silence threshold (typically 300-1000ms). For voice agents, start with 500-700ms and tune based on user feedback. Some providers like Deepgram offer "smart endpointing" that uses context to distinguish natural pauses from turn endings.

Handle Barge-In (Interruptions)

Users will interrupt your voice agent mid-sentence. Your STT integration needs to handle this gracefully: detect when the user starts speaking while TTS audio is playing, immediately stop TTS playback, and process the user's new input. This requires keeping the STT WebSocket open continuously (not just during expected user turns) and implementing echo cancellation so the STT engine does not transcribe your agent's own audio output.

Build for Failover

Like TTS, STT APIs can have outages. Implement a fallback provider — for example, Deepgram primary with Google Cloud STT as backup. Monitor transcription latency and accuracy in real time. If latency spikes above your threshold or you detect unusual error rates, automatically switch to the fallback. Also consider running both providers in parallel during high-stakes calls and using the faster or more confident result.

Frequently Asked Questions

What is the best speech-to-text API for voice agents?

Deepgram is the best overall STT API for real-time voice agents. It delivers the lowest streaming latency, excellent accuracy with Nova-2, the best endpointing for conversational turn-taking, and competitive pricing. For batch transcription or self-hosting, OpenAI Whisper is the strongest option. For multilingual deployments, Google Cloud STT has the widest language coverage.

How much does a speech-to-text API cost?

STT API pricing ranges from $0.004/minute (Deepgram pay-as-you-go) to $0.065/minute (AssemblyAI premium features). For a voice agent handling 10,000 calls per month at 3 minutes average, monthly STT costs range from $120 to $1,950 depending on the provider and features used. Whisper API is $0.006/minute. Most providers offer volume discounts at scale.

What is word error rate and what WER should I target?

Word error rate (WER) measures the percentage of words incorrectly transcribed. It includes substitutions, insertions, and deletions. For voice agents, target below 8% WER on your actual audio conditions. Below 5% is excellent. Above 12% causes frequent misunderstandings that degrade the user experience. WER benchmarks from providers are measured on clean audio — your real-world WER will be higher due to background noise, accents, and telephony audio quality.

Can I self-host a speech-to-text model?

Yes. OpenAI Whisper is open-source and can be self-hosted on your own GPU infrastructure. Faster-whisper and whisper.cpp are optimized implementations that reduce hardware requirements. Self-hosting eliminates per-minute API costs but requires GPU infrastructure ($0.50-4/hour for cloud GPUs), model optimization expertise, and ongoing maintenance. Self-hosting makes sense for very high-volume applications (100K+ hours/month) or strict data residency requirements. For real-time voice agents, self-hosted Whisper requires additional engineering for streaming support since the model is batch-native.

What is endpointing and why does it matter?

Endpointing (also called voice activity detection or utterance detection) is how the STT engine determines when a user has finished speaking. Good endpointing lets your voice agent respond quickly without cutting the user off. Bad endpointing either waits too long (adding latency) or triggers too early (interrupting the user). Deepgram has the best endpointing for voice agents, with configurable silence thresholds and smart detection of natural pauses vs. actual turn completion.

Do I need speaker diarization for a voice agent?

For a basic two-party voice agent (one user, one agent), you typically do not need STT-level diarization because you already know who is speaking based on the audio channels. Your agent's speech comes from TTS and the user's speech comes from the microphone or phone line. Diarization becomes important for conference calls, multi-party conversations, or when analyzing recorded calls where you need to attribute speech to specific speakers.

The Bottom Line

For real-time voice agents, Deepgram is the clear leader. It has the lowest streaming latency, the best endpointing for conversational turn-taking, excellent accuracy with Nova-2, and pricing that scales well. If you are building a voice agent that needs to respond in real time, start with Deepgram.

AssemblyAI is the best choice if you need rich analytics on top of transcription — sentiment analysis, topic detection, and summarization built into the same API. OpenAI Whisper is unbeatable for batch processing or self-hosted deployments where you want to eliminate per-minute costs. Google Cloud STT and AWS Transcribe are the pragmatic choices for enterprise teams already committed to those cloud platforms.

Whichever provider you choose, invest in your streaming integration. Use interim results, configure endpointing carefully, handle barge-in gracefully, and build failover into your pipeline. The implementation details matter as much as the provider.