Deepgram Review: Speech-to-Text API (2026)

Deepgram is the STT engine we reach for first when building voice agent pipelines. Here's why it earns that spot and where it falls short.

4.5 / 5
Best for

Real-time STT in voice agent pipelines

Pricing

Pay-per-audio-hour, free tier included

Category

Speech-to-Text API

What is Deepgram?

Deepgram is a speech-to-text API built from the ground up for developers who need fast, accurate transcription at scale. Unlike legacy providers that bolted APIs onto existing speech recognition systems, Deepgram trained their own end-to-end deep learning models specifically for API-first use cases -- and it shows in both the developer experience and the performance numbers.

If you build voice agents, Deepgram is likely already in your stack or on your shortlist. The platform powers the STT layer for a significant portion of production voice agent pipelines, including those built on Vapi, Retell, and custom Twilio-based architectures. When platforms like these list "Deepgram" as a provider option, there's a reason it's often the default.

We've used Deepgram daily in production voice agent pipelines connected through Twilio for over a year. The WebSocket streaming API is the backbone of our real-time transcription, and Nova-2 is our default model. This review is based on that hands-on experience, not a weekend evaluation.

Key Features

Nova-2 Model

Deepgram's flagship model delivers the best English word error rate we've tested. Nova-2 handles conversational speech, phone audio, accented English, and background noise significantly better than Whisper or Google STT. It's the model you want for production voice agents where every missed word costs you a failed intent detection.

Streaming API

The WebSocket-based streaming API is where Deepgram truly separates from the pack. You send raw audio bytes over a persistent connection and receive interim and final transcript events in real time. Latency sits below 300ms in our measurements. For voice agents, this means you can start processing the caller's intent before they finish speaking.

Diarization

Speaker diarization labels each transcript segment by speaker, which is essential for voice agent pipelines where you need to separate the agent's speech from the caller's. Works reliably for two-speaker phone calls. Accuracy drops with more than three speakers or heavy crosstalk, but the core use case is well-served. Included at no extra charge.

Smart Formatting

Automatic formatting of numbers, currency, dates, emails, URLs, and phone numbers. Instead of "my number is five five five one two three four" you get "my number is 555-1234." This sounds minor but saves significant post-processing work when you need to extract structured data from voice conversations.

Custom Vocabulary

Boost recognition of domain-specific terms, product names, or jargon by passing a custom vocabulary list. If your voice agent handles medical appointments, legal terms, or proprietary product names, this prevents the model from misrecognizing niche vocabulary. The boost is meaningful -- we've seen 30-40% error reduction on domain terms.

Language Support

Over 30 languages supported with automatic language detection. English is best in class. Spanish, French, German, and Portuguese are strong. Hindi, Japanese, and Korean are usable but noticeably less accurate. Less common languages can have WER 2-3x higher than English -- always benchmark with your specific use case before going to production.

Developer Experience

Deepgram has the best developer experience of any STT API we've used. The SDKs (Python, Node, Go, .NET, Rust) are well-maintained, the API surface is clean, and the documentation is genuinely good -- not just reference docs, but practical guides that show you how to build real things.

The WebSocket streaming API deserves special praise. Connecting to the streaming endpoint is straightforward: open a WebSocket, set your parameters (model, language, features), and start sending audio chunks. The event model is clean -- you get interim results as the speaker talks and final results when they pause. Handling this in a Twilio Media Streams pipeline takes about 50 lines of code.

The REST API for pre-recorded audio is equally simple. POST a file or URL, get a transcript back. Batch processing, callbacks, and async transcription are all supported for when you're not doing real-time work.

The developer console is functional and gives you usage metrics, API key management, and a playground for testing. It's not flashy, but everything works. Support is responsive -- we've gotten answers to technical questions within a few hours on the developer community, and enterprise support is faster.

One small complaint: the SDK versioning has occasionally introduced breaking changes between minor versions. Pin your dependency versions and read the changelog before upgrading.

Performance

Latency is where Deepgram earns its reputation. Streaming transcription latency consistently sits below 300ms in our production pipeline -- that's the time from when audio bytes hit the WebSocket to when you get a transcript event back. For voice agents, this is the difference between a natural-feeling conversation and an awkward pause. Sub-300ms is fast enough that the STT step is rarely the bottleneck in a voice agent pipeline.

Accuracy on English conversational speech is the best we've benchmarked. Nova-2 achieves word error rates in the 8-12% range on typical phone call audio, which is meaningfully better than Whisper (12-18%) and Google STT (10-15%) on the same test sets. On clean studio audio the gap narrows, but voice agents don't deal with clean studio audio -- they deal with cell phone calls, noisy offices, and bluetooth headsets.

Throughput is also strong. The API handles concurrent connections well, and we haven't hit scaling issues even during peak load. Pre-recorded batch transcription processes faster than real-time -- a one-hour audio file transcribes in about 15-20 seconds.

Reliability has been excellent. In over a year of continuous production use, we've experienced two brief degradation events (both under 30 minutes) and zero full outages. The status page is honest and updates are timely. For a component that sits in the critical path of every voice call, that reliability matters enormously.

Pricing

Deepgram uses pay-per-audio-minute pricing. Nova-2 costs $0.0043/minute ($0.26/hour) on pay-as-you-go, which is competitive for the quality you get. The free tier includes $200 in credit -- enough for roughly 775 hours of transcription, which is generous for development and testing.

The Growth plan offers volume discounts and starts at $4,000/year with committed usage. Enterprise plans add dedicated infrastructure, custom SLAs, and further price breaks. For teams processing millions of minutes per month, the per-minute cost drops substantially.

Features like diarization, smart formatting, and custom vocabulary are included at no extra charge, which is a welcome change from providers that nickel-and-dime on every add-on.

The honest take on cost: Deepgram is not the cheapest option. If you're transcribing podcasts or meeting recordings where latency doesn't matter, Whisper (self-hosted or through OpenAI) can be significantly cheaper. But for real-time voice agent pipelines where accuracy and latency are critical, the price premium is justified. At high volume -- tens of thousands of hours per month -- the costs add up, and it's worth negotiating an enterprise deal.

Pros and Cons

Pros

  • Best-in-class English accuracy with Nova-2
  • Sub-300ms streaming latency for real-time use
  • Clean WebSocket API, excellent SDKs and docs
  • Speaker diarization included at no extra cost
  • Smart formatting saves post-processing work
  • Custom vocabulary boosts domain-specific accuracy
  • Generous $200 free tier for development
  • Rock-solid reliability in production

Cons

  • More expensive than self-hosted Whisper at scale
  • Non-English language accuracy noticeably lower
  • SDK versioning can introduce breaking changes
  • Diarization degrades with 3+ speakers or crosstalk
  • No on-premise deployment option for air-gapped environments
  • Enterprise pricing requires sales conversation
  • Less common languages have 2-3x higher WER than English

Who Should Use Deepgram?

Voice agent builders who need the fastest, most accurate real-time STT available. If you're building on Twilio, LiveKit, or any WebRTC-based system and need streaming transcription that doesn't bottleneck your pipeline, Deepgram is the answer. Most production voice agent platforms (Vapi, Retell, custom stacks) use Deepgram as their default STT for a reason.

Call center analytics teams processing large volumes of recorded calls. Deepgram's batch API transcribes faster than real-time, and the combination of diarization, smart formatting, and high accuracy makes it excellent for post-call analysis, compliance monitoring, and agent coaching.

Developers building accessibility features who need reliable real-time captioning. The streaming API's low latency and accuracy make it suitable for live captioning in apps, meetings, and broadcast.

Skip Deepgram if: You only need batch transcription of clean audio and cost is your primary concern (Whisper is cheaper). You need on-premise deployment for regulatory reasons. Or your primary language isn't well-supported -- test accuracy with your specific language and accent before committing.

Deepgram vs Alternatives

FeatureDeepgramOpenAI WhisperGoogle STT
Streaming Latency<300msNo native streaming300-500ms
English AccuracyBest (8-12% WER)Good (12-18% WER)Good (10-15% WER)
Language Coverage30+ languages99+ languages125+ languages
Self-Hosted OptionNoYes (open source)No
DiarizationIncluded freeNot built-inExtra cost
Best ForReal-time voice agentsBatch transcription, multilingualGCP-integrated workflows

Frequently Asked Questions

Is Deepgram more accurate than OpenAI Whisper?

For real-time streaming transcription, yes. Deepgram Nova-2 consistently outperforms Whisper on English word error rate (WER) benchmarks, particularly on conversational speech with background noise. Whisper is competitive for batch transcription of clean audio, but Deepgram wins on speed, streaming support, and accuracy in noisy real-world conditions.

How much does Deepgram cost per audio hour?

Deepgram's pay-as-you-go pricing starts at $0.0043 per minute for the Nova-2 model (about $0.26 per audio hour). The free tier includes $200 in credit, which is roughly 775 hours of audio. Volume discounts are available on Growth and Enterprise plans. Costs scale linearly with usage, and features like diarization and smart formatting are included at no extra charge.

Does Deepgram support real-time streaming transcription?

Yes, and it's one of Deepgram's biggest strengths. The WebSocket streaming API delivers interim and final transcripts with sub-300ms latency. You open a WebSocket connection, stream raw audio bytes, and get back transcript events in real time. This is critical for voice agent pipelines where you need to know what the caller said before they finish speaking.

What languages does Deepgram support?

Deepgram supports over 30 languages including English, Spanish, French, German, Portuguese, Hindi, Japanese, Korean, and Mandarin. English accuracy is best in class. European languages are strong. Some less common languages have noticeably higher error rates, so test with your specific language and accent before committing to production.

Can Deepgram identify different speakers?

Yes. Deepgram offers speaker diarization that labels transcript segments by speaker. It works well for two-speaker phone calls, which is the most common voice agent scenario. Accuracy decreases with more speakers or crosstalk. Diarization is included in the standard per-minute pricing with no extra cost.

How does Deepgram compare to Google Speech-to-Text?

Deepgram is generally faster (lower streaming latency), more accurate on conversational English (lower WER), and simpler to integrate. Google Speech-to-Text has broader language coverage and tighter integration with GCP services. For voice agent pipelines where latency and English accuracy matter most, Deepgram is the better choice. For multilingual enterprise deployments already on GCP, Google may make more sense.

The Bottom Line

Deepgram is the best speech-to-text API for real-time voice applications in 2026. Nova-2's combination of sub-300ms streaming latency, industry-leading English accuracy, and a clean developer experience makes it the default choice for production voice agent pipelines. It's not the cheapest option and it's not the best for every language, but for the use case that matters most in voice AI -- real-time English transcription with high reliability -- nothing else comes close.