Speech-to-TextUpdated March 2026

Deepgram vs Whisper: Speech-to-Text Comparison

A technical comparison of Deepgram's commercial STT API and OpenAI's open-source Whisper model. Based on production experience building real-time voice agent pipelines with Twilio, not vendor marketing pages.

10 min read Based on production testing

Quick Comparison

Feature	Deepgram	Whisper
Type	Commercial API	Open source (MIT) + OpenAI API
Best Model	Nova-2	large-v3 / large-v3-turbo
WER (English)	~8.4%	~8.8% (large-v3)
Streaming Latency	<300ms	1-5s (batch/chunked)
Real-time Streaming	Native WebSocket	Not built-in (community hacks)
Price (per minute)	$0.0043 (Nova-2)	$0.006 (API) / free self-hosted*
Languages	36+	99
Diarization	Built-in	Requires extra pipeline
Self-Hostable	On-prem (enterprise)	Yes, fully open source
Best For	Real-time voice agents	Batch transcription, research

*Self-hosted Whisper requires GPU infrastructure ($200-800+/mo for production workloads)

Overview

Deepgram

Deepgram is a commercial speech-to-text API purpose-built for real-time applications. Their Nova-2 model was trained on massive proprietary datasets and optimized for low-latency streaming over WebSockets. It's the STT engine behind most production voice agent platforms including Vapi, Retell, and Bland.

Deepgram also offers features like speaker diarization, punctuation, smart formatting, topic detection, and intent recognition as API add-ons. You hit an endpoint and get transcripts back -- no infrastructure to manage.

Whisper

OpenAI's Whisper is an open-source speech recognition model released under the MIT license. Trained on 680,000 hours of multilingual audio, it delivers impressive accuracy across 99 languages. You can self-host it on your own GPUs or use it through OpenAI's API.

Whisper was designed as a batch transcription model, not a streaming one. Community forks like faster-whisper (CTranslate2) and whisper.cpp (GGML) add speed optimizations, and whisper-streaming provides pseudo-real-time capabilities by chunking audio buffers.

Accuracy

Accuracy in speech-to-text is measured by Word Error Rate (WER) -- the lower the better. On clean English audio, both Deepgram Nova-2 and Whisper large-v3 are remarkably close, typically in the 8-12% WER range depending on the benchmark dataset.

Where they diverge is on noisy telephony audio. Deepgram has specifically trained models for phone call audio (their "phonecall" model variant), which handles background noise, compression artifacts, and low-bitrate codecs significantly better. In our testing with Twilio Media Streams (8kHz mulaw), Deepgram's WER was 3-5 percentage points lower than Whisper's on the same audio.

Whisper's strength is multilingual transcription. It was trained on audio from 99 languages simultaneously, and the large-v3 model delivers strong accuracy even on low-resource languages where Deepgram has limited or no coverage.

Our take on accuracy:

For English voice agents handling phone calls, Deepgram wins on accuracy. For multilingual or research use cases with clean audio, Whisper large-v3 is equal or better. In practice, the accuracy gap between them is small enough that latency and integration requirements are usually more important decision factors.

Latency

If you're building real-time voice agents, latency is the single most important metric. Every millisecond of STT delay compounds with your LLM inference time and TTS synthesis to determine how fast your agent responds. Users start perceiving conversations as "laggy" above 800ms total round-trip time.

Deepgram Streaming

Native WebSocket streaming with partial results in <300ms
Interim results let you start LLM inference before the user finishes speaking
Endpointing detection (speech-end) built into the API
Direct integration with Twilio Media Streams and other telephony WebSockets

Whisper Latency

Batch model by design -- must buffer full utterance before processing
faster-whisper adds chunked pseudo-streaming: 1-3s latency per chunk
No native endpointing -- you need a separate VAD (voice activity detection)
OpenAI API only supports batch requests, not streaming

For real-time voice agents, latency is the deciding factor and Deepgram's streaming API wins handily. This is not a close contest -- Deepgram was engineered for real-time from day one, while Whisper was designed for batch transcription of recorded audio. If your use case involves live conversations (voice agents, live captioning, call center automation), Deepgram is the only serious choice.

Pricing

Pricing for STT is more nuanced than it appears. The "Whisper is free" narrative misses the real cost of GPU infrastructure, DevOps, and scaling challenges.

Deepgram API

$0.0043

per minute (Nova-2)

No infrastructure to manage
Volume discounts available
$200 free credit to start
Scales automatically

OpenAI Whisper API

$0.006

per minute

No infrastructure needed
40% more expensive than Deepgram
Batch only, no streaming
25MB file size limit

Self-Hosted Whisper

$0*

per minute (API cost)

No per-minute charges
GPU instance: $200-800/mo
DevOps overhead for scaling
Full data privacy control

Cost math:

At 10,000 minutes/month, Deepgram costs ~$43/mo. The OpenAI Whisper API costs ~$60/mo. Self-hosting Whisper large-v3 on an A10G GPU (AWS g5.xlarge) runs about $400/mo before engineering time. Self-hosting only becomes cost-effective above ~100,000 minutes/month, and even then you're trading money for significant operational complexity.

Language Support

Whisper has a clear advantage in language breadth. With 99 supported languages, it covers far more ground than Deepgram's 36+. For teams building products that need to handle less common languages or code-switching (users alternating between languages mid-sentence), Whisper is the stronger choice.

However, Deepgram's coverage of the most commercially important languages (English, Spanish, French, German, Portuguese, Japanese, Korean, Hindi, Mandarin) is strong, and they typically offer better accent recognition within those languages. Deepgram also supports language detection, so you don't need to specify the language upfront.

For most voice agent use cases -- which are overwhelmingly English, Spanish, and a handful of other major languages -- both platforms deliver adequate coverage.

Integration Complexity

Deepgram Integration

Deepgram provides official SDKs for Python, Node.js, Go, .NET, and Rust. The streaming API uses standard WebSockets -- open a connection, pipe audio bytes in, get JSON transcripts back. Integration with Twilio Media Streams takes about 50 lines of code.

The API handles endpointing, punctuation, formatting, and diarization as query parameters. No ML pipeline to build -- just configure the API call.

Hours to integrate, not days

Whisper Integration

Via OpenAI API: Simple REST endpoint. Upload audio file, get transcript back. Good for batch jobs but no streaming support.

Self-hosted: Install Python dependencies, configure GPU, set up an inference server (FastAPI/Triton), implement audio chunking for pseudo-streaming, add VAD for endpointing, handle scaling under load, manage model versions. This is a real engineering project.

API: hours. Self-hosted: days to weeks

When to Choose Each

Choose Deepgram When...

Building real-time voice agents -- the streaming latency advantage is massive and non-negotiable for conversational AI
Processing phone call audio -- Deepgram's telephony-specific models handle 8kHz mulaw audio and background noise much better
You need speaker diarization -- built into the API with a single parameter
You want minimal infrastructure -- managed API with no GPUs to provision or scale
Using Twilio, Vonage, or telephony WebSockets -- first-class support for media stream protocols
You need live captioning or subtitles -- streaming partial results enable real-time display

Choose Whisper When...

Batch transcription of recorded audio -- processing podcasts, interviews, meetings, or lectures where latency doesn't matter
You need 60+ language support -- Whisper covers 99 languages, far more than Deepgram
Data privacy is critical -- self-hosting means audio never leaves your servers
You have GPU infrastructure already -- if you're already running ML workloads, adding Whisper is incremental
Research or experimentation -- fine-tuning Whisper on domain-specific data is straightforward with the open weights
Very high volume (>100K min/month) -- self-hosted can be cheaper at scale if you have DevOps capacity

Verdict

Deepgram and Whisper are both excellent speech-to-text engines, but they serve fundamentally different use cases. This is not a matter of one being "better" -- it's about matching the tool to the job.

For real-time voice agents and conversational AI:

Deepgram is the clear winner. The native streaming API, sub-300ms latency, telephony-optimized models, and built-in endpointing make it the backbone of virtually every serious voice agent platform. There is no combination of Whisper workarounds that comes close to Deepgram's real-time performance. If you're building with Twilio, LiveKit, or any telephony stack, Deepgram is the default answer.

For batch transcription and offline processing:

Whisper is a strong choice, especially if you need broad language support or want to self-host for privacy. The OpenAI API is the simplest option for occasional use, while self-hosting with faster-whisper gives you maximum control and can be cost-effective at very high volumes.

Many production systems use both: Deepgram for real-time transcription during live calls, and Whisper for post-call processing, summarization, and archival transcription of recorded audio. They're complementary tools, not always competitors.

Frequently Asked Questions

Is Deepgram more accurate than Whisper?

In most real-world benchmarks, Deepgram Nova-2 and Whisper large-v3 achieve similar word error rates (WER) in the 8-12% range on clean English audio. Deepgram tends to edge ahead on noisy telephony audio and domain-specific vocabulary thanks to its custom-trained models, while Whisper large-v3 performs exceptionally well on multilingual content. The accuracy gap is small enough that latency and integration requirements are usually the deciding factors.

Can Whisper do real-time streaming transcription?

OpenAI Whisper was designed as a batch model, not a streaming one. Community projects like faster-whisper and whisper-streaming add pseudo-streaming by chunking audio, but they introduce 1-3 seconds of latency per chunk. Deepgram was built from the ground up for streaming and delivers partial transcripts in under 300ms. For real-time voice agents where every millisecond counts, Deepgram is the clear winner.

How much does Deepgram cost compared to Whisper?

Deepgram charges $0.0043/minute for its Nova-2 model (pay-as-you-go) with volume discounts available. OpenAI Whisper API costs $0.006/minute. Self-hosting Whisper is "free" in API fees but requires GPU infrastructure - expect $200-800/month for a single GPU instance capable of real-time processing, plus engineering time for scaling and maintenance. Deepgram is cheaper per minute than the OpenAI API and far less operational overhead than self-hosting.

Which is better for building voice agents with Twilio?

Deepgram is the better choice for Twilio voice agent pipelines. Its WebSocket streaming API integrates directly with Twilio Media Streams, providing sub-300ms transcription latency. Whisper requires buffering audio chunks and sending them as batch requests, adding 1-5 seconds of latency that makes conversations feel sluggish. Most production voice agent platforms (Vapi, Retell, Bland) use Deepgram as their default STT provider for this reason.

Does Whisper support more languages than Deepgram?

Yes. Whisper supports 99 languages out of the box, while Deepgram supports 36+ languages. If you need transcription in less common languages, Whisper has broader coverage. However, for the top 15-20 most common languages, both platforms deliver strong accuracy, and Deepgram typically has better support for accents and dialects within those languages.

Can I use Whisper for free?

The open-source Whisper model weights are free to download and use under the MIT license. However, running Whisper requires GPU hardware. You can run smaller models (tiny, base, small) on a CPU, but accuracy drops significantly. For production quality, you need at least the medium model on a GPU. The OpenAI Whisper API charges $0.006/minute and handles all infrastructure for you.

Related Comparisons

Vapi vs Retell ElevenLabs vs Rime