How AI Voice Agents Work: The Complete Guide

A deep-dive into every layer of the voice agent stack — from raw telephone audio to intelligent conversation. Written by engineers who build these systems in production.

Last updated: March 2026 · 20 min read

In this guide

What Is a Voice Agent?

A voice agent is software that can hold a real-time, two-way phone conversation with a human — understanding what they say, reasoning about it, and responding with natural-sounding speech. Unlike traditional IVR systems that funnel callers through rigid menu trees ("Press 1 for billing..."), voice agents handle freeform dialogue. They understand intent, maintain context across turns, and take actions mid-conversation.

The technology became practical in 2023-2024 when three trends converged: LLMs got fast enough for real-time conversation (sub-500ms to first token), streaming speech-to-text hit near-human accuracy, and neural text-to-speech became indistinguishable from human voices at latencies under 200ms. By 2025, platforms like Vapi and Retell made it possible to deploy production voice agents without building the entire stack from scratch.

Today, voice agents handle millions of phone calls daily — booking appointments, qualifying sales leads, providing customer support, conducting surveys, and even making outbound calls. The market is growing because, unlike chatbots, voice agents meet people where they already are: on the phone.

Common Voice Agent Use Cases

Inbound customer support

Appointment scheduling

Lead qualification

Outbound sales calls

Restaurant reservations

Insurance claims intake

IT helpdesk triage

Debt collection

The Voice Agent Pipeline

Every voice agent, regardless of the platform or framework, follows the same five-stage pipeline. A phone call comes in, the audio is transcribed to text, an LLM processes the text and decides what to say, that response is converted back to speech, and the audio is sent back to the caller. This loop runs continuously for every conversational turn.

Telephony

Audio In/Out

STT

Speech-to-Text

LLM

AI Processing

TTS

Text-to-Speech

Telephony

Audio Out

The voice agent pipeline: every conversation turn flows through all five stages

The critical insight is that this pipeline runs in a loop for every turn of the conversation. A two-minute phone call might involve 15-20 turns, and each turn must complete the full pipeline in under a second to feel natural. This tight latency constraint shapes every architectural decision in voice AI.

Let's walk through each stage in detail.

Step 1: Telephony Layer

The telephony layer is the bridge between the phone network and your software. When someone calls your voice agent's phone number, the telephony provider answers the call, establishes a media stream, and sends raw audio data to your application in real time.

Twilio is the most common choice. Their Media Streams API opens a WebSocket connection and streams audio as mulaw-encoded 8kHz chunks — the standard telephone audio format. You get a bidirectional pipe: audio flows in from the caller, and you send synthesized audio back out. Twilio handles call routing, phone number provisioning, SIP trunking, and PSTN connectivity.

Vonage (formerly Nexmo) is the main alternative. Their WebSocket API works similarly but uses linear PCM encoding by default, which is slightly higher quality. Some teams prefer Vonage for its international number availability and slightly lower per-minute rates in certain regions.

For web-based voice agents (no phone number needed), WebRTC replaces SIP/PSTN entirely. The browser captures microphone audio and sends it over a peer-to-peer or server-relayed connection. WebRTC offers higher audio quality (16kHz+ wideband) compared to telephone audio (8kHz narrowband), which downstream STT models benefit from. Platforms like Vapi support both phone and WebRTC simultaneously.

Phone (SIP/PSTN)

+Reaches any phone number globally
+Users need no app or browser
-8kHz narrowband audio quality
-Higher per-minute costs

WebRTC (Browser/App)

+16kHz+ wideband audio quality
+Lower latency (no PSTN hops)
-Requires browser or app
-NAT traversal complexity

SIP trunking tip: If you're connecting to an existing PBX or call center, you'll need a SIP trunk between your infrastructure and the telephony provider. Twilio's Elastic SIP Trunking and Vonage's SIP Connect both support this, but configuration involves firewall rules, codec negotiation, and authentication — budget a day for setup and testing.

Step 2: Speech-to-Text (STT)

The STT engine converts raw audio into text that the LLM can process. This is the first transformation in the pipeline, and its speed directly impacts the total response latency. Every millisecond the STT takes is a millisecond the user is waiting in silence.

Streaming vs Batch is the fundamental architectural choice. Batch STT processes an entire audio clip after the user finishes speaking — simpler to implement but adds the full transcription time to latency. Streaming STT processes audio continuously and returns partial transcripts as the user speaks, with a final transcript when they stop. For voice agents, streaming is non-negotiable because it lets downstream components start processing before the user finishes talking.

Leading STT Providers

Deepgram Nova-2

The default choice for most voice agent deployments. Deepgram's streaming API returns interim results in 100-200ms and final transcripts within 300ms of end-of-speech. Their endpointing algorithm (detecting when the user has finished a sentence) is the best in the industry — a critical feature because premature endpointing causes the agent to interrupt the user, while slow endpointing adds dead air.

~100-200ms streaming latency$0.0059/minBest endpointing

OpenAI Whisper (API)

Whisper offers excellent accuracy, especially for accented speech and multilingual input. The API version is batch-only (not streaming), so it adds 500ms-2s of latency depending on audio length. Some teams use Whisper for offline transcript correction while using Deepgram for the real-time path. Self-hosted Whisper (via faster-whisper or whisper.cpp) can achieve streaming-like latency but requires GPU infrastructure.

500ms-2s batch latency$0.006/minBest multilingual accuracy

AssemblyAI Universal-2

AssemblyAI offers strong real-time transcription with a WebSocket streaming API. Their differentiator is built-in features like speaker diarization, sentiment analysis, and entity detection in the streaming path. If you need more than raw transcription (e.g., detecting caller frustration in real time), AssemblyAI reduces the work you'd otherwise push to the LLM.

~200-300ms streaming latency$0.0065/min (real-time)Built-in NLU features

For a deeper comparison, see our Deepgram vs Whisper comparison. The short version: use Deepgram for latency-critical real-time applications, Whisper for accuracy-critical or multilingual use cases.

Key STT Concepts for Voice Agents

Endpointing: Detecting when the user has finished speaking. Too aggressive and the agent interrupts mid-sentence; too conservative and the user waits in silence. Most providers let you tune this threshold.
Interim Results: Partial transcripts sent before the user finishes speaking. Some orchestrators use these to start "pre-thinking" with the LLM, reducing perceived latency.
Voice Activity Detection (VAD): Distinguishing speech from background noise, breathing, and silence. Critical for knowing when to start and stop processing.

Step 3: LLM Processing

The LLM is the "brain" of the voice agent. It receives the transcribed text, the conversation history, a system prompt defining the agent's persona and instructions, and optionally context from external systems (CRM records, knowledge bases). It then generates a text response that will be synthesized into speech.

For voice agents, LLM selection is dominated by one factor: time to first token (TTFT). Unlike chatbots where a 2-second delay is acceptable, voice agents need the first token in under 300ms to avoid awkward silence. This pushes most teams toward the fastest available models, even at the expense of raw reasoning capability.

Popular LLM Choices

Model	TTFT	Best For
GPT-4o mini	~150-250ms	Most voice agents (best speed/quality ratio)
Claude 3.5 Haiku	~200-300ms	Complex instructions, nuanced conversation
GPT-4o	~300-500ms	High-stakes calls needing stronger reasoning
Groq (Llama 3)	~50-100ms	Ultra-low latency, simpler conversations
Claude 3.5 Sonnet	~300-500ms	Complex function calling, long conversations

Function Calling: How Agents Take Actions

Raw conversation is only half the story. Voice agents become truly useful when they can take actions: book an appointment in your calendar, look up an order in your database, transfer the call to a human agent, or send a follow-up text message. This is done through function calling (sometimes called "tool use").

The LLM is given a list of available functions with descriptions and parameter schemas. When the conversation reaches a point where an action is needed, the LLM outputs a function call instead of a text response. The orchestrator executes the function (an API call, database query, etc.), feeds the result back to the LLM, and the LLM generates a natural language response incorporating that result.

For example, a dental office agent might have functions like check_availability(date, provider), book_appointment(patient_name, date, time, provider), and transfer_to_human(reason). The caller says "I need a cleaning next Tuesday," the LLM calls check_availability, gets back available slots, and responds: "Dr. Smith has openings at 10 AM and 2 PM next Tuesday. Which works better for you?"

Context Management

Voice conversations generate context fast. A 5-minute call might produce 3,000-5,000 tokens of conversation history, plus system prompts, function definitions, and retrieved knowledge base content. Managing this context window is critical for both quality and cost.

Production agents typically use a sliding window approach: keep the full system prompt and recent turns, summarize older turns, and drop function call/response pairs that are no longer relevant. Some implementations use a smaller, faster model to generate turn summaries that compress the context window without losing critical information.

Step 4: Text-to-Speech (TTS)

TTS converts the LLM's text response into audio that sounds like a human speaking. This is where the "voice" in voice agent comes from, and voice quality has a massive impact on user perception. A robotic-sounding agent triggers immediate distrust; a natural-sounding one keeps callers engaged.

Modern neural TTS has crossed the uncanny valley. The best models produce speech that is indistinguishable from human recordings in blind tests. The challenge for voice agents is doing this fast enough for real-time conversation. TTS latency — the time from receiving text to producing the first audio chunk — is typically the second-largest contributor to end-to-end latency, after the LLM.

Leading TTS Providers

ElevenLabs

The quality benchmark. ElevenLabs produces the most natural, expressive voices in the market, with excellent prosody, emotion, and pacing. Their streaming API delivers first audio in 150-300ms. The trade-off is cost ($0.30/1K characters) and occasional latency spikes under load. Custom voice cloning is available with as little as 30 seconds of sample audio.

Best voice quality150-300ms TTFBHigher cost

Rime

Built specifically for real-time voice applications. Rime prioritizes consistency and speed over peak naturalness. Their streaming latency (80-150ms TTFB) is the lowest in the market, making them the go-to choice for latency-obsessed teams. Voice quality is a step below ElevenLabs but still solidly natural — most callers won't notice the difference in a phone conversation.

Lowest latency80-150ms TTFBLower cost

PlayAI

PlayAI (formerly PlayHT) offers a good balance of quality, speed, and cost. Their streaming API is fast (100-200ms TTFB), and they provide a wide selection of pre-built voices across languages and accents. Their ultra-realistic voice cloning competes with ElevenLabs on quality for custom voices.

Good balance100-200ms TTFBStrong voice cloning

For a detailed comparison, see our ElevenLabs vs Rime comparison. The decision usually comes down to: do you optimize for voice quality (ElevenLabs) or latency (Rime)?

Streaming TTS: The Key to Low Latency

Just like STT, TTS must be streaming for voice agents. Instead of waiting for the entire response to be synthesized (which could take 1-3 seconds for a long sentence), streaming TTS starts producing audio as soon as the first few words are ready. The orchestrator begins playing audio to the caller while the rest of the response is still being generated.

This interleaving of LLM generation and TTS synthesis is what makes sub-1-second response times possible. The LLM streams tokens, the TTS converts them to audio in chunks, and the telephony layer plays them back — all concurrently. Getting this pipeline right is the core engineering challenge of voice AI.

Step 5: Orchestration

Orchestration is the glue that connects every other layer. The orchestrator manages the real-time flow of data between telephony, STT, LLM, and TTS — routing audio, managing conversation state, handling interruptions, and recovering from errors. It's the most complex part of the stack and the primary reason voice agent platforms exist.

What the Orchestrator Does

Turn-taking management. Deciding when the user has finished speaking and when to start/stop playing the agent's response. This involves coordinating STT endpointing signals, VAD output, and TTS playback state.
Interruption handling (barge-in). When the user starts speaking while the agent is talking, the orchestrator must immediately stop TTS playback, discard any queued audio, capture the user's new input, and route it to the LLM with updated context. Poor interruption handling is the most common complaint about voice agents.
Audio buffering. Smoothing out the inherently bursty nature of streaming TTS. The orchestrator maintains an audio buffer to prevent gaps (buffer underrun) while avoiding adding unnecessary latency (buffer overrun). This is a tuning exercise — different network conditions and TTS providers require different buffer strategies.
Function execution. When the LLM outputs a function call, the orchestrator executes it (calling external APIs, querying databases), handles timeouts and errors, and feeds results back to the LLM. During execution, it may play filler audio ("Let me check that for you...") to avoid dead air.
Error recovery. When an STT provider goes down, a TTS request times out, or the LLM returns an unexpected response, the orchestrator must gracefully handle the failure — retrying, falling back to alternate providers, or smoothly ending the call.

Orchestration Approaches

Voice Agent Platforms

Vapi, Retell, Bland, and Synthflow handle all orchestration for you. You configure the agent (prompt, voice, tools) and the platform manages the real-time pipeline.

Fastest to productionLess control

Custom Pipelines (LangGraph, Pipecat)

Frameworks like LangGraph and Pipecat provide the building blocks for custom orchestration. You wire together the STT, LLM, and TTS yourself but get helper abstractions for streaming, state management, and turn-taking. Better for unique requirements.

Full controlMonths of work

Latency: The Critical Challenge

Latency is what separates a good voice agent from a frustrating one. In human conversation, the average response gap is 200-300ms. People perceive anything over 1 second as an awkward pause, and anything over 2 seconds as the other person not paying attention or being confused. Voice agents need to hit sub-1-second end-to-end latency to feel natural.

The Latency Budget

Every pipeline stage contributes to the total. Here's a realistic budget for an optimized voice agent:

STT endpointing + final transcript100-200ms
LLM time-to-first-token150-400ms
TTS time-to-first-byte80-200ms
Network + orchestration overhead50-100ms
Total end-to-end380-900ms

Optimization Techniques

Speculative LLM execution. Start sending STT interim results to the LLM before the user finishes speaking. If the final transcript matches, you've saved the entire STT finalization time. If it doesn't, discard and retry — a small waste for a large latency gain on most turns.
Token-level TTS streaming. Send LLM tokens to TTS as they arrive, not sentence by sentence. Modern TTS APIs accept partial text and start synthesizing immediately, producing audio for "Sure, I can" while the LLM is still generating "help you with that."
Smaller, faster LLMs. GPT-4o mini and Groq-hosted Llama models achieve 2-5x faster TTFT than full GPT-4o. For most voice agent conversations, the reasoning gap is unnoticeable to callers.
Geographic colocation. Run your orchestrator in the same region as your providers. Deepgram and most LLMs are US-East or US-West. A 40ms cross-country round trip adds up when you're doing it four times per turn (STT, LLM, TTS, telephony).
Filler responses. For turns that require function execution (database lookups, API calls), play a natural filler ("One moment while I look that up") immediately from a pre-cached audio clip. This buys 2-3 seconds without the user perceiving dead air.

The 800ms rule: In our testing across thousands of production calls, 800ms is the threshold where callers stop noticing latency. Under 600ms feels instant. Between 800ms and 1.2s feels slightly slow but acceptable. Over 1.5s, callers start to disengage or talk over the agent. If you're only optimizing one thing, optimize for total end-to-end latency below 800ms.

Build vs Buy

The biggest architectural decision in voice AI: do you use a platform that handles orchestration for you, or do you build the pipeline yourself? Both are valid — the right answer depends on your team, your use case, and how much of the stack you need to control.

Use a Platform When...

You need to ship in days or weeks, not months
Your use case is standard (receptionist, scheduler, support)
Your team doesn't have real-time audio experience
Call volume is under 100K minutes/month
You want to iterate on prompts, not infrastructure

Platforms: Vapi, Retell, Bland, Synthflow

Build Custom When...

You need non-standard conversation patterns
You're running 500K+ minutes/month (cost savings matter)
Strict data residency or compliance requirements
You need to control every millisecond of latency
Voice AI is your core product, not a feature

Frameworks: LangGraph, Pipecat, LiveKit Agents, custom WebSocket

Our recommendation: Start with a platform, even if you plan to build custom eventually. The fastest way to learn what your voice agent needs is to ship one and iterate with real callers. Platforms like Vapi let you go from zero to production calls in days. Once you've handled 10,000 real conversations, you'll know exactly which parts of the stack need custom work — and which don't.

The Future of Voice AI

The five-stage pipeline (telephony, STT, LLM, TTS, telephony) has been the dominant architecture since 2023, but it's about to change. Several trends are converging that will reshape how voice agents work.

Speech-to-Speech Models

OpenAI's GPT-4o and Google's Gemini can process audio natively — no separate STT or TTS needed. The model takes in raw audio and produces raw audio, collapsing the five-stage pipeline into three stages (telephony, model, telephony). This eliminates the information loss of text serialization and dramatically reduces latency. As these models mature and become more accessible, the STT+LLM+TTS pipeline will increasingly become the "legacy" approach.

Emotion and Tone Awareness

Current voice agents are tone-deaf: they process text transcripts and miss all the emotional information in the caller's voice. An angry customer and a happy one produce the same transcript. Speech-to-speech models can perceive and generate emotion, enabling agents that match the caller's energy, calm frustrated callers with softer tones, and express genuine-sounding empathy.

Multimodal Agents

Voice won't stay voice-only. Agents will simultaneously handle phone calls while sending text messages, emails, calendar invites, and visual confirmations. A support agent that talks through a diagnosis and simultaneously sends a screen-share link, a knowledge base article, and a follow-up email is the natural evolution. The orchestration challenge scales accordingly.

Cost Curve

The all-in cost of a voice agent minute has dropped from $0.50+ in early 2024 to $0.08-0.15 in 2026. Open-source STT (faster-whisper), LLMs (Llama, Mistral), and TTS (Coqui, Piper) are approaching commercial-quality at a fraction of the cost. By 2027, running a voice agent will cost less than $0.03/minute, making it economical for use cases that were previously impractical — like handling every single inbound call for a small business.

Frequently Asked Questions

How long does it take to build a voice agent from scratch?

A basic voice agent that can handle a single use case (e.g., appointment scheduling) can be built in 1-2 weeks using a platform like Vapi or Retell. Building from scratch with raw APIs (Twilio + Deepgram + OpenAI + ElevenLabs + custom orchestration) takes 4-8 weeks for an experienced team. Getting to production quality with error handling, interruption support, and fallbacks adds another 2-4 weeks on top of that.

What is the minimum latency achievable for a voice agent?

The theoretical minimum with current technology is around 500-600ms end-to-end (from the moment the user stops speaking to when the agent starts responding). This requires streaming STT (~100-150ms), a fast LLM like GPT-4o mini or Claude 3.5 Haiku with streaming (~200-300ms to first token), and streaming TTS (~100-150ms). In practice, most production agents achieve 700ms-1.2s, which still feels conversational.

Can voice agents handle multiple languages?

Yes, but with trade-offs. Modern STT engines like Deepgram and Whisper support dozens of languages, and TTS providers like ElevenLabs offer multilingual voice synthesis. The challenge is that LLM performance varies by language — English prompts and function calling work most reliably. Some teams run language detection on the first utterance and route to language-specific agent configurations for better quality.

How much does it cost to run a voice agent per minute?

A typical voice agent costs $0.08-0.20 per minute of conversation, depending on your provider choices. The breakdown: STT ($0.005-0.01/min), LLM inference ($0.01-0.08/min depending on model), TTS ($0.01-0.04/min), telephony ($0.01-0.02/min), and platform fees ($0-0.05/min). Using GPT-4o mini instead of GPT-4o can cut LLM costs by 10x, which is the biggest lever for most teams.

What happens when the voice agent doesn't understand the user?

Well-built agents use a confidence threshold on STT transcription. When confidence is low, they ask for clarification ("I didn't catch that — could you repeat?"). More sophisticated implementations use the LLM to detect ambiguity and ask targeted follow-up questions. Critical production agents also implement fallback-to-human transfers when the conversation goes off-script or the user explicitly asks for a person.

Do I need to use a voice agent platform, or can I build my own?

Both are valid. Platforms like Vapi and Retell handle orchestration, turn-taking, interruption detection, and provider integration for you — saving months of engineering. Building your own stack gives you more control and lower per-minute costs at scale, but you'll spend significant time on edge cases like audio buffering, silence detection, and error recovery. Most teams should start with a platform and only go custom when they hit its limitations.

How do voice agents handle interruptions?

Interruption handling (also called "barge-in") is one of the hardest problems in voice AI. The agent must detect that the user has started speaking while the agent is talking, stop TTS playback immediately, process the new user input, and generate a contextually appropriate response. Good platforms use voice activity detection (VAD) with tunable sensitivity, and some use the LLM itself to decide whether an utterance is a true interruption or just a backchannel ("uh huh", "yeah").

What is the difference between a voice agent and an IVR?

Traditional IVR (Interactive Voice Response) systems use pre-recorded prompts and DTMF (keypress) or basic speech recognition to route calls through a fixed decision tree. Voice agents use real-time AI to have freeform conversations — they understand natural language, maintain context across turns, and can take actions (book appointments, look up records, transfer calls) based on the conversation. The user experience difference is night and day.

Ready to Build?

Now that you understand the architecture, explore our platform reviews and comparisons to choose your stack.

Platform Reviews Comparisons More Guides