How I Built a Custom AI Voice Agent by Combining Deepgram, OpenAI, and ElevenLabs

Sascha · 14 Окт 2025

Tired of voice assistants that take forever to reply, making the chat feel slow and robotic? The trick to building a fast, natural AI voice agent is simple don’t depend on one provider for everything.

In this guide, I’ll walk you through how I built a voice agent using a cascading architecture, where each part of the STT → LLM → TTS pipeline works together to create smooth, human-like conversations.

Here’s what I used:

Deepgram for Speech-to-Text (STT) - converting voice into text
OpenAI for the LLM (Large Language Model) - understanding and generating replies
ElevenLabs for Text-to-Speech (TTS) - turning responses back into realistic voice

Each component plays its part in the pipeline. When connected in a cascading flow, they create a fast, flexible, and natural-sounding voice assistant.

By the end of this post, you’ll understand how a cascading architecture voice agent works, what tools to use, and how to connect the STT, LLM, and TTS components into one smooth AI experience.

Overview: How the Cascading System Works

The output of one step immediately drops down (cascades) to the next. For the agent to feel natural, this entire process must happen almost instantly (ideally under 1.5 seconds).

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

1. Speech-to-Text (STT)

Converts speech to text as you speak. Use streaming STT with turn detection for speed and accuracy.

2. Large Language Model (LLM)

Generates responses in real-time, streaming words to TTS immediately.

3. Text-to-Speech (TTS)

Turns text into a human-like voice instantly, with fast playback and natural tone.

4. For turn detection

we will use VideoSDK’s specialized Namo Turn Detector model. This component is essential for determining the precise moment a user has finished speaking, ensuring the agent doesn't interrupt or pause unnecessarily.

read more about

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

How to make cascading pipeline

Installation Prerequisites

Before you begin, ensure you have:

create virtual environment

for windows

python -m venv venv
venv\Scripts\activate

for macOs

python3.12 -m venv venv
source venv/bin/activate

Install all dependencies

pip install "videosdk-agents[deepgram,openai,elevenlabs,silero]"

Want to use a different provider? Check out our plugins for

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

,

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

, and

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

.

Plugin Installation

Install additional plugins as needed:

# Install specific provider plugins
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram

# Install namo turn detection model
pip install "videosdk-plugins-turn-detector"

Environment setup

DEEPGRAM_API_KEY = "Your Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
ELEVENLABS_API_KEY = "Your ElevenLabs API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"

API Keys - Get API keys

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

,

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

,

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

&

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

follow to guide to

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Creating our AI Voice Agent

create a

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

file

import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from typing import AsyncIterator
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
from videosdk.agents import CascadingPipeline

# Pre-download the English model to avoid delays
pre_download_namo_turn_v1_model(language="en")

# Initialize the Turn Detector for English
turn_detector = NamoTurnDetectorV1(
language="en",
threshold=0.7
)

class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
async def on_enter(self): await self.session.say("Hello! How can I help?")
async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
# Create agent and conversation flow
agent = MyVoiceAgent()
conversation_flow = ConversationFlow(agent)

# Create pipeline
pipeline = CascadingPipeline(
stt=DeepgramSTT(model="nova-2", language="en"),
llm=OpenAILLM(model="gpt-4o"),
tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
vad=SileroVAD(threshold=0.35),
turn_detector=turn_detector # Add the Turn Detector to a cascading pipeline
)

session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow
)

try:
await context.connect()
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
await context.shutdown()

def make_context() -> JobContext:
room_options = RoomOptions(
# room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
name="VideoSDK Cascaded Agent",
playground=True
)

return JobContext(room_options=room_options)

if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()

Get started quickly with the

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

for the VideoSDK AI Agent SDK everything you need to build your first AI agent fast.

You've now got the blueprint for building a voice agent that doesn't just talk, but responds instantly. By demanding streaming from your STT, LLM, and TTS providers and carefully managing the flow with the Turn Detection logic, you bypass the common lag issues that plague most voice assistants. This best-of-breed, cascading approach puts you in control, allowing you to future-proof your agent by swapping out a component (like upgrading your LLM) without rebuilding the entire system.

We’d love to hear from you!

Did you manage to set up your first AI voice agent in Python?
What challenges did you face while integrating cascading pipeline?
Are you more interested in cascading pipeline or realtime pipeline?
How do you see AI voice assistants transforming customer experience in your business?

Share your thoughts, roadblocks, or success stories in the comments or join our

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Источник:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

How I Built a Custom AI Voice Agent by Combining Deepgram, OpenAI, and ElevenLabs

Sascha

Заместитель Администратора