Skip to content
Subscribe

Replacing Vapi with a $35/month Self-Hosted Voice AI

Engineering 10 min read

I was paying $0.09/minute for Vapi to handle phone calls with my AI agent.

At 30 minutes/day, that’s $81/month for one feature. For a bootstrapped MBA student with a family of four in Provo, Utah, that math doesn’t work.

So I replaced it. The self-hosted version costs $0.035/minute — 61% cheaper — and it runs 24/7 on the same Mac Mini that runs everything else.

Here’s the exact stack and how it works.


The Problem with Vapi (and Similar Platforms)

Section titled “The Problem with Vapi (and Similar Platforms)”

Vapi is genuinely impressive. Fast setup, good documentation, reliable infrastructure. But the pricing model is:

  • Platform fee: ~$0.03/min
  • STT (Deepgram via Vapi): ~$0.02/min
  • TTS (ElevenLabs or Cartesia via Vapi): ~$0.025/min
  • LLM (Claude via Vapi): ~$0.015/min
  • Twilio passthrough: ~$0.005/min

Total: ~$0.09/min

When you pay Vapi, you’re paying for the convenience of not wiring this up yourself. That convenience is real — but at scale, or on a budget, it becomes expensive fast.


Phone call → Twilio → Cloudflare Tunnel → Pipecat pipeline
Deepgram (STT) → Claude (LLM+tools) → Cartesia (TTS)
Back to Twilio → Caller's phone
ComponentServiceCostWhy
TelephonyTwilio$0.005/minStandard, reliable, well-documented
STTDeepgram Nova-2~$0.008/minBest accuracy/cost ratio in 2026
LLMClaude Sonnet 4.6~$0.012/minFull tool use, streaming, low latency
TTSCartesia Sonic-2~$0.007/minLowest TTFA (<100ms), natural voice
HostingMac Mini (already running)$0Already owned
TunnelCloudflare (free)$0Exposes local server to Twilio

Total: ~$0.032–0.035/min. (vs $0.09/min on Vapi)


Pipecat is the open-source framework I chose to wire everything together. It handles:

  • Transport: Bidirectional WebSocket with Twilio’s Media Streams API
  • VAD: Voice Activity Detection using Silero (knows when you stop talking)
  • Pipeline: Connects STT → LLM → TTS with proper backpressure handling
  • Interruptions: User can speak over Claude’s response (barge-in)

Without Pipecat, you’d be manually managing WebSocket frames, audio chunks, and async coordination between four different APIs. It’s 500 lines of infrastructure that Pipecat gives you for free.


The target is <800ms TTFA (Time to First Audio) — the time between when you stop speaking and when Claude starts responding.

VAD silence detection: 300ms (configured in Silero)
Deepgram STT: 150ms (streaming, near-real-time)
Claude TTFB: 280ms (first token from API)
Cartesia TTS TTFA: 90ms (first audio chunk)
Network round-trip: 30ms
─────────────────────────────
Total: 850ms ← target &lt;800ms

Getting under 800ms required:

  1. Reducing VAD stop_secs from 0.8 → 0.5 (detect silence faster)
  2. Pre-loading call context before the pipeline starts (saves 100-200ms on first turn)
  3. Playing buffer phrases immediately (“One moment…”) while tools execute, so there’s no dead air

# pipeline.py — simplified
from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.vad.silero import SileroVADAnalyzer
async def run_call(websocket, caller):
# Load system context before pipeline starts
context = await load_call_context(caller)
system_prompt = build_system_prompt(context)
pipeline = Pipeline([
transport.input(), # Audio in from Twilio
DeepgramSTTService(...), # Audio → transcript
user_aggregator, # Accumulate user turns
AnthropicLLMService( # Transcript → response (with tools)
system=system_prompt,
tools=VOICE_TOOLS,
),
CartesiaTTSService(...), # Response → audio
transport.output(), # Audio out to Twilio
assistant_aggregator, # Accumulate assistant turns
])
await PipelineRunner().run(PipelineTask(pipeline))

The beauty of Pipecat: the pipeline is just a list. Data flows left to right. The framework handles async coordination, backpressure, and interruption.


Claude on voice has access to:

VOICE_TOOLS = [
"get_calendar_events", # Google Calendar via gog CLI
"create_calendar_event", # Add new events
"add_reminder", # Apple Reminders via remindctl
"get_weather", # wttr.in
"search_web", # Web search
"send_imessage", # iMessage via imsg CLI
"append_to_notes", # Daily notes / Obsidian
"read_emails", # Gmail via gog CLI
"run_skill", # Any OpenClaw skill as escape hatch
]

When Claude calls a tool, a buffer phrase fires immediately (“Let me check your calendar…”) so there’s no dead air while the tool executes. Then Claude responds with the result.


Every call is transcribed and saved to ~/clawd/memory/voice-calls/. The index tracks:

{
"calls": [
{
"date": "2026-03-28T10:30:00",
"duration_secs": 342,
"summary": "JD asked about schedule for next week...",
"turns": 18
}
]
}

At the start of every call, the last 3 summaries are loaded into the system prompt. Claude knows what we talked about. It’s not starting fresh every time.


At 30 min/day usage:

PlatformCost/minDailyMonthly
Vapi$0.090$2.70$81.00
Self-hosted$0.035$1.05$31.50
Savings$1.65/day$49.50/mo

The build time was ~10 days. At $49.50/month savings, ROI breakeven is ~3 months.


  1. Twilio account — free trial with $15 credit, then ~$1/month for a phone number
  2. Deepgram account — $200 free credit, then pay-as-you-go
  3. Cartesia account — 20K free credits, then pay-as-you-go
  4. Anthropic API key — you probably already have one
  5. Cloudflare — free tunnel for exposing local server
  6. A server — Mac Mini, Raspberry Pi, cheap VPS, whatever you have running 24/7

The Pipecat framework is open source. The code is yours.


Start with a quick Cloudflare tunnel. Don’t wait to configure a permanent tunnel. Use cloudflared tunnel --url http://localhost:8765 for instant HTTPS — takes 30 seconds.

Build the transcript system first. You’ll want to review calls to debug latency and tool execution. If you don’t have transcripts, you’re flying blind.

Test interruptions aggressively. The first version I built didn’t handle barge-in correctly. Users would speak, Claude wouldn’t stop. That’s a dealbreaker for phone UX. Pipecat handles it natively — just make sure interruptions_enabled=True.


The full build plan is at ~/clawd/output/VOICECLAW-V2-BUILDPLAN.md (private).

If you want to build this for your own AI system and want help with the architecture, reach out at jddavenport.com.


Next: The 54-Agent Stack: What I Learned — lessons from building and running 54 AI agents in production.