Replacing Vapi with a $35/month Self-Hosted Voice AI

Engineering 10 min read

I was paying $0.09/minute for Vapi to handle phone calls with my AI agent.

At 30 minutes/day, that’s $81/month for one feature. For a bootstrapped MBA student with a family of four in Provo, Utah, that math doesn’t work.

So I replaced it. The self-hosted version costs $0.035/minute — 61% cheaper — and it runs 24/7 on the same Mac Mini that runs everything else.

Here’s the exact stack and how it works.

The Problem with Vapi (and Similar Platforms)

Vapi is genuinely impressive. Fast setup, good documentation, reliable infrastructure. But the pricing model is:

Platform fee: ~$0.03/min
STT (Deepgram via Vapi): ~$0.02/min
TTS (ElevenLabs or Cartesia via Vapi): ~$0.025/min
LLM (Claude via Vapi): ~$0.015/min
Twilio passthrough: ~$0.005/min

Total: ~$0.09/min

When you pay Vapi, you’re paying for the convenience of not wiring this up yourself. That convenience is real — but at scale, or on a budget, it becomes expensive fast.

The Self-Hosted Stack

Phone call → Twilio → Cloudflare Tunnel → Pipecat pipeline
                                              ↓
                              Deepgram (STT) → Claude (LLM+tools) → Cartesia (TTS)
                                              ↓
                                       Back to Twilio → Caller's phone

Components

Component	Service	Cost	Why
Telephony	Twilio	$0.005/min	Standard, reliable, well-documented
STT	Deepgram Nova-2	~$0.008/min	Best accuracy/cost ratio in 2026
LLM	Claude Sonnet 4.6	~$0.012/min	Full tool use, streaming, low latency
TTS	Cartesia Sonic-2	~$0.007/min	Lowest TTFA (<100ms), natural voice
Hosting	Mac Mini (already running)	$0	Already owned
Tunnel	Cloudflare (free)	$0	Exposes local server to Twilio

Total: ~$0.032–0.035/min. (vs $0.09/min on Vapi)

Why Pipecat

Pipecat is the open-source framework I chose to wire everything together. It handles:

Transport: Bidirectional WebSocket with Twilio’s Media Streams API
VAD: Voice Activity Detection using Silero (knows when you stop talking)
Pipeline: Connects STT → LLM → TTS with proper backpressure handling
Interruptions: User can speak over Claude’s response (barge-in)

Without Pipecat, you’d be manually managing WebSocket frames, audio chunks, and async coordination between four different APIs. It’s 500 lines of infrastructure that Pipecat gives you for free.

Latency Budget

The target is <800ms TTFA (Time to First Audio) — the time between when you stop speaking and when Claude starts responding.

VAD silence detection:   300ms  (configured in Silero)
Deepgram STT:            150ms  (streaming, near-real-time)
Claude TTFB:             280ms  (first token from API)
Cartesia TTS TTFA:        90ms  (first audio chunk)
Network round-trip:       30ms
─────────────────────────────
Total:                   850ms  ← target &lt;800ms

Getting under 800ms required:

Reducing VAD stop_secs from 0.8 → 0.5 (detect silence faster)
Pre-loading call context before the pipeline starts (saves 100-200ms on first turn)
Playing buffer phrases immediately (“One moment…”) while tools execute, so there’s no dead air

The Core Pipeline Code

# pipeline.py — simplified
from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.vad.silero import SileroVADAnalyzer

async def run_call(websocket, caller):
    # Load system context before pipeline starts
    context = await load_call_context(caller)
    system_prompt = build_system_prompt(context)

    pipeline = Pipeline([
        transport.input(),          # Audio in from Twilio
        DeepgramSTTService(...),    # Audio → transcript
        user_aggregator,            # Accumulate user turns
        AnthropicLLMService(        # Transcript → response (with tools)
            system=system_prompt,
            tools=VOICE_TOOLS,
        ),
        CartesiaTTSService(...),    # Response → audio
        transport.output(),         # Audio out to Twilio
        assistant_aggregator,       # Accumulate assistant turns
    ])

    await PipelineRunner().run(PipelineTask(pipeline))

The beauty of Pipecat: the pipeline is just a list. Data flows left to right. The framework handles async coordination, backpressure, and interruption.

Tools Available on Voice Calls

Claude on voice has access to:

VOICE_TOOLS = [
    "get_calendar_events",    # Google Calendar via gog CLI
    "create_calendar_event",  # Add new events
    "add_reminder",           # Apple Reminders via remindctl
    "get_weather",            # wttr.in
    "search_web",             # Web search
    "send_imessage",          # iMessage via imsg CLI
    "append_to_notes",        # Daily notes / Obsidian
    "read_emails",            # Gmail via gog CLI
    "run_skill",              # Any OpenClaw skill as escape hatch
]

When Claude calls a tool, a buffer phrase fires immediately (“Let me check your calendar…”) so there’s no dead air while the tool executes. Then Claude responds with the result.

Cross-Call Memory

Every call is transcribed and saved to ~/clawd/memory/voice-calls/. The index tracks:

{
  "calls": [
    {
      "date": "2026-03-28T10:30:00",
      "duration_secs": 342,
      "summary": "JD asked about schedule for next week...",
      "turns": 18
    }
  ]
}

At the start of every call, the last 3 summaries are loaded into the system prompt. Claude knows what we talked about. It’s not starting fresh every time.

Cost Comparison

At 30 min/day usage:

Platform	Cost/min	Daily	Monthly
Vapi	$0.090	$2.70	$81.00
Self-hosted	$0.035	$1.05	$31.50
Savings		$1.65/day	$49.50/mo

The build time was ~10 days. At $49.50/month savings, ROI breakeven is ~3 months.

What You’d Need to Build This

Twilio account — free trial with $15 credit, then ~$1/month for a phone number
Deepgram account — $200 free credit, then pay-as-you-go
Cartesia account — 20K free credits, then pay-as-you-go
Anthropic API key — you probably already have one
Cloudflare — free tunnel for exposing local server
A server — Mac Mini, Raspberry Pi, cheap VPS, whatever you have running 24/7

The Pipecat framework is open source. The code is yours.

What I’d Do Differently

Start with a quick Cloudflare tunnel. Don’t wait to configure a permanent tunnel. Use cloudflared tunnel --url http://localhost:8765 for instant HTTPS — takes 30 seconds.

Build the transcript system first. You’ll want to review calls to debug latency and tool execution. If you don’t have transcripts, you’re flying blind.

Test interruptions aggressively. The first version I built didn’t handle barge-in correctly. Users would speak, Claude wouldn’t stop. That’s a dealbreaker for phone UX. Pipecat handles it natively — just make sure interruptions_enabled=True.

The full build plan is at ~/clawd/output/VOICECLAW-V2-BUILDPLAN.md (private).

If you want to build this for your own AI system and want help with the architecture, reach out at jddavenport.com.

Next: The 54-Agent Stack: What I Learned — lessons from building and running 54 AI agents in production.