Voice In/Out

Tier 2 · Building 7 min read

Before this, read:

Hooks: automating behavior the model can’t — the Stop hook mechanism this feature is built on
Telegram as a control plane — voice lives on top of the Telegram reply layer

A long Telegram reply is hard to read while you’re walking. The voice system solves this by automatically converting summary-class replies into audio and sending the audio to JD’s chat — without Claude having to remember to do it.

The key word is automatically. For weeks, JD had to prompt “and the voice?” after every summary reply because Claude would remember the protocol early in a session and forget it as context filled. The fix was moving the responsibility from Claude’s memory to a Stop hook.

How auto-voice works

After every Claude turn that sent a Telegram reply, the Stop hook (~/agent-system/scripts/telegram-stop-hook.py) runs maybe_send_voice(). The function checks five gates in order:

Is voice mode on or set to once? (Configured in ~/agent-system/state/voice-mode.json.) If off, stop.
Did Claude already explicitly call send_voice this turn? If yes, stop — Claude wins, don’t double-fire.
Is the reply summary-class? The criteria:
- Length ≥ 300 characters, OR
- Contains a markdown header (# or ##), OR
- Contains 2+ bullet or numbered list items
Does the reply contain the literal token [no-voice]? If yes, stop — this is the escape hatch.
Was voice already sent in this session for this reply? (Dedup ledger at ~/agent-system/state/voice-sent.json.) If yes, stop.

If all gates pass, the hook calls agents.ai_os.telegram_voice.send_voice(chat_id, text). Long replies (>700 characters) are shortened to the first paragraph plus a last-paragraph TL;DR. Code fences and markdown headers are stripped from the spoken text — they read badly as audio.

The Kokoro upgrade (2026-06-08)

The original voice system used ElevenLabs TTS via API. When the ElevenLabs key went dead, the fallback was macOS say — which is serviceable but robotic.

The replacement: Kokoro-82M, a local neural TTS model that runs offline on Apple Silicon. MIT license. No API calls, no cost per character.

Key facts about the Kokoro setup:

Isolated in its own virtual environment (.venv-tts) because Kokoro requires numpy>=2 while the agent fleet pins numpy==1.26.4 — mixing them would break the data science stack
Voice selected: am_adam (set in .env as KOKORO_VOICE, shared by both manual send_voice calls and the stop-hook auto path for consistency)
Tier order: Kokoro PRIMARY → ElevenLabs (only if key works) → macOS say fallback

The build agent that shipped Kokoro also caught and reverted a near-break: during dependency installation, it detected that updating numpy in the main environment would conflict with pycaret and other analytics dependencies. It isolated Kokoro instead of breaking the existing stack. That judgment call (isolate over break) is the right default when adding a new capability with conflicting dependencies.

The stop-hook kill switch that was silently disabling voice

Before the Kokoro upgrade, the stop hook had a gate: if ELEVENLABS_API_KEY was not set in the environment, skip auto-voice entirely. When the ElevenLabs key expired, this gate silently disabled voice for everyone — including the macOS say fallback path.

The fix: remove the ELEVENLABS_API_KEY gate from the stop hook. The tier order handles fallback; the hook shouldn’t make decisions about which TTS backend is available. This is an example of a kill switch that was correct when written (don’t attempt TTS if there’s no TTS configured) becoming wrong after the architecture changed (Kokoro doesn’t need an API key).

Controlling voice mode

Three settings, toggled by the /voice command from Telegram:

/voice on — auto-fire on every qualifying reply, forever
/voice off — never auto-fire
/voice once — auto-fire on the next qualifying reply, then turn off

State at ~/agent-system/state/voice-mode.json.

The escape hatch: include [no-voice] anywhere in a reply to suppress auto-fire for that reply regardless of mode. Useful when you’re sending a long status report that doesn’t need audio (e.g., a formatted table that strips badly when spoken).

What Claude needs to do (very little)

The whole point of the Stop hook approach is that Claude doesn’t have to remember anything:

Default summary reply: just call mcp__plugin_telegram_telegram__reply with the full text. Voice fires automatically.
Short acknowledgement (“got it,” “done”): just reply. It’s not summary-class, so voice doesn’t fire.
Want to suppress voice for a specific reply: include [no-voice] in the text.
Want to send voice without a reply: call send_voice directly via Bash or the Python import.

That’s the full interface. The infrastructure handles the rest.

Building something similar

The core pattern is:

A Stop hook runs after every turn
The hook inspects what the turn produced
For qualifying output, it takes a follow-on action automatically

The voice feature uses this to send audio. The same pattern works for: auto-logging to a file every time a specific topic is discussed, auto-creating a watcher when a pending item is mentioned, sending a notification to a second channel for high-priority replies. The hook mechanism is generic; the classification logic is the customizable part.

Next: The discipline that prevents bugs from recurring — how the system diagnoses root causes before patching symptoms. Root-cause-first as a build process.