Audio — STT & TTS
hal0 ships two audio endpoints — speech-to-text on the stt slot
and text-to-speech on the tts slot. Both speak the OpenAI Audio
shape so any client that hits OpenAI’s audio API works here.
Speech-to-text — Moonshine
Section titled “Speech-to-text — Moonshine”The stt slot defaults to Moonshine — a small, fast ASR model
built for edge real-time. The toolbox image is
hal0-toolbox-moonshine.
curl http://localhost:8080/v1/audio/transcriptions \ -H "Content-Type: multipart/form-data" \ -F file=@hello.wav \ -F model=sttResponse (OpenAI-shape):
{ "text": "Hello, world."}Alternates
Section titled “Alternates”The stt slot can host any ASR-compatible model the Moonshine
provider supports — for example a higher-accuracy
whisper-large-v3-turbo (~1.6 GB) if you have the headroom, or
Canary-Qwen-2.5B (Open ASR Leaderboard leader, 5.63% WER) for
SOTA accuracy. Swap with:
hal0 slot swap stt --model whisper-large-v3-turboSee Recommended loadouts → Voice mode for the picks per tier.
Text-to-speech — Kokoro
Section titled “Text-to-speech — Kokoro”The tts slot defaults to Kokoro-82M v1.0 — a small open TTS
model with 54 voices across 8 languages. The toolbox image is
hal0-toolbox-kokoro.
curl http://localhost:8080/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "model": "tts", "input": "Hello from hal0.", "voice": "af_bella" }' --output speech.wavAlternates
Section titled “Alternates”For voice cloning, the Kokoro provider also supports F5-TTS. Swap
with:
hal0 slot swap tts --model f5-ttsStatus today
Section titled “Status today”Moonshine and Kokoro are first-class providers in the v1
plan — they have working code paths and a slot lifecycle integration.
The toolbox container images (hal0-toolbox-moonshine,
hal0-toolbox-kokoro) are not yet published to
ghcr.io/hal0ai/ — Phase 2 publishes them, and that’s the last gap
before v1.0 cut.
Until those images land, the stt and tts slots are visible in the
UI but won’t start. The dashboard’s Slots view marks them
“image pending” with a link to the relevant roadmap entry.
Coming soon — outline
Section titled “Coming soon — outline”- Real-time streaming TTS (chunked PCM output).
- Speaker diarization for transcription.
- Voice cloning UX in the dashboard.
- WebSocket transport for full duplex voice mode.