v1 pre-alpha · Apache-2.0

Local AI for your home. Strix Halo native.

hal0 turns a Ryzen AI Max+ 395 with 128 GB of unified memory into a polished, OpenAI-compatible inference box. Slots, dispatcher, prewired chat — one Linux command installs the lot.

install on any modern Linux box
sh
curl -fsSL https://hal0.dev/install | bash

Why hal0

A polished home AI box, not another llama-server wrapper.

Strix Halo native

UMA-aware probe, FLM provider for the XDNA NPU, unified-memory slot-fit warnings sized to the real pool — not the BAR carve-out. 128 GB Ryzen AI Max+ 395 is the reference deployment, not a hopeful port.

Concurrent chat + embed + voice

Five built-in slot classes — chat, embed, STT, TTS, image — each a real systemd-managed process on its own port. Run them at once: primary + embed concurrent on Strix Halo measures ~258 tok/s with <200 ms dispatch.

Image gen, day one

POST /v1/images/generations served by ComfyUI on ROCm, inside the same slot lifecycle as everything else. Curated SDXL Turbo, SD 1.5, and Flux Schnell ship pre-pinned by sha256. Not a side-quest — a first-class slot.

Dispatcher with single-flight

Registry-aware routing across local slots and external upstreams (OpenRouter, Anthropic, OpenAI). Cold-cache prefetch with request coalescing — a thundering herd of identical prefetches becomes one HTTP call. Every routing decision logged as structured breadcrumbs.

What ships in v1

A complete home inference platform.

Five-provider stack

llama.cpp (Vulkan / ROCm / CUDA) for chat + embed + rerank, FLM for the XDNA NPU, Moonshine for STT, Kokoro for TTS, ComfyUI for image gen — all wrapped in the same slot lifecycle.

OpenAI /v1/* surface

Chat, completions, embeddings, rerank, audio transcriptions, audio speech, images, models. Same shapes any OpenAI SDK already speaks — point your client at localhost:8080 and go.

Slot state machine

Every workload has typed states (offline → pulling → warming → ready → serving ↔ idle → unloading) with atomic transitions, persisted to state.json and streamed over SSE.

Hugging Face model pulls

Streaming GGUF downloads from the dashboard or hal0 model pull, with live progress and resumable transfers. No manual repo wrangling, no git lfs.

Prewired OpenWebUI

Chat UI on :3001, zero config — the installer points it at the local hal0 API. Dashboard is for operating the box, not chatting.

Dashboard (Vue 3)

Nine views — Slots, Models, Hardware, Logs, Settings, Providers, First-run, plus error shell. Dark by default. SSE-backed live status and log tail.

Auth + HTTPS, one flag

Off by default for trusted-LAN installs. --auth=basic brings up Caddy with basic_auth at the edge, bearer tokens for the OpenAI API, and automatic HTTPS — internal CA for .local, Let's Encrypt for real domains. Zero certbot.

Atomic self-update

hal0 update --channel stable|nightly. Cosign-verified tarballs swap a current symlink; --rollback reverts. Slot units survive API restarts.

One-line install

Linux + systemd, idempotent, non-interactive. Pre-flight checks, systemd template units, working slot defaults dropped on disk — no manual yaml.

Hardware

Built for Strix Halo. Honest about everything else.

Linux + systemd is the only hard requirement (installer/install.sh:86). The probe picks the right provider; you don't pin a backend by hand.

Provider matrix — picked automatically by the hardware probe

Hardware Vendor Unified / VRAM Support Notes
AMD Ryzen AI Max+ 395 AMD · "Strix Halo" Unified 128 GB first-class Reference deployment. iGPU + XDNA NPU + UMA-aware probe. Vulkan default; FLM for NPU.
AMD Ryzen AI Max 385 / 390 AMD · "Strix Halo" Unified 64 GB first-class Same path as the 128 GB SKU; small + mid tiers fit comfortably, 70B Q4 with shorter context.
NVIDIA RTX 30/40/50 NVIDIA 10–32 GB supported CUDA-backed llama.cpp. Same slot lifecycle, dedicated VRAM instead of UMA.
AMD Radeon RX 7000 / discrete AMD 16–24 GB supported Vulkan path today; ROCm toolbox image on the build list for opt-in.
CPU-only x86_64 Intel / AMD System RAM experimental Vulkan-CPU fallback. Small models only — CI runs Qwen 0.5B here. Not the headline experience.

macOS and Windows are not in scope for v1. NPU benchmarks land once the FLM toolbox image is published to ghcr.io/hal0ai/.

Recommended loadouts

Three starting points. Mix, match, swap.

Curated picks for the 128 GB Strix Halo — refreshed to the latest open-weight releases as of May 2026. The slot system takes a different model per slot whenever you change your mind. See the full hardware page for discrete-GPU and CPU loadouts.

Coding · mid

~19 GB
  • primary · Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
  • embed · nomic-embed-text-v2-moe-Q4_K_M

MoE with 3B active params — runs near 3B speeds, reasons like a 30B. Pairs with a 140 MB embed for repo-aware search.

Voice mode

~3 GB
  • primary · Qwen3-4B-Instruct-2507-Q4_K_M
  • stt · Moonshine base
  • tts · Kokoro-82M v1.0

Low-latency reply, edge-built STT, 54-voice TTS. 128 GB leaves the entire rest of the budget free for a second chat model warm in another slot.

Maxed-out · long context

~50 GB
  • primary · Llama-4-Scout-17B-16E-Instruct-Q4_K_M
  • embed · bge-m3 (8192-token context)

10M-token context, MoE with 17B active. The biggest realistic single-model loadout that still leaves room for STT/TTS slots warm on 128 GB unified.

Loadouts are starting points. Every real install ends up tweaked. Sizes are published GGUF file sizes (Hugging Face, May 2026); no tok/s numbers on this page.

Quickstart

From zero to a live /v1/chat in two commands.

The installer is idempotent and non-interactive. It probes the hardware, writes /etc/hal0/hardware.json, drops working slot defaults, and brings the API up on :8080.

1 · install

install.sh
sh
# install on any modern Linux box with systemd
curl -fsSL https://hal0.dev/install | bash

# optional overrides
# HAL0_PORT=9090 HAL0_PREFIX=/opt/hal0 curl … | bash

2 · chat

/v1/chat/completions
sh
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b-instruct-q4_k_m",
    "messages": [{"role":"user","content":"Hello!"}]
  }'

Prefer a chat UI? OpenWebUI ships prewired on :3001, pointed at the local hal0 API out of the box.

Comparison

Why not just ollama, LM Studio, or raw llama.cpp?

hal0 isn't an inference engine — it's the orchestration, lifecycle, and multi-modal surface around llama.cpp, FLM, Moonshine, and Kokoro. Honest take, in one table.

Feature hal0 ollama LM Studio raw llama.cpp Cloud API
OpenAI /v1/* surface chat, embed, rerank, STT, TTSchat-only subsetchat-onlyraw HTTPfull
systemd-managed lifecycle partialdesktop appDIYn/a
Hardware probe + fit warnings n/a
Headless one-line install GUI installermanualn/a
Multi-model concurrent slots partial DIY
Bundled chat UI OpenWebUI prewired built-in varies
Signed self-update + rollback cosignmanualdesktop updatermanualn/a
Data stays on your box

vs. ollama — systemd-managed slots survive hal0-api restarts; the OpenAI surface includes embeddings, rerank, and STT/TTS, not just chat. Hardware probe + slot fit warnings are first-class.

vs. LM Studio — Linux-first, headless-first, one-line install, no GUI required. Prewired OpenWebUI handles chat; the dashboard is for operating the box.

vs. raw llama.cpp — hal0 owns the lifecycle: health probes, atomic env writes, cold-boot grace, single-flight prefetch, structured errors, signed self-update with rollback.

vs. cloud APIs — your hardware, your data, your models. External upstreams (OpenRouter, Anthropic, OpenAI, custom) can be configured as fallbacks behind the same /v1/* surface — mix local + remote per-model in one config.

Roadmap

v1 shipped. v0.2 is where it gets interesting.

Phase 1 landed on 2026-05-15 — 353 unit tests passing, integration tier on Vulkan-CPU + Qwen 0.5B. Here's what's next.

Full roadmap →

Now

v1

Slot lifecycle + dispatcher

shipped

v1

State machine, single-flight, adaptive cold-boot, atomic TOML config, structured error envelopes.

Five providers wired

shipped

v1

llama.cpp (Vulkan), FLM (XDNA NPU), Moonshine (STT), Kokoro (TTS), ComfyUI (image gen on ROCm).

Image generation

shipped

v1

POST /v1/images/generations served by ComfyUI on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell, slot named img.

Caddy + auth + HTTPS

shipped

v1

install.sh --auth=basic brings up Caddy with basic_auth + Bearer tokens + automatic HTTPS, with a post-install round-trip self-test.

Next

v0.2

Memory subsystem

next
Conversation memory store wired into the dispatcher — opt-in per slot.

MCP support

next
Model Context Protocol surface so hal0 can host MCP servers alongside slots.

hal0.local mDNS

next
Zero-config LAN discovery so the box is reachable as hal0.local without DNS edits.

Packaging

v0.2

AUR + Ubuntu PPA

next
Distro-native packages alongside the curl installer.

Benchmarks + Presets UI

next
In-dashboard benchmark runner and one-click loadout presets.

Trust

Boring guarantees, in writing.

License

Apache-2.0

Patent grant included. Bundle, fork, ship.

Telemetry

Off by default

Opt-in surfaces hw class, version, slot count. No model names. No IPs. No config.

Releases

cosign-signed

hal0 update verifies GitHub-OIDC signer identity before unpacking.

Source

github.com/hal0ai/hal0 →

Issues, discussions, release manifests. Pre-alpha — see CONTRIBUTING for contribution status.

One Linux box. Your AI.

Install hal0 in under a minute.

Linux + systemd is the only hard requirement. The probe picks the right provider — Vulkan on Strix Halo, CUDA on NVIDIA, ROCm on AMD discrete, CPU as a fallback.