Strix Halo native
UMA-aware probe, FLM provider for the XDNA NPU, unified-memory slot-fit warnings sized to the real pool — not the BAR carve-out. 128 GB Ryzen AI Max+ 395 is the reference deployment, not a hopeful port.
hal0 turns a Ryzen AI Max+ 395 with 128 GB of unified memory into a polished, OpenAI-compatible inference box. Slots, dispatcher, prewired chat — one Linux command installs the lot.
curl -fsSL https://hal0.dev/install | bash Why hal0
UMA-aware probe, FLM provider for the XDNA NPU, unified-memory slot-fit warnings sized to the real pool — not the BAR carve-out. 128 GB Ryzen AI Max+ 395 is the reference deployment, not a hopeful port.
Five built-in slot classes — chat, embed, STT, TTS, image — each a real systemd-managed process on its own port. Run them at once: primary + embed concurrent on Strix Halo measures ~258 tok/s with <200 ms dispatch.
POST /v1/images/generations
served by ComfyUI on ROCm, inside the same slot lifecycle
as everything else. Curated SDXL Turbo, SD 1.5, and Flux
Schnell ship pre-pinned by sha256. Not a side-quest — a
first-class slot.
Registry-aware routing across local slots and external upstreams (OpenRouter, Anthropic, OpenAI). Cold-cache prefetch with request coalescing — a thundering herd of identical prefetches becomes one HTTP call. Every routing decision logged as structured breadcrumbs.
What ships in v1
localhost:8080 and go.
state.json and streamed over SSE.
hal0 model pull, with live progress and resumable
transfers. No manual repo wrangling, no git lfs.
:3001, zero config — the installer points
it at the local hal0 API. Dashboard is for operating the box,
not chatting.
--auth=basic
brings up Caddy with basic_auth at the edge, bearer tokens for
the OpenAI API, and automatic HTTPS — internal CA for
.local, Let's Encrypt for real domains. Zero
certbot.
hal0 update --channel stable|nightly. Cosign-verified
tarballs swap a current symlink; --rollback
reverts. Slot units survive API restarts.
Hardware
Linux + systemd is the only hard requirement
(installer/install.sh:86). The
probe picks the right provider; you don't pin a backend by hand.
Provider matrix — picked automatically by the hardware probe
| Hardware | Vendor | Unified / VRAM | Support | Notes |
|---|---|---|---|---|
| AMD Ryzen AI Max+ 395 | AMD · "Strix Halo" | Unified 128 GB | first-class | Reference deployment. iGPU + XDNA NPU + UMA-aware probe. Vulkan default; FLM for NPU. |
| AMD Ryzen AI Max 385 / 390 | AMD · "Strix Halo" | Unified 64 GB | first-class | Same path as the 128 GB SKU; small + mid tiers fit comfortably, 70B Q4 with shorter context. |
| NVIDIA RTX 30/40/50 | NVIDIA | 10–32 GB | supported | CUDA-backed llama.cpp. Same slot lifecycle, dedicated VRAM instead of UMA. |
| AMD Radeon RX 7000 / discrete | AMD | 16–24 GB | supported | Vulkan path today; ROCm toolbox image on the build list for opt-in. |
| CPU-only x86_64 | Intel / AMD | System RAM | experimental | Vulkan-CPU fallback. Small models only — CI runs Qwen 0.5B here. Not the headline experience. |
macOS and Windows are not in scope for v1. NPU benchmarks land
once the FLM toolbox image is published to
ghcr.io/hal0ai/.
Recommended loadouts
Curated picks for the 128 GB Strix Halo — refreshed to the latest open-weight releases as of May 2026. The slot system takes a different model per slot whenever you change your mind. See the full hardware page for discrete-GPU and CPU loadouts.
MoE with 3B active params — runs near 3B speeds, reasons like a 30B. Pairs with a 140 MB embed for repo-aware search.
Low-latency reply, edge-built STT, 54-voice TTS. 128 GB leaves the entire rest of the budget free for a second chat model warm in another slot.
10M-token context, MoE with 17B active. The biggest realistic single-model loadout that still leaves room for STT/TTS slots warm on 128 GB unified.
Loadouts are starting points. Every real install ends up tweaked. Sizes are published GGUF file sizes (Hugging Face, May 2026); no tok/s numbers on this page.
Quickstart
/v1/chat in two commands.
The installer is idempotent and non-interactive. It probes the
hardware, writes /etc/hal0/hardware.json,
drops working slot defaults, and brings the API up on
:8080.
1 · install
# install on any modern Linux box with systemd
curl -fsSL https://hal0.dev/install | bash
# optional overrides
# HAL0_PORT=9090 HAL0_PREFIX=/opt/hal0 curl … | bash 2 · chat
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-0.5b-instruct-q4_k_m",
"messages": [{"role":"user","content":"Hello!"}]
}'
Prefer a chat UI? OpenWebUI ships prewired on
:3001, pointed at the local
hal0 API out of the box.
Comparison
hal0 isn't an inference engine — it's the orchestration, lifecycle, and multi-modal surface around llama.cpp, FLM, Moonshine, and Kokoro. Honest take, in one table.
| Feature | hal0 | ollama | LM Studio | raw llama.cpp | Cloud API |
|---|---|---|---|---|---|
| OpenAI /v1/* surface | chat, embed, rerank, STT, TTS | chat-only subset | chat-only | raw HTTP | full |
| systemd-managed lifecycle | ✓ | partial | desktop app | DIY | n/a |
| Hardware probe + fit warnings | ✓ | — | — | — | n/a |
| Headless one-line install | ✓ | ✓ | GUI installer | manual | n/a |
| Multi-model concurrent slots | ✓ | partial | — | DIY | ✓ |
| Bundled chat UI | OpenWebUI prewired | — | built-in | — | varies |
| Signed self-update + rollback | cosign | manual | desktop updater | manual | n/a |
| Data stays on your box | ✓ | ✓ | ✓ | ✓ | — |
vs. ollama —
systemd-managed slots survive
hal0-api restarts; the OpenAI
surface includes embeddings, rerank,
and STT/TTS, not just chat. Hardware probe + slot
fit warnings are first-class.
vs. LM Studio — Linux-first, headless-first, one-line install, no GUI required. Prewired OpenWebUI handles chat; the dashboard is for operating the box.
vs. raw llama.cpp — hal0 owns the lifecycle: health probes, atomic env writes, cold-boot grace, single-flight prefetch, structured errors, signed self-update with rollback.
vs. cloud APIs —
your hardware, your data, your models. External upstreams
(OpenRouter, Anthropic, OpenAI, custom) can be configured as
fallbacks behind the same /v1/*
surface — mix local + remote per-model in one config.
Roadmap
Phase 1 landed on 2026-05-15 — 353 unit tests passing, integration tier on Vulkan-CPU + Qwen 0.5B. Here's what's next.
v1
v1
v1
POST /v1/images/generations served by ComfyUI
on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell, slot
named img.
v1
install.sh --auth=basic brings up Caddy with
basic_auth + Bearer tokens + automatic HTTPS, with a
post-install round-trip self-test.
hal0.local without DNS edits.
Trust
License
Apache-2.0
Patent grant included. Bundle, fork, ship.
Telemetry
Off by default
Opt-in surfaces hw class, version, slot count. No model names. No IPs. No config.
Releases
cosign-signed
hal0 update verifies
GitHub-OIDC signer identity before unpacking.
Source
github.com/hal0ai/hal0 →
Issues, discussions, release manifests. Pre-alpha — see CONTRIBUTING for contribution status.
Community
Bug reports and feature requests. Reference the slot or
endpoint in the title; attach
journalctl -u hal0-api output
for crashes.
Loadout questions, hardware tuning, model recommendations. Lower friction than an issue.
Anything off the GitHub path — security disclosures, partnerships, hardware loaners, press, or just hello.
One Linux box. Your AI.
Linux + systemd is the only hard requirement. The probe picks the right provider — Vulkan on Strix Halo, CUDA on NVIDIA, ROCm on AMD discrete, CPU as a fallback.