v1 pre-alpha · Apache-2.0

Local AI for your home. Strix Halo native.

Name: hal0
Author: hal0ai

hal0 turns a Ryzen AI Max+ 395 with 128 GB of unified memory into a polished, OpenAI-compatible inference box. Slots, dispatcher, prewired chat — one Linux command installs the lot.

Install hal0 Read the docs

install on any modern Linux box

curl -fsSL https://hal0.dev/install | bash

Why hal0

A polished home AI box, not another llama-server wrapper.

Strix Halo native

UMA-aware probe, FLM provider for the XDNA NPU, unified-memory slot-fit warnings sized to the real pool — not the BAR carve-out. 128 GB Ryzen AI Max+ 395 is the reference deployment, not a hopeful port.

Concurrent chat + embed + voice

Five built-in slot classes — chat, embed, STT, TTS, image — each a real systemd-managed process on its own port. Run them at once: primary + embed concurrent on Strix Halo measures ~258 tok/s with <200 ms dispatch.

Image gen, day one

POST /v1/images/generations served by ComfyUI on ROCm, inside the same slot lifecycle as everything else. Curated SDXL Turbo, SD 1.5, and Flux Schnell ship pre-pinned by sha256. Not a side-quest — a first-class slot.

Dispatcher with single-flight

Registry-aware routing across local slots and external upstreams (OpenRouter, Anthropic, OpenAI). Cold-cache prefetch with request coalescing — a thundering herd of identical prefetches becomes one HTTP call. Every routing decision logged as structured breadcrumbs.

What ships in v1

A complete home inference platform.

Five-provider stack

llama.cpp (Vulkan / ROCm / CUDA) for chat + embed + rerank, FLM for the XDNA NPU, Moonshine for STT, Kokoro for TTS, ComfyUI for image gen — all wrapped in the same slot lifecycle.

OpenAI /v1/* surface

Chat, completions, embeddings, rerank, audio transcriptions, audio speech, images, models. Same shapes any OpenAI SDK already speaks — point your client at localhost:8080 and go.

Slot state machine

Every workload has typed states (offline → pulling → warming → ready → serving ↔ idle → unloading) with atomic transitions, persisted to state.json and streamed over SSE.

Hugging Face model pulls

Streaming GGUF downloads from the dashboard or hal0 model pull, with live progress and resumable transfers. No manual repo wrangling, no git lfs.

Prewired OpenWebUI

Chat UI on :3001, zero config — the installer points it at the local hal0 API. Dashboard is for operating the box, not chatting.

Dashboard (Vue 3)

Nine views — Slots, Models, Hardware, Logs, Settings, Providers, First-run, plus error shell. Dark by default. SSE-backed live status and log tail.

Auth + HTTPS, one flag

Off by default for trusted-LAN installs. --auth=basic brings up Caddy with basic_auth at the edge, bearer tokens for the OpenAI API, and automatic HTTPS — internal CA for .local, Let's Encrypt for real domains. Zero certbot.

Atomic self-update

hal0 update --channel stable|nightly. Cosign-verified tarballs swap a current symlink; --rollback reverts. Slot units survive API restarts.

One-line install

Linux + systemd, idempotent, non-interactive. Pre-flight checks, systemd template units, working slot defaults dropped on disk — no manual yaml.

Hardware

Built for Strix Halo. Honest about everything else.

Linux + systemd is the only hard requirement (installer/install.sh:86). The probe picks the right provider; you don't pin a backend by hand.

Provider matrix — picked automatically by the hardware probe

Hardware	Vendor	Unified / VRAM	Support	Notes
AMD Ryzen AI Max+ 395	AMD · "Strix Halo"	Unified 128 GB	first-class	Reference deployment. iGPU + XDNA NPU + UMA-aware probe. Vulkan default; FLM for NPU.
AMD Ryzen AI Max 385 / 390	AMD · "Strix Halo"	Unified 64 GB	first-class	Same path as the 128 GB SKU; small + mid tiers fit comfortably, 70B Q4 with shorter context.
NVIDIA RTX 30/40/50	NVIDIA	10–32 GB	supported	CUDA-backed llama.cpp. Same slot lifecycle, dedicated VRAM instead of UMA.
AMD Radeon RX 7000 / discrete	AMD	16–24 GB	supported	Vulkan path today; ROCm toolbox image on the build list for opt-in.
CPU-only x86_64	Intel / AMD	System RAM	experimental	Vulkan-CPU fallback. Small models only — CI runs Qwen 0.5B here. Not the headline experience.

macOS and Windows are not in scope for v1. NPU benchmarks land once the FLM toolbox image is published to ghcr.io/hal0ai/.

Recommended loadouts

Three starting points. Mix, match, swap.

Curated picks for the 128 GB Strix Halo — refreshed to the latest open-weight releases as of May 2026. The slot system takes a different model per slot whenever you change your mind. See the full hardware page for discrete-GPU and CPU loadouts.

Coding · mid

~19 GB

primary · Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
embed · nomic-embed-text-v2-moe-Q4_K_M

MoE with 3B active params — runs near 3B speeds, reasons like a 30B. Pairs with a 140 MB embed for repo-aware search.

Voice mode

~3 GB

primary · Qwen3-4B-Instruct-2507-Q4_K_M
stt · Moonshine base
tts · Kokoro-82M v1.0

Low-latency reply, edge-built STT, 54-voice TTS. 128 GB leaves the entire rest of the budget free for a second chat model warm in another slot.

Maxed-out · long context

~50 GB

primary · Llama-4-Scout-17B-16E-Instruct-Q4_K_M
embed · bge-m3 (8192-token context)

10M-token context, MoE with 17B active. The biggest realistic single-model loadout that still leaves room for STT/TTS slots warm on 128 GB unified.

Loadouts are starting points. Every real install ends up tweaked. Sizes are published GGUF file sizes (Hugging Face, May 2026); no tok/s numbers on this page.

Quickstart

From zero to a live `/v1/chat` in two commands.

The installer is idempotent and non-interactive. It probes the hardware, writes /etc/hal0/hardware.json, drops working slot defaults, and brings the API up on :8080.

1 · install

install.sh

# install on any modern Linux box with systemd
curl -fsSL https://hal0.dev/install | bash

# optional overrides
# HAL0_PORT=9090 HAL0_PREFIX=/opt/hal0 curl … | bash

2 · chat

/v1/chat/completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b-instruct-q4_k_m",
    "messages": [{"role":"user","content":"Hello!"}]
  }'

Prefer a chat UI? OpenWebUI ships prewired on :3001, pointed at the local hal0 API out of the box.

Comparison

Why not just ollama, LM Studio, or raw llama.cpp?

hal0 isn't an inference engine — it's the orchestration, lifecycle, and multi-modal surface around llama.cpp, FLM, Moonshine, and Kokoro. Honest take, in one table.

Feature	hal0	ollama	LM Studio	raw llama.cpp	Cloud API
OpenAI /v1/* surface	chat, embed, rerank, STT, TTS	chat-only subset	chat-only	raw HTTP	full
systemd-managed lifecycle	✓	partial	desktop app	DIY	n/a
Hardware probe + fit warnings	✓	—	—	—	n/a
Headless one-line install	✓	✓	GUI installer	manual	n/a
Multi-model concurrent slots	✓	partial	—	DIY	✓
Bundled chat UI	OpenWebUI prewired	—	built-in	—	varies
Signed self-update + rollback	cosign	manual	desktop updater	manual	n/a
Data stays on your box	✓	✓	✓	✓	—

vs. ollama — systemd-managed slots survive hal0-api restarts; the OpenAI surface includes embeddings, rerank, and STT/TTS, not just chat. Hardware probe + slot fit warnings are first-class.

vs. LM Studio — Linux-first, headless-first, one-line install, no GUI required. Prewired OpenWebUI handles chat; the dashboard is for operating the box.

vs. raw llama.cpp — hal0 owns the lifecycle: health probes, atomic env writes, cold-boot grace, single-flight prefetch, structured errors, signed self-update with rollback.

vs. cloud APIs — your hardware, your data, your models. External upstreams (OpenRouter, Anthropic, OpenAI, custom) can be configured as fallbacks behind the same /v1/* surface — mix local + remote per-model in one config.

Roadmap

v1 shipped. v0.2 is where it gets interesting.

Phase 1 landed on 2026-05-15 — 353 unit tests passing, integration tier on Vulkan-CPU + Qwen 0.5B. Here's what's next.

Full roadmap →

Now

Slot lifecycle + dispatcher

shipped

State machine, single-flight, adaptive cold-boot, atomic TOML config, structured error envelopes.

Five providers wired

shipped

llama.cpp (Vulkan), FLM (XDNA NPU), Moonshine (STT), Kokoro (TTS), ComfyUI (image gen on ROCm).

Image generation

shipped

POST /v1/images/generations served by ComfyUI on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell, slot named img.

Caddy + auth + HTTPS

shipped

install.sh --auth=basic brings up Caddy with basic_auth + Bearer tokens + automatic HTTPS, with a post-install round-trip self-test.

v0.2

Memory subsystem

Conversation memory store wired into the dispatcher — opt-in per slot.

MCP support

Model Context Protocol surface so hal0 can host MCP servers alongside slots.

hal0.local mDNS

Zero-config LAN discovery so the box is reachable as hal0.local without DNS edits.

Packaging

v0.2

AUR + Ubuntu PPA

Distro-native packages alongside the curl installer.

Benchmarks + Presets UI

In-dashboard benchmark runner and one-click loadout presets.

Trust

Boring guarantees, in writing.

License

Apache-2.0

Patent grant included. Bundle, fork, ship.

Telemetry

Off by default

Opt-in surfaces hw class, version, slot count. No model names. No IPs. No config.

Releases

cosign-signed

hal0 update verifies GitHub-OIDC signer identity before unpacking.

Source

github.com/hal0ai/hal0 →

Issues, discussions, release manifests. Pre-alpha — see CONTRIBUTING for contribution status.

Community

Install hal0 in under a minute.

Linux + systemd is the only hard requirement. The probe picks the right provider — Vulkan on Strix Halo, CUDA on NVIDIA, ROCm on AMD discrete, CPU as a fallback.

curl … | bash Read the docs

Local AI for your home. Strix Halo native.

A polished home AI box, not another llama-server wrapper.

Strix Halo native

Concurrent chat + embed + voice

Image gen, day one

Dispatcher with single-flight

A complete home inference platform.

Five-provider stack

OpenAI /v1/* surface

Slot state machine

Hugging Face model pulls

Prewired OpenWebUI

Dashboard (Vue 3)

Auth + HTTPS, one flag

Atomic self-update

One-line install

Built for Strix Halo. Honest about everything else.

Three starting points. Mix, match, swap.

From zero to a live /v1/chat in two commands.

Why not just ollama, LM Studio, or raw llama.cpp?

v1 shipped. v0.2 is where it gets interesting.

Slot lifecycle + dispatcher

Five providers wired

Image generation

Caddy + auth + HTTPS

Memory subsystem

MCP support

hal0.local mDNS

AUR + Ubuntu PPA

Benchmarks + Presets UI

Boring guarantees, in writing.

Built in the open. Talk to us in the open.

GitHub Issues →

GitHub Discussions →

hello@hal0.dev →

Install hal0 in under a minute.

From zero to a live `/v1/chat` in two commands.