Skip to content

CPU-only

hal0’s CPU-only path is the fallback tier. It exists so the installer works on a fresh VM, so CI can smoke-test slot lifecycle on boxes without GPUs, and so anyone trying hal0 for the first time can do so before deciding whether to commit hardware to it.

It is not the headline experience. The streaming chat feel that makes local AI worth using needs at least an iGPU.

The hardware probe detects “no GPU” and writes that to /etc/hal0/hardware.json. The installer picks the Vulkan-CPU path: llama.cpp compiled with the Vulkan backend running against lavapipe (Mesa’s software Vulkan implementation). This is the same path hal0’s CI uses to smoke-test the slot lifecycle with the Qwen 0.5B model.

All four built-in slots can theoretically run:

  • primary — small Q4 chat (4B and under)
  • embed — embeddings + rerank
  • stt — Moonshine (CPU-capable but latency-sensitive)
  • tts — Kokoro (CPU-capable but latency-sensitive)

The OpenAI-compatible /v1/* API, the slot lifecycle state machine, the dispatcher, and OpenWebUI all behave identically to a GPU box.

The honest answer:

  • Chat: a few tokens per second on a 4B Q4 model with a modest context. Fine for occasional Q&A, painful for long conversations.
  • Streaming voice: not realistic. Moonshine STT and Kokoro TTS run on CPU, but the round-trip latency for streaming audio isn’t what the slots were designed for. You can sanity-check the path; you can’t run voice mode comfortably.
  • Embeddings: fine. nomic-embed-text-v2-moe-Q4_K_M at 140 MB is small enough to run at usable speeds on any modern CPU, and the embed slot doesn’t need streaming.
Section titled “Recommended loadout (CPU-only, 32–64 GB RAM, no GPU)”
  • primary: gemma-3-1b-it-Q4_K_M (~0.7 GB) or Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for a snappier feel. (fallback: Phi-3-mini-4k-instruct-q4.gguf ~2.4 GB, the curated default.)
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) — runs fine on CPU.
  • No stt / tts slots — leave them in the offline state.

This is also the smallest viable hal0 install. The whole runtime, with a model loaded, fits comfortably under 3 GB of RSS.

  • Smoke-testing the install on a VM before committing hardware.
  • Development: running the API + dashboard against a tiny model while you build something against /v1/*.
  • CI: hal0’s own integration tier uses Vulkan-CPU + Qwen 0.5B — same path you’d hit here.
  • A box that’s already running for some other reason where you’d like an occasional local Q&A endpoint.
  • A daily-driver chat box. Get an iGPU.
  • Anything voice-mode. Get an iGPU.
  • Any model larger than Q4 4B unless you are deeply patient.
  • Anything where the model is wider than memory bandwidth supports. Above 8B Q4 on CPU you start hitting wall-clock limits that no amount of patience fixes.

The standard installer from the install page detects no-GPU correctly and picks Vulkan-CPU:

Terminal window
curl -fsSL https://hal0.dev/install | bash

A few things to check:

  • The Vulkan loader must be installed (apt install libvulkan1 / pacman -S vulkan-icd-loader).
  • Mesa’s lavapipe (mesa-vulkan-drivers / vulkan-swrast) provides the software Vulkan implementation.
  • vulkaninfo --summary should show llvmpipe as a device.

vulkaninfo shows no devices on a no-GPU box. Install Mesa’s software rasterizer Vulkan driver:

  • Debian/Ubuntu: apt install mesa-vulkan-drivers
  • Arch/CachyOS: pacman -S vulkan-swrast
  • Fedora: dnf install vulkan-loader mesa-vulkan-drivers

Slot starts but inference is extremely slow. Expected on CPU. Confirm the model is the size you think it is (a Q8 14B is dramatically slower than a Q4 4B), and shorten context windows where possible.

OOM on slot start. A Q4 model needs roughly its file size in RAM plus headroom for KV cache. A 7 GB model on a 8 GB box won’t fit; swap matters more here than it does on GPU paths.

If you’re doing anything more than smoke-testing, the cheapest meaningful upgrade is any modern AMD APU with RDNA-class graphics. Even a 780M-class iGPU is dramatically faster than CPU-only Vulkan on chat workloads. The full Strix Halo experience is the top end; the floor is “any iGPU at all.”