Streaming
hal0 streams /v1/chat/completions and /v1/completions responses as
Server-Sent Events, exactly matching the OpenAI streaming
protocol. Any OpenAI SDK that handles streaming today works against
hal0 unmodified.
Enable streaming
Section titled “Enable streaming”Add "stream": true to the request body:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "primary", "stream": true, "messages": [ {"role": "user", "content": "Count to five."} ] }'Wire format
Section titled “Wire format”Each chunk is a data: … line, JSON-encoded, terminated by a blank
line. The stream ends with data: [DONE]. Same shape OpenAI ships:
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"One"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", two"}}]}
data: [DONE]Python SDK
Section titled “Python SDK”from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
stream = client.chat.completions.create( model="primary", stream=True, messages=[{"role": "user", "content": "Count to five."}],)
for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)What hal0 adds on top
Section titled “What hal0 adds on top”Streaming flows through the same dispatcher that handles non-streaming requests, so you get:
- Single-flight prefetch — if two clients open identical streams on a cold slot, the slot fires one upstream call and fans the token stream to both.
- Adaptive cold-boot — the first request after a slot reaches
readykeeps the connection open while the model finishes warming; you don’t get a 503 on a request that’s about to work. - Structured errors mid-stream — if the slot transitions to
errorpart-way through, the stream emits one final SSE event with the structured error envelope before closing.