Skip to content

Streaming

AI Gateway endpoints can stream responses as Server-Sent Events in OpenAI-compatible format. The client opts in per-request via "stream": true; the gateway decides whether to honour it based on the endpoint’s streaming_enabled flag.

Endpoint wizard, Tab 4 — Streaming: tick the toggle.

When the toggle is off, stream: true from the client is ignored — the gateway buffers the full response and returns a normal JSON body.

When the toggle is on AND the request body has "stream": true, the gateway streams.

The gateway speaks OpenAI’s SSE shape regardless of which provider serves the request:

data: {"choices":[{"delta":{"content":"Hi"},"index":0}]}
data: {"choices":[{"delta":{"content":" there"},"index":0}]}
data: {"choices":[{"delta":{"content":"!"},"index":0}]}
data: [DONE]

Each event:

  • Starts with data: (note the space).
  • Followed by JSON or the literal [DONE].
  • Terminated by a blank line (\n\n).

This works whether the underlying provider is OpenAI (native SSE), Anthropic (different event type names that the adapter rewrites), or Cohere (different again). Every provider adapter that supports streaming converts upstream chunks to this OpenAI shape, so client code stays portable across providers.

Terminal window
curl -N -X POST $URL/api/$UUID/$SLUG \
-H "Authorization: Bearer pg_live_..." \
-H "Content-Type: application/json" \
-d '{"message": "Write a haiku about Mondays.", "stream": true}'

The -N flag is critical — it disables curl’s output buffering so chunks land as they arrive.

import os, json, requests
with requests.post(
f"{os.environ['PG_URL']}/api/{os.environ['PG_UUID']}/{os.environ['PG_SLUG']}",
headers={"Authorization": f"Bearer {os.environ['PG_TOKEN']}"},
json={"message": "Write a poem.", "stream": True},
stream=True,
timeout=120,
) as r:
r.raise_for_status()
for raw in r.iter_lines():
if not raw or not raw.startswith(b"data: "):
continue
payload = raw[6:]
if payload == b"[DONE]":
break
chunk = json.loads(payload)
print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True)
const url = `${process.env.PG_URL}/api/${process.env.PG_UUID}/${process.env.PG_SLUG}`;
const r = await fetch(url, {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PG_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ message: 'Write a poem.', stream: true }),
});
const reader = r.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() ?? '';
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const data = line.slice(6);
if (data === '[DONE]') return;
const chunk = JSON.parse(data);
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
}

If you’re using the AI Wrapper project type, you can use the OpenAI SDK’s streaming helpers as-is:

from openai import OpenAI
client = OpenAI(base_url=f"{PG_URL}/api/{UUID}/v1", api_key=TOKEN)
stream = client.chat.completions.create(
model="fast",
messages=[{"role": "user", "content": "Write a poem."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)

PromptGate sets these headers on streaming responses:

Content-Type: text/event-stream
Cache-Control: no-cache
X-Accel-Buffering: no (disables nginx buffering when present)

If you’re behind nginx, ensure proxy_buffering off; for the streaming route — otherwise nginx will buffer the SSE chunks and the client experiences chunked-but-not-streamed responses.

For Caddy, no special config is needed.

You can combine them — pass session_id and stream: true together. The streaming response contains the same [DONE] marker; the gateway records the assistant’s full content into the session after the stream closes. The next request with that session ID sees the full transcript.

PromptGate doesn’t implement client-side reconnection (no id: markers, no retry:). If the connection drops mid-stream, the partial response is what the client got. Re-issue the request to start over.

SituationWhat happens
Client sends stream: true, endpoint streaming offNormal JSON response.
Client sends stream: true, endpoint streaming onSSE response.
Network error mid-streamConnection closes. Partial output retained client-side.
Provider error mid-streamAn error: event then [DONE] (depending on provider adapter).
Output schema set on endpointStreaming is disabled at wizard level.

Next: Providers Overview.


© Akyros Labs LLC. All rights reserved.