Skip to content

AI Endpoints

AI Endpoints are the core building block of an ai_gateway project. Each endpoint defines a fixed system prompt, a provider configuration, and runtime settings. Your application sends user input to the endpoint URL; PromptGate runs guardrails, applies the prompts, calls the upstream model, validates the output, logs the result, and returns the response.

POST /api/{project_uuid}/{endpoint_slug}
┌────────────────────────────┐
│ Token + scope check │
│ Rate limit │
│ Budget check │
│ Guardrails │
│ Input schema validation │
│ System prompt + template │
│ Provider call (+failover) │
│ Output schema validation │
│ Log + audit │
└────────────────────────────┘
Response

In your ai_gateway project: sidebar → AI Endpoints+ New endpoint.

The wizard has 7 tabs. Only the first two are required; the rest have sensible defaults.

FieldNotes
NameShown in the UI and logs.
SlugURL path segment. Auto-generated from name; editable.
DescriptionOptional notes for human readers.
Expose as MCP toolToggle. When on, the project’s MCP Bridge serves this endpoint to MCP clients.

The endpoint URL is POST /api/{project_uuid}/{slug}.

Two modes:

Provider Template (recommended) — pick a pre-baked bundle that already pairs a provider, model, and credential. The endpoint inherits everything; you can override temperature, top_p, max_tokens on the endpoint without touching the template.

Manual — pick each piece individually:

  • Provideropenai, anthropic, google, mistral, groq, together, ollama, cohere
  • Model — provider-specific identifier (gpt-4o-mini, claude-3-5-sonnet-20241022, …)
  • Credential — list filtered by provider

Failover — optional list of backup (credential, model) pairs. When the primary call throws (5xx, timeout, rate limit), PromptGate retries the next entry. Runtime settings (temperature, top_p, max_tokens) are reused from the endpoint, not from the failover entry.

Runtime settings:

FieldRangeNotes
temperature0–20 = deterministic, 1 = default, 2 = creative
top_p0–1Nucleus sampling. Don’t tune both temp and top_p.
max_output_tokens1–200 000Hard cap per response.

Cost / volume protection. Empty fields = unlimited.

FieldTypeBehaviour
usage_hard_limit_tokensintPer-request cap on input tokens (~4 chars/token estimate). 422 if exceeded.
monthly_budget_usddecimalCumulative ceiling per calendar month.
estimated_cost_per_1k_tokens_usddecimalRequired if you set monthly_budget_usd — used to compute spend.
rate_limit_per_minuteint429 + Retry-After when exceeded.
rate_limit_per_hourintIndependent second window.

See Budgets and Rate Limits for the enforcement details.

Toggle Server-Sent Events. When on and the client sends "stream": true:

  • Response is streamed as data:-prefixed JSON chunks
  • Each chunk has the OpenAI Chat Completion shape (delta inside choices[0].delta.content)
  • Final chunk is data: [DONE]

When off, stream: true from the client is ignored.

See Streaming for the SSE format and client examples.

Server-side conversation state. When enabled:

FieldNotes
session_ttl_secondsAuto-expire idle sessions (60–604800).
session_max_messagesCap on the conversation length (1–500).
session_max_tokensOptional total token cap.

The gateway creates a session on the first request and returns meta.session_id. Subsequent requests include session_id to resume.

See Sessions for the flow + edge cases.

FieldNotes
promptSystem message prepended to every request.
user_prompt_templateWraps the user’s input. Use {{input}} as the placeholder.

If user_prompt_template is empty, the user’s message is passed through unchanged.

Example template:

You are responding to a customer support ticket.
Ticket from user:
{{input}}
Reply concisely.

JSON Schema validation, both directions.

FieldNotes
input_schemaValidates the request payload. 422 if invalid.
output_schemaValidates the model’s response content. 502 if invalid.

See JSON Schema Validation.

Terminal window
curl -X POST $URL/api/$UUID/$SLUG \
-H "Authorization: Bearer pg_live_..." \
-H "Content-Type: application/json" \
-d '{"message": "Explain quantum computing."}'
Terminal window
curl -X POST $URL/api/$UUID/$SLUG \
-H "Authorization: Bearer pg_live_..." \
-d '{
"messages": [
{"role": "user", "content": "What is AI?"},
{"role": "assistant", "content": "AI is..."},
{"role": "user", "content": "Tell me more."}
]
}'
import os, requests
resp = requests.post(
f"{os.environ['PG_URL']}/api/{os.environ['PG_UUID']}/{os.environ['PG_SLUG']}",
headers={"Authorization": f"Bearer {os.environ['PG_TOKEN']}"},
json={"message": "Explain quantum computing."},
timeout=120,
)
resp.raise_for_status()
print(resp.json()["content"])
const url = `${process.env.PG_URL}/api/${process.env.PG_UUID}/${process.env.PG_SLUG}`;
const r = await fetch(url, {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PG_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ message: 'Explain quantum computing.' }),
});
const data = await r.json();
console.log(data.content);

First request creates the session:

POST .../my-chat
{ "message": "Hello, my name is Sam." }

Response:

{
"ok": true,
"content": "Nice to meet you, Sam!",
"meta": { "session_id": "0e2f...c4" }
}

Subsequent requests pass session_id:

POST .../my-chat
{
"message": "What's my name?",
"session_id": "0e2f...c4"
}

The gateway prepends the stored history before calling the model.

Terminal window
curl -N -X POST $URL/api/$UUID/$SLUG \
-H "Authorization: Bearer pg_live_..." \
-d '{"message": "Write a poem.", "stream": true}'

The -N flag disables curl’s output buffering so chunks land as they arrive.

The index page shows every endpoint for the project with:

  • Name + slug (with eye icon → bridge URL modal)
  • Provider or template chip
  • ON/OFF status (deactivated endpoints reject all calls)
  • MCP chip when expose_as_mcp_tool=true
  • Edit / Deactivate actions

The KPI strip at the top shows total / streaming / sessions / MCP tools counts.

Clicking a row opens the detail page with:

  • Live status (1h KPIs: RPS, p95, errors)
  • Routing policy (provider, model, credential, failover chain)
  • Recent trace (last 10 requests)
  • Prompt + schema preview
  • MCP card when exposed (tool name, bridge URL, curl example)

Tokens with admin scope can list endpoints:

Terminal window
curl $URL/api/$UUID/endpoints \
-H "Authorization: Bearer pg_live_..."

Returns:

{
"ok": true,
"data": [
{ "uuid": "...", "name": "Hello World", "slug": "hello-world", "is_active": true, "expose_as_mcp_tool": false }
]
}
SituationResponse
No bearer token401
Token from different project403
Wrong scope (admin-only on chat endpoint)403
Endpoint inactive404
Per-minute rate limit hit429 + Retry-After
Per-request token cap exceeded422
Monthly budget exhausted422
Guardrail blocks request422
Input schema invalid422
Provider error after failover502
Output schema invalid502
Successful chat200
  • Locked-down product endpoint — set system prompt, input/output schema, low monthly budget, mask-mode PII filter, expose_as_mcp_tool=false.
  • Internal-tools agent — sessions on, no schemas, generous limits, expose_as_mcp_tool=true, MCP scope token issued to the agent.
  • Public-facing demo — strict per-minute rate limit, low monthly budget, content-length guardrail, no sessions.

Next: AI Wrapper — OpenAI-compatible mode.


© Akyros Labs LLC. All rights reserved.