API reference

Base URL: https://porten.ai/v1. Authenticate with Authorization: Bearer sk-porten-…. The surface is OpenAI-compatible, so any OpenAI SDK works by overriding base_url.

POST /v1/chat/completions

The core endpoint. Streaming and non-streaming.

Request:

{
  "model": "qwen2.5-coder-32b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Write a haiku about the aurora." }
  ],
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false
}

Response (non-streaming):

{
  "id": "chatcmpl-porten-7f3a2b",
  "object": "chat.completion",
  "model": "qwen2.5-coder-32b",
  "choices": [
    { "index": 0,
      "message": { "role": "assistant", "content": "Green fire dances…" },
      "finish_reason": "stop" }
  ],
  "usage": { "prompt_tokens": 32, "completion_tokens": 41, "total_tokens": 73 }
}

Response (streaming, stream: true) — OpenAI-style SSE:

data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Green"},"finish_reason":null}]}

data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Parameters

Param	Status	Behaviour
`model`	✅	Canonical id; unknown → `404 model_not_found`
`messages`	✅	`system` / `user` / `assistant` / `tool` roles
`stream`	✅	SSE chunks
`stream_options.include_usage`	✅	Hub fills `usage` from its own count
`max_tokens` / `max_completion_tokens`	✅	Both accepted
`temperature`, `top_p`, `stop`, `seed`	⚠️	Passed to the engine; `seed` honoured only if it supports it
`presence_penalty`, `frequency_penalty`	⚠️	Passed through; ignored if the engine lacks them
`response_format`	✅	JSON mode / JSON schema. Routed to a model whose `json` capability is verified (the set `/v1/models` shows); forwarded to the engine. A served model that can't honour it → `400`
`tools`, `tool_choice`	✅	Routed to a model whose `tools` capability is verified (the set `/v1/models` shows); response carries `tool_calls` + `finish_reason: "tool_calls"`. A served model without tool support → `400` (clear message), not a capacity error
`content` with `image_url`	✅	Routed to a model whose `vision` capability is verified. Inline `data:` images are forwarded to the model; remote `http(s)` image URLs are not fetched (SSRF protection) — inline them as data URLs. A served model without vision support → `400` (clear message), not a capacity/`node_error`
`content` with `input_audio`	⚠️	Routed to a model whose `audio` capability is verified. Most engines can't actually deliver audio to the model (e.g. Ollama silently drops it), so `audio` is rarely verified → `400`. Never a silent wrong answer
`n`	⚠️	Only `n=1`; `n>1` → `400 unsupported_parameter`
`user`, `metadata`	✅	Logged for usage/abuse

Capabilities are verified, not declared. A model's features in /v1/models (chat, tools, vision, audio, thinking, json) are the capabilities a serving node has empirically passed a probe for — the Hub runs a tiny probe per capability when a model loads (ADR-0026). Routing only sends a capability-bearing request to a node that verified it, so a model can never be sent input its served build can't handle. An operator cannot hand-enable a capability a model doesn't actually have.

Principle: unknown convenience params are ignored silently (forward-compatible); params that would change semantics but can't be honoured (n>1, or a capability the model hasn't verified) are rejected with 400 rather than silently producing the wrong result.

Reasoning models

Models that "think" (e.g. DeepSeek-R1 family) return their reasoning separately as reasoning_content (a delta field in streaming), kept distinct from the answer text — so you can show or hide the chain of thought.

Budget enough tokens. Reasoning is generated before the answer and counts against max_tokens. With a small cap the model can spend the whole budget thinking: you then get finish_reason: "length", content as an empty string "", and the thinking so far in reasoning_content. For reasoning models, set max_tokens to at least 4096 (more for hard prompts).

Usage counts reasoning. usage.completion_tokens includes reasoning tokens — they are generated output and are billed like answer tokens (same as OpenAI's reasoning models).

POST /v1/embeddings

{ "model": "nomic-embed-text", "input": ["text to embed", "and another"] }

{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.0123, -0.045] },
    { "object": "embedding", "index": 1, "embedding": [0.0210, -0.011] }
  ],
  "model": "nomic-embed-text",
  "usage": { "prompt_tokens": 12, "total_tokens": 12 }
}

POST /v1/images/generations

Text-to-image generation, OpenAI-compatible. Routed to a node whose image-generation capability is verified (the set /v1/models shows) — a node that can actually run the diffusion model. Served on EU-sovereign community GPUs; the models are commercially usable and self-hosted.

{ "model": "flux.1-schnell", "prompt": "a stylized editorial illustration of a lakeside town hall, flat vector", "size": "1024x1024" }

{
  "created": 1730000000,
  "data": [
    { "b64_json": "iVBORw0KGgoAAAANSUhEUgAA…" }
  ]
}

Field		Notes
`prompt`	✅	The image description (required).
`size`	✅	`WxH`, e.g. `1024x1024` or a `1200x630`-ish landscape for share cards. Defaults to `1024x1024`.
`n`	1 only	One image per request; `n>1` → `400 unsupported_parameter`.
`response_format`	`b64_json`	The image bytes come back base64-encoded in `data[].b64_json` (the default and only format today).
`negative_prompt`	accepted	Advisory. No-op on guidance-distilled models like FLUX.1 [schnell] (they ignore it) — put editorial guardrails (no faces/text/logos/photorealism) in the positive prompt.
`seed`, `steps`	optional	Reproducibility / quality-vs-speed hints, where the model honours them.

Image generation is being rolled out — availability depends on a node serving an image model. If none is warm yet you'll get 503 no_available_node (retry shortly) or, where it isn't enabled, 501.

GET /v1/models

Every offered model, aggregated and deduplicated across the fleet.

{
  "object": "list",
  "data": [
    { "id": "qwen2.5-coder-32b", "object": "model", "owned_by": "porten",
      "x_porten": { "ready": true, "type": "chat", "ctx": 32768 } },
    { "id": "qwen3-coder-next", "object": "model", "owned_by": "porten",
      "x_porten": { "ready": false, "type": "chat", "ctx": 262144 } }
  ]
}

ready: false means the model is offered but not loaded this instant — your first request will trigger an on-demand load. See Models & on-demand loading.

Headers

Header	Direction	Note
`Authorization: Bearer sk-porten-…`	in	Required
`X-Request-Id`	out	Correlation id, echoed in logs
`X-Porten-Node`	out	Which node served the request
`Retry-After`	out	On `429` / `503`

Errors

All errors follow OpenAI's format: {"error":{"message","type","code","param"}}.

HTTP	`code`	Meaning
400	`unsupported_parameter` / `unsupported_content`	A param can't be honoured (e.g. `n>1`), or the model can't process the request content (e.g. image input the served build doesn't support)
401	`invalid_api_key`	Invalid or revoked key
403	`model_not_allowed`	The key may not use this model (region/policy)
404	`model_not_found`	No node advertises this model and it isn't offered
429	`rate_limit_exceeded`	Quota/rate exhausted (`Retry-After`)
502	`node_error`	All candidate nodes failed (a transient/node fault, after retries)
503	`no_available_node` / `model_warming`	Model exists but no free/healthy node, or it's still loading (`Retry-After`)
504	`gateway_timeout`	Total timeout exceeded

A 503 model_warming is expected the first time you hit a cold model and the load takes longer than the request's warm-up budget. Retry — it'll be ready shortly. Most clients won't see it because the request blocks until the model is ready.

📄 Reading as a machine? This page is available as raw Markdown at https://porten.ai/docs/api-reference.md — or grab the whole site via llms.txt / llms-full.txt.