API reference
Base URL: https://porten.ai/v1. Authenticate with Authorization: Bearer sk-porten-…. The surface is OpenAI-compatible, so any OpenAI SDK works by overriding base_url.
POST /v1/chat/completions
The core endpoint. Streaming and non-streaming.
Request:
{
"model": "qwen2.5-coder-32b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Write a haiku about the aurora." }
],
"temperature": 0.7,
"max_tokens": 256,
"stream": false
}
Response (non-streaming):
{
"id": "chatcmpl-porten-7f3a2b",
"object": "chat.completion",
"model": "qwen2.5-coder-32b",
"choices": [
{ "index": 0,
"message": { "role": "assistant", "content": "Green fire dances…" },
"finish_reason": "stop" }
],
"usage": { "prompt_tokens": 32, "completion_tokens": 41, "total_tokens": 73 }
}
Response (streaming, stream: true) — OpenAI-style SSE:
data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Green"},"finish_reason":null}]}
data: {"id":"…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Parameters
| Param | Status | Behaviour |
|---|---|---|
model |
✅ | Canonical id; unknown → 404 model_not_found |
messages |
✅ | system / user / assistant / tool roles |
stream |
✅ | SSE chunks |
stream_options.include_usage |
✅ | Hub fills usage from its own count |
max_tokens / max_completion_tokens |
✅ | Both accepted |
temperature, top_p, stop, seed |
⚠️ | Passed to the engine; seed honoured only if it supports it |
presence_penalty, frequency_penalty |
⚠️ | Passed through; ignored if the engine lacks them |
response_format |
✅ | JSON mode / JSON schema, forwarded to the engine (best-effort per engine) |
tools, tool_choice |
✅ | Routed to a model whose catalog entry declares tools (the same capability /v1/models shows); response carries tool_calls + finish_reason: "tool_calls". A model without tool support → 400 (clear message), not a capacity error |
content with image_url |
✅ | Inline data: images forwarded to vision models. Remote http(s) image URLs are not fetched (SSRF protection) — inline them as data URLs |
n |
⚠️ | Only n=1; n>1 → 400 unsupported_parameter |
user, metadata |
✅ | Logged for usage/abuse |
Principle: unknown convenience params are ignored silently (forward-compatible); params that would change semantics but can't be honoured (n>1) are rejected with 400 rather than silently producing the wrong result.
Reasoning models
Models that "think" (e.g. DeepSeek-R1 family) return their reasoning separately as reasoning_content (a delta field in streaming), kept distinct from the answer text — so you can show or hide the chain of thought.
Budget enough tokens. Reasoning is generated before the answer and counts against max_tokens. With a small cap the model can spend the whole budget thinking and you get finish_reason: "length" with empty content. For reasoning models, set max_tokens to at least 4096 (more for hard prompts).
POST /v1/embeddings
{ "model": "nomic-embed-text", "input": ["text to embed", "and another"] }
{
"object": "list",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.0123, -0.045] },
{ "object": "embedding", "index": 1, "embedding": [0.0210, -0.011] }
],
"model": "nomic-embed-text",
"usage": { "prompt_tokens": 12, "total_tokens": 12 }
}
GET /v1/models
Every offered model, aggregated and deduplicated across the fleet.
{
"object": "list",
"data": [
{ "id": "qwen2.5-coder-32b", "object": "model", "owned_by": "porten",
"x_porten": { "ready": true, "type": "chat", "ctx": 32768 } },
{ "id": "qwen3-coder-next", "object": "model", "owned_by": "porten",
"x_porten": { "ready": false, "type": "chat", "ctx": 262144 } }
]
}
ready: false means the model is offered but not loaded this instant — your first request will trigger an on-demand load. See Models & on-demand loading.
Headers
| Header | Direction | Note |
|---|---|---|
Authorization: Bearer sk-porten-… |
in | Required |
X-Request-Id |
out | Correlation id, echoed in logs |
X-Porten-Node |
out | Which node served the request |
Retry-After |
out | On 429 / 503 |
Errors
All errors follow OpenAI's format: {"error":{"message","type","code","param"}}.
| HTTP | code |
Meaning |
|---|---|---|
| 401 | invalid_api_key |
Invalid or revoked key |
| 403 | model_not_allowed |
The key may not use this model (region/policy) |
| 404 | model_not_found |
No node advertises this model and it isn't offered |
| 429 | rate_limit_exceeded |
Quota/rate exhausted (Retry-After) |
| 502 | node_error |
All candidate nodes failed |
| 503 | no_available_node / model_warming |
Model exists but no free/healthy node, or it's still loading (Retry-After) |
| 504 | gateway_timeout |
Total timeout exceeded |
A
503 model_warmingis expected the first time you hit a cold model and the load takes longer than the request's warm-up budget. Retry — it'll be ready shortly. Most clients won't see it because the request blocks until the model is ready.
https://porten.ai/docs/api-reference.md — or grab the
whole site via llms.txt / llms-full.txt.