If you ship features on top of GPT‑5, understanding response time is just as important as model quality. This post summarises practical latency expectations when using the OpenAI API with different reasoning levels, explains what actually drives those numbers, and provides a reproducible way to measure and monitor them in your stack.
Note: Numbers in this article are directional and environment‑dependent. They vary by prompt length, output length, system load, region, networking, and streaming vs. non‑streaming usage. Treat them as planning guidance, not hard SLAs.
TL;DR
- Reasoning level is the single biggest lever on latency; higher levels trade speed for better chain‑of‑thought depth.
- Currently smaller models are tuned for lower reasoning speed and bigger models for higher reasoning speed.
- Time‑to‑first‑token (TTFT) is the UX‑critical metric for streaming. You can often keep TTFT low even when the total completion time rises.
- You control more than you think: prompt/response token budgets, parallel function calls, and smart streaming strategies make significant differences.
Latency metrics you should track
- Time‑to‑first‑token (TTFT): request start → first streamed token received. Primary UX metric for conversational UIs.
- Tokens per second (TPS): generation throughput once tokens start flowing. Good for sizing progress indicators.
- Total wall time: request start → stream closed (or full JSON received). Important for background jobs.
- Server vs. network: separate model compute time from client/edge/network overhead to find the real bottleneck.
Why reasoning level affects latency
Reasoning adds internal deliberation steps (planning, self‑checking, tool selection). Even when responses are streamed, the model may take longer to produce the first token because it spends more compute on pre‑generation thinking. At higher levels, the model may also generate longer answers, which lengthens total time.
Methodology you can reproduce
Use small, controlled prompts with stable token budgets. Run at least 30 trials per configuration and report medians and p90, not just averages.
- Fix input length: templated 800–1000 token prompt.
- Fix output length: set max_tokens and steer the model toward short direct answers.
- Control network variance: same region, no VPN, warm HTTP/2 connections.
- Measure server and client: capture timestamps on both sides if you have a proxy; otherwise, instrument the client precisely.
Example Node script (streaming, TTFT, totals, min, max, and mean)
The following script batches 30 runs for each model configuration with the same prompts and creates min, max, mean and spread (max-min) results. It runs the batches in parallel to significantly reduce overall time to a result (8.5x faster).
// run: node measure-gpt5-latency.mjs
import { performance } from 'node:perf_hooks';
import OpenAI from 'openai';
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function measure({ model, reasoning = undefined }) {
const start = performance.now();
let firstTokenAt = null;
let tokens = 0;
const params: OpenAI.Responses.ResponseCreateParams = {
model,
input: [
{ role: 'developer', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain what a hash map is.' },
],
stream: true,
};
if (reasoning) {
params.reasoning = { effort: reasoning }; // e.g., 'low' | 'medium' | 'high'
}
const stream = await client.responses.create(params);
for await (const chunk of stream) {
const now = performance.now();
if (!firstTokenAt) firstTokenAt = now;
if (chunk.type === 'response.output_text.delta') {
const delta = chunk?.delta ?? '';
tokens += delta.length > 0 ? 1 : 0; // rough proxy if you can’t count tokens client-side
}
}
const end = performance.now();
return {
ttftMs: Math.round((firstTokenAt ?? end) - start),
timeTakenMs: Math.round(end - start),
tokens,
};
}
(async () => {
const overallStart = performance.now();
const startTimeStr = new Date().toLocaleString();
console.log(`Starting latency measurements at ${startTimeStr}`);
const configs = [
{ model: 'gpt-5-nano', reasoning: 'minimal' },
{ model: 'gpt-5-nano', reasoning: 'low' },
{ model: 'gpt-5-nano', reasoning: 'medium' },
{ model: 'gpt-5-nano', reasoning: 'high' },
{ model: 'gpt-5-mini', reasoning: 'minimal' },
{ model: 'gpt-5-mini', reasoning: 'low' },
{ model: 'gpt-5-mini', reasoning: 'medium' },
{ model: 'gpt-5-mini', reasoning: 'high' },
{ model: 'gpt-5', reasoning: 'minimal' },
{ model: 'gpt-5', reasoning: 'low' },
{ model: 'gpt-5', reasoning: 'medium' },
{ model: 'gpt-5', reasoning: 'high' },
];
const RUNS = 30;
const CONFIG_CONCURRENCY = Number(process.env.CONFIG_CONCURRENCY ?? '1');
const results = [];
function percentile(arr: number[], p: number) {
if (arr.length === 0) return NaN as unknown as number;
const sorted = [...arr].sort((a, b) => a - b);
const rank = Math.ceil(p * sorted.length) - 1; // nearest-rank method
const idx = Math.min(sorted.length - 1, Math.max(0, rank));
return sorted[idx];
}
function stats(arr: number[]) {
const min = Math.min(...arr);
const max = Math.max(...arr);
const mean = arr.reduce((a, b) => a + b, 0) / arr.length;
const spread = max - min;
const p50 = percentile(arr, 0.5);
const p90 = percentile(arr, 0.9);
return { min, max, mean: Math.round(mean), spread, median: Math.round(p50), p90: Math.round(p90) };
}
function fmt(ms: number) {
const s = Math.floor(ms / 1000);
const m = Math.floor(s / 60);
const h = Math.floor(m / 60);
const remMs = ms % 1000;
const remS = s % 60;
const remM = m % 60;
const hh = h > 0 ? String(h).padStart(2, '0') + ':' : '';
const mm = String(remM).padStart(2, '0');
const ss = String(remS).padStart(2, '0');
const msStr = String(remMs).padStart(3, '0');
return `${hh}${mm}:${ss}.${msStr}`;
}
const totalConfigs = configs.length;
let printedTotal = false;
async function runConfig(cfg: { model: string; reasoning: string }, index: number) {
const cfgStart = performance.now();
const cfgLabel = `Config ${index + 1}/${totalConfigs}: model=${cfg.model}, reasoning=${cfg.reasoning}`;
console.log(`[${new Date().toLocaleTimeString()}] ${cfgLabel} — starting ${RUNS} runs`);
const ttfts: number[] = [];
const timeTaken: number[] = [];
const tokens: number[] = [];
const runPromises = Array.from({ length: RUNS }, (_, i) => (async () => {
console.log(` Run ${i + 1}/${RUNS} — model=${cfg.model}, reasoning=${cfg.reasoning}`);
try {
const r = await measure(cfg);
const elapsedSoFar = Math.round(performance.now() - overallStart);
console.log(` Completed run ${i + 1}/${RUNS}: ttft=${r.ttftMs}ms timeTaken=${r.timeTakenMs}ms — elapsed so far=${fmt(elapsedSoFar)}`);
ttfts.push(r.ttftMs);
timeTaken.push(r.timeTakenMs);
tokens.push(r.tokens);
} catch (err) {
const elapsedSoFar = Math.round(performance.now() - overallStart);
console.warn(` Run ${i + 1}/${RUNS} failed — elapsed so far=${fmt(elapsedSoFar)}:`, err);
}
})());
await Promise.allSettled(runPromises);
const ttftStats = stats(ttfts);
const timeTakenStats = stats(timeTaken);
const meanTokens = Math.round(tokens.reduce((a, b) => a + b, 0) / tokens.length);
results.push({
...cfg,
runs: RUNS,
ttftMinMs: ttftStats.min,
ttftMaxMs: ttftStats.max,
ttftMeanMs: ttftStats.mean,
ttftMedianMs: ttftStats.median,
ttftSpreadMs: ttftStats.spread,
ttftP90Ms: ttftStats.p90,
timeTakenMinMs: timeTakenStats.min,
timeTakenMaxMs: timeTakenStats.max,
timeTakenMeanMs: timeTakenStats.mean,
timeTakenMedianMs: timeTakenStats.median,
timeTakenSpreadMs: timeTakenStats.spread,
timeTakenP90Ms: timeTakenStats.p90,
meanTokens: meanTokens,
});
console.table(results);
const cfgElapsed = Math.round(performance.now() - cfgStart);
console.log(`[${new Date().toLocaleTimeString()}] Finished ${cfgLabel} — took ${fmt(cfgElapsed)}; total elapsed=${fmt(Math.round(performance.now() - overallStart))}`);
if (!printedTotal && results.length === totalConfigs) {
printedTotal = true;
const totalElapsed = Math.round(performance.now() - overallStart);
console.log(`All configs complete. Total time: ${fmt(totalElapsed)}`);
}
}
const workers = Math.max(1, Math.min(CONFIG_CONCURRENCY, totalConfigs));
let nextIndex = 0;
await Promise.all(Array.from({ length: workers }, async () => {
while (true) {
const idx = nextIndex++;
if (idx >= totalConfigs) break;
await runConfig(configs[idx], idx);
}
}));
})();
Example results
| model | reasoning | runs | ttftMinMs | ttftMaxMs | ttftMeanMs | ttftMedianMs | ttftSpreadMs | ttftP90Ms | timeTakenMinMs | timeTakenMaxMs | timeTakenMeanMs | timeTakenMedianMs | timeTakenSpreadMs | timeTakenP90Ms | meanTokens |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gpt-5-nano | minimal | 30 | 849 | 2559 | 1084 | 972 | 1710 | 1182 | 2906 | 6940 | 4131 | 3804 | 4034 | 5387 | 243 |
| gpt-5-nano | low | 30 | 196 | 920 | 405 | 295 | 724 | 821 | 2492 | 5630 | 3731 | 3284 | 3138 | 5344 | 276 |
| gpt-5-nano | medium | 30 | 198 | 1238 | 394 | 288 | 1040 | 724 | 5435 | 9817 | 7246 | 7192 | 4382 | 8528 | 260 |
| gpt-5-nano | high | 30 | 198 | 757 | 331 | 252 | 559 | 466 | 9648 | 20101 | 14947 | 14897 | 10453 | 17784 | 254 |
| gpt-5-mini | minimal | 30 | 231 | 1540 | 585 | 496 | 1309 | 1144 | 5155 | 10761 | 7153 | 6608 | 5606 | 9592 | 336 |
| gpt-5-mini | low | 30 | 200 | 1069 | 402 | 284 | 869 | 764 | 5804 | 11909 | 7783 | 7556 | 6105 | 9167 | 399 |
| gpt-5-mini | medium | 30 | 205 | 2231 | 414 | 298 | 2026 | 503 | 7519 | 14325 | 10304 | 10039 | 6806 | 12049 | 394 |
| gpt-5-mini | high | 30 | 190 | 821 | 358 | 325 | 631 | 491 | 9215 | 29832 | 17044 | 16680 | 20617 | 22556 | 311 |
| gpt-5 | minimal | 30 | 295 | 1224 | 520 | 404 | 929 | 784 | 1991 | 4076 | 2959 | 2850 | 2085 | 3723 | 132 |
| gpt-5 | low | 30 | 214 | 3034 | 545 | 299 | 2820 | 770 | 2584 | 7593 | 4311 | 4002 | 5009 | 5515 | 175 |
| gpt-5 | medium | 30 | 211 | 2440 | 500 | 310 | 2229 | 839 | 4417 | 11834 | 6492 | 6093 | 7417 | 8420 | 175 |
| gpt-5 | high | 30 | 231 | 973 | 426 | 377 | 742 | 606 | 5925 | 14314 | 8851 | 8552 | 8389 | 11179 | 162 |
If you rely on JSON mode or tool calls, test those paths specifically, both add overhead and can change TTFT.
Mean time taken vs reasoning level

Key Findings
-
Performance scaling:
- All models show significant increases in response time as reasoning complexity rises from minimal → high.
- The growth is nonlinear, particularly steep between medium and high reasoning.
-
Relative speed:
- gpt-5 is consistently the fastest model across all reasoning levels, for this simple example which is surprising.
- gpt-5-mini is slower than gpt-5 but faster than gpt-5-nano for higher reasoning levels.
-
Variability:
- The spread (variability) of timing also grows sharply at higher reasoning levels, especially for mini and nano.
- This implies less predictable performance under heavier reasoning loads.
-
Efficiency per token:
- Mean tokens remain roughly stable (~150–400), meaning higher reasoning times aren’t caused by longer responses but by increased computational depth.
-
Overall ranking (speed → slowest):
- gpt-5 (fastest, most consistent)
- gpt-5-mini (moderate latency)
- gpt-5-nano (slowest, high variability)
Token usage vs reasoning level

Key Findings
- Token usage grows with reasoning level, but not uniformly
- All models show a gradual increase in token usage from minimal → low → medium reasoning.
- Interestingly, token counts tend to flatten or even drop slightly at high reasoning levels, suggesting that higher reasoning may involve deeper computation rather than longer responses.
- For example, gpt-5-mini peaks at medium reasoning (394 tokens) but drops to 311 at high reasoning.
- gpt-5-nano stays almost flat between medium and high reasoning (260 → 254).
- Larger models are more concise
- The larger the model, the fewer tokens it tends to use for equivalent reasoning tasks:
- gpt-5: 132–175 tokens
- gpt-5-mini: 311–399 tokens
- gpt-5-nano: 243–276 tokens
- This implies that the larger models reason more efficiently, expressing the same logic in fewer tokens — a hallmark of higher linguistic and cognitive compression.
- The larger the model, the fewer tokens it tends to use for equivalent reasoning tasks:
- “Mini” models expand more as reasoning deepens
- gpt-5-mini shows the largest variance in token usage (336 → 399 → 394 → 311).
- It suggests that intermediate-sized models generate verbose explanations as reasoning increases, possibly compensating for smaller reasoning capacity through elaboration.
- Token usage stability in smaller models
- gpt-5-nano maintains a narrow range of 243–276 tokens regardless of reasoning level.
- This consistency indicates limited adaptability, likely hitting internal caps on reasoning complexity or verbosity.
Optimisation checklist
- Stream responses and render progressively; prioritise TTFT.
- Keep prompts tight; prune unused context and reduce system message verbosity.
- Set max_tokens realistically; don’t over‑budget if you don’t need long answers.
- Cache immutable prefix prompts (client side) and send only the diff where possible.
- Prefer shorter tool traces; avoid unnecessary parallel tool calls.
- Enable gzip/br encodings and HTTP/2 keep‑alive; reuse clients between requests.
- Run close to the API region you target; minimise cross‑region hops.
- For background jobs, batch requests off the hot path and set longer timeouts.
Measuring in production
- Capture TTFT and total time per request in your observability layer (e.g. OpenTelemetry).
- Tag metrics with model and reasoning level so dashboards can alert when p95 drifts.
- Record token counts to correlate cost, latency, and user outcomes.
- Keep a small canary suite (stable prompts) that you run periodically to detect regressions independent of user traffic.
Takeaways
- Reasoning level is a quality/speed dial, use the lowest level that still meets your acceptance criteria.
- Streaming hides total time but not TTFT; keep TTFT low for perceived performance.
- Measure with discipline in your own environment; publish medians/p90s to your team so expectations remain realistic.
If you spot materially different numbers in your setup, share your methodology alongside the metrics, context is everything.