If you ship features on top of GPT‑5, understanding response time is just as important as model quality. This post summarises practical latency expectations when using the OpenAI API with different reasoning levels, explains what actually drives those numbers, and provides a reproducible way to measure and monitor them in your stack.
Note: Numbers in this article are directional and environment‑dependent. They vary by prompt length, output length, system load, region, networking, and streaming vs. non‑streaming usage. Treat them as planning guidance, not hard SLAs.
TL;DR
- Reasoning level is a massive latency multiplier for standard models (GPT-5, Nano), but newer iterations (GPT-5.1, GPT-5.2) can yield consistently higer speeds depending on the task/prompt.
- For complex reasoning tasks, newer models can be 2-3x faster than the base GPT-5 model.
- Time‑to‑first‑token (TTFT) is the UX‑critical metric for streaming. You can often keep TTFT low even when the total completion time rises.
- You control more than you think: prompt/response token budgets, parallel function calls, and smart streaming strategies make significant differences.
Latency metrics you should track
- Time‑to‑first‑token (TTFT): request start → first streamed token received. Primary UX metric for conversational UIs.
- Tokens per second (TPS): generation throughput once tokens start flowing. Good for sizing progress indicators.
- Total wall time: request start → stream closed (or full JSON received). Important for background jobs.
- Server vs. network: separate model compute time from client/edge/network overhead to find the real bottleneck.
Why reasoning level affects latency
Reasoning adds internal deliberation steps (planning, self‑checking, tool selection). Even when responses are streamed, the model may take longer to produce the first token because it spends more compute on pre‑generation thinking. At higher levels, the model may also generate longer answers, which lengthens total time.
Measurement Methodology used
Use a small prompt; Run at least 30 trials per configuration and report medians and P. 90, not just averages. No assessment of response quality, just speed and tokens usage.
- Fix input prompt: templated 800–1000 token prompt.
- Output length: steer the model toward short direct answers.
- Control network variance: same region, no VPN, warm HTTP/2 connections.
- Measure client: instrument the client precisely.
Example Node script (streaming, TTFT, totals, min, max, and mean)
The following script batches 30 runs for each model configuration with the same prompts and creates min, max, mean and spread (max-min) results. It runs the batches in parallel to significantly reduce overall time to a result (8.5x faster). However, it also uses a “leaky bucket” to avoid hitting OpenAI rate limits, although a more robust implementation would be to use incremental backoff with retries on hitting a rate limit.
// run: node measure-gpt5-latency.mjs
import { performance } from 'node:perf_hooks';
import OpenAI from 'openai';
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function measure({ model, reasoning = undefined }: { model: string; reasoning?: string }) {
const start = performance.now();
let firstTokenAt = null;
let tokens = 0;
const params: OpenAI.Responses.ResponseCreateParams = {
model,
input: [
{ role: 'developer', content: 'Answer concisely.' },
{
role: 'user',
content: 'Explain what a hash map is, when to use it and when simpler data structures might be better.'
},
],
stream: true,
};
// valid reasoning efforts usually exclude 'none'
if (reasoning) {
// @ts-ignore - aiming for forward compatibility with potential new values like 'minimal'
params.reasoning = { effort: reasoning };
}
const stream = await client.responses.create(params);
for await (const chunk of stream) {
const now = performance.now();
if (!firstTokenAt) firstTokenAt = now;
if (chunk.type === 'response.output_text.delta') {
const delta = chunk?.delta ?? '';
tokens += delta.length > 0 ? 1 : 0; // rough proxy if you can’t count tokens client-side
}
}
const end = performance.now();
return {
ttftMs: Math.round((firstTokenAt ?? end) - start),
timeTakenMs: Math.round(end - start),
tokens,
};
}
class LeakyBucket {
private capacity: number;
private interval: number;
private lastRefill: number;
private tokens: number;
private queue: (() => void)[] = [];
constructor(capacity: number, requestsPerSecond: number) {
this.capacity = capacity;
this.interval = 1000 / requestsPerSecond;
this.tokens = capacity;
this.lastRefill = performance.now();
}
async acquire(): Promise<void> {
return new Promise((resolve) => {
this.queue.push(resolve);
this.processQueue();
});
}
private processQueue() {
const now = performance.now();
const elapsed = now - this.lastRefill;
const newTokens = Math.floor(elapsed / this.interval);
if (newTokens > 0) {
this.tokens = Math.min(this.capacity, this.tokens + newTokens);
this.lastRefill = now;
}
while (this.queue.length > 0 && this.tokens > 0) {
this.tokens--;
const next = this.queue.shift();
if (next) next();
}
if (this.queue.length > 0) {
// Schedule next check
const waitTime = this.interval - (performance.now() - this.lastRefill);
setTimeout(() => this.processQueue(), Math.max(0, waitTime));
}
}
}
// Global rate limiter: 5 requests per second, burst up to 10
const rateLimiter = new LeakyBucket(10, 5);
(async () => {
const overallStart = performance.now();
const startTimeStr = new Date().toLocaleString();
console.log(`Starting latency measurements at ${startTimeStr}`);
if (!process.env.OPENAI_API_KEY) {
console.error("Error: OPENAI_API_KEY environment variable is not set.");
process.exit(1);
}
const configs = [
{ model: 'gpt-5-nano', reasoning: 'minimal' },
{ model: 'gpt-5-nano', reasoning: 'low' },
{ model: 'gpt-5-nano', reasoning: 'medium' },
{ model: 'gpt-5-nano', reasoning: 'high' },
{ model: 'gpt-5-mini', reasoning: 'minimal' },
{ model: 'gpt-5-mini', reasoning: 'low' },
{ model: 'gpt-5-mini', reasoning: 'medium' },
{ model: 'gpt-5-mini', reasoning: 'high' },
{ model: 'gpt-5', reasoning: 'minimal' },
{ model: 'gpt-5', reasoning: 'low' },
{ model: 'gpt-5', reasoning: 'medium' },
{ model: 'gpt-5', reasoning: 'high' },
{ model: 'gpt-5.1', reasoning: 'none' },
{ model: 'gpt-5.1', reasoning: 'low' },
{ model: 'gpt-5.1', reasoning: 'medium' },
{ model: 'gpt-5.1', reasoning: 'high' },
{ model: 'gpt-5.2', reasoning: 'none' },
{ model: 'gpt-5.2', reasoning: 'low' },
{ model: 'gpt-5.2', reasoning: 'medium' },
{ model: 'gpt-5.2', reasoning: 'high' },
];
const RUNS = 30;
const CONFIG_CONCURRENCY = Number(process.env.CONFIG_CONCURRENCY ?? '1');
const results = [];
function percentile(arr: number[], p: number) {
if (arr.length === 0) return NaN as unknown as number;
const sorted = [...arr].sort((a, b) => a - b);
const rank = Math.ceil(p * sorted.length) - 1; // nearest-rank method
const idx = Math.min(sorted.length - 1, Math.max(0, rank));
return sorted[idx];
}
function stats(arr: number[]) {
if (arr.length === 0) return { min: 0, max: 0, mean: 0, spread: 0, median: 0, p90: 0 };
const min = Math.min(...arr);
const max = Math.max(...arr);
const mean = arr.reduce((a, b) => a + b, 0) / arr.length;
const spread = max - min;
const p50 = percentile(arr, 0.5);
const p90 = percentile(arr, 0.9);
return { min, max, mean: Math.round(mean), spread, median: Math.round(p50), p90: Math.round(p90) };
}
function fmt(ms: number) {
const s = Math.floor(ms / 1000);
const m = Math.floor(s / 60);
const h = Math.floor(m / 60);
const remMs = ms % 1000;
const remS = s % 60;
const remM = m % 60;
const hh = h > 0 ? String(h).padStart(2, '0') + ':' : '';
const mm = String(remM).padStart(2, '0');
const ss = String(remS).padStart(2, '0');
const msStr = String(remMs).padStart(3, '0');
return `${hh}${mm}:${ss}.${msStr}`;
}
const totalConfigs = configs.length;
let printedTotal = false;
async function runConfig(cfg: { model: string; reasoning: string }, index: number) {
const cfgStart = performance.now();
const cfgLabel = `Config ${index + 1}/${totalConfigs}: model=${cfg.model}, reasoning=${cfg.reasoning}`;
console.log(`[${new Date().toLocaleTimeString()}] ${cfgLabel} — starting ${RUNS} runs`);
const ttfts: number[] = [];
const timeTaken: number[] = [];
const tokens: number[] = [];
const runPromises = Array.from({ length: RUNS }, (_, i) => (async () => {
await rateLimiter.acquire();
console.log(` Run ${i + 1}/${RUNS} — model=${cfg.model}, reasoning=${cfg.reasoning}`);
try {
const r = await measure(cfg);
const elapsedSoFar = Math.round(performance.now() - overallStart);
console.log(` Completed run ${i + 1}/${RUNS}: ttft=${r.ttftMs}ms timeTaken=${r.timeTakenMs}ms — elapsed so far=${fmt(elapsedSoFar)}`);
ttfts.push(r.ttftMs);
timeTaken.push(r.timeTakenMs);
tokens.push(r.tokens);
} catch (err) {
const elapsedSoFar = Math.round(performance.now() - overallStart);
console.warn(` Run ${i + 1}/${RUNS} failed — elapsed so far=${fmt(elapsedSoFar)}:`, err);
}
})());
await Promise.allSettled(runPromises);
const ttftStats = stats(ttfts);
const timeTakenStats = stats(timeTaken);
const meanTokens = Math.round(tokens.reduce((a, b) => a + b, 0) / tokens.length);
results.push({
...cfg,
runs: RUNS,
ttftMinMs: ttftStats.min,
ttftMaxMs: ttftStats.max,
ttftMeanMs: ttftStats.mean,
ttftMedianMs: ttftStats.median,
ttftSpreadMs: ttftStats.spread,
ttftP90Ms: ttftStats.p90,
timeTakenMinMs: timeTakenStats.min,
timeTakenMaxMs: timeTakenStats.max,
timeTakenMeanMs: timeTakenStats.mean,
timeTakenMedianMs: timeTakenStats.median,
timeTakenSpreadMs: timeTakenStats.spread,
timeTakenP90Ms: timeTakenStats.p90,
meanTokens: meanTokens,
});
const cfgElapsed = Math.round(performance.now() - cfgStart);
console.log(`[${new Date().toLocaleTimeString()}] Finished ${cfgLabel} — took ${fmt(cfgElapsed)}; total elapsed=${fmt(Math.round(performance.now() - overallStart))}`);
}
const workers = Math.max(1, Math.min(CONFIG_CONCURRENCY, totalConfigs));
let nextIndex = 0;
await Promise.all(Array.from({ length: workers }, async () => {
while (true) {
const idx = nextIndex++;
if (idx >= totalConfigs) break;
await runConfig(configs[idx], idx);
}
}));
console.table(results);
const totalElapsed = Math.round(performance.now() - overallStart);
console.log(`All configs complete. Total time: ${fmt(totalElapsed)}`);
})();
Example results
| Index | Model | Reasoning | Runs | ttftMinMs | ttftMaxMs | ttftMeanMs | ttftMedianMs | ttftSpreadMs | ttftP90Ms | timeTakenMinMs | timeTakenMaxMs | timeTakenMeanMs | timeTakenMedianMs | timeTakenSpreadMs | timeTakenP90Ms | meanTokens |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | gpt-5-nano | minimal | 30 | 217 | 519 | 322 | 301 | 302 | 416 | 3610 | 7095 | 4746 | 4514 | 3485 | 5603 | 476 |
| 1 | gpt-5-nano | low | 30 | 158 | 368 | 241 | 233 | 210 | 309 | 5020 | 9731 | 6770 | 6696 | 4711 | 8236 | 472 |
| 2 | gpt-5-nano | medium | 30 | 171 | 414 | 267 | 252 | 243 | 352 | 12093 | 20257 | 16936 | 16792 | 8164 | 18950 | 455 |
| 3 | gpt-5-nano | high | 30 | 171 | 439 | 268 | 252 | 268 | 320 | 22910 | 62091 | 39188 | 39328 | 39181 | 46039 | 504 |
| 4 | gpt-5-mini | minimal | 30 | 224 | 941 | 353 | 327 | 717 | 391 | 7087 | 46279 | 10646 | 8761 | 39192 | 12593 | 634 |
| 5 | gpt-5-mini | low | 30 | 184 | 1296 | 309 | 283 | 1112 | 332 | 7777 | 12145 | 9788 | 9723 | 4368 | 11079 | 623 |
| 6 | gpt-5-mini | medium | 30 | 176 | 395 | 283 | 277 | 219 | 351 | 10033 | 15693 | 13118 | 12799 | 5660 | 14747 | 590 |
| 7 | gpt-5-mini | high | 30 | 171 | 358 | 280 | 270 | 187 | 325 | 19461 | 44936 | 29784 | 27781 | 25475 | 39261 | 558 |
| 8 | gpt-5 | minimal | 30 | 188 | 678 | 320 | 295 | 490 | 355 | 5347 | 11158 | 6954 | 6416 | 5811 | 8394 | 297 |
| 9 | gpt-5 | low | 30 | 183 | 417 | 298 | 297 | 234 | 352 | 8817 | 27184 | 14211 | 12461 | 18367 | 21695 | 346 |
| 10 | gpt-5 | medium | 30 | 190 | 1573 | 322 | 273 | 1383 | 354 | 15584 | 84439 | 25958 | 23497 | 68855 | 31028 | 356 |
| 11 | gpt-5 | high | 30 | 218 | 410 | 307 | 300 | 192 | 379 | 24975 | 59702 | 39029 | 37450 | 34727 | 50854 | 336 |
| 12 | gpt-5.1 | none | 30 | 221 | 1484 | 329 | 278 | 1263 | 361 | 9332 | 18009 | 12312 | 12162 | 8677 | 15421 | 681 |
| 13 | gpt-5.1 | low | 30 | 181 | 412 | 279 | 267 | 231 | 335 | 9995 | 18688 | 12286 | 11868 | 8693 | 14068 | 762 |
| 14 | gpt-5.1 | medium | 30 | 208 | 356 | 272 | 268 | 148 | 314 | 9372 | 15971 | 12812 | 12529 | 6599 | 14652 | 700 |
| 15 | gpt-5.1 | high | 30 | 177 | 347 | 256 | 264 | 170 | 295 | 9443 | 28959 | 14805 | 14553 | 19516 | 18094 | 773 |
| 16 | gpt-5.2 | none | 30 | 214 | 403 | 270 | 261 | 189 | 319 | 8130 | 15438 | 12044 | 12005 | 7308 | 13357 | 602 |
| 17 | gpt-5.2 | low | 30 | 194 | 412 | 291 | 273 | 218 | 386 | 10612 | 16003 | 12284 | 11923 | 5391 | 14233 | 644 |
| 18 | gpt-5.2 | medium | 30 | 172 | 1398 | 351 | 276 | 1226 | 387 | 11575 | 71061 | 16899 | 12978 | 59486 | 16291 | 669 |
| 19 | gpt-5.2 | high | 30 | 235 | 626 | 455 | 446 | 391 | 578 | 10464 | 14641 | 12858 | 13017 | 4177 | 14170 | 627 |
If you rely on JSON mode or tool calls, test those paths specifically, both add overhead and can change TTFT.
The mean time taken vs reasoning level
Key Findings
-
Newer models prioritise consistency
- GPT-5.1 and GPT-5.2 show remarkable stability. Their total response times remain nearly flat across all reasoning levels (taking ~12–15 seconds), regardless of whether reasoning is set to ‘low’ or ‘high’. Although this could be down to the nature of the prompt, which is fairly simple.
- In contrast, strictly “smaller” or “older” architectures like gpt-5-nano and gpt-5 show steep, non-linear latency spikes as reasoning depth increases.
-
The “Reasoning Penalty” varies by model
- gpt-5-nano is the fastest model for simple tasks (4.7s at “minimal” reasoning) but degrades severely at “high” reasoning (~39s), making it nearly 8x slower.
- gpt-5 follows a similar pattern, starting fast (~7s) but matching Nano’s slowness at “high” reasoning (~39s).
- gpt-5-mini sits in the middle, scaling moderately (~10s → 30s).
-
High Reasoning Speed Champions
- For deep reasoning tasks, gpt-5.1 and gpt-5.2 are the clear winners, delivering results 2–3x faster than gpt-5 or gpt-5-nano. Again, the warning being the prompt is straightforward and not complex.
-
Overall ranking (Speed):
- Low Reasoning / Simple Tasks: gpt-5-nano > gpt-5 > gpt-5.1/5.2
- High Reasoning / Complex Tasks: gpt-5.2/5.1 > gpt-5-mini > gpt-5 ≈ gpt-5-nano
Token usage vs reasoning level
Key Findings
-
Conciseness vs. Verbosity
- gpt-5 is the most concise model by far, consistently producing the fewest tokens (~300–350) to answer the prompt.
- gpt-5.1 is the most verbose, averaging ~700+ tokens, often double the output of gpt-5 for the same task.
- gpt-5-mini and gpt-5.2 sit in the older, generally 550–650 token range.
-
Reasoning doesn’t always equal more tokens
- Counter-intuitively, higher reasoning levels do not always lead to longer final answers.
- gpt-5-mini actually produces fewer tokens at ‘high’ reasoning (558) compared to ‘minimal’ (634), suggesting it optimises its final output better after “thinking” more.
- gpt-5-nano and gpt-5 remain relatively stable in token output regardless of the reasoning setting.
-
Latency is Compute, not Length
- Comparing the two graphs reveals a critical insight: The massive latency spikes in gpt-5 and nano at ’ high’ reasoning are not caused by generating more text (token counts are flat). They are caused purely by increased “thinking” time distribution (Time-To-First-Token and inter-token pauses).
Time-to-First-Token (TTFT) Stability
Key Findings
-
Reasoning happens during generation
- The data reveals a surprising trend: TTFT remains flat (~250–350ms) across virtually all models and reasoning levels.
- This contradicts the common assumption that “reasoning” implies a long “thinking” pause before the first token effectively. Instead, these models appear to distribute their reasoning process throughout the stream, simply generating tokens slower (lower tokens/sec) rather than waiting longer to start.
-
UX Implications
- Because TTFT is unaffected by reasoning depth, streaming is mandatory. You can deliver an “instant” feeling UI (starting to type in <300ms) even if the full response takes 40 seconds to complete.
Optimisation checklist
- Stream responses and render progressively; prioritise TTFT.
- Keep prompts tight; prune unused context and reduce system message verbosity.
- Set max_tokens realistically; don’t over‑budget if you don’t need long answers.
- Cache immutable prefix prompts (client side) and send only the diff where possible.
- Prefer shorter tool traces; avoid unnecessary parallel tool calls.
- Enable gzip/br encodings and HTTP/2 keep‑alive; reuse clients between requests.
- Run close to the API region you target; minimise cross‑region hops.
- For background jobs, batch requests off the hot path and set longer timeouts.
Measuring in production
- Capture TTFT and total time per request in your observability layer (e.g. OpenTelemetry).
- Tag metrics with model and reasoning level so dashboards can alert when p95 drifts.
- Record token counts to correlate cost, latency, and user outcomes.
- Keep a small canary suite (stable prompts) that you run periodically to detect regressions independent of user traffic.
Takeaways
- Reasoning level is a quality/speed dial, use the lowest level that still meets your acceptance criteria.
- Streaming hides total time but not TTFT; keep TTFT low for perceived performance.
- Measure with discipline in your own environment; publish medians/p90s to your team so expectations remain realistic.
If you spot materially different numbers in your setup, share your methodology alongside the metrics, context is everything.