Average response time metrics for OpenAI GPT‑5 models (by reasoning level)

If you ship features on top of GPT‑5, understanding response time is just as important as model quality. This post summarises practical latency expectations when using the OpenAI API with different reasoning levels, explains what actually drives those numbers, and provides a reproducible way to measure and monitor them in your stack.

Note: Numbers in this article are directional and environment‑dependent. They vary by prompt length, output length, system load, region, networking, and streaming vs. non‑streaming usage. Treat them as planning guidance, not hard SLAs.

TL;DR

Reasoning level is the single biggest lever on latency; higher levels trade speed for better chain‑of‑thought depth.
Currently smaller models are tuned for lower reasoning speed and bigger models for higher reasoning speed.
Time‑to‑first‑token (TTFT) is the UX‑critical metric for streaming. You can often keep TTFT low even when the total completion time rises.
You control more than you think: prompt/response token budgets, parallel function calls, and smart streaming strategies make significant differences.

Latency metrics you should track

Time‑to‑first‑token (TTFT): request start → first streamed token received. Primary UX metric for conversational UIs.
Tokens per second (TPS): generation throughput once tokens start flowing. Good for sizing progress indicators.
Total wall time: request start → stream closed (or full JSON received). Important for background jobs.
Server vs. network: separate model compute time from client/edge/network overhead to find the real bottleneck.

Why reasoning level affects latency

Reasoning adds internal deliberation steps (planning, self‑checking, tool selection). Even when responses are streamed, the model may take longer to produce the first token because it spends more compute on pre‑generation thinking. At higher levels, the model may also generate longer answers, which lengthens total time.

Methodology you can reproduce

Use small, controlled prompts with stable token budgets. Run at least 30 trials per configuration and report medians and p90, not just averages.

Fix input length: templated 800–1000 token prompt.
Fix output length: set max_tokens and steer the model toward short direct answers.
Control network variance: same region, no VPN, warm HTTP/2 connections.
Measure server and client: capture timestamps on both sides if you have a proxy; otherwise, instrument the client precisely.

Example Node script (streaming, TTFT, totals, min, max, and mean)

The following script batches 30 runs for each model configuration with the same prompts and creates min, max, mean and spread (max-min) results. It runs the batches in parallel to significantly reduce overall time to a result (8.5x faster).

// run: node measure-gpt5-latency.mjs
import { performance } from 'node:perf_hooks';
import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function measure({ model, reasoning = undefined }) {
    const start = performance.now();
    let firstTokenAt = null;
    let tokens = 0;

    const params: OpenAI.Responses.ResponseCreateParams = {
        model,
        input: [
            { role: 'developer', content: 'Answer concisely.' },
            { role: 'user', content: 'Explain what a hash map is.' },
        ],
        stream: true,
    };
    if (reasoning) {
        params.reasoning = { effort: reasoning }; // e.g., 'low' | 'medium' | 'high'
    }
    const stream = await client.responses.create(params);

    for await (const chunk of stream) {
        const now = performance.now();
        if (!firstTokenAt) firstTokenAt = now;
        if (chunk.type === 'response.output_text.delta') {
            const delta = chunk?.delta ?? '';
            tokens += delta.length > 0 ? 1 : 0; // rough proxy if you can’t count tokens client-side
        }
    }

    const end = performance.now();
    return {
        ttftMs: Math.round((firstTokenAt ?? end) - start),
        timeTakenMs: Math.round(end - start),
        tokens,
    };
}

(async () => {
    const overallStart = performance.now();
    const startTimeStr = new Date().toLocaleString();
    console.log(`Starting latency measurements at ${startTimeStr}`);
    const configs = [
        { model: 'gpt-5-nano', reasoning: 'minimal' },
        { model: 'gpt-5-nano', reasoning: 'low' },
        { model: 'gpt-5-nano', reasoning: 'medium' },
        { model: 'gpt-5-nano', reasoning: 'high' },
        { model: 'gpt-5-mini', reasoning: 'minimal' },
        { model: 'gpt-5-mini', reasoning: 'low' },
        { model: 'gpt-5-mini', reasoning: 'medium' },
        { model: 'gpt-5-mini', reasoning: 'high' },
        { model: 'gpt-5', reasoning: 'minimal' },
        { model: 'gpt-5', reasoning: 'low' },
        { model: 'gpt-5', reasoning: 'medium' },
        { model: 'gpt-5', reasoning: 'high' },
    ];

    const RUNS = 30;
    const CONFIG_CONCURRENCY = Number(process.env.CONFIG_CONCURRENCY ?? '1');
    const results = [];

    function percentile(arr: number[], p: number) {
        if (arr.length === 0) return NaN as unknown as number;
        const sorted = [...arr].sort((a, b) => a - b);
        const rank = Math.ceil(p * sorted.length) - 1; // nearest-rank method
        const idx = Math.min(sorted.length - 1, Math.max(0, rank));
        return sorted[idx];
    }

    function stats(arr: number[]) {
        const min = Math.min(...arr);
        const max = Math.max(...arr);
        const mean = arr.reduce((a, b) => a + b, 0) / arr.length;
        const spread = max - min;
        const p50 = percentile(arr, 0.5);
        const p90 = percentile(arr, 0.9);
        return { min, max, mean: Math.round(mean), spread, median: Math.round(p50), p90: Math.round(p90) };
    }

    function fmt(ms: number) {
        const s = Math.floor(ms / 1000);
        const m = Math.floor(s / 60);
        const h = Math.floor(m / 60);
        const remMs = ms % 1000;
        const remS = s % 60;
        const remM = m % 60;
        const hh = h > 0 ? String(h).padStart(2, '0') + ':' : '';
        const mm = String(remM).padStart(2, '0');
        const ss = String(remS).padStart(2, '0');
        const msStr = String(remMs).padStart(3, '0');
        return `${hh}${mm}:${ss}.${msStr}`;
    }

    const totalConfigs = configs.length;
    let printedTotal = false;

    async function runConfig(cfg: { model: string; reasoning: string }, index: number) {
        const cfgStart = performance.now();
        const cfgLabel = `Config ${index + 1}/${totalConfigs}: model=${cfg.model}, reasoning=${cfg.reasoning}`;
        console.log(`[${new Date().toLocaleTimeString()}] ${cfgLabel} — starting ${RUNS} runs`);
        const ttfts: number[] = [];
        const timeTaken: number[] = [];
        const tokens: number[] = [];

        const runPromises = Array.from({ length: RUNS }, (_, i) => (async () => {
            console.log(`  Run ${i + 1}/${RUNS} — model=${cfg.model}, reasoning=${cfg.reasoning}`);
            try {
                const r = await measure(cfg);
                const elapsedSoFar = Math.round(performance.now() - overallStart);
                console.log(`  Completed run ${i + 1}/${RUNS}: ttft=${r.ttftMs}ms timeTaken=${r.timeTakenMs}ms — elapsed so far=${fmt(elapsedSoFar)}`);
                ttfts.push(r.ttftMs);
                timeTaken.push(r.timeTakenMs);
                tokens.push(r.tokens);
            } catch (err) {
                const elapsedSoFar = Math.round(performance.now() - overallStart);
                console.warn(`  Run ${i + 1}/${RUNS} failed — elapsed so far=${fmt(elapsedSoFar)}:`, err);
            }
        })());

        await Promise.allSettled(runPromises);

        const ttftStats = stats(ttfts);
        const timeTakenStats = stats(timeTaken);
        const meanTokens = Math.round(tokens.reduce((a, b) => a + b, 0) / tokens.length);

        results.push({
            ...cfg,
            runs: RUNS,
            ttftMinMs: ttftStats.min,
            ttftMaxMs: ttftStats.max,
            ttftMeanMs: ttftStats.mean,
            ttftMedianMs: ttftStats.median,
            ttftSpreadMs: ttftStats.spread,
            ttftP90Ms: ttftStats.p90,
            timeTakenMinMs: timeTakenStats.min,
            timeTakenMaxMs: timeTakenStats.max,
            timeTakenMeanMs: timeTakenStats.mean,
            timeTakenMedianMs: timeTakenStats.median,
            timeTakenSpreadMs: timeTakenStats.spread,
            timeTakenP90Ms: timeTakenStats.p90,
            meanTokens: meanTokens,
        });
        console.table(results);
        const cfgElapsed = Math.round(performance.now() - cfgStart);
        console.log(`[${new Date().toLocaleTimeString()}] Finished ${cfgLabel} — took ${fmt(cfgElapsed)}; total elapsed=${fmt(Math.round(performance.now() - overallStart))}`);
        if (!printedTotal && results.length === totalConfigs) {
            printedTotal = true;
            const totalElapsed = Math.round(performance.now() - overallStart);
            console.log(`All configs complete. Total time: ${fmt(totalElapsed)}`);
        }
    }

    const workers = Math.max(1, Math.min(CONFIG_CONCURRENCY, totalConfigs));
    let nextIndex = 0;
    await Promise.all(Array.from({ length: workers }, async () => {
        while (true) {
            const idx = nextIndex++;
            if (idx >= totalConfigs) break;
            await runConfig(configs[idx], idx);
        }
    }));
})();

Example results

model	reasoning	runs	ttftMinMs	ttftMaxMs	ttftMeanMs	ttftMedianMs	ttftSpreadMs	ttftP90Ms	timeTakenMinMs	timeTakenMaxMs	timeTakenMeanMs	timeTakenMedianMs	timeTakenSpreadMs	timeTakenP90Ms	meanTokens
gpt-5-nano	minimal	30	849	2559	1084	972	1710	1182	2906	6940	4131	3804	4034	5387	243
gpt-5-nano	low	30	196	920	405	295	724	821	2492	5630	3731	3284	3138	5344	276
gpt-5-nano	medium	30	198	1238	394	288	1040	724	5435	9817	7246	7192	4382	8528	260
gpt-5-nano	high	30	198	757	331	252	559	466	9648	20101	14947	14897	10453	17784	254
gpt-5-mini	minimal	30	231	1540	585	496	1309	1144	5155	10761	7153	6608	5606	9592	336
gpt-5-mini	low	30	200	1069	402	284	869	764	5804	11909	7783	7556	6105	9167	399
gpt-5-mini	medium	30	205	2231	414	298	2026	503	7519	14325	10304	10039	6806	12049	394
gpt-5-mini	high	30	190	821	358	325	631	491	9215	29832	17044	16680	20617	22556	311
gpt-5	minimal	30	295	1224	520	404	929	784	1991	4076	2959	2850	2085	3723	132
gpt-5	low	30	214	3034	545	299	2820	770	2584	7593	4311	4002	5009	5515	175
gpt-5	medium	30	211	2440	500	310	2229	839	4417	11834	6492	6093	7417	8420	175
gpt-5	high	30	231	973	426	377	742	606	5925	14314	8851	8552	8389	11179	162

If you rely on JSON mode or tool calls, test those paths specifically, both add overhead and can change TTFT.

Mean time taken vs reasoning level

mean time taken vs reasoning level

Key Findings

Performance scaling:
- All models show significant increases in response time as reasoning complexity rises from minimal → high.
- The growth is nonlinear, particularly steep between medium and high reasoning.
Relative speed:
- gpt-5 is consistently the fastest model across all reasoning levels, for this simple example which is surprising.
- gpt-5-mini is slower than gpt-5 but faster than gpt-5-nano for higher reasoning levels.
Variability:
- The spread (variability) of timing also grows sharply at higher reasoning levels, especially for mini and nano.
- This implies less predictable performance under heavier reasoning loads.
Efficiency per token:
- Mean tokens remain roughly stable (~150–400), meaning higher reasoning times aren’t caused by longer responses but by increased computational depth.
Overall ranking (speed → slowest):
- gpt-5 (fastest, most consistent)
- gpt-5-mini (moderate latency)
- gpt-5-nano (slowest, high variability)

Token usage vs reasoning level

mean token usage vs reasoning level

Key Findings

Token usage grows with reasoning level, but not uniformly
- All models show a gradual increase in token usage from minimal → low → medium reasoning.
- Interestingly, token counts tend to flatten or even drop slightly at high reasoning levels, suggesting that higher reasoning may involve deeper computation rather than longer responses.
  - For example, gpt-5-mini peaks at medium reasoning (394 tokens) but drops to 311 at high reasoning.
  - gpt-5-nano stays almost flat between medium and high reasoning (260 → 254).
Larger models are more concise
- The larger the model, the fewer tokens it tends to use for equivalent reasoning tasks:
  - gpt-5: 132–175 tokens
  - gpt-5-mini: 311–399 tokens
  - gpt-5-nano: 243–276 tokens
- This implies that the larger models reason more efficiently, expressing the same logic in fewer tokens — a hallmark of higher linguistic and cognitive compression.
“Mini” models expand more as reasoning deepens
- gpt-5-mini shows the largest variance in token usage (336 → 399 → 394 → 311).
- It suggests that intermediate-sized models generate verbose explanations as reasoning increases, possibly compensating for smaller reasoning capacity through elaboration.
Token usage stability in smaller models
- gpt-5-nano maintains a narrow range of 243–276 tokens regardless of reasoning level.
- This consistency indicates limited adaptability, likely hitting internal caps on reasoning complexity or verbosity.

Optimisation checklist

Stream responses and render progressively; prioritise TTFT.
Keep prompts tight; prune unused context and reduce system message verbosity.
Set max_tokens realistically; don’t over‑budget if you don’t need long answers.
Cache immutable prefix prompts (client side) and send only the diff where possible.
Prefer shorter tool traces; avoid unnecessary parallel tool calls.
Enable gzip/br encodings and HTTP/2 keep‑alive; reuse clients between requests.
Run close to the API region you target; minimise cross‑region hops.
For background jobs, batch requests off the hot path and set longer timeouts.

Measuring in production

Capture TTFT and total time per request in your observability layer (e.g. OpenTelemetry).
Tag metrics with model and reasoning level so dashboards can alert when p95 drifts.
Record token counts to correlate cost, latency, and user outcomes.
Keep a small canary suite (stable prompts) that you run periodically to detect regressions independent of user traffic.

Takeaways

Reasoning level is a quality/speed dial, use the lowest level that still meets your acceptance criteria.
Streaming hides total time but not TTFT; keep TTFT low for perceived performance.
Measure with discipline in your own environment; publish medians/p90s to your team so expectations remain realistic.

If you spot materially different numbers in your setup, share your methodology alongside the metrics, context is everything.