Average response time metrics for OpenAI GPT‑5 models (by reasoning level)

Average response time metrics for OpenAI GPT‑5 models (by reasoning level)

Revised
Published

~ 11 min read

If you ship features on top of GPT‑5, understanding response time is just as important as model quality. This post summarises practical latency expectations when using the OpenAI API with different reasoning levels, explains what actually drives those numbers, and provides a reproducible way to measure and monitor them in your stack.

Note: Numbers in this article are directional and environment‑dependent. They vary by prompt length, output length, system load, region, networking, and streaming vs. non‑streaming usage. Treat them as planning guidance, not hard SLAs.

TL;DR

  • Reasoning level is a massive latency multiplier for standard models (GPT-5, Nano), but newer iterations (GPT-5.1, GPT-5.2) can yield consistently higer speeds depending on the task/prompt.
  • For complex reasoning tasks, newer models can be 2-3x faster than the base GPT-5 model.
  • Time‑to‑first‑token (TTFT) is the UX‑critical metric for streaming. You can often keep TTFT low even when the total completion time rises.
  • You control more than you think: prompt/response token budgets, parallel function calls, and smart streaming strategies make significant differences.

Latency metrics you should track

  • Time‑to‑first‑token (TTFT): request start → first streamed token received. Primary UX metric for conversational UIs.
  • Tokens per second (TPS): generation throughput once tokens start flowing. Good for sizing progress indicators.
  • Total wall time: request start → stream closed (or full JSON received). Important for background jobs.
  • Server vs. network: separate model compute time from client/edge/network overhead to find the real bottleneck.

Why reasoning level affects latency

Reasoning adds internal deliberation steps (planning, self‑checking, tool selection). Even when responses are streamed, the model may take longer to produce the first token because it spends more compute on pre‑generation thinking. At higher levels, the model may also generate longer answers, which lengthens total time.

Measurement Methodology used

Use a small prompt; Run at least 30 trials per configuration and report medians and P. 90, not just averages. No assessment of response quality, just speed and tokens usage.

  • Fix input prompt: templated 800–1000 token prompt.
  • Output length: steer the model toward short direct answers.
  • Control network variance: same region, no VPN, warm HTTP/2 connections.
  • Measure client: instrument the client precisely.

Example Node script (streaming, TTFT, totals, min, max, and mean)

The following script batches 30 runs for each model configuration with the same prompts and creates min, max, mean and spread (max-min) results. It runs the batches in parallel to significantly reduce overall time to a result (8.5x faster). However, it also uses a “leaky bucket” to avoid hitting OpenAI rate limits, although a more robust implementation would be to use incremental backoff with retries on hitting a rate limit.

// run: node measure-gpt5-latency.mjs
import { performance } from 'node:perf_hooks';
import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function measure({ model, reasoning = undefined }: { model: string; reasoning?: string }) {
    const start = performance.now();
    let firstTokenAt = null;
    let tokens = 0;

    const params: OpenAI.Responses.ResponseCreateParams = {
        model,
        input: [
            { role: 'developer', content: 'Answer concisely.' },
            {
                role: 'user',
                content: 'Explain what a hash map is, when to use it and when simpler data structures might be better.'
            },
        ],
        stream: true,
    };

    // valid reasoning efforts usually exclude 'none'
    if (reasoning) {
        // @ts-ignore - aiming for forward compatibility with potential new values like 'minimal'
        params.reasoning = { effort: reasoning };
    }
    const stream = await client.responses.create(params);

    for await (const chunk of stream) {
        const now = performance.now();
        if (!firstTokenAt) firstTokenAt = now;
        if (chunk.type === 'response.output_text.delta') {
            const delta = chunk?.delta ?? '';
            tokens += delta.length > 0 ? 1 : 0; // rough proxy if you can’t count tokens client-side
        }
    }

    const end = performance.now();
    return {
        ttftMs: Math.round((firstTokenAt ?? end) - start),
        timeTakenMs: Math.round(end - start),
        tokens,
    };
}

class LeakyBucket {
    private capacity: number;
    private interval: number;
    private lastRefill: number;
    private tokens: number;
    private queue: (() => void)[] = [];

    constructor(capacity: number, requestsPerSecond: number) {
        this.capacity = capacity;
        this.interval = 1000 / requestsPerSecond;
        this.tokens = capacity;
        this.lastRefill = performance.now();
    }

    async acquire(): Promise<void> {
        return new Promise((resolve) => {
            this.queue.push(resolve);
            this.processQueue();
        });
    }

    private processQueue() {
        const now = performance.now();
        const elapsed = now - this.lastRefill;
        const newTokens = Math.floor(elapsed / this.interval);

        if (newTokens > 0) {
            this.tokens = Math.min(this.capacity, this.tokens + newTokens);
            this.lastRefill = now;
        }

        while (this.queue.length > 0 && this.tokens > 0) {
            this.tokens--;
            const next = this.queue.shift();
            if (next) next();
        }

        if (this.queue.length > 0) {
            // Schedule next check
            const waitTime = this.interval - (performance.now() - this.lastRefill);
            setTimeout(() => this.processQueue(), Math.max(0, waitTime));
        }
    }
}

// Global rate limiter: 5 requests per second, burst up to 10
const rateLimiter = new LeakyBucket(10, 5);

(async () => {
    const overallStart = performance.now();
    const startTimeStr = new Date().toLocaleString();
    console.log(`Starting latency measurements at ${startTimeStr}`);

    if (!process.env.OPENAI_API_KEY) {
        console.error("Error: OPENAI_API_KEY environment variable is not set.");
        process.exit(1);
    }

    const configs = [
        { model: 'gpt-5-nano', reasoning: 'minimal' },
        { model: 'gpt-5-nano', reasoning: 'low' },
        { model: 'gpt-5-nano', reasoning: 'medium' },
        { model: 'gpt-5-nano', reasoning: 'high' },
        { model: 'gpt-5-mini', reasoning: 'minimal' },
        { model: 'gpt-5-mini', reasoning: 'low' },
        { model: 'gpt-5-mini', reasoning: 'medium' },
        { model: 'gpt-5-mini', reasoning: 'high' },
        { model: 'gpt-5', reasoning: 'minimal' },
        { model: 'gpt-5', reasoning: 'low' },
        { model: 'gpt-5', reasoning: 'medium' },
        { model: 'gpt-5', reasoning: 'high' },
        { model: 'gpt-5.1', reasoning: 'none' },
        { model: 'gpt-5.1', reasoning: 'low' },
        { model: 'gpt-5.1', reasoning: 'medium' },
        { model: 'gpt-5.1', reasoning: 'high' },
        { model: 'gpt-5.2', reasoning: 'none' },
        { model: 'gpt-5.2', reasoning: 'low' },
        { model: 'gpt-5.2', reasoning: 'medium' },
        { model: 'gpt-5.2', reasoning: 'high' },
    ];

    const RUNS = 30;
    const CONFIG_CONCURRENCY = Number(process.env.CONFIG_CONCURRENCY ?? '1');
    const results = [];

    function percentile(arr: number[], p: number) {
        if (arr.length === 0) return NaN as unknown as number;
        const sorted = [...arr].sort((a, b) => a - b);
        const rank = Math.ceil(p * sorted.length) - 1; // nearest-rank method
        const idx = Math.min(sorted.length - 1, Math.max(0, rank));
        return sorted[idx];
    }

    function stats(arr: number[]) {
        if (arr.length === 0) return { min: 0, max: 0, mean: 0, spread: 0, median: 0, p90: 0 };
        const min = Math.min(...arr);
        const max = Math.max(...arr);
        const mean = arr.reduce((a, b) => a + b, 0) / arr.length;
        const spread = max - min;
        const p50 = percentile(arr, 0.5);
        const p90 = percentile(arr, 0.9);
        return { min, max, mean: Math.round(mean), spread, median: Math.round(p50), p90: Math.round(p90) };
    }

    function fmt(ms: number) {
        const s = Math.floor(ms / 1000);
        const m = Math.floor(s / 60);
        const h = Math.floor(m / 60);
        const remMs = ms % 1000;
        const remS = s % 60;
        const remM = m % 60;
        const hh = h > 0 ? String(h).padStart(2, '0') + ':' : '';
        const mm = String(remM).padStart(2, '0');
        const ss = String(remS).padStart(2, '0');
        const msStr = String(remMs).padStart(3, '0');
        return `${hh}${mm}:${ss}.${msStr}`;
    }

    const totalConfigs = configs.length;
    let printedTotal = false;

    async function runConfig(cfg: { model: string; reasoning: string }, index: number) {
        const cfgStart = performance.now();
        const cfgLabel = `Config ${index + 1}/${totalConfigs}: model=${cfg.model}, reasoning=${cfg.reasoning}`;
        console.log(`[${new Date().toLocaleTimeString()}] ${cfgLabel} — starting ${RUNS} runs`);
        const ttfts: number[] = [];
        const timeTaken: number[] = [];
        const tokens: number[] = [];

        const runPromises = Array.from({ length: RUNS }, (_, i) => (async () => {
            await rateLimiter.acquire();
            console.log(`  Run ${i + 1}/${RUNS} — model=${cfg.model}, reasoning=${cfg.reasoning}`);
            try {
                const r = await measure(cfg);
                const elapsedSoFar = Math.round(performance.now() - overallStart);
                console.log(`  Completed run ${i + 1}/${RUNS}: ttft=${r.ttftMs}ms timeTaken=${r.timeTakenMs}ms — elapsed so far=${fmt(elapsedSoFar)}`);
                ttfts.push(r.ttftMs);
                timeTaken.push(r.timeTakenMs);
                tokens.push(r.tokens);
            } catch (err) {
                const elapsedSoFar = Math.round(performance.now() - overallStart);
                console.warn(`  Run ${i + 1}/${RUNS} failed — elapsed so far=${fmt(elapsedSoFar)}:`, err);
            }
        })());

        await Promise.allSettled(runPromises);

        const ttftStats = stats(ttfts);
        const timeTakenStats = stats(timeTaken);
        const meanTokens = Math.round(tokens.reduce((a, b) => a + b, 0) / tokens.length);

        results.push({
            ...cfg,
            runs: RUNS,
            ttftMinMs: ttftStats.min,
            ttftMaxMs: ttftStats.max,
            ttftMeanMs: ttftStats.mean,
            ttftMedianMs: ttftStats.median,
            ttftSpreadMs: ttftStats.spread,
            ttftP90Ms: ttftStats.p90,
            timeTakenMinMs: timeTakenStats.min,
            timeTakenMaxMs: timeTakenStats.max,
            timeTakenMeanMs: timeTakenStats.mean,
            timeTakenMedianMs: timeTakenStats.median,
            timeTakenSpreadMs: timeTakenStats.spread,
            timeTakenP90Ms: timeTakenStats.p90,
            meanTokens: meanTokens,
        });

        const cfgElapsed = Math.round(performance.now() - cfgStart);
        console.log(`[${new Date().toLocaleTimeString()}] Finished ${cfgLabel} — took ${fmt(cfgElapsed)}; total elapsed=${fmt(Math.round(performance.now() - overallStart))}`);
    }

    const workers = Math.max(1, Math.min(CONFIG_CONCURRENCY, totalConfigs));
    let nextIndex = 0;
    await Promise.all(Array.from({ length: workers }, async () => {
        while (true) {
            const idx = nextIndex++;
            if (idx >= totalConfigs) break;
            await runConfig(configs[idx], idx);
        }
    }));

    console.table(results);
    const totalElapsed = Math.round(performance.now() - overallStart);
    console.log(`All configs complete. Total time: ${fmt(totalElapsed)}`);
})();

Example results

IndexModelReasoningRunsttftMinMsttftMaxMsttftMeanMsttftMedianMsttftSpreadMsttftP90MstimeTakenMinMstimeTakenMaxMstimeTakenMeanMstimeTakenMedianMstimeTakenSpreadMstimeTakenP90MsmeanTokens
0gpt-5-nanominimal30217519322301302416361070954746451434855603476
1gpt-5-nanolow30158368241233210309502097316770669647118236472
2gpt-5-nanomedium3017141426725224335212093202571693616792816418950455
3gpt-5-nanohigh30171439268252268320229106209139188393283918146039504
4gpt-5-miniminimal302249413533277173917087462791064687613919212593634
5gpt-5-minilow301841296309283111233277771214597889723436811079623
6gpt-5-minimedium3017639528327721935110033156931311812799566014747590
7gpt-5-minihigh30171358280270187325194614493629784277812547539261558
8gpt-5minimal301886783202954903555347111586954641658118394297
9gpt-5low3018341729829723435288172718414211124611836721695346
10gpt-5medium3019015733222731383354155848443925958234976885531028356
11gpt-5high30218410307300192379249755970239029374503472750854336
12gpt-5.1none30221148432927812633619332180091231212162867715421681
13gpt-5.1low301814122792672313359995186881228611868869314068762
14gpt-5.1medium302083562722681483149372159711281212529659914652700
15gpt-5.1high3017734725626417029594432895914805145531951618094773
16gpt-5.2none302144032702611893198130154381204412005730813357602
17gpt-5.2low3019441229127321838610612160031228411923539114233644
18gpt-5.2medium3017213983512761226387115757106116899129785948616291669
19gpt-5.2high3023562645544639157810464146411285813017417714170627

If you rely on JSON mode or tool calls, test those paths specifically, both add overhead and can change TTFT.

The mean time taken vs reasoning level

mean time taken vs reasoning level

Key Findings

  1. Newer models prioritise consistency

    • GPT-5.1 and GPT-5.2 show remarkable stability. Their total response times remain nearly flat across all reasoning levels (taking ~12–15 seconds), regardless of whether reasoning is set to ‘low’ or ‘high’. Although this could be down to the nature of the prompt, which is fairly simple.
    • In contrast, strictly “smaller” or “older” architectures like gpt-5-nano and gpt-5 show steep, non-linear latency spikes as reasoning depth increases.
  2. The “Reasoning Penalty” varies by model

    • gpt-5-nano is the fastest model for simple tasks (4.7s at “minimal” reasoning) but degrades severely at “high” reasoning (~39s), making it nearly 8x slower.
    • gpt-5 follows a similar pattern, starting fast (~7s) but matching Nano’s slowness at “high” reasoning (~39s).
    • gpt-5-mini sits in the middle, scaling moderately (~10s → 30s).
  3. High Reasoning Speed Champions

    • For deep reasoning tasks, gpt-5.1 and gpt-5.2 are the clear winners, delivering results 2–3x faster than gpt-5 or gpt-5-nano. Again, the warning being the prompt is straightforward and not complex.
  4. Overall ranking (Speed):

    • Low Reasoning / Simple Tasks: gpt-5-nano > gpt-5 > gpt-5.1/5.2
    • High Reasoning / Complex Tasks: gpt-5.2/5.1 > gpt-5-mini > gpt-5 ≈ gpt-5-nano

Token usage vs reasoning level

mean token usage vs reasoning level

Key Findings

  1. Conciseness vs. Verbosity

    • gpt-5 is the most concise model by far, consistently producing the fewest tokens (~300–350) to answer the prompt.
    • gpt-5.1 is the most verbose, averaging ~700+ tokens, often double the output of gpt-5 for the same task.
    • gpt-5-mini and gpt-5.2 sit in the older, generally 550–650 token range.
  2. Reasoning doesn’t always equal more tokens

    • Counter-intuitively, higher reasoning levels do not always lead to longer final answers.
    • gpt-5-mini actually produces fewer tokens at ‘high’ reasoning (558) compared to ‘minimal’ (634), suggesting it optimises its final output better after “thinking” more.
    • gpt-5-nano and gpt-5 remain relatively stable in token output regardless of the reasoning setting.
  3. Latency is Compute, not Length

    • Comparing the two graphs reveals a critical insight: The massive latency spikes in gpt-5 and nano at ’ high’ reasoning are not caused by generating more text (token counts are flat). They are caused purely by increased “thinking” time distribution (Time-To-First-Token and inter-token pauses).

Time-to-First-Token (TTFT) Stability

Time to First Token (TTFT) vs Reasoning Level

Key Findings

  1. Reasoning happens during generation

    • The data reveals a surprising trend: TTFT remains flat (~250–350ms) across virtually all models and reasoning levels.
    • This contradicts the common assumption that “reasoning” implies a long “thinking” pause before the first token effectively. Instead, these models appear to distribute their reasoning process throughout the stream, simply generating tokens slower (lower tokens/sec) rather than waiting longer to start.
  2. UX Implications

    • Because TTFT is unaffected by reasoning depth, streaming is mandatory. You can deliver an “instant” feeling UI (starting to type in <300ms) even if the full response takes 40 seconds to complete.

Optimisation checklist

  • Stream responses and render progressively; prioritise TTFT.
  • Keep prompts tight; prune unused context and reduce system message verbosity.
  • Set max_tokens realistically; don’t over‑budget if you don’t need long answers.
  • Cache immutable prefix prompts (client side) and send only the diff where possible.
  • Prefer shorter tool traces; avoid unnecessary parallel tool calls.
  • Enable gzip/br encodings and HTTP/2 keep‑alive; reuse clients between requests.
  • Run close to the API region you target; minimise cross‑region hops.
  • For background jobs, batch requests off the hot path and set longer timeouts.

Measuring in production

  • Capture TTFT and total time per request in your observability layer (e.g. OpenTelemetry).
  • Tag metrics with model and reasoning level so dashboards can alert when p95 drifts.
  • Record token counts to correlate cost, latency, and user outcomes.
  • Keep a small canary suite (stable prompts) that you run periodically to detect regressions independent of user traffic.

Takeaways

  • Reasoning level is a quality/speed dial, use the lowest level that still meets your acceptance criteria.
  • Streaming hides total time but not TTFT; keep TTFT low for perceived performance.
  • Measure with discipline in your own environment; publish medians/p90s to your team so expectations remain realistic.

If you spot materially different numbers in your setup, share your methodology alongside the metrics, context is everything.

all posts →