Build Hour: GPT-5

TL;DR

GPT-5 is framed as a major step forward for coding quality and front-end UI generation, with stronger instruction adherence and better long-horizon tool-chaining.

Briefing Cornell Notes

Briefing

GPT-5 is positioned as OpenAI’s “smartest, most steerable” coding model yet—built to produce higher-quality code, handle long-running agentic workflows with many tool calls, and follow instructions more literally when prompts are precise. The core message from the build hour is practical: developers can get better results not just by switching models, but by using the Responses API correctly—especially by choosing the right reasoning mode, enabling stateful reasoning when appropriate, and tuning output controls like minimal reasoning and verbosity.

On capabilities, the session emphasized three areas. First, GPT-5’s coding performance shows a “step function” improvement in code quality, including front-end UI generation that looks more polished out of the box. Second, it’s designed for long-horizon tasks where the model must chain tool calls end-to-end—planning, calling tools, checking conditions, and correcting course when it starts down the wrong path. Third, GPT-5 adds new control knobs for reasoning and output behavior. A new reasoning parameter called “minimal” is described as doing the least reasoning needed to keep latency closer to non-reasoning models, while still retaining the intelligence of a reasoning model. Another parameter, “verbosity,” controls how much final output the model produces and even affects tool-call formatting.

A major technical focus landed on the Responses API as the recommended way to use GPT-5. Compared with the older Completions API, Responses is framed as a more feature-rich “v2” that improves developer experience (like easier output text handling) and—crucially—supports statefulness by default. When state is enabled, GPT-5 can emit intermediate “reasoning items” (chain-of-thought tokens) that can be passed back after tool calls so the model continues its internal reasoning coherently across steps. For teams that can’t use statefulness, the session described an alternative: encrypted reasoning content that preserves the benefits without exposing raw intermediate tokens.

The build hour also connected these mechanics to measurable outcomes. In long agentic rollouts with 20–50 tool calls, preserving reasoning continuity can translate into small but meaningful benchmark differences (cited as roughly 2–4% on Sweetbench). The session also highlighted prompt caching behavior: caching depends on the prefix, and reasoning items are part of what gets included in that prefix, improving both cost efficiency and speed when requests share the same prefix.

Demos made the model’s steering tangible. In Cursor, a simple prompt generated a functioning landing page for “vibe coder,” with GPT-5 producing working UI elements. In Codex CLI, GPT-5 was used to build a Minecraft clone from an empty folder, producing a playable world after several minutes, then iterating on visuals (adding pink flowers and improving rendering) through additional prompts. The research side framed future work around better web development, tighter “loop” iteration with testing, and more coherent long-horizon agent behavior—imagining an “agent slider” that lets users choose between quick help and extended autonomous work.

Finally, prompting guidance stressed that GPT-5 follows instructions more literally than older models, so conflicting or vague instructions can hurt performance. Practical advice included removing contradictions, selecting appropriate reasoning effort (starting medium, using low for latency-sensitive tasks and high for long tasks), using XML as a prompt structure based on internal tests, and “metaprompting” by asking the model why it did something and then correcting based on those reasons. The session closed with an applied example from Charlie Labs: an autonomous TypeScript-focused coding agent (“Charlie”) that uses GPT-5 via the Responses API, integrates with GitHub/Linear/Slack, runs tests in a VM, and creates PRs and issues—reporting improvements in internal evals and strong head-to-head results versus GitHub’s Cloud Code.

Cornell Notes

GPT-5 is presented as a coding-first, highly steerable reasoning model designed for both high-quality code generation and long-running agentic tasks. The biggest practical lever is using the Responses API: it supports stateful tool-calling by default and can pass “reasoning items” (or encrypted reasoning content) back after tool calls so the model maintains coherent internal reasoning across many steps. Developers can tune latency and output with parameters like minimal reasoning and verbosity, and can improve reliability by writing precise, non-conflicting prompts. Prompting guidance also recommends structured formats (XML in internal tests), choosing appropriate reasoning effort, and using metaprompting (ask why, then fix) to reduce “slop.” These changes matter most for multi-tool workflows where small reasoning-continuity gains compound over 20–50 tool calls.

Why does the Responses API matter more than just swapping in GPT-5?

Responses API is framed as the “v2” path for GPT-5’s reasoning and tool-calling strengths. It supports statefulness by default (store=true), letting GPT-5 emit reasoning items that can be passed back after tool calls so the model continues its internal chain of thought coherently. If statefulness can’t be used, Responses can carry the same benefit via encrypted reasoning content. The session also contrasted developer experience: Responses makes output handling simpler than the older Completions API’s choices/content structures.

What does “minimal” reasoning do, and how does it affect latency?

“minimal” is described as the least amount of reasoning needed while still retaining the intelligence of a reasoning model. The demo used two identical requests: one with minimal response finished in about 0.9 seconds, while the other with higher reasoning effort finished around 6.9 seconds. The implication is that reasoning effort is a direct latency lever.

How do reasoning items improve long agentic tasks?

For long rollouts with many tool calls, passing reasoning items back after each tool call prevents “amnesia” inside the reasoning process. The session described a workflow where the model decides whether it has enough context or needs another tool call; providing the reasoning items lets it continue that decision-making coherently. It also cited benchmark impact: preserving reasoning continuity can yield roughly a 2–4% improvement on Sweetbench in multi-tool scenarios.

How does verbosity change tool calls and code output?

Verbosity doesn’t only change final output length; it can affect how tool calls are emitted. In a custom tool example (a stand-in for an apply patch tool), verbosity=high produced tool-call code that was more readable and included better error handling, while verbosity=low was still correct but less readable. The practical takeaway: for code agent workflows where developer readability matters, verbosity high was found to work well.

What prompting mistake hurts GPT-5 most, compared with older models?

GPT-5’s literal instruction-following means conflicting or imprecise instructions can degrade performance more than with older “vibier” models. The prompting guide example described how GPT-5 can appear to proceed while its reasoning summaries show conflict between parts of the instruction. The fix is to remove contradictions and make requirements precise.

What is metaprompting in this context, and why does it help?

Metaprompting here means asking the model to justify a behavior (“why did you do X?”) and then using that explanation to request a corrected action. Instead of making one-off edits that may overfit the original prompt, the model’s stated reasons become the basis for a more targeted revision.

Review Questions

In a multi-tool agent workflow, when would you enable statefulness (store=true) versus using encrypted reasoning content?
How would you choose between minimal reasoning and higher reasoning effort for a task that must run 30 tool calls end-to-end?
What kinds of prompt conflicts are most likely to hurt GPT-5’s performance, and how would you rewrite the prompt to remove them?

Key Points

1
GPT-5 is framed as a major step forward for coding quality and front-end UI generation, with stronger instruction adherence and better long-horizon tool-chaining.
2
The Responses API is the recommended integration path for GPT-5 because it supports reasoning continuity across tool calls via reasoning items (or encrypted reasoning content).
3
Use the reasoning parameter “minimal” to reduce latency for latency-sensitive tasks, and increase reasoning effort for longer, more complex agentic work.
4
The verbosity parameter affects both how much the model outputs and how tool calls are formatted; verbosity high can improve code readability and error handling.
5
For long agentic rollouts (20–50 tool calls), preserving reasoning continuity can produce small but meaningful benchmark gains (cited around 2–4% on Sweetbench).
6
Prompting reliability improves when instructions are precise and non-conflicting; GPT-5 follows wording more literally than older models.
7
Prompt optimization tactics include choosing appropriate reasoning effort, using XML as a structured prompt format (per internal tests), and metaprompting by asking “why” before requesting fixes.

Highlights

GPT-5’s “minimal” reasoning mode delivered a large latency gap in a side-by-side demo: ~0.9 seconds versus ~6.9 seconds when reasoning effort was increased.

Responses API statefulness (store=true) lets GPT-5 maintain coherent reasoning across tool calls by passing reasoning items back after each tool result.

In Codex CLI, GPT-5 built a playable Minecraft clone from an empty folder, then iterated on visuals (pink flowers and rendering tweaks) through additional prompts.

Prompting guidance warned that GPT-5 can be harmed by conflicting instructions because it interprets wording literally and may show contradictions in reasoning summaries.

Charlie Labs’ autonomous agent (“Charlie”) uses GPT-5 through the Responses API and reports improvements in PR-creation and PR-review evals, plus strong head-to-head performance versus GitHub Cloud Code. 

Topics

GPT-5 Capabilities
Responses API
Reasoning Effort
Tool Calling
Prompt Optimization

Mentioned

OpenAI
Cursor
Codex
Charlie Labs
GitHub
Linear
Slack
Cloud Code
MCP
Christine
Bill
Eric
Riley
Anoop
API
JSON
MCP
VM
SF
UI
PR
CLI
XML
YAML