The New Prompting Rules: How to Prompt Frontier LLM Models like Gemini 2.5, GPT 4.1 & Claude 3.7
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use longer context windows to include large histories and tool outputs, improving grounding and reducing mid-task forgetting.
Briefing
Frontier LLMs are getting dramatically easier to use because context windows have ballooned to 200,000 tokens and beyond, letting models reliably track long conversations and ingest huge tool outputs. In practice, that means prompts can include far more source material—entire codebases, large documents, or multi-step agent traces—without the model “forgetting” key details mid-task. The transcript cites OpenAI’s guidance that long-context performance can remain strong even as accuracy gradually dips at the extreme end, and it highlights real-world experience with Gemini 2.5 where conversations starting around 120,000 tokens continued with the model still able to reference earlier information deep into the thread. The payoff is not just better recall; it also enables longer chains of internal reasoning-like behavior even for models that aren’t explicitly “reasoning” models, because more tokens remain available for intermediate work.
A second shift is that newer frontier models follow instructions more reliably, which changes how developers can design agentic workflows. Instead of hoping a model will naturally do the right thing in the right order, teams can now specify execution sequences—such as “read material first, then use it to ground the response”—and also demand ordered output formats. The transcript emphasizes that these models are especially strong at producing structured outputs in markdown, JSON, and XML. XML, in particular, has moved from a niche technique to a recommended pattern: OpenAI’s materials reportedly show XML performing well in long-context testing, and the transcript notes that XML delimiters with attributes and nested input/output examples can make complex instructions easier to constrain.
Prompting guidance also shifts toward stricter control mechanisms. Delimiters and formatting matter more as prompts grow, and the transcript suggests starting with markdown for simpler tasks, then layering XML when prompts become large or when output needs to be more repeatable. It also warns that even with better models, repetitive or highly constrained outputs may still require splitting work across multiple prompts.
Instruction wording is getting more nuanced as well. Where earlier prompting advice leaned heavily on positive examples, OpenAI’s newer guidance for GPT-4.1–class models and up reportedly allows the use of negating terms. That matters for reducing hallucinations in extraction and QA tasks: prompts can explicitly instruct the model to respond “I don’t know” when information is missing, rather than inventing details. The transcript claims Gemini 2.5 Pro behaves better on this front, but it stops short of promising zero hallucinations.
Finally, the transcript lays out a practical prompt structure: begin with role and objective, add step-by-step instructions and output formatting requirements (including schemas), include one-shot or two-shot examples plus edge cases, then provide the bulk context (customer history, device info, retrieved documents, etc.). It also recommends repeating key instructions at the end because models pay more attention to the final directives. However, it flags a cost tradeoff: repeating instructions can break prompt caching, increasing compute bills at scale.
The overall message is pragmatic: longer context and stronger instruction following make prompting more effective, but outputs still aren’t deterministic. The transcript concludes that teams should evaluate models with real metrics and expect some hallucination risk even when using advanced frontier systems.
Cornell Notes
Long-context frontier LLMs (200k+ tokens) make prompting more effective by letting models retain large histories and ingest massive tool outputs without losing important details. Newer models also follow instructions and output constraints better, enabling more reliable agent workflows with ordered steps and structured outputs (especially markdown, JSON, and XML). XML delimiters are increasingly recommended for long prompts, while markdown works well for simpler formatting needs. OpenAI guidance for GPT-4.1–class models and up supports using negating terms like “I don’t know” to reduce hallucinations in extraction/QA tasks, though hallucinations can’t be eliminated. A practical prompt template starts with role/objective, adds instructions and formatting (schemas), includes examples and edge cases, then supplies the main context—often repeating key instructions at the end, with a caching/cost tradeoff.
Why do larger context windows change day-to-day prompting?
How does improved instruction following affect agentic application design?
When should a developer use XML delimiters versus markdown?
What’s the purpose of using negating terms like “I don’t know”?
What prompt structure is recommended for complex tasks with tools and formatting?
Why can repeating instructions at the end increase cost?
Review Questions
- How does a 200,000+ token context window change what you can safely include in a prompt (and what kinds of failures it reduces)?
- What are the practical differences between using markdown-only formatting and adding XML delimiters for structured outputs?
- In a prompt template, where should examples and edge cases go, and why might repeating key instructions at the end be both helpful and costly?
Key Points
- 1
Use longer context windows to include large histories and tool outputs, improving grounding and reducing mid-task forgetting.
- 2
Design agent workflows with explicit step order and ordered output requirements, leveraging stronger instruction following in newer models.
- 3
Prefer markdown for simpler formatting, then add XML delimiters when prompts get large or when output repeatability needs tighter constraints.
- 4
Use negating instructions like “I don’t know” to reduce hallucinations in extraction/QA, but still evaluate and expect some errors.
- 5
Adopt a structured prompt template: role/objective → instructions → formatting/schema → examples/edge cases → main context → repeated key instructions.
- 6
Be mindful that repeating instructions can break prompt caching and raise costs at high request volumes.
- 7
Treat model outputs as non-deterministic and rely on evaluation metrics rather than assuming perfect behavior.