Get AI summaries of any video or article — Sign up free
The New Prompting Rules: How to Prompt Frontier LLM Models like Gemini 2.5, GPT 4.1 & Claude 3.7 thumbnail

The New Prompting Rules: How to Prompt Frontier LLM Models like Gemini 2.5, GPT 4.1 & Claude 3.7

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Use longer context windows to include large histories and tool outputs, improving grounding and reducing mid-task forgetting.

Briefing

Frontier LLMs are getting dramatically easier to use because context windows have ballooned to 200,000 tokens and beyond, letting models reliably track long conversations and ingest huge tool outputs. In practice, that means prompts can include far more source material—entire codebases, large documents, or multi-step agent traces—without the model “forgetting” key details mid-task. The transcript cites OpenAI’s guidance that long-context performance can remain strong even as accuracy gradually dips at the extreme end, and it highlights real-world experience with Gemini 2.5 where conversations starting around 120,000 tokens continued with the model still able to reference earlier information deep into the thread. The payoff is not just better recall; it also enables longer chains of internal reasoning-like behavior even for models that aren’t explicitly “reasoning” models, because more tokens remain available for intermediate work.

A second shift is that newer frontier models follow instructions more reliably, which changes how developers can design agentic workflows. Instead of hoping a model will naturally do the right thing in the right order, teams can now specify execution sequences—such as “read material first, then use it to ground the response”—and also demand ordered output formats. The transcript emphasizes that these models are especially strong at producing structured outputs in markdown, JSON, and XML. XML, in particular, has moved from a niche technique to a recommended pattern: OpenAI’s materials reportedly show XML performing well in long-context testing, and the transcript notes that XML delimiters with attributes and nested input/output examples can make complex instructions easier to constrain.

Prompting guidance also shifts toward stricter control mechanisms. Delimiters and formatting matter more as prompts grow, and the transcript suggests starting with markdown for simpler tasks, then layering XML when prompts become large or when output needs to be more repeatable. It also warns that even with better models, repetitive or highly constrained outputs may still require splitting work across multiple prompts.

Instruction wording is getting more nuanced as well. Where earlier prompting advice leaned heavily on positive examples, OpenAI’s newer guidance for GPT-4.1–class models and up reportedly allows the use of negating terms. That matters for reducing hallucinations in extraction and QA tasks: prompts can explicitly instruct the model to respond “I don’t know” when information is missing, rather than inventing details. The transcript claims Gemini 2.5 Pro behaves better on this front, but it stops short of promising zero hallucinations.

Finally, the transcript lays out a practical prompt structure: begin with role and objective, add step-by-step instructions and output formatting requirements (including schemas), include one-shot or two-shot examples plus edge cases, then provide the bulk context (customer history, device info, retrieved documents, etc.). It also recommends repeating key instructions at the end because models pay more attention to the final directives. However, it flags a cost tradeoff: repeating instructions can break prompt caching, increasing compute bills at scale.

The overall message is pragmatic: longer context and stronger instruction following make prompting more effective, but outputs still aren’t deterministic. The transcript concludes that teams should evaluate models with real metrics and expect some hallucination risk even when using advanced frontier systems.

Cornell Notes

Long-context frontier LLMs (200k+ tokens) make prompting more effective by letting models retain large histories and ingest massive tool outputs without losing important details. Newer models also follow instructions and output constraints better, enabling more reliable agent workflows with ordered steps and structured outputs (especially markdown, JSON, and XML). XML delimiters are increasingly recommended for long prompts, while markdown works well for simpler formatting needs. OpenAI guidance for GPT-4.1–class models and up supports using negating terms like “I don’t know” to reduce hallucinations in extraction/QA tasks, though hallucinations can’t be eliminated. A practical prompt template starts with role/objective, adds instructions and formatting (schemas), includes examples and edge cases, then supplies the main context—often repeating key instructions at the end, with a caching/cost tradeoff.

Why do larger context windows change day-to-day prompting?

With context windows reaching 200,000 tokens or more across major frontier models, prompts can include far more than a short instruction plus a small document. The transcript cites OpenAI guidance and personal experience with Gemini 2.5, where an initial context around 120,000 tokens was still remembered later in the conversation. That means developers can pack in long histories, retrieved documents, and large tool outputs (e.g., from web search, scraping, or document returns) and expect the model to reference earlier material instead of drifting.

How does improved instruction following affect agentic application design?

Better instruction adherence lets prompts specify explicit order-of-operations. For example, an agent can be told to first read material, then use that material plus user context to decide next steps, rather than relying on the model’s default behavior. The transcript also notes that models can be directed to order outputs in a specified sequence and to emit content in strict formats—particularly markdown, JSON, and XML—making downstream automation more dependable.

When should a developer use XML delimiters versus markdown?

Markdown is recommended as a starting point for simpler prompts because models handle titles, numbered lists, and code fences well. XML becomes valuable as prompts grow or when stricter structure is needed for repeatability. The transcript highlights OpenAI’s long-context testing where XML performed well, and it describes XML patterns with nested input/output examples and attributes to constrain what the model should produce.

What’s the purpose of using negating terms like “I don’t know”?

Negating terms help prevent fabricated answers in tasks where the source may not contain the requested information. The transcript frames this as especially important for extraction and QA: if the document lacks the needed detail, the model should respond “I don’t know.” It contrasts older advice that emphasized positive examples with newer OpenAI guidance for GPT-4.1–class models and up that allows negating instructions, claiming Gemini 2.5 Pro shows improved behavior, while still not guaranteeing zero hallucinations.

What prompt structure is recommended for complex tasks with tools and formatting?

A suggested template begins with role and objective (e.g., customer support agent), then provides instructions and optional reasoning steps, followed by explicit output formatting requirements (like “reply in JSON” plus a schema) or markdown preferences. Next come examples: one general example plus one-shot/two-shot style cases, including 3–5 edge cases. Finally, the main context (user/customer history, device info, retrieved documents, etc.) is inserted as the bulk of the prompt, with key instructions repeated at the end to improve attention—while noting that repeating instructions can break prompt caching.

Why can repeating instructions at the end increase cost?

The transcript warns that repeating instructions at the end can break prompt caching. Even if much of the prompt content stays the same across requests, the uncached portion forces more computation, which can add up significantly at scale (thousands or tens of thousands of requests).

Review Questions

  1. How does a 200,000+ token context window change what you can safely include in a prompt (and what kinds of failures it reduces)?
  2. What are the practical differences between using markdown-only formatting and adding XML delimiters for structured outputs?
  3. In a prompt template, where should examples and edge cases go, and why might repeating key instructions at the end be both helpful and costly?

Key Points

  1. 1

    Use longer context windows to include large histories and tool outputs, improving grounding and reducing mid-task forgetting.

  2. 2

    Design agent workflows with explicit step order and ordered output requirements, leveraging stronger instruction following in newer models.

  3. 3

    Prefer markdown for simpler formatting, then add XML delimiters when prompts get large or when output repeatability needs tighter constraints.

  4. 4

    Use negating instructions like “I don’t know” to reduce hallucinations in extraction/QA, but still evaluate and expect some errors.

  5. 5

    Adopt a structured prompt template: role/objective → instructions → formatting/schema → examples/edge cases → main context → repeated key instructions.

  6. 6

    Be mindful that repeating instructions can break prompt caching and raise costs at high request volumes.

  7. 7

    Treat model outputs as non-deterministic and rely on evaluation metrics rather than assuming perfect behavior.

Highlights

Context windows of 200,000+ tokens let prompts carry entire codebases and large tool outputs while maintaining reference to earlier information deep in the conversation.
XML delimiters have become a mainstream recommendation for long prompts, with OpenAI reporting strong long-context performance for XML patterns.
Negating terms such as “I don’t know” are now explicitly supported in GPT-4.1–class guidance to curb hallucinations when information is missing.
Repeating key instructions at the end can improve attention but may break prompt caching, increasing compute bills.
Even with advanced models, hallucinations aren’t eliminated—evaluation and guardrails remain necessary.

Topics

Mentioned

  • GPT-4.1
  • GPT4.1
  • GPT4.1 mini
  • GPT4 mini
  • JSON
  • XML
  • VS Code
  • R1
  • OM