Prompt Engineering: Zero-shot, One-shot, Few-shot Techniques Explained (Practical Implementation)

TL;DR

Prompting steers a pre-trained LLM using task instructions and optional in-context examples without changing model weights.

Briefing Cornell Notes

Briefing

Prompting lets a pre-trained language model follow tasks using only instructions and examples—no weight updates—so performance can be improved by moving from zero-shot to one-shot to few-shot prompting. The core takeaway is that adding a small number of in-context examples meaningfully raises output quality, and the transcript demonstrates this using dialog summarization with a flan-T5 model.

The discussion starts by contrasting prompting with fine-tuning. Prompting works by feeding an LLM a task description plus optional context; the model generates output using its existing knowledge without changing parameters or weights. Its main advantages are speed and flexibility: users can adapt to new tasks quickly by rewriting prompts, without retraining. The tradeoffs are weaker performance when the task/domain is unfamiliar and a dependence on prompt quality—complex tasks often require careful experimentation.

Fine-tuning, by contrast, retrains the model on task-specific data, updating parameters through gradient-based learning. That specialization can improve performance on targeted domains (the transcript gives legal-document handling as an example), but it comes with costs: additional compute and time, plus risks like overfitting when datasets are small or overly narrow. The transcript frames the key differences as: prompting keeps model parameters fixed while fine-tuning updates them; prompting is dynamic and general-purpose while fine-tuning is targeted; prompting is cheaper and faster but lower-performing overall, while fine-tuning is more expensive but can deliver stronger task-specific results.

To ground the prompting concepts, the transcript then walks through traditional fine-tuning at a high level: the model trains on example input-output pairs, computes errors, applies gradient updates, repeats across many examples, and gradually improves its ability to generate correct responses.

The practical implementation section focuses on prompting methods using Google Colab. It installs and imports Hugging Face libraries (datasets and Transformers), loads a dialog dataset from Hugging Face that includes dialogues and baseline human summaries, and uses flan-T5 base with its tokenizer. The notebook demonstrates tokenization (string-to-token IDs and back) and then runs summarization in three prompting regimes.

In zero-shot prompting, the model receives only an instruction such as “Please summarize in the three lines” along with the dialogue. No example pairs are provided, and no gradient updates occur. The resulting summaries are generally on-task but differ in phrasing and focus from the dataset’s baseline summaries.

In one-shot prompting, the prompt includes the task instruction plus a single dialogue-summary example before asking the model to summarize a new dialogue. The transcript notes that this extra example makes outputs more acceptable and closer to the baseline style.

In few-shot prompting, the prompt includes multiple examples (three in the described setup) before summarizing a new dialogue. With more demonstrations in-context, the flan-T5 model produces higher-quality summaries, and the transcript explicitly reports an improvement as the method progresses from zero-shot to one-shot to few-shot.

Overall, the transcript positions in-context learning as a practical alternative to fine-tuning: it avoids retraining while still improving results through better prompt construction and additional examples.

Cornell Notes

Prompting uses a pre-trained LLM’s existing weights to follow tasks via instructions and optional context, without retraining. The transcript contrasts this with fine-tuning, which updates model parameters using task-specific datasets, improving specialized performance at the cost of compute, time, and overfitting risk. For prompting, three in-context strategies are demonstrated for dialog summarization with flan-T5 base: zero-shot (instruction only), one-shot (instruction + one example dialogue-summary pair), and few-shot (instruction + multiple example pairs). The results reported in the notebook show a quality improvement as the prompt moves from zero-shot to one-shot to few-shot, with outputs becoming closer to baseline human summaries. This makes few-shot prompting a practical way to boost performance when retraining isn’t feasible.

How does prompting differ from fine-tuning in terms of model behavior and cost?

Prompting keeps model parameters fixed: it steers output by providing a task description and (optionally) contextual examples in the input prompt, with no gradient updates. That makes it fast and flexible because users can adapt to new tasks by rewriting prompts. Fine-tuning updates internal weights via training on task-specific data, which can improve performance on targeted domains but requires additional compute/time and carries overfitting risk when data is small or narrow.

Why does zero-shot prompting sometimes underperform even when the model is strong?

Zero-shot prompting relies on the model’s pre-existing knowledge and the quality of the instruction alone. If the task or domain isn’t well represented in the model’s learned knowledge, outputs can be inaccurate or off-target. The transcript also highlights that prompt engineering becomes important: complex tasks may require iterative prompt experimentation to get reliable results.

What changes between zero-shot, one-shot, and few-shot prompting?

Zero-shot provides only the task instruction (e.g., “Please summarize in the three lines”) plus the input dialogue, with no example pairs. One-shot adds exactly one dialogue-summary example before the new dialogue to guide the model’s style and content selection. Few-shot adds several examples (three in the described setup) so the model can generalize from multiple demonstrations before summarizing a new dialogue.

What does the notebook do to implement prompting for summarization?

It installs Hugging Face datasets and Transformers, loads a dialog dataset containing dialogues and baseline summaries, and initializes flan-T5 base with its tokenizer. It then tokenizes inputs, runs summarization under different prompting setups, and prints the dialogue, the baseline human summary, and the model-generated summary for comparison.

What evidence of improvement is reported as prompting examples increase?

The transcript reports that summaries improve as the prompt moves from zero-shot to one-shot to few-shot. Zero-shot summaries are described as slightly different in phrasing and focus from baseline. One-shot outputs are described as more acceptable. Few-shot outputs are described as closer to the baseline, with the transcript giving an example where the generated summary is “perfect” relative to the dialogue’s expected key points.

Review Questions

In what specific ways do prompting and fine-tuning differ regarding parameter updates, performance expectations, and failure modes?
How would you construct a one-shot prompt for a new summarization dataset using the same approach shown for flan-T5 base?
Why might few-shot prompting outperform one-shot prompting, and what practical limits could still remain without fine-tuning?

Key Points

1
Prompting steers a pre-trained LLM using task instructions and optional in-context examples without changing model weights.
2
Fine-tuning updates model parameters on task-specific data, often improving specialized performance but requiring more compute and risking overfitting.
3
Zero-shot prompting uses only an instruction plus the input, which can lead to summaries that differ from baseline phrasing and focus.
4
One-shot prompting adds a single example dialogue-summary pair to better align output style and content.
5
Few-shot prompting adds multiple example pairs, producing higher-quality summaries in the demonstrated dialog task.
6
The practical workflow uses Hugging Face datasets/Transformers in Google Colab, with flan-T5 base for sequence-to-sequence summarization.
7
Comparing baseline human summaries to model outputs across prompting regimes is the main evaluation method used in the notebook.

Highlights

Prompting avoids retraining by keeping model parameters fixed, making it fast and flexible for new tasks.

Moving from zero-shot to one-shot to few-shot improves dialog summarization quality because the model gains more in-context guidance.

The implementation uses flan-T5 base with Hugging Face datasets, printing baseline vs generated summaries to compare results.

Topics

Prompt Engineering
Zero-Shot Prompting
One-Shot Prompting
Few-Shot Prompting
Prompting vs Fine-Tuning