Fine-tuning a Phi-3 LeetCode Expert? - Dataset Generation, Unsloth ++

TL;DR

Build a LeetCode fine-tuning dataset by pairing complete problem statements (including examples and constraints) with structured outputs that include a step-by-step plan, Python 3 code, and an explanation.

Briefing Cornell Notes

Briefing

Fine-tuning a Phi-3-style “LeetCode expert” starts with building a clean, structured dataset: each training example pairs a full LeetCode problem statement (including examples and constraints) with a step-by-step plan, a Python 3 solution, and an explanation. The workflow hinges on avoiding data contamination by selecting problems from a contest that has already ended, then converting those scraped problem texts into a consistent JSON schema with three fields: instruction, input, and output.

The dataset creation process begins by copying the LeetCode problem text—everything needed to solve it—then saving it in a repeatable template. The creator adds a “solution to problem X” section and uses a strong LLM (examples mentioned include Claude 3 Opus and GPT-4) to generate the reasoning and code. A system prompt is used to enforce structure: the model should provide a step-by-step chain-of-thought-style plan before writing Python code. The resulting output is then validated by pasting the generated code back into LeetCode’s Python 3 editor and running the test cases until the solution is accepted.

Once a single problem is verified, the same pattern is repeated across multiple problems. In the walkthrough, 13 LeetCode problems are assembled into a single dataset file, with each JSON record formatted so the instruction is standardized (e.g., “Solve the following lead code problem in Python 3 syntax”), the input contains the raw problem statement, and the output contains the step-by-step plan plus the final Python code and explanation. The dataset is then uploaded to Hugging Face as a public dataset, making it easy to reuse in training pipelines.

Training is performed using Unsloth notebooks aimed at fine-tuning an “F3 mini 4K instruct” model (Phi-3 family). The notebook loads the Hugging Face dataset, maps the instruction/input/response fields into the training format, and includes an EOS token for proper sequence termination. Training runs for a set number of epochs (60 is shown) and uses a small GPU setup (the walkthrough mentions a Tesla T4). Training loss declines from roughly 1.0 toward the mid-0.7 range, suggesting the model is learning patterns from the 13 examples.

Evaluation shows mixed results. A base test on a new easy LeetCode problem yields partial correctness, and the fine-tuned model does not consistently improve accuracy on the same task. Still, the fine-tuned model tends to follow the desired output structure—producing a step-by-step solution section, then a code solution, then an explanation—even when the code itself fails some test cases. The takeaway is less about achieving perfect LeetCode performance from a tiny dataset and more about demonstrating a practical, end-to-end method: scrape and verify problems, generate structured reasoning+code with a stronger model, validate with LeetCode, convert to JSON, upload to Hugging Face, then fine-tune with Unsloth.

The creator also notes that better results might require more data, different hyperparameters, or training a stronger base model (e.g., Llama 3 instead of Phi-3), but the core contribution remains the repeatable dataset-generation pipeline and the structured prompting that makes the fine-tuned outputs look like the target format.

Cornell Notes

A reliable way to fine-tune a Phi-3-style model for LeetCode-style coding is to build a dataset where every example is fully structured: a standardized instruction, the complete LeetCode problem text (including examples and constraints), and an output that contains a step-by-step plan, Python 3 code, and an explanation. The workflow emphasizes quality control: generate solutions with a stronger LLM, then paste the code back into LeetCode and confirm it passes test cases before adding it to the dataset. In the walkthrough, 13 verified problems are converted into JSON, uploaded to Hugging Face, and used in Unsloth to fine-tune an “F3 mini 4K instruct” model. Training loss drops during fine-tuning, and the model more reliably produces the expected reasoning/code/explanation format, even though correctness on new problems remains inconsistent.

Why does the dataset start from contest problems that have already ended?

To reduce contamination risk. The walkthrough selects a recently ended contest so the generated training set doesn’t accidentally include problems that might already be widely available in other forms (or that could leak into evaluation). That choice is meant to keep the fine-tuning signal focused on problems that are less likely to have been memorized or pre-exposed through other channels.

What exact structure is used for each training example in the JSON dataset?

Each record uses three fields: instruction, input, and output. The instruction is standardized (e.g., “Solve the following lead code problem in Python 3 syntax”). The input is the full LeetCode problem text copied from the site (including the problem statement, examples, and constraints). The output contains the generated step-by-step plan, the Python 3 solution code, and an explanation, all kept in a consistent format so the model learns the target response layout.

How are the step-by-step plans and code generated, and how is format enforced?

A stronger LLM is prompted with a system instruction that requires a step-by-step plan before writing the Python code. The walkthrough mentions using models such as Claude 3 Opus or GPT-4 for this generation. The key is that the prompt explicitly orders the reasoning section first, then the code, then the explanation—so the fine-tuned model later reproduces that structure.

What quality gate ensures the generated solutions are suitable for training?

Code validation on LeetCode. After generating a solution, the creator copies the Python code back into LeetCode’s Python 3 editor and runs the test cases until the submission is accepted. Only then is the problem’s full reasoning+code+explanation added to the dataset.

What does the Unsloth fine-tuning pipeline require from the dataset?

It expects the dataset to provide instruction/input/response-style fields and uses an EOS token to mark sequence termination. The notebook loads the Hugging Face dataset via its URL, then prepares batches so the model trains on the combined instruction + problem text + structured output.

What improvement is observed after fine-tuning, and what limitation remains?

The model more consistently follows the desired output structure: it produces a step-by-step solution section, then a code solution, then an explanation. However, correctness on new unseen problems is still unreliable—some test cases fail—showing that format adherence doesn’t automatically translate into accurate algorithmic solutions, especially with a small dataset (13 examples).

Review Questions

How does validating generated code on LeetCode before adding it to the dataset change the expected quality of fine-tuning outcomes?
What are the roles of the instruction, input, and output fields in shaping the fine-tuned model’s behavior?
Why might training loss decrease while accuracy on new LeetCode problems still stays inconsistent?

Key Points

1
Build a LeetCode fine-tuning dataset by pairing complete problem statements (including examples and constraints) with structured outputs that include a step-by-step plan, Python 3 code, and an explanation.
2
Reduce contamination risk by selecting problems from a contest that has already ended before generating training examples.
3
Generate reasoning+code using a stronger LLM with a system prompt that forces the plan to appear before the code, then keep that format consistent across all examples.
4
Validate every generated solution by pasting the Python code back into LeetCode and confirming it passes test cases before including it in the training set.
5
Convert the curated text into a JSON schema with instruction, input, and output fields so training pipelines can load it reliably.
6
Upload the dataset to Hugging Face to make it easy to reference from Unsloth notebooks and training scripts.
7
Expect format adherence to improve faster than raw problem-solving accuracy when the dataset is small; correctness may require more data, better hyperparameters, or a stronger base model.

Highlights

The dataset is built around a strict response template: step-by-step plan first, then Python 3 code, then an explanation—learned through consistent JSON formatting.

Each generated solution is quality-checked by running it on LeetCode until accepted, preventing incorrect code from entering the training set.

Fine-tuning improved the model’s ability to produce the target structure, but didn’t guarantee correct answers on new problems with only 13 training examples.

Unsloth training uses the Hugging Face dataset URL, maps instruction/input/response fields, and relies on EOS tokens for proper sequence handling.

Topics

Dataset Generation
LeetCode Fine-Tuning
Unsloth Training
Hugging Face Datasets
Phi-3 Prompting