Fine-tuning a Phi-3 LeetCode Expert? - Dataset Generation, Unsloth ++
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Build a LeetCode fine-tuning dataset by pairing complete problem statements (including examples and constraints) with structured outputs that include a step-by-step plan, Python 3 code, and an explanation.
Briefing
Fine-tuning a Phi-3-style “LeetCode expert” starts with building a clean, structured dataset: each training example pairs a full LeetCode problem statement (including examples and constraints) with a step-by-step plan, a Python 3 solution, and an explanation. The workflow hinges on avoiding data contamination by selecting problems from a contest that has already ended, then converting those scraped problem texts into a consistent JSON schema with three fields: instruction, input, and output.
The dataset creation process begins by copying the LeetCode problem text—everything needed to solve it—then saving it in a repeatable template. The creator adds a “solution to problem X” section and uses a strong LLM (examples mentioned include Claude 3 Opus and GPT-4) to generate the reasoning and code. A system prompt is used to enforce structure: the model should provide a step-by-step chain-of-thought-style plan before writing Python code. The resulting output is then validated by pasting the generated code back into LeetCode’s Python 3 editor and running the test cases until the solution is accepted.
Once a single problem is verified, the same pattern is repeated across multiple problems. In the walkthrough, 13 LeetCode problems are assembled into a single dataset file, with each JSON record formatted so the instruction is standardized (e.g., “Solve the following lead code problem in Python 3 syntax”), the input contains the raw problem statement, and the output contains the step-by-step plan plus the final Python code and explanation. The dataset is then uploaded to Hugging Face as a public dataset, making it easy to reuse in training pipelines.
Training is performed using Unsloth notebooks aimed at fine-tuning an “F3 mini 4K instruct” model (Phi-3 family). The notebook loads the Hugging Face dataset, maps the instruction/input/response fields into the training format, and includes an EOS token for proper sequence termination. Training runs for a set number of epochs (60 is shown) and uses a small GPU setup (the walkthrough mentions a Tesla T4). Training loss declines from roughly 1.0 toward the mid-0.7 range, suggesting the model is learning patterns from the 13 examples.
Evaluation shows mixed results. A base test on a new easy LeetCode problem yields partial correctness, and the fine-tuned model does not consistently improve accuracy on the same task. Still, the fine-tuned model tends to follow the desired output structure—producing a step-by-step solution section, then a code solution, then an explanation—even when the code itself fails some test cases. The takeaway is less about achieving perfect LeetCode performance from a tiny dataset and more about demonstrating a practical, end-to-end method: scrape and verify problems, generate structured reasoning+code with a stronger model, validate with LeetCode, convert to JSON, upload to Hugging Face, then fine-tune with Unsloth.
The creator also notes that better results might require more data, different hyperparameters, or training a stronger base model (e.g., Llama 3 instead of Phi-3), but the core contribution remains the repeatable dataset-generation pipeline and the structured prompting that makes the fine-tuned outputs look like the target format.
Cornell Notes
A reliable way to fine-tune a Phi-3-style model for LeetCode-style coding is to build a dataset where every example is fully structured: a standardized instruction, the complete LeetCode problem text (including examples and constraints), and an output that contains a step-by-step plan, Python 3 code, and an explanation. The workflow emphasizes quality control: generate solutions with a stronger LLM, then paste the code back into LeetCode and confirm it passes test cases before adding it to the dataset. In the walkthrough, 13 verified problems are converted into JSON, uploaded to Hugging Face, and used in Unsloth to fine-tune an “F3 mini 4K instruct” model. Training loss drops during fine-tuning, and the model more reliably produces the expected reasoning/code/explanation format, even though correctness on new problems remains inconsistent.
Why does the dataset start from contest problems that have already ended?
What exact structure is used for each training example in the JSON dataset?
How are the step-by-step plans and code generated, and how is format enforced?
What quality gate ensures the generated solutions are suitable for training?
What does the Unsloth fine-tuning pipeline require from the dataset?
What improvement is observed after fine-tuning, and what limitation remains?
Review Questions
- How does validating generated code on LeetCode before adding it to the dataset change the expected quality of fine-tuning outcomes?
- What are the roles of the instruction, input, and output fields in shaping the fine-tuned model’s behavior?
- Why might training loss decrease while accuracy on new LeetCode problems still stays inconsistent?
Key Points
- 1
Build a LeetCode fine-tuning dataset by pairing complete problem statements (including examples and constraints) with structured outputs that include a step-by-step plan, Python 3 code, and an explanation.
- 2
Reduce contamination risk by selecting problems from a contest that has already ended before generating training examples.
- 3
Generate reasoning+code using a stronger LLM with a system prompt that forces the plan to appear before the code, then keep that format consistent across all examples.
- 4
Validate every generated solution by pasting the Python code back into LeetCode and confirming it passes test cases before including it in the training set.
- 5
Convert the curated text into a JSON schema with instruction, input, and output fields so training pipelines can load it reliably.
- 6
Upload the dataset to Hugging Face to make it easy to reference from Unsloth notebooks and training scripts.
- 7
Expect format adherence to improve faster than raw problem-solving accuracy when the dataset is small; correctness may require more data, better hyperparameters, or a stronger base model.