Fine-Tuning ChatGPT 3.5 with Synthetic Data from GPT-4 | VERY Interesting Results (!)
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
A synthetic dataset generated by GPT-4 can be used to fine-tune GPT-3.5 turbo toward more structured, step-by-step reasoning.
Briefing
Fine-tuning ChatGPT 3.5 turbo using a fully synthetic training set generated by GPT-4 can measurably improve step-by-step reasoning—though it doesn’t reliably match GPT-4’s accuracy and can still produce uncertainty. The experiment builds a pipeline that uses GPT-4 to generate riddles, math, and logic problems with explicit multi-step solution structure, then trains a GPT-3.5 fine-tune model on a few hundred such examples. After training, the resulting model shows more structured answers than vanilla GPT-3.5, and in at least one test it moves closer to GPT-4’s correctness.
The workflow starts by choosing problem types that are easy to benchmark: riddles, math problems, and logical puzzles. Each synthetic example is created with a consistent “assignments → reasoning steps → final answer” format. A Python script automates generation: it feeds GPT-4 prompts that request new problems solvable only through step-by-step reasoning, then captures GPT-4’s structured solution. The output is saved first as plain text and then converted into JSONL for fine-tuning, using a system instruction that pushes the model to solve problems in a step-by-step way (framed as “Chain of Thought” reasoning).
The creator runs the generator at scale to balance cost and coverage. After monitoring expenses, the plan targets 200 synthetic examples. The run is interrupted after about 1 hour and 50 minutes, but the dataset produced so far contains 161 examples (about 600 KB). Fine-tuning is then launched on the generated JSONL using a GPT-3.5 turbo fine-tune job. Training metrics are tracked via epochs and training loss; the job completes after multiple epochs, with training loss fluctuating and reaching a low point during training. Total spending is reported as roughly $35 for the combined process (data generation plus fine-tuning), with the data generation cost dominating.
Benchmarking focuses on whether the fine-tuned model behaves more like a “reasoning-first” system. In one puzzle about identifying a cartoon character and then inferring a country of origin, vanilla GPT-3.5 refuses to answer without more information. The fine-tuned model produces a more confident chain of reasoning and lands on a country, though the reasoning path is still somewhat speculative (it assumes a character identity based on the prompt).
A second test uses a real-world physics-style scenario involving a ball dropped into a cone with holes, then shipping the cone to a friend. GPT-4 gives the correct outcome: the ball falls through the bottom hole and is not in the shipped box. Vanilla GPT-3.5 produces a clearly inconsistent answer (placing the ball inside the shipped container). The fine-tuned model lands between the two: it provides a step-by-step explanation that correctly identifies the ball passing through the cone, but it becomes internally inconsistent when it tries to reconcile later steps (whether the ball remains in the office, gets left behind, or ends up elsewhere). The creator interprets this as partial improvement—better reasoning structure and closer alignment to GPT-4’s logic, but not full reliability.
Overall, the results suggest synthetic GPT-4 data can “bake in” structured reasoning behavior for GPT-3.5, yet accuracy still lags GPT-4 and depends heavily on dataset size, diversity, and how well the synthetic reasoning matches real-world constraints. The experiment ends with a call for further testing ideas, including narrower fine-tunes on synthetic data.
Cornell Notes
A pipeline generates a synthetic training set using GPT-4, then fine-tunes ChatGPT 3.5 turbo on that data to encourage step-by-step reasoning. The synthetic examples are riddles, math, and logic problems formatted with explicit multi-step “assignments” and a final answer, then converted to JSONL for fine-tuning. After training on 161 examples (targeting up to 200), the fine-tuned model shows more structured reasoning than vanilla GPT-3.5 and sometimes reaches GPT-4-like conclusions. In a physics-style cone-and-ball scenario, vanilla GPT-3.5 gives a wrong answer, GPT-4 is correct, and the fine-tuned model improves but still shows internal uncertainty when later steps conflict.
How was the synthetic dataset constructed to influence the fine-tuned model’s behavior?
Why did the experiment focus on “step-by-step” structure rather than just asking for answers?
What did the fine-tuning process involve after the dataset was generated?
What happened in the cartoon/museum puzzle benchmark?
How did the cone-and-ball physics benchmark distinguish GPT-4, vanilla GPT-3.5, and the fine-tuned model?
Review Questions
- What specific formatting choices in the synthetic prompts were meant to teach the fine-tuned model how to reason?
- In the cone-and-ball test, which detail about the cone’s holes drives the correct conclusion, and how did each model handle it?
- Why might training loss trends alone be insufficient to guarantee correct reasoning on unseen problems?
Key Points
- 1
A synthetic dataset generated by GPT-4 can be used to fine-tune GPT-3.5 turbo toward more structured, step-by-step reasoning.
- 2
The dataset generation emphasized consistent multi-step solution formatting (assignments, intermediate determinations, final answer) to “bake in” reasoning behavior.
- 3
Fine-tuning required converting synthetic outputs into JSONL chat format with a system instruction that reinforces step-by-step problem solving.
- 4
On a cartoon/museum inference puzzle, vanilla GPT-3.5 refused due to missing information, while the fine-tuned model produced a reasoned answer by making assumptions.
- 5
On a cone-and-ball physics scenario, vanilla GPT-3.5 produced a clearly wrong outcome, GPT-4 was correct, and the fine-tuned model improved but still showed internal inconsistency when later steps conflicted.
- 6
Cost can become dominated by GPT-4 synthetic data generation; even hundreds of examples can add up quickly compared with the fine-tuning compute itself.