Fine-Tuning ChatGPT 3.5 with Synthetic Data from GPT-4

TL;DR

A synthetic dataset generated by GPT-4 can be used to fine-tune GPT-3.5 turbo toward more structured, step-by-step reasoning.

Briefing Cornell Notes

Briefing

Fine-tuning ChatGPT 3.5 turbo using a fully synthetic training set generated by GPT-4 can measurably improve step-by-step reasoning—though it doesn’t reliably match GPT-4’s accuracy and can still produce uncertainty. The experiment builds a pipeline that uses GPT-4 to generate riddles, math, and logic problems with explicit multi-step solution structure, then trains a GPT-3.5 fine-tune model on a few hundred such examples. After training, the resulting model shows more structured answers than vanilla GPT-3.5, and in at least one test it moves closer to GPT-4’s correctness.

The workflow starts by choosing problem types that are easy to benchmark: riddles, math problems, and logical puzzles. Each synthetic example is created with a consistent “assignments → reasoning steps → final answer” format. A Python script automates generation: it feeds GPT-4 prompts that request new problems solvable only through step-by-step reasoning, then captures GPT-4’s structured solution. The output is saved first as plain text and then converted into JSONL for fine-tuning, using a system instruction that pushes the model to solve problems in a step-by-step way (framed as “Chain of Thought” reasoning).

The creator runs the generator at scale to balance cost and coverage. After monitoring expenses, the plan targets 200 synthetic examples. The run is interrupted after about 1 hour and 50 minutes, but the dataset produced so far contains 161 examples (about 600 KB). Fine-tuning is then launched on the generated JSONL using a GPT-3.5 turbo fine-tune job. Training metrics are tracked via epochs and training loss; the job completes after multiple epochs, with training loss fluctuating and reaching a low point during training. Total spending is reported as roughly $35 for the combined process (data generation plus fine-tuning), with the data generation cost dominating.

Benchmarking focuses on whether the fine-tuned model behaves more like a “reasoning-first” system. In one puzzle about identifying a cartoon character and then inferring a country of origin, vanilla GPT-3.5 refuses to answer without more information. The fine-tuned model produces a more confident chain of reasoning and lands on a country, though the reasoning path is still somewhat speculative (it assumes a character identity based on the prompt).

A second test uses a real-world physics-style scenario involving a ball dropped into a cone with holes, then shipping the cone to a friend. GPT-4 gives the correct outcome: the ball falls through the bottom hole and is not in the shipped box. Vanilla GPT-3.5 produces a clearly inconsistent answer (placing the ball inside the shipped container). The fine-tuned model lands between the two: it provides a step-by-step explanation that correctly identifies the ball passing through the cone, but it becomes internally inconsistent when it tries to reconcile later steps (whether the ball remains in the office, gets left behind, or ends up elsewhere). The creator interprets this as partial improvement—better reasoning structure and closer alignment to GPT-4’s logic, but not full reliability.

Overall, the results suggest synthetic GPT-4 data can “bake in” structured reasoning behavior for GPT-3.5, yet accuracy still lags GPT-4 and depends heavily on dataset size, diversity, and how well the synthetic reasoning matches real-world constraints. The experiment ends with a call for further testing ideas, including narrower fine-tunes on synthetic data.

Cornell Notes

A pipeline generates a synthetic training set using GPT-4, then fine-tunes ChatGPT 3.5 turbo on that data to encourage step-by-step reasoning. The synthetic examples are riddles, math, and logic problems formatted with explicit multi-step “assignments” and a final answer, then converted to JSONL for fine-tuning. After training on 161 examples (targeting up to 200), the fine-tuned model shows more structured reasoning than vanilla GPT-3.5 and sometimes reaches GPT-4-like conclusions. In a physics-style cone-and-ball scenario, vanilla GPT-3.5 gives a wrong answer, GPT-4 is correct, and the fine-tuned model improves but still shows internal uncertainty when later steps conflict.

How was the synthetic dataset constructed to influence the fine-tuned model’s behavior?

GPT-4 was prompted to create problems (riddles, math, logic) that require step-by-step solving. Each example included a structured reasoning sequence—identifying relevant elements, associating them to the task, determining intermediate conclusions, and then producing a final answer. The generator automated this with Python loops, saving outputs and then converting them into JSONL formatted as chat messages (system + user prompt + assistant response). A system instruction during fine-tuning emphasized being an “expert problem solver” that thinks step-by-step using Chain of Thought-style reasoning.

Why did the experiment focus on “step-by-step” structure rather than just asking for answers?

The goal was to bake reasoning format into the fine-tuned model. The synthetic prompts repeatedly requested the same procedure: break the problem into assignments, solve in order, and then provide a final answer. That consistency is intended to teach the model not only what to answer, but how to produce an answer with intermediate steps that match the training pattern.

What did the fine-tuning process involve after the dataset was generated?

The JSONL file was uploaded, then a fine-tuning job was created for a GPT-3.5 turbo fine-tune model. Training was monitored through epochs and training loss. Epochs were described as full passes over the training examples, and training loss as a measure of discrepancy between predictions and the training labels—useful for spotting learning progress and potential overfitting.

What happened in the cartoon/museum puzzle benchmark?

Vanilla GPT-3.5 produced an unhelpful response, effectively saying the country of origin couldn’t be determined without more information about the cartoon character. The fine-tuned model instead generated a chain of reasoning that assumed the painting was the Mona Lisa and linked the cartoon character to Leonardo-associated logic, ultimately giving a country-of-origin answer. The improvement was mainly in willingness to reason through the missing link rather than refusing.

How did the cone-and-ball physics benchmark distinguish GPT-4, vanilla GPT-3.5, and the fine-tuned model?

GPT-4 correctly reasoned that the cone has a hole at the bottom, so the ball passes through and falls out before the cone is shipped—meaning the ball isn’t in the shipped box. Vanilla GPT-3.5 gave a contradictory answer that kept the ball inside the shipped container. The fine-tuned model produced a more GPT-4-like step breakdown (ball passes through the cone), but it still struggled with consistency later—when asked for a final location, it expressed uncertainty about whether the ball remained in the office or was left behind en route.

Review Questions

What specific formatting choices in the synthetic prompts were meant to teach the fine-tuned model how to reason?
In the cone-and-ball test, which detail about the cone’s holes drives the correct conclusion, and how did each model handle it?
Why might training loss trends alone be insufficient to guarantee correct reasoning on unseen problems?

Key Points

1
A synthetic dataset generated by GPT-4 can be used to fine-tune GPT-3.5 turbo toward more structured, step-by-step reasoning.
2
The dataset generation emphasized consistent multi-step solution formatting (assignments, intermediate determinations, final answer) to “bake in” reasoning behavior.
3
Fine-tuning required converting synthetic outputs into JSONL chat format with a system instruction that reinforces step-by-step problem solving.
4
On a cartoon/museum inference puzzle, vanilla GPT-3.5 refused due to missing information, while the fine-tuned model produced a reasoned answer by making assumptions.
5
On a cone-and-ball physics scenario, vanilla GPT-3.5 produced a clearly wrong outcome, GPT-4 was correct, and the fine-tuned model improved but still showed internal inconsistency when later steps conflicted.
6
Cost can become dominated by GPT-4 synthetic data generation; even hundreds of examples can add up quickly compared with the fine-tuning compute itself.

Highlights

Synthetic GPT-4 data didn’t just change answers—it changed the model’s willingness to produce structured reasoning steps instead of refusing.

In the cone-and-ball test, the fine-tuned model recognized the key physical constraint (holes mean the ball falls through), unlike vanilla GPT-3.5.

The fine-tuned model sometimes landed between GPT-4 and vanilla GPT-3.5: better reasoning structure, but not fully consistent conclusions.

Training on a few hundred synthetic examples produced noticeable behavioral shifts, but accuracy still depended on problem complexity and consistency demands.

Topics

Synthetic Data Generation
GPT-4 Prompting
Fine-Tuning GPT-3.5
Chain of Thought
Benchmarking Reasoning

Mentioned

GPT-4
GPT-3.5
JSONL
LLM

Fine-Tuning ChatGPT 3.5 with Synthetic Data from GPT-4 | VERY Interesting Results (!)