Building an LLM fine-tuning Dataset

TL;DR

Treat dataset curation as the main engineering challenge: training is comparatively straightforward once reply chains are correctly reconstructed.

Briefing Cornell Notes

Briefing

Fine-tuning an LLM on Reddit comments is less about model training and more about building a usable dataset—especially when the goal is multi-turn, multi-speaker conversations. The workflow centers on extracting large volumes of Reddit comment data (not posts), converting it into conversation chains, and then formatting those chains into training samples that a causal language model can learn from. The payoff is a dataset tailored to realistic social dynamics: on Reddit, responses often involve more than two participants, and the “bot” reply needs to be grounded in a history of prior comments.

The process starts with sourcing comment archives. A long-running Reddit comment dataset exists via torrents/archives and, crucially for scale and convenience, through a maintained BigQuery dataset containing Reddit comments from roughly 2005 through the end of 2019. That BigQuery resource is large enough to train or fine-tune models at meaningful sizes (the transcript mentions a 7B-class model as a plausible target) and to filter behavior by subreddit. The practical bottleneck is not the amount of data, but exporting it in manageable chunks, downloading it reliably, and reshaping it into something trainable.

To handle export, the workflow uses BigQuery exports to Google Cloud Storage, typically organized by year and month. The transcript emphasizes that CSV exports become unwieldy for comment bodies, so JSON is preferred, even though it creates many small files. Downloading and storage become operational problems: compression (gzip) can reduce transfer size but adds decompression overhead, and unstable internet can force overnight downloads. Because decompression and parsing are expensive at scale, the workflow pre-decompresses files locally (or on a NAS) to avoid repeating CPU work during later iterations.

Once the raw JSON is decompressed, the next step is data wrangling: selecting only the fields needed for conversation reconstruction (comment body, author, timestamps, IDs, parent IDs, and score). The transcript notes that many comments won’t participate in reply chains, and that top-level posts (which may not have a parent) can be missing from naive “parent-based” reconstruction—so the dataset may need additional logic to anchor replies to titles. A key design decision is how to build training samples: instead of simple instruction-response pairs, the dataset is assembled as conversation chains by walking parent IDs backward until no parent is found, then packaging the resulting history.

For WallStreetBets, the dataset is filtered using minimum reply length and minimum upvotes (score), producing multiple dataset variants (e.g., min score 3/5/10 and min length 2/5). The transcript also explores multi-speaker formatting by assigning speaker IDs to each author and using a dedicated bot name (“WallStreetBot”) for the target reply. The resulting samples can be used either as full prompt+response training text or as structured instruction formats, depending on the fine-tuning approach.

On the training side, the workflow experiments with model choices and ultimately uses parameter-efficient fine-tuning (LoRA/QLoRA-style) with an “AutoPFT” approach for causal language models. That automation supports checkpointing adapters and testing them without repeatedly merging or dequantizing full models—making iteration faster and cheaper. Early results suggest the curated, filtered dataset improves quality versus an earlier attempt, and the transcript claims diminishing returns beyond roughly 500–1,000 training steps, with the possibility that far fewer samples could work if the filtering is strong enough. The end state is a set of Hugging Face datasets and fine-tuned adapters, plus a plan to iterate on better prompts and more targeted sample selection (including higher-score thresholds) while avoiding “junk” comments that inflate scores without meaningful content.

Cornell Notes

The core work is turning massive Reddit comment archives into multi-turn, multi-speaker training data for LLM fine-tuning. The workflow pulls Reddit comments from a BigQuery dataset (2005–2019), exports them to JSON chunks in Google Cloud Storage, downloads and pre-decompresses them, then parses only the fields needed to reconstruct reply chains (body, author, created time, ID, parent ID, score). Training samples are built by walking parent IDs to form conversation histories, then labeling participants as “speaker” roles with a dedicated bot name for the target reply. Filtering by minimum reply length and minimum upvotes creates multiple dataset variants, and parameter-efficient fine-tuning with AutoPFT enables fast adapter checkpoint testing without repeatedly merging full models.

Why does the workflow focus on Reddit comments (not posts) and on conversation chains rather than simple instruction-response pairs?

Comments provide the reply structure needed to build multi-turn histories. Instead of treating each sample as a standalone prompt/answer, the workflow reconstructs chains by using comment IDs and parent IDs to link each reply back through earlier comments. That better matches how Reddit threads actually unfold, where multiple participants can contribute to the context before the “bot” reply.

How does the dataset reconstruction avoid producing meaningless training samples?

It filters aggressively before building samples. The transcript keeps only comments that participate in reply chains (many comments won’t have parents that can be found), and it applies minimum thresholds such as minimum reply length and minimum upvotes (score). It also notes that top-level posts may be missed by parent-based reconstruction, which can require additional logic to anchor replies to titles.

What makes exporting and preparing the raw Reddit data so time-consuming?

Operational friction dominates. JSON exports create many small files, CSV exports are hard to parse for long text bodies, and downloading large months of data can saturate limited internet. Compression reduces transfer size but adds decompression overhead, so the workflow pre-decompresses once (often on a NAS to reduce network transfer bottlenecks) to avoid repeating CPU work during later iterations.

How are multi-speaker training samples represented for fine-tuning?

Each unique author in a conversation chain is mapped to a speaker index (speaker zero, speaker one, etc.), while the target reply is labeled as coming from a dedicated bot identity (“WallStreetBot”). The training text then alternates speaker turns, so the model learns to generate the bot’s response conditioned on prior multi-party context.

What role do score and length thresholds play in dataset quality?

They act as a quality gate. The transcript builds multiple dataset variants with different minimum score and minimum length settings (e.g., min score 3/5/10 and min length 2/5). Higher thresholds reduce low-effort or low-quality replies, while minimum length helps ensure there’s enough conversational content for the model to learn meaningful turn-taking.

Why does AutoPFT matter for iteration speed during fine-tuning?

AutoPFT for causal LM fine-tuning supports checkpointing adapters and testing them directly, without repeatedly merging dequantized full models. Since the base model stays quantized and only small adapters change, each training run produces modular artifacts that can be swapped and evaluated quickly—reducing both memory pressure and turnaround time.

Review Questions

When reconstructing conversation chains from Reddit comments, which fields are essential (and why) to link a reply to its prior context?
What trade-offs arise when choosing JSON vs CSV exports and when using gzip compression for large comment bodies?
How do minimum score and minimum length filters change the distribution of training samples, and what kinds of “junk” comments are they meant to suppress?

Key Points

1
Treat dataset curation as the main engineering challenge: training is comparatively straightforward once reply chains are correctly reconstructed.
2
Use a scalable comment source (BigQuery for 2005–2019 Reddit comments) and export in JSON chunks to keep long comment bodies usable.
3
Pre-decompress exported data once to avoid repeated CPU overhead during parsing and dataset building; consider decompressing on a NAS to reduce network bottlenecks.
4
Rebuild multi-turn context by walking parent IDs backward to form conversation histories, then package those histories as multi-speaker samples with explicit speaker labels.
5
Filter by minimum reply length and minimum upvotes (score) to reduce low-effort replies and to create multiple dataset variants for experimentation.
6
Prefer parameter-efficient fine-tuning with AutoPFT so adapter checkpoints can be tested quickly without repeatedly merging or dequantizing full models.
7
Expect diminishing returns beyond roughly 500–1,000 training steps when the dataset is already well-filtered; further gains may require better sample selection rather than longer training.

Highlights

The workflow’s biggest bottleneck is not model training—it’s exporting, downloading, decompressing, and reshaping terabytes of Reddit comment JSON into coherent reply chains.

Multi-speaker realism comes from reconstructing parent-linked conversation histories, then labeling each author as a speaker and the target reply as a dedicated bot identity.

Pre-decompressing once (often on a NAS) is a practical optimization to avoid repeating expensive work across multiple dataset-building passes.

AutoPFT enables rapid adapter checkpoint testing by keeping the base model quantized and swapping small adapters instead of merging full models each time.

Filtering by minimum score and minimum length produces multiple dataset variants, and early experiments suggest quality improves more from curation than from simply training longer.

Topics

Reddit Comment Datasets
LLM Fine-Tuning
Conversation Chain Reconstruction
JSON Export Pipelines
LoRA/AutoPFT Training

Mentioned

LLM
GCS
GTC
NAS
LoRA
QLoRA
PFT
AutoPFT
CPU
GPU