Build a custom dataset with LightningDataModule in PyTorch Lightning

TL;DR

Collapse GoEmotions’ multi-annotator labels into one emotion per comment by grouping by comment ID, summing one-hot emotion columns, and taking the argmax index.

Briefing Cornell Notes

Briefing

A practical path to text classification in PyTorch Lightning starts with turning the multi-annotator GoEmotions dataset into one clean label per comment, then wrapping that data in a custom Dataset and a LightningDataModule. The key move is collapsing 27 emotion categories (including a neutral class) into a single target by grouping annotations by comment ID, summing the one-hot label columns, and taking the index of the maximum count—so every comment ends up with exactly one emotion label suitable for supervised training.

After the labels are consolidated, the workflow adds a small but useful enrichment: an emoji mapped from each emotion category. Using an emoji map from the original GoEmotions example, the code maps each comment’s chosen emotion to its human-readable name and then to the corresponding emoji, producing a dataframe that pairs text with a single categorical label (and an emoji column for readability/debugging).

With the dataframe ready, the next step is a custom PyTorch Dataset that plugs into transformer tokenization. An `EmotionDataset` class inherits from the PyTorch dataset base and stores the dataframe plus an Electra tokenizer. The dataset implements `__len__` to return the number of rows and `__getitem__` to tokenize the comment text with `max_length=64`, enabling truncation and padding to a fixed length. The tokenizer outputs `input_ids` and `attention_mask` as PyTorch tensors, while the emotion label is converted into a `torch.tensor` for training. Each dataset item is returned as a dictionary containing `input_ids`, `attention_mask`, and `labels`.

Once the Dataset works in isolation, the LightningDataModule becomes the abstraction layer that handles splitting and batching. A custom `EmotionDataModule` inherits from the PyTorch Lightning `LightningDataModule` (imported as `PL`). In `setup`, the dataframe is split into training and test sets using `train_test_split` with `test_size=0.2` and a fixed random seed (`PL.seed_everything(42)`) for reproducibility. The test portion is then split again to create a validation set (50/50 split of the test data), yielding train/val/test dataframes.

The module then defines dataloaders for each stage. `train_dataloader` returns a DataLoader built from the custom `EmotionDataset`, using the provided tokenizer, batch size, shuffling enabled for training, and `num_workers` set via `os.cpu_count()` to match available CPU cores. Validation and test dataloaders are created similarly but without shuffling. A quick manual call to `setup` and inspection of a batch confirms the expected tensor shapes: sequences are padded/truncated to length 64, and each batch contains `batch_size` items with `input_ids`, `attention_mask`, and `labels`.

The result is a reusable data pipeline: swapping in a different dataset requires only a new LightningDataModule (or Dataset) while keeping downstream model code intact. The next logical step—mentioned as the follow-up—is wrapping an Electra model into a Lightning module for training or fine-tuning.

Cornell Notes

The workflow converts GoEmotions’ multi-annotator labels into one emotion label per comment by grouping rows by comment ID, summing the one-hot emotion columns, and taking the index of the maximum count. It then tokenizes each comment using an Electra tokenizer with `max_length=64`, truncation, and padding, returning `input_ids`, `attention_mask`, and `labels` from a custom PyTorch Dataset. A LightningDataModule wraps that Dataset to handle train/validation/test splitting (20% test, then half of that for validation) with a fixed seed for reproducibility. Finally, dataloaders are created with batch size, shuffling for training, and `num_workers` from `os.cpu_count()`, and batch contents are sanity-checked for expected tensor shapes.

How does the pipeline turn multiple emotion annotations into a single training label per comment?

It groups the annotations dataframe by comment ID, then sums the one-hot columns for each emotion category (including the neutral class). The emotion label for that comment is chosen by taking the index of the maximum summed value—so the most frequently assigned emotion becomes the single `labels` target used for supervised learning.

Why use `max_length=64` with truncation and padding in the Dataset?

Transformer models require fixed-size token sequences per batch. Tokenization with `max_length=64`, `truncation=True`, and `padding='max_length'` ensures every example produces `input_ids` and `attention_mask` tensors of the same length, simplifying batching and model input handling.

What exactly does `getitem` return for each example?

For a given index, it tokenizes the corresponding text and returns a dictionary with three keys: `input_ids` (token IDs tensor), `attention_mask` (mask tensor), and `labels` (the emotion label converted to a `torch.tensor`). This matches what transformer classification training loops expect.

How does the LightningDataModule create train/val/test splits?

In `setup`, it first splits the full dataframe into training and test using `train_test_split` with `test_size=0.2` and a fixed seed (`PL.seed_everything(42)`). It then splits the test dataframe again into validation and final test using `test_size=0.5`, resulting in train/val/test partitions.

What parameters control batching and parallelism in the dataloaders?

The dataloaders use the `batch_size` passed into the LightningDataModule constructor. Training uses `shuffle=True`, while validation/test use `shuffle=False`. Parallel data loading uses `num_workers=os.cpu_count()`, which sets the number of worker processes based on available CPU cores.

Review Questions

If a comment has equal counts across multiple emotion categories after summing annotations, what label selection behavior does the `max`/index approach imply?
How would changing `max_length` from 64 to a larger value affect memory use and batch throughput during training?
Why is calling `setup` important before requesting dataloaders from a LightningDataModule in this workflow?

Key Points

1
Collapse GoEmotions’ multi-annotator labels into one emotion per comment by grouping by comment ID, summing one-hot emotion columns, and taking the argmax index.
2
Create a custom PyTorch Dataset that tokenizes text with an Electra tokenizer using `max_length=64`, truncation, and padding to a fixed sequence length.
3
Return a dictionary from `__getitem__` containing `input_ids`, `attention_mask`, and `labels` as tensors to match transformer training expectations.
4
Use a LightningDataModule to centralize splitting logic: 80/20 train/test, then split the test half into validation and final test.
5
Make splits reproducible by setting a fixed seed via `PL.seed_everything(42)` before calling `train_test_split`.
6
Build DataLoaders from the custom Dataset inside `train_dataloader`, `val_dataloader`, and `test_dataloader`, using batch size and `num_workers=os.cpu_count()`.
7
Sanity-check a batch to confirm tensor shapes: sequence length equals 64 and batch dimension equals the chosen batch size.

Highlights

The core labeling step picks the dominant emotion per comment by summing annotation columns and taking the maximum index, turning a multi-label dataset into single-label classification data.

Tokenization is standardized at length 64 with truncation and padding, so every batch has consistent `input_ids` and `attention_mask` shapes.

LightningDataModule’s `setup` method is where train/val/test splits are created, while dataloader methods focus on batching and shuffling behavior.

Topics

GoEmotions Labeling
Electra Tokenization
PyTorch Dataset
LightningDataModule
Train/Val/Test Splits

Mentioned

PyTorch Lightning
PyTorch
Transformers
Electra
GoEmotions
PL