Build a custom dataset with LightningDataModule in PyTorch Lightning
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Collapse GoEmotions’ multi-annotator labels into one emotion per comment by grouping by comment ID, summing one-hot emotion columns, and taking the argmax index.
Briefing
A practical path to text classification in PyTorch Lightning starts with turning the multi-annotator GoEmotions dataset into one clean label per comment, then wrapping that data in a custom Dataset and a LightningDataModule. The key move is collapsing 27 emotion categories (including a neutral class) into a single target by grouping annotations by comment ID, summing the one-hot label columns, and taking the index of the maximum count—so every comment ends up with exactly one emotion label suitable for supervised training.
After the labels are consolidated, the workflow adds a small but useful enrichment: an emoji mapped from each emotion category. Using an emoji map from the original GoEmotions example, the code maps each comment’s chosen emotion to its human-readable name and then to the corresponding emoji, producing a dataframe that pairs text with a single categorical label (and an emoji column for readability/debugging).
With the dataframe ready, the next step is a custom PyTorch Dataset that plugs into transformer tokenization. An `EmotionDataset` class inherits from the PyTorch dataset base and stores the dataframe plus an Electra tokenizer. The dataset implements `__len__` to return the number of rows and `__getitem__` to tokenize the comment text with `max_length=64`, enabling truncation and padding to a fixed length. The tokenizer outputs `input_ids` and `attention_mask` as PyTorch tensors, while the emotion label is converted into a `torch.tensor` for training. Each dataset item is returned as a dictionary containing `input_ids`, `attention_mask`, and `labels`.
Once the Dataset works in isolation, the LightningDataModule becomes the abstraction layer that handles splitting and batching. A custom `EmotionDataModule` inherits from the PyTorch Lightning `LightningDataModule` (imported as `PL`). In `setup`, the dataframe is split into training and test sets using `train_test_split` with `test_size=0.2` and a fixed random seed (`PL.seed_everything(42)`) for reproducibility. The test portion is then split again to create a validation set (50/50 split of the test data), yielding train/val/test dataframes.
The module then defines dataloaders for each stage. `train_dataloader` returns a DataLoader built from the custom `EmotionDataset`, using the provided tokenizer, batch size, shuffling enabled for training, and `num_workers` set via `os.cpu_count()` to match available CPU cores. Validation and test dataloaders are created similarly but without shuffling. A quick manual call to `setup` and inspection of a batch confirms the expected tensor shapes: sequences are padded/truncated to length 64, and each batch contains `batch_size` items with `input_ids`, `attention_mask`, and `labels`.
The result is a reusable data pipeline: swapping in a different dataset requires only a new LightningDataModule (or Dataset) while keeping downstream model code intact. The next logical step—mentioned as the follow-up—is wrapping an Electra model into a Lightning module for training or fine-tuning.
Cornell Notes
The workflow converts GoEmotions’ multi-annotator labels into one emotion label per comment by grouping rows by comment ID, summing the one-hot emotion columns, and taking the index of the maximum count. It then tokenizes each comment using an Electra tokenizer with `max_length=64`, truncation, and padding, returning `input_ids`, `attention_mask`, and `labels` from a custom PyTorch Dataset. A LightningDataModule wraps that Dataset to handle train/validation/test splitting (20% test, then half of that for validation) with a fixed seed for reproducibility. Finally, dataloaders are created with batch size, shuffling for training, and `num_workers` from `os.cpu_count()`, and batch contents are sanity-checked for expected tensor shapes.
How does the pipeline turn multiple emotion annotations into a single training label per comment?
Why use `max_length=64` with truncation and padding in the Dataset?
What exactly does `__getitem__` return for each example?
How does the LightningDataModule create train/val/test splits?
What parameters control batching and parallelism in the dataloaders?
Review Questions
- If a comment has equal counts across multiple emotion categories after summing annotations, what label selection behavior does the `max`/index approach imply?
- How would changing `max_length` from 64 to a larger value affect memory use and batch throughput during training?
- Why is calling `setup` important before requesting dataloaders from a LightningDataModule in this workflow?
Key Points
- 1
Collapse GoEmotions’ multi-annotator labels into one emotion per comment by grouping by comment ID, summing one-hot emotion columns, and taking the argmax index.
- 2
Create a custom PyTorch Dataset that tokenizes text with an Electra tokenizer using `max_length=64`, truncation, and padding to a fixed sequence length.
- 3
Return a dictionary from `__getitem__` containing `input_ids`, `attention_mask`, and `labels` as tensors to match transformer training expectations.
- 4
Use a LightningDataModule to centralize splitting logic: 80/20 train/test, then split the test half into validation and final test.
- 5
Make splits reproducible by setting a fixed seed via `PL.seed_everything(42)` before calling `train_test_split`.
- 6
Build DataLoaders from the custom Dataset inside `train_dataloader`, `val_dataloader`, and `test_dataloader`, using batch size and `num_workers=os.cpu_count()`.
- 7
Sanity-check a batch to confirm tensor shapes: sequence length equals 64 and batch dimension equals the chosen batch size.