Create Custom Dataset for Question Answering with T5 using HuggingFace, Pytorch Lightning & PyTorch

TL;DR

Extract BioASQ QA JSON into a DataFrame by iterating paragraphs → questions → answers and using answer_start to compute answer_end inside the context.

Briefing Cornell Notes

Briefing

Fine-tuning T5 for question answering starts with turning BioASQ biomedical QA files into a model-ready dataset: each training example becomes a (question, context) input pair with the answer text extracted as a span from the context. The workflow builds a PyTorch Dataset that tokenizes the source as “question + context” and tokenizes the target as the answer, then prepares labels by masking padding tokens so loss is computed only on real answer tokens. This matters because T5 is a text-to-text, sequence-to-sequence model—so the data must be shaped exactly around its input/output format, not around classification-style targets.

The pipeline begins by setting up the environment for HuggingFace Transformers and PyTorch Lightning, including installing SentencePiece and specific library versions. After imports, it loads the BioASQ question answering training data from a downloaded and unzipped BioASQ zip (the example uses the “factoid” training split files). Each JSON document contains a title and a list of paragraphs; each paragraph includes a context string plus multiple question entries. For every question, the script iterates through the associated answers, capturing both the answer text and its character start index inside the context. The end index is computed from the start plus the answer text length, enabling the key property of the dataset: the answer is a substring of the context.

Those extracted fields are assembled into a Pandas DataFrame with columns for question, context, answer text, answer start, and answer end. The resulting dataset is then aggregated across the training files and checked for scale: the transcript reports roughly 443 unique questions and about 2.5k total (question, answer) examples when considering unique contexts. A quick visualization step uses colored terminal output to highlight the answer span within the context, confirming that the indexing aligns with the text.

Next comes the tokenization logic tailored to T5. Using the HuggingFace T5 tokenizer (for “t5-base” / “T5Tokenizer”), the script demonstrates how the tokenizer produces input_ids and attention_mask for a sample question, and how it inserts special tokens to separate sequences. For training, it encodes the source with a maximum length (396 tokens) while truncating only the context if needed, and it encodes the target answer with its own maximum length (32 tokens). Labels are derived from the target input_ids, but padding positions are replaced with -100 so the training loss ignores them.

Finally, a custom PyTorch Dataset class is wrapped in a PyTorch Lightning DataModule. The DataModule splits the DataFrame into training and validation sets, constructs dataset instances with the tokenizer and max lengths, and returns DataLoaders with a batch size of 8 and four worker processes. The setup is verified by instantiating the DataModule and running its setup method, preparing everything needed for the next phase: actually fine-tuning T5 on the BioASQ question answering task.

Cornell Notes

The dataset-building step for T5 QA turns BioASQ JSON files into training rows where each example is (question, context) → answer text. The script extracts answer spans by using each answer’s text plus its character start index inside the context, then stores question, context, answer text, and span indices in a Pandas DataFrame. A custom PyTorch Dataset tokenizes the source as a question-context pair and tokenizes the target as the answer, using max lengths (396 for source, 32 for target). Training labels come from the target token ids, but padding tokens are masked to -100 so loss ignores them. A PyTorch Lightning DataModule then splits into train/validation and provides DataLoaders for fine-tuning.

How does the transcript ensure the answer is correctly aligned inside the context text?

Each paragraph contains a context string and a list of questions, and each question contains answers with an answer start index. For every answer, the script records answer_start and computes answer_end as answer_start + len(answer_text). This span logic is later validated by a visualization function that prints the context with the answer characters highlighted (green) between answer_start and answer_end.

Why does T5 require a different data format than extractive QA models?

T5 is a sequence-to-sequence text-to-text model: it takes text in and generates text out. That means the input must be tokenized from the combined question and context, and the output must be tokenized from the answer text itself. The dataset therefore produces labels that correspond to the answer tokens, not to start/end positions as in classic span-extraction setups.

What exactly becomes the model’s “source” and “target” during tokenization?

The source is encoded from the question plus context pair using the T5 tokenizer with a max length of 396, truncating only the context when necessary. The target is encoded from the answer text with a max length of 32. The dataset returns input_ids and attention_mask for the source, and labels derived from the target input_ids.

Why are padding tokens converted to -100 in the labels?

The transcript notes that labels initially contain padding positions (shown as zeros after encoding). Before training, those padding positions are replaced with -100 so the loss computation ignores them. This prevents the model from being penalized for predictions on padded tokens that carry no semantic meaning.

How is the training/validation split handled for Lightning?

The transcript uses PyTorch Lightning utilities to split the DataFrame into training_df and validation_df. The DataModule’s setup method then constructs a BioASQ QA Dataset for each split, and the DataLoaders are returned with batch_size=8 and num_workers=4.

What scale checks are performed after extracting the BioASQ data?

After concatenating extracted rows across the training files, the script checks counts of unique questions and the number of examples. It reports about 443 different questions and roughly 2.5k (question, answer) examples when considering unique contexts, acknowledging that some questions can repeat across different contexts.

Review Questions

How would you modify the dataset if you wanted the model to handle multiple answers per question differently (e.g., choose one vs. generate all)?
What changes would be needed if you wanted to truncate the question-context pair differently (e.g., truncate from the beginning of the context)?
Why might using a fixed target max length of 32 harm performance for longer biomedical answers, and how could you address it?

Key Points

1
Extract BioASQ QA JSON into a DataFrame by iterating paragraphs → questions → answers and using answer_start to compute answer_end inside the context.
2
Represent each training example for T5 as (question, context) → answer text, since T5 is text-to-text rather than span-position based.
3
Tokenize the source with a max length of 396 and truncate only the context to preserve the question as much as possible.
4
Tokenize the target answer with a max length of 32 and build labels from target input_ids.
5
Mask padding positions in labels by replacing them with -100 so loss ignores padding tokens.
6
Wrap the tokenization and label logic in a custom PyTorch Dataset and expose train/validation DataLoaders via a PyTorch Lightning DataModule.
7
Validate the extraction and tokenization by sampling rows and visually checking that the answer span matches the context indices.

Highlights

BioASQ answers are treated as spans inside the context using answer_start and answer_end, then converted into answer text targets for T5 generation.

T5 training labels are created from the target token ids, with padding masked to -100 to prevent loss on meaningless padding.

The source encoding uses question+context with max length 396, while the target answer uses max length 32—two separate length budgets.

A PyTorch Lightning DataModule cleanly separates dataset construction (tokenization/labels) from DataLoader batching for fine-tuning.

Topics

BioASQ Data Preparation
T5 Tokenization
Question Answering Dataset
PyTorch Lightning DataModule
Span Extraction

Mentioned

T5
QA
GPU
CPU
NLP
pl
PyTorch
T5Tokenizer