Create Custom Dataset for Question Answering with T5 using HuggingFace, Pytorch Lightning & PyTorch
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Extract BioASQ QA JSON into a DataFrame by iterating paragraphs → questions → answers and using answer_start to compute answer_end inside the context.
Briefing
Fine-tuning T5 for question answering starts with turning BioASQ biomedical QA files into a model-ready dataset: each training example becomes a (question, context) input pair with the answer text extracted as a span from the context. The workflow builds a PyTorch Dataset that tokenizes the source as “question + context” and tokenizes the target as the answer, then prepares labels by masking padding tokens so loss is computed only on real answer tokens. This matters because T5 is a text-to-text, sequence-to-sequence model—so the data must be shaped exactly around its input/output format, not around classification-style targets.
The pipeline begins by setting up the environment for HuggingFace Transformers and PyTorch Lightning, including installing SentencePiece and specific library versions. After imports, it loads the BioASQ question answering training data from a downloaded and unzipped BioASQ zip (the example uses the “factoid” training split files). Each JSON document contains a title and a list of paragraphs; each paragraph includes a context string plus multiple question entries. For every question, the script iterates through the associated answers, capturing both the answer text and its character start index inside the context. The end index is computed from the start plus the answer text length, enabling the key property of the dataset: the answer is a substring of the context.
Those extracted fields are assembled into a Pandas DataFrame with columns for question, context, answer text, answer start, and answer end. The resulting dataset is then aggregated across the training files and checked for scale: the transcript reports roughly 443 unique questions and about 2.5k total (question, answer) examples when considering unique contexts. A quick visualization step uses colored terminal output to highlight the answer span within the context, confirming that the indexing aligns with the text.
Next comes the tokenization logic tailored to T5. Using the HuggingFace T5 tokenizer (for “t5-base” / “T5Tokenizer”), the script demonstrates how the tokenizer produces input_ids and attention_mask for a sample question, and how it inserts special tokens to separate sequences. For training, it encodes the source with a maximum length (396 tokens) while truncating only the context if needed, and it encodes the target answer with its own maximum length (32 tokens). Labels are derived from the target input_ids, but padding positions are replaced with -100 so the training loss ignores them.
Finally, a custom PyTorch Dataset class is wrapped in a PyTorch Lightning DataModule. The DataModule splits the DataFrame into training and validation sets, constructs dataset instances with the tokenizer and max lengths, and returns DataLoaders with a batch size of 8 and four worker processes. The setup is verified by instantiating the DataModule and running its setup method, preparing everything needed for the next phase: actually fine-tuning T5 on the BioASQ question answering task.
Cornell Notes
The dataset-building step for T5 QA turns BioASQ JSON files into training rows where each example is (question, context) → answer text. The script extracts answer spans by using each answer’s text plus its character start index inside the context, then stores question, context, answer text, and span indices in a Pandas DataFrame. A custom PyTorch Dataset tokenizes the source as a question-context pair and tokenizes the target as the answer, using max lengths (396 for source, 32 for target). Training labels come from the target token ids, but padding tokens are masked to -100 so loss ignores them. A PyTorch Lightning DataModule then splits into train/validation and provides DataLoaders for fine-tuning.
How does the transcript ensure the answer is correctly aligned inside the context text?
Why does T5 require a different data format than extractive QA models?
What exactly becomes the model’s “source” and “target” during tokenization?
Why are padding tokens converted to -100 in the labels?
How is the training/validation split handled for Lightning?
What scale checks are performed after extracting the BioASQ data?
Review Questions
- How would you modify the dataset if you wanted the model to handle multiple answers per question differently (e.g., choose one vs. generate all)?
- What changes would be needed if you wanted to truncate the question-context pair differently (e.g., truncate from the beginning of the context)?
- Why might using a fixed target max length of 32 harm performance for longer biomedical answers, and how could you address it?
Key Points
- 1
Extract BioASQ QA JSON into a DataFrame by iterating paragraphs → questions → answers and using answer_start to compute answer_end inside the context.
- 2
Represent each training example for T5 as (question, context) → answer text, since T5 is text-to-text rather than span-position based.
- 3
Tokenize the source with a max length of 396 and truncate only the context to preserve the question as much as possible.
- 4
Tokenize the target answer with a max length of 32 and build labels from target input_ids.
- 5
Mask padding positions in labels by replacing them with -100 so loss ignores padding tokens.
- 6
Wrap the tokenization and label logic in a custom PyTorch Dataset and expose train/validation DataLoaders via a PyTorch Lightning DataModule.
- 7
Validate the extraction and tokenization by sampling rows and visually checking that the answer span matches the context indices.