Intent Recognition with BERT using Keras and TensorFlow 2 in Python | Text Classification Tutorial
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The dataset is converted into CSV and used for a seven-intent single-label classification task, with class balance checked before training.
Briefing
Fine-tuning a pre-trained BERT model for intent recognition on a seven-class dataset can deliver near-saturating accuracy—about 97% on a held-out test set—using a straightforward TensorFlow 2 + Keras training pipeline. The workflow pairs BERT’s built-in tokenizer with a custom preprocessing layer that converts short user queries into fixed-length token ID sequences, then trains a small classification head on top of the BERT encoder outputs.
The tutorial starts by preparing data for a classic intent recognition setup: each input is a short text (a user query), and each output is one of seven intent labels (with the possibility of multi-intent classification mentioned, though the training here is single-label). The dataset is pulled from a GitHub repository (based on an embedded spoken language understanding system paper) and stored as JSON, then converted into CSV for easier handling. Training and validation splits are merged and later re-created via a validation split during model fitting. With roughly 14,000 examples across seven intents, the class distribution is checked via a bar chart; the counts are described as roughly balanced, reducing the need for imbalance mitigation.
The core modeling step uses BERT “base” (the uncased variant with 12 encoder layers). Because the original BERT implementation isn’t compatible with TensorFlow 2 as-is, the pipeline relies on a TensorFlow 2-compatible BERT package that can load pre-trained checkpoints. The model weights are downloaded from the original Google research links, then the checkpoint, vocabulary, and config JSON are wired into the BERT layer.
A custom preprocessing class handles tokenization and sequence shaping. It uses BERT’s tokenizer to add special tokens ([CLS] and [SEP]), converts tokens to integer IDs, and tracks the maximum sequence length observed across training and test data. Sequences are then padded with zeros to a fixed length so Keras can batch them efficiently. Labels are encoded as integers from 0 to 6, aligning with sparse categorical cross-entropy.
On the modeling side, the BERT layer produces a tensor of hidden states; the pipeline extracts the first token representation (the [CLS] position) and feeds it through a compact classification head: dropout, a dense layer with ReLU, another dropout, and a final dense softmax layer sized to the seven intents. The network is compiled with Adam using a very small learning rate (appropriate for fine-tuning), sparse categorical cross-entropy loss, and sparse categorical accuracy.
Training runs for five epochs with shuffling and a 10% validation split, while TensorBoard logs track progress. Results are reported as extremely strong: validation accuracy rises above 97% after the first epoch and reaches about 98.5% by the end, with signs of overfitting later. Evaluation on the test set lands around 97% accuracy. A classification report (precision, recall, and F1 per intent) is described as uniformly high, and a confusion matrix is referenced.
Finally, the trained model is used for inference by tokenizing new sentences, padding them to the same max length, running the model to get softmax probabilities, and mapping the predicted integer label back to the intent name. The tutorial closes by pointing to Hugging Face Transformers as a broader ecosystem for transformer models in both PyTorch and TensorFlow.
Cornell Notes
The pipeline fine-tunes BERT (uncased “base”, 12-layer) for seven-class intent recognition in TensorFlow 2/Keras. Text inputs are tokenized with BERT’s tokenizer, wrapped with [CLS] and [SEP], converted to token IDs, and padded/truncated to a fixed max sequence length so batches can train efficiently. A small classification head sits on top of the BERT [CLS] embedding: dropout → dense(ReLU) → dropout → dense(softmax over 7 intents). Training with Adam at a very small learning rate and sparse categorical cross-entropy yields about 97% test accuracy, with very high precision/recall across intents. The approach also demonstrates how to run inference on new user queries by repeating the same tokenization and padding steps.
How does the tutorial turn raw intent text into model-ready inputs for BERT?
Why is padding/truncation necessary in this Keras setup?
What architecture sits on top of BERT for intent classification?
What training choices are used to fine-tune BERT effectively?
What performance results are reported, and what do they imply?
How is inference performed on new sentences after training?
Review Questions
- What specific preprocessing steps (special tokens, token-to-ID conversion, padding/truncation) must be repeated at inference time to keep predictions consistent with training?
- Why does the model use sparse categorical cross-entropy instead of categorical cross-entropy, and how does that choice relate to the label format (integers vs one-hot)?
- Which part of BERT’s output is used for classification in this pipeline, and how does that choice affect the design of the classification head?
Key Points
- 1
The dataset is converted into CSV and used for a seven-intent single-label classification task, with class balance checked before training.
- 2
BERT’s tokenizer is used directly, adding [CLS] and [SEP], converting tokens to IDs, and padding sequences with zeros to a fixed max length.
- 3
A custom preprocessing class computes the effective max sequence length from both training and test data, then enforces fixed-length inputs for batching.
- 4
The model fine-tunes BERT base (uncased, 12 encoders) and adds a lightweight head on the [CLS] embedding: dropout, dense(ReLU), dropout, and a 7-way softmax.
- 5
Training uses Adam with a very small learning rate, sparse categorical cross-entropy loss, and sparse categorical accuracy, running for five epochs with a 10% validation split.
- 6
Reported results are about 98.5% validation accuracy and roughly 97% test accuracy, with signs of overfitting after early epochs.
- 7
Inference repeats the same tokenization and padding steps, then maps the argmax softmax output back to the intent label.