Intent Recognition with BERT using Keras and TensorFlow 2 in Python

TL;DR

The dataset is converted into CSV and used for a seven-intent single-label classification task, with class balance checked before training.

Briefing Cornell Notes

Briefing

Fine-tuning a pre-trained BERT model for intent recognition on a seven-class dataset can deliver near-saturating accuracy—about 97% on a held-out test set—using a straightforward TensorFlow 2 + Keras training pipeline. The workflow pairs BERT’s built-in tokenizer with a custom preprocessing layer that converts short user queries into fixed-length token ID sequences, then trains a small classification head on top of the BERT encoder outputs.

The tutorial starts by preparing data for a classic intent recognition setup: each input is a short text (a user query), and each output is one of seven intent labels (with the possibility of multi-intent classification mentioned, though the training here is single-label). The dataset is pulled from a GitHub repository (based on an embedded spoken language understanding system paper) and stored as JSON, then converted into CSV for easier handling. Training and validation splits are merged and later re-created via a validation split during model fitting. With roughly 14,000 examples across seven intents, the class distribution is checked via a bar chart; the counts are described as roughly balanced, reducing the need for imbalance mitigation.

The core modeling step uses BERT “base” (the uncased variant with 12 encoder layers). Because the original BERT implementation isn’t compatible with TensorFlow 2 as-is, the pipeline relies on a TensorFlow 2-compatible BERT package that can load pre-trained checkpoints. The model weights are downloaded from the original Google research links, then the checkpoint, vocabulary, and config JSON are wired into the BERT layer.

A custom preprocessing class handles tokenization and sequence shaping. It uses BERT’s tokenizer to add special tokens ([CLS] and [SEP]), converts tokens to integer IDs, and tracks the maximum sequence length observed across training and test data. Sequences are then padded with zeros to a fixed length so Keras can batch them efficiently. Labels are encoded as integers from 0 to 6, aligning with sparse categorical cross-entropy.

On the modeling side, the BERT layer produces a tensor of hidden states; the pipeline extracts the first token representation (the [CLS] position) and feeds it through a compact classification head: dropout, a dense layer with ReLU, another dropout, and a final dense softmax layer sized to the seven intents. The network is compiled with Adam using a very small learning rate (appropriate for fine-tuning), sparse categorical cross-entropy loss, and sparse categorical accuracy.

Training runs for five epochs with shuffling and a 10% validation split, while TensorBoard logs track progress. Results are reported as extremely strong: validation accuracy rises above 97% after the first epoch and reaches about 98.5% by the end, with signs of overfitting later. Evaluation on the test set lands around 97% accuracy. A classification report (precision, recall, and F1 per intent) is described as uniformly high, and a confusion matrix is referenced.

Finally, the trained model is used for inference by tokenizing new sentences, padding them to the same max length, running the model to get softmax probabilities, and mapping the predicted integer label back to the intent name. The tutorial closes by pointing to Hugging Face Transformers as a broader ecosystem for transformer models in both PyTorch and TensorFlow.

Cornell Notes

The pipeline fine-tunes BERT (uncased “base”, 12-layer) for seven-class intent recognition in TensorFlow 2/Keras. Text inputs are tokenized with BERT’s tokenizer, wrapped with [CLS] and [SEP], converted to token IDs, and padded/truncated to a fixed max sequence length so batches can train efficiently. A small classification head sits on top of the BERT [CLS] embedding: dropout → dense(ReLU) → dropout → dense(softmax over 7 intents). Training with Adam at a very small learning rate and sparse categorical cross-entropy yields about 97% test accuracy, with very high precision/recall across intents. The approach also demonstrates how to run inference on new user queries by repeating the same tokenization and padding steps.

How does the tutorial turn raw intent text into model-ready inputs for BERT?

It uses BERT’s tokenizer to split text into tokens while preserving punctuation, then adds special tokens: [CLS] at the start and [SEP] at the end. Tokens are converted to integer token IDs via the tokenizer’s “convert tokens to ids” helper. A custom preprocessing class computes the maximum sequence length observed (across training and test) and then pads shorter sequences with zeros so every example becomes a fixed-length vector of token IDs. Longer sequences are truncated to fit the max length, with the padding logic accounting for the two special tokens.

Why is padding/truncation necessary in this Keras setup?

Keras training expects batches where each sample has the same shape. The preprocessing therefore enforces a fixed max sequence length: token ID sequences shorter than that length get zero-padding at the end, while longer ones are cut down. This ensures the model input tensor has a consistent dimension (max sequence length) for every batch.

What architecture sits on top of BERT for intent classification?

The BERT layer outputs hidden states for every token position. The pipeline extracts the first token’s representation (the [CLS] position) by taking the 0th element along the sequence dimension. That vector then passes through a classification head: Dropout(0.5) → Dense(ReLU) → Dropout(0.5) → Dense(softmax) with 7 units (one per intent). The softmax output is trained to match integer labels 0–6.

What training choices are used to fine-tune BERT effectively?

The model is compiled with Adam using a very small learning rate (described as recommended for fine-tuning). Loss is sparse categorical cross-entropy, matching integer labels rather than one-hot vectors. Accuracy is tracked with sparse categorical accuracy. Training uses a 10% validation split, shuffling, and five epochs, with TensorBoard logging via a callback.

What performance results are reported, and what do they imply?

Validation accuracy exceeds 97% after the first epoch and reaches roughly 98.5% by the end, then begins to show overfitting around later epochs. On the held-out test set, the model is reported at about 97% accuracy. The tutorial also reports high precision/recall/F1 across intents in a classification report, suggesting the model generalizes well despite the large parameter count (~110M total parameters for BERT base plus the head).

How is inference performed on new sentences after training?

New text is tokenized with the same BERT tokenizer, wrapped with [CLS] and [SEP], converted to token IDs, and padded to the same max sequence length used during training. The padded token ID vector is fed into the trained model to produce softmax probabilities. The predicted class is the argmax of those probabilities, and the integer label is mapped back to the corresponding intent name.

Review Questions

What specific preprocessing steps (special tokens, token-to-ID conversion, padding/truncation) must be repeated at inference time to keep predictions consistent with training?
Why does the model use sparse categorical cross-entropy instead of categorical cross-entropy, and how does that choice relate to the label format (integers vs one-hot)?
Which part of BERT’s output is used for classification in this pipeline, and how does that choice affect the design of the classification head?

Key Points

1
The dataset is converted into CSV and used for a seven-intent single-label classification task, with class balance checked before training.
2
BERT’s tokenizer is used directly, adding [CLS] and [SEP], converting tokens to IDs, and padding sequences with zeros to a fixed max length.
3
A custom preprocessing class computes the effective max sequence length from both training and test data, then enforces fixed-length inputs for batching.
4
The model fine-tunes BERT base (uncased, 12 encoders) and adds a lightweight head on the [CLS] embedding: dropout, dense(ReLU), dropout, and a 7-way softmax.
5
Training uses Adam with a very small learning rate, sparse categorical cross-entropy loss, and sparse categorical accuracy, running for five epochs with a 10% validation split.
6
Reported results are about 98.5% validation accuracy and roughly 97% test accuracy, with signs of overfitting after early epochs.
7
Inference repeats the same tokenization and padding steps, then maps the argmax softmax output back to the intent label.

Highlights

BERT base fine-tuning plus a small classification head reaches ~97% accuracy on a held-out test set for seven intents.

A custom preprocessing layer ensures every input becomes a fixed-length token-ID sequence by adding [CLS]/[SEP] and padding with zeros.

The classification head uses the [CLS] token representation (first token) as the single vector summarizing the input for intent prediction.

Training for only five epochs with a small learning rate yields very high validation accuracy quickly, then overfitting begins to appear.

Topics

Intent Recognition
BERT Fine-Tuning
Text Classification
TensorFlow 2
Keras Tokenization

Mentioned

Jay Armour
BERT
Keras
TensorFlow
GPU
JSON
CSV
NLP
ELMO
[CLS]
[SEP]
Adam
CPU
TF2
F1

Intent Recognition with BERT using Keras and TensorFlow 2 in Python | Text Classification Tutorial