LSTM | Part 3 | Next Word Predictor Using

TL;DR

Reframe next-word prediction as supervised learning by generating many prefix-to-next-word training pairs from each sentence.

Briefing Cornell Notes

Briefing

A next word predictor can be built as a text generator, but it becomes much easier to train when the problem is reframed as supervised learning: turn each sentence into many “input → next word” pairs, then train an LSTM to predict the next token. The practical payoff is straightforward—once the model learns these word-to-word transitions from text, it can take a partial sequence and output the most likely next word, enabling features like keyboard suggestions and Smart Compose-style writing assistance.

The core workflow starts by converting raw text into a dataset suitable for supervised training. For every sentence, the system slides a window across the words. Using the example “Hi, my name is Nitish,” the training pairs become: input “Hi” → output “my,” input “Hi my” → output “name,” input “Hi my name” → output “is,” and input “Hi my name is” → output “Nitish.” The same process repeats for every sentence in the corpus, producing a large collection of sequences where the input is a prefix and the output is the next word.

The next major step is making the data numeric, because neural networks operate on numbers rather than raw English tokens. A Keras Tokenizer assigns an integer index to each unique word in the dataset. After fitting the tokenizer on the text, the model can transform any sentence into a sequence of token IDs. For instance, if “Hi” maps to 1, “My” to 2, “Name” to 3, “Is” to 4, and “Nitish” to 5, then the sentence “Hi My Name Is Nitish” becomes [1, 2, 3, 4, 5]. From there, training pairs can be generated directly from these integer sequences.

With the text-to-integers conversion in place, the remaining work is to construct and train an LSTM architecture on the resulting input-output pairs. The LSTM’s job is to learn patterns in sequences—capturing how earlier words influence which word is likely to come next. The transcript also notes that this approach has real-world precedent: SwiftKey’s early next-word system relied on LSTM, even though later versions moved toward more complex models.

For demonstration, the plan uses a small FAQ-derived dataset copied into a Google Colab notebook, chosen because live coding with a large dataset would be impractical. The method still scales conceptually: better performance typically requires more text so the model can learn broader language patterns. The immediate focus, however, is on the mechanics—tokenizing the vocabulary, splitting text into sentences, converting each sentence into sequences of token IDs, and then preparing the supervised learning dataset that an LSTM can train on.

Cornell Notes

The transcript frames next-word prediction as supervised learning by converting text generation into many “prefix → next word” training examples. Each sentence is split into word prefixes: for “Hi my name is Nitish,” the inputs are “Hi,” “Hi my,” “Hi my name,” and “Hi my name is,” with outputs “my,” “name,” “is,” and “Nitish.” Because neural networks require numbers, a Keras Tokenizer assigns an integer ID to every unique word and converts sentences into sequences of token IDs. Once the text is tokenized and structured into input-output pairs, an LSTM can be trained to predict the next token given a prefix. This approach mirrors the logic behind real typing and writing assistants, including early SwiftKey models that used LSTM.

How does next-word prediction become a supervised learning problem?

Instead of generating text token-by-token directly, the method creates training pairs from each sentence. For every position in a sentence, the prefix up to that point becomes the input, and the next word becomes the output. Example: “Hi my” → “name,” “Hi my name” → “is,” and “Hi my name is” → “Nitish.” Training on many such pairs teaches the model to map a sequence prefix to the most likely next token.

Why must text be converted into numbers before training an LSTM?

Models like LSTMs operate on numeric tensors, not raw strings. A Keras Tokenizer assigns each unique word an integer index. After fitting the tokenizer on the dataset text, sentences are converted using texts_to_sequences into arrays of token IDs (e.g., “Hi My Name Is Nitish” → [1, 2, 3, 4, 5]). These integer sequences then form the basis for input-output pairs.

What does the Keras Tokenizer do in this pipeline?

The Tokenizer is used in two key steps: (1) fit_on_texts learns the vocabulary and assigns token IDs to each unique word; (2) texts_to_sequences converts each sentence into a sequence of those IDs. The transcript notes that fit_on_texts expects text in a list, and the tokenizer’s internal word-to-index mapping can be inspected to see which number was assigned to each word.

How are sentences turned into training examples after tokenization?

After splitting the dataset into sentences (by line breaks/new lines), each sentence is converted into token IDs. Then the training examples are created by taking prefixes of the token sequence as inputs and the subsequent token as the output. For a token sequence [w1, w2, w3, w4, w5], the pairs are ([w1]→w2), ([w1,w2]→w3), ([w1,w2,w3]→w4), and ([w1,w2,w3,w4]→w5).

Why use a small dataset for a live demonstration?

Large text corpora can make the model and preprocessing steps heavier and slower, which is impractical for live coding. The transcript describes using a smaller FAQ-derived dataset copied into Google Colab to keep the workflow manageable while still demonstrating the full pipeline. For better accuracy, a larger dataset would be needed.

Review Questions

When converting a sentence into training pairs, what exactly counts as the input and what counts as the output?
What two Tokenizer methods are used, and what does each one accomplish in the preprocessing pipeline?
Given a tokenized sentence sequence like [1,2,3,4,5], what are the input-output pairs used to train the next-word predictor?

Key Points

1
Reframe next-word prediction as supervised learning by generating many prefix-to-next-word training pairs from each sentence.
2
Use a sliding window over each sentence so every word position contributes an input sequence and a single next-word target.
3
Convert words to integers with Keras Tokenizer so the LSTM receives numeric sequences rather than text strings.
4
Fit the tokenizer on the full dataset text first, then convert each sentence into token ID sequences using texts_to_sequences.
5
Split the raw text into sentences (e.g., by new lines) before building the prefix/next-token pairs.
6
Train an LSTM on the resulting input-output dataset so it learns sequence patterns that determine the next likely token.
7
For demonstration, a smaller dataset can be used for feasibility, but larger corpora typically improve model quality.

Highlights

Next-word prediction becomes trainable by turning each sentence into many supervised examples: every prefix maps to the next word.

Keras Tokenizer provides the bridge from language to numbers by assigning each unique word an integer ID and converting sentences into token ID sequences.

The approach scales from a small FAQ dataset to larger corpora, with accuracy generally improving as more text is used.

Early SwiftKey next-word systems relied on LSTM, showing the method’s real-world lineage.

Topics

Next Word Prediction
LSTM
Supervised Learning
Tokenization
Text Preprocessing

Mentioned

Nitesh
LSTM

LSTM | Part 3 | Next Word Predictor Using | CampusX