LSTM | Part 3 | Next Word Predictor Using | CampusX
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Reframe next-word prediction as supervised learning by generating many prefix-to-next-word training pairs from each sentence.
Briefing
A next word predictor can be built as a text generator, but it becomes much easier to train when the problem is reframed as supervised learning: turn each sentence into many “input → next word” pairs, then train an LSTM to predict the next token. The practical payoff is straightforward—once the model learns these word-to-word transitions from text, it can take a partial sequence and output the most likely next word, enabling features like keyboard suggestions and Smart Compose-style writing assistance.
The core workflow starts by converting raw text into a dataset suitable for supervised training. For every sentence, the system slides a window across the words. Using the example “Hi, my name is Nitish,” the training pairs become: input “Hi” → output “my,” input “Hi my” → output “name,” input “Hi my name” → output “is,” and input “Hi my name is” → output “Nitish.” The same process repeats for every sentence in the corpus, producing a large collection of sequences where the input is a prefix and the output is the next word.
The next major step is making the data numeric, because neural networks operate on numbers rather than raw English tokens. A Keras Tokenizer assigns an integer index to each unique word in the dataset. After fitting the tokenizer on the text, the model can transform any sentence into a sequence of token IDs. For instance, if “Hi” maps to 1, “My” to 2, “Name” to 3, “Is” to 4, and “Nitish” to 5, then the sentence “Hi My Name Is Nitish” becomes [1, 2, 3, 4, 5]. From there, training pairs can be generated directly from these integer sequences.
With the text-to-integers conversion in place, the remaining work is to construct and train an LSTM architecture on the resulting input-output pairs. The LSTM’s job is to learn patterns in sequences—capturing how earlier words influence which word is likely to come next. The transcript also notes that this approach has real-world precedent: SwiftKey’s early next-word system relied on LSTM, even though later versions moved toward more complex models.
For demonstration, the plan uses a small FAQ-derived dataset copied into a Google Colab notebook, chosen because live coding with a large dataset would be impractical. The method still scales conceptually: better performance typically requires more text so the model can learn broader language patterns. The immediate focus, however, is on the mechanics—tokenizing the vocabulary, splitting text into sentences, converting each sentence into sequences of token IDs, and then preparing the supervised learning dataset that an LSTM can train on.
Cornell Notes
The transcript frames next-word prediction as supervised learning by converting text generation into many “prefix → next word” training examples. Each sentence is split into word prefixes: for “Hi my name is Nitish,” the inputs are “Hi,” “Hi my,” “Hi my name,” and “Hi my name is,” with outputs “my,” “name,” “is,” and “Nitish.” Because neural networks require numbers, a Keras Tokenizer assigns an integer ID to every unique word and converts sentences into sequences of token IDs. Once the text is tokenized and structured into input-output pairs, an LSTM can be trained to predict the next token given a prefix. This approach mirrors the logic behind real typing and writing assistants, including early SwiftKey models that used LSTM.
How does next-word prediction become a supervised learning problem?
Why must text be converted into numbers before training an LSTM?
What does the Keras Tokenizer do in this pipeline?
How are sentences turned into training examples after tokenization?
Why use a small dataset for a live demonstration?
Review Questions
- When converting a sentence into training pairs, what exactly counts as the input and what counts as the output?
- What two Tokenizer methods are used, and what does each one accomplish in the preprocessing pipeline?
- Given a tokenized sentence sequence like [1,2,3,4,5], what are the input-output pairs used to train the next-word predictor?
Key Points
- 1
Reframe next-word prediction as supervised learning by generating many prefix-to-next-word training pairs from each sentence.
- 2
Use a sliding window over each sentence so every word position contributes an input sequence and a single next-word target.
- 3
Convert words to integers with Keras Tokenizer so the LSTM receives numeric sequences rather than text strings.
- 4
Fit the tokenizer on the full dataset text first, then convert each sentence into token ID sequences using texts_to_sequences.
- 5
Split the raw text into sentences (e.g., by new lines) before building the prefix/next-token pairs.
- 6
Train an LSTM on the resulting input-output dataset so it learns sequence patterns that determine the next likely token.
- 7
For demonstration, a smaller dataset can be used for feasibility, but larger corpora typically improve model quality.