LSTM Architecture | Part 2 | The How?

TL;DR

LSTM maintains two states—cell state (long-term memory) and hidden state (short-term output)—and updates them separately at each time step.

Briefing Cornell Notes

Briefing

LSTM’s architecture is built to decide, at every time step, what information to keep, what to overwrite, and what to discard—using a three-part “gating” system that prevents long sequences from collapsing under vanishing gradients. Instead of treating the hidden state as the only memory, the design maintains a separate long-term memory (the cell state) and a short-term working memory (the hidden state), then forces controlled interaction between them.

The architecture looks more complex than a basic RNN because it carries two parallel states: the cell state (long-term memory) and the hidden state (short-term memory). At each time step, the model receives three key inputs: the current input at time t, the previous hidden state from time t−1, and the previous cell state from time t−1. The outputs include the current hidden state and an updated cell state. The cell state update happens in two stages: first, it updates the cell state by removing irrelevant information and adding new candidate information; second, it computes the new hidden state based on the filtered cell state.

The core mechanism is the gates concept—three sigmoid-based gates that regulate information flow element-by-element. The **forget gate** decides what portion of the previous cell state should be erased. It takes the previous hidden state and the current input, passes them through a neural layer with a sigmoid activation, and produces a vector of values between 0 and 1. A value near 0 means “remove this part,” while a value near 1 means “keep it.” This is described as a controlled “removal” step: the previous cell state is multiplied by the forget gate output, effectively zeroing out unneeded components.

Next comes the **input gate**, which determines what new information should be written into the cell state. It computes a set of candidate values (via a tanh-like transformation) from the current input and previous hidden state, then uses the input gate (sigmoid) to filter those candidates—allowing only the parts deemed useful to pass through. The updated cell state is then formed by combining the retained old information (after forget) with the filtered new candidate information (after input).

Finally, the **output gate** decides what part of the cell state becomes the hidden state that will be exposed to the next time step. The cell state is transformed (described as producing h_t from the cell state), and the output gate again acts as an element-wise filter using a sigmoid vector. This gating structure is presented as the reason LSTMs can preserve relevant information over long sequences: instead of relying on repeated transformations of a single hidden state (a weakness in vanilla RNNs), the cell state can carry forward useful signals while gates block irrelevant changes.

The transcript also grounds the math in a sentiment-analysis example, where words are converted into numeric vectors (e.g., one-hot style representations), then fed into the LSTM to predict whether sentiment is 0 or 1. The overall takeaway is that LSTM’s “keep/forget/write/read” logic—implemented through vectorized gates and element-wise operations—turns sequence modeling into a controlled memory management problem rather than an uncontrolled recurrence.

Cornell Notes

LSTM keeps two kinds of memory: a long-term cell state (c_t) and a short-term hidden state (h_t). At each time step, three sigmoid gates decide what to keep, what to write, and what to output. The forget gate multiplies the previous cell state by a vector of 0–1 values to erase irrelevant parts. The input gate filters newly computed candidate values, and the cell state becomes “retained old + accepted new.” The output gate then filters the transformed cell state to produce the next hidden state, helping long sequences avoid vanishing-gradient collapse.

Why does LSTM need both a cell state and a hidden state instead of using only the hidden state like a basic RNN?

The architecture maintains a dedicated long-term memory (the cell state) and a separate short-term working memory (the hidden state). The cell state is updated through controlled operations—forgetting and writing—so useful information can persist across many time steps. The hidden state is then derived from the cell state using the output gate, meaning the model can expose only the relevant parts to the next step. This separation is presented as the reason LSTM can handle long dependencies better than a vanilla RNN that repeatedly transforms a single hidden state.

How does the forget gate decide what to remove from the previous cell state?

The forget gate takes the current input and the previous hidden state, passes them through a neural layer with a sigmoid activation, and produces a vector of values between 0 and 1. That vector is used element-wise to scale the previous cell state: parts multiplied by values near 0 are effectively erased, while parts multiplied by values near 1 are kept. The transcript describes this as a “removal” step implemented by multiplying c_{t−1} by the forget gate output.

What is the difference between the input gate and the candidate cell state in the LSTM update?

The input gate is a sigmoid filter that decides which candidate information should be written into the cell state. In parallel, a candidate cell state is computed from the current input and previous hidden state (using a neural transformation such as tanh-like computation). The candidate values are then filtered element-wise by the input gate output. The updated cell state becomes: (forget-filtered old cell state) + (input-filtered candidate values).

How does the output gate produce the hidden state from the cell state?

The output gate again uses the current input and previous hidden state to compute a sigmoid vector. The cell state is transformed (described as producing h_t from the cell state), and then the output gate filters it element-wise. The result is the hidden state h_t that will be used for the next time step and for producing outputs.

What does the transcript mean by “gates” operating on vectors rather than scalars?

Each gate outputs a vector whose dimension matches the cell state/hidden state dimension (the transcript repeatedly emphasizes that the cell state and hidden state are vectors). Operations like point-wise multiplication and point-wise addition are applied element-by-element across these vectors. This means the model can keep some features of the memory and erase others simultaneously at each time step.

How is the LSTM used in the sentiment-analysis example described?

The transcript frames a sentiment classification task where a text must be converted into numeric vectors before feeding it to the LSTM. Words are mapped into one-hot style vectors (described as turning words into 0/1 patterns). The LSTM processes these vectors over time, producing hidden/cell dynamics, and the final decision is treated as predicting sentiment as 0 or 1.

Review Questions

In an LSTM time step, which operations update the cell state, and how do the forget and input gates each affect that update?
Why does multiplying by a sigmoid gate output help prevent irrelevant information from accumulating over long sequences?
Describe the flow of information needed to compute h_t: which gate determines what portion of the cell state becomes the hidden state?

Key Points

1
LSTM maintains two states—cell state (long-term memory) and hidden state (short-term output)—and updates them separately at each time step.
2
Three sigmoid gates control information flow element-wise: forget (erase), input (write), and output (expose).
3
The forget gate scales the previous cell state by values in [0,1], effectively removing unneeded components.
4
The input gate filters candidate values computed from the current input and previous hidden state before writing them into the cell state.
5
The updated cell state is formed by combining retained old information with accepted new candidate information.
6
The output gate filters the transformed cell state to produce the next hidden state, which then feeds the next time step.
7
Vectorized gate operations (point-wise multiply/add) let LSTM selectively keep or discard different features simultaneously.

Highlights

LSTM’s “gates” act like element-wise controllers: forget gate erases parts of the cell state, input gate writes filtered candidate information, and output gate decides what becomes the hidden state.

The cell state can carry useful signals forward without being repeatedly overwritten, which is presented as the key fix for long-range dependency issues seen in vanilla RNNs.

The transcript ties the math to intuition: sigmoid outputs near 0 remove information; values near 1 preserve it, all happening through vector multiplications.

Topics

LSTM Architecture
Gates
Cell State
Hidden State
Sentiment Analysis

Mentioned

LSTM
RNN
LSDM
RNNs
t
t−1
h_t
c_t
CT
OT
IT
FT

LSTM Architecture | Part 2 | The How? | CampusX