LSTM Architecture | Part 2 | The How? | CampusX
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LSTM maintains two states—cell state (long-term memory) and hidden state (short-term output)—and updates them separately at each time step.
Briefing
LSTM’s architecture is built to decide, at every time step, what information to keep, what to overwrite, and what to discard—using a three-part “gating” system that prevents long sequences from collapsing under vanishing gradients. Instead of treating the hidden state as the only memory, the design maintains a separate long-term memory (the cell state) and a short-term working memory (the hidden state), then forces controlled interaction between them.
The architecture looks more complex than a basic RNN because it carries two parallel states: the cell state (long-term memory) and the hidden state (short-term memory). At each time step, the model receives three key inputs: the current input at time t, the previous hidden state from time t−1, and the previous cell state from time t−1. The outputs include the current hidden state and an updated cell state. The cell state update happens in two stages: first, it updates the cell state by removing irrelevant information and adding new candidate information; second, it computes the new hidden state based on the filtered cell state.
The core mechanism is the gates concept—three sigmoid-based gates that regulate information flow element-by-element. The **forget gate** decides what portion of the previous cell state should be erased. It takes the previous hidden state and the current input, passes them through a neural layer with a sigmoid activation, and produces a vector of values between 0 and 1. A value near 0 means “remove this part,” while a value near 1 means “keep it.” This is described as a controlled “removal” step: the previous cell state is multiplied by the forget gate output, effectively zeroing out unneeded components.
Next comes the **input gate**, which determines what new information should be written into the cell state. It computes a set of candidate values (via a tanh-like transformation) from the current input and previous hidden state, then uses the input gate (sigmoid) to filter those candidates—allowing only the parts deemed useful to pass through. The updated cell state is then formed by combining the retained old information (after forget) with the filtered new candidate information (after input).
Finally, the **output gate** decides what part of the cell state becomes the hidden state that will be exposed to the next time step. The cell state is transformed (described as producing h_t from the cell state), and the output gate again acts as an element-wise filter using a sigmoid vector. This gating structure is presented as the reason LSTMs can preserve relevant information over long sequences: instead of relying on repeated transformations of a single hidden state (a weakness in vanilla RNNs), the cell state can carry forward useful signals while gates block irrelevant changes.
The transcript also grounds the math in a sentiment-analysis example, where words are converted into numeric vectors (e.g., one-hot style representations), then fed into the LSTM to predict whether sentiment is 0 or 1. The overall takeaway is that LSTM’s “keep/forget/write/read” logic—implemented through vectorized gates and element-wise operations—turns sequence modeling into a controlled memory management problem rather than an uncontrolled recurrence.
Cornell Notes
LSTM keeps two kinds of memory: a long-term cell state (c_t) and a short-term hidden state (h_t). At each time step, three sigmoid gates decide what to keep, what to write, and what to output. The forget gate multiplies the previous cell state by a vector of 0–1 values to erase irrelevant parts. The input gate filters newly computed candidate values, and the cell state becomes “retained old + accepted new.” The output gate then filters the transformed cell state to produce the next hidden state, helping long sequences avoid vanishing-gradient collapse.
Why does LSTM need both a cell state and a hidden state instead of using only the hidden state like a basic RNN?
How does the forget gate decide what to remove from the previous cell state?
What is the difference between the input gate and the candidate cell state in the LSTM update?
How does the output gate produce the hidden state from the cell state?
What does the transcript mean by “gates” operating on vectors rather than scalars?
How is the LSTM used in the sentiment-analysis example described?
Review Questions
- In an LSTM time step, which operations update the cell state, and how do the forget and input gates each affect that update?
- Why does multiplying by a sigmoid gate output help prevent irrelevant information from accumulating over long sequences?
- Describe the flow of information needed to compute h_t: which gate determines what portion of the cell state becomes the hidden state?
Key Points
- 1
LSTM maintains two states—cell state (long-term memory) and hidden state (short-term output)—and updates them separately at each time step.
- 2
Three sigmoid gates control information flow element-wise: forget (erase), input (write), and output (expose).
- 3
The forget gate scales the previous cell state by values in [0,1], effectively removing unneeded components.
- 4
The input gate filters candidate values computed from the current input and previous hidden state before writing them into the cell state.
- 5
The updated cell state is formed by combining retained old information with accepted new candidate information.
- 6
The output gate filters the transformed cell state to produce the next hidden state, which then feeds the next time step.
- 7
Vectorized gate operations (point-wise multiply/add) let LSTM selectively keep or discard different features simultaneously.