Get AI summaries of any video or article — Sign up free
Can We Build an Artificial Hippocampus? thumbnail

Can We Build an Artificial Hippocampus?

Artem Kirsanov·
6 min read

Based on Artem Kirsanov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

TEM frames navigation as next-observation prediction from action–observation sequences, with learning driven by prediction-error minimization.

Briefing

Artificial hippocampus–style intelligence hinges on a simple but powerful idea: learn to predict the next sensory event by factoring experience into reusable spatial structure plus a memory that binds “where” to “what.” The Tolman–Eichenbaum machine (TEM) framework described here treats navigation as a sequence-prediction problem—observations arrive alongside actions, and the system trains to minimize prediction error. Because the training data come from structured environments, the most efficient way to reduce error is to infer the underlying relational rules of the world, not just memorize transitions. That pressure to discover structure is what enables flexible generalization.

The model is built from two core components inspired by the entorhinal cortex and hippocampus. A position module, analogous to medial entorhinal inputs, performs path integration: it updates an internal “location belief” using actions alone, without directly receiving sensory observations. A memory module then binds this positional state to incoming sensory inputs, storing associations like “at X, I saw Y.” Crucially, the memory is associative: partial cues can retrieve full content. Provide position and it reconstructs the likely sensory observation; provide sensory input and it can infer the corresponding location. During prediction, the system rolls the position module forward through an action sequence, then queries memory using the resulting position to retrieve the next expected observation.

To demonstrate generalization, the transcript uses a family-tree navigation task. After training on many trees generated from shared transition rules, the position module learns to return to the same internal state after loops, effectively embedding the world’s transition logic into its update dynamics. When the model reaches the end of a sequence, it uses the path-integrated position to retrieve the correct missing sensory element from memory—without having explicitly memorized every possible edge.

Performance is contrasted with a lookup-table baseline that stores every “previous observation + action → next observation” pair. That approach requires visiting all possible combinations to reach perfect accuracy, scaling with the number of edges in the relational graph. TEM reaches high accuracy by visiting nodes rather than all edges, reflecting a learned structural representation. Internally, the model’s emergent representations mirror hippocampal and entorhinal cell types: position-module units develop periodic, grid-like firing patterns; memory-module units behave like place cells, firing for specific conjunctions of location and sensory context. Representations remap across environments, and the transcript emphasizes that grid and place-like structure was not hard-coded—random parameters optimized under prediction pressure produced it.

The framework also adapts when sensory statistics change to resemble real animal behavior. Training with boundary-seeking and object-approach biases yields boundary cells, object-vector-like responses, and even landmark-like selectivity. In more complex alternation tasks, the model learns latent rules about reward timing and develops splitter-cell-like activity modulated by both position and future turn direction.

Finally, the model helps reinterpret hippocampal remapping. Place-cell remapping is not purely random in this account: because place-like units are conjunctions of sensory and structural inputs, grid-cell structure constrains where place fields shift. The transcript notes that a predicted correlation between grid–place alignment across environments matches experimental findings in real brains.

In short, TEM offers a unified computational story for how factorized spatial structure and associative memory can produce rapid abstraction, biologically plausible representations, and testable links between entorhinal grid dynamics and hippocampal remapping—plus a claimed connection to transformer architectures via a modified TEM variant.

Cornell Notes

The Tolman–Eichenbaum machine (TEM) frames navigation as next-step prediction from sequences of actions and observations. It learns world structure by minimizing prediction error, using a factorized architecture: a position module performs path integration from actions to estimate location, while a memory module binds that location to sensory inputs and supports associative recall from partial cues. After training on many structured environments, the system can predict unseen transitions by retrieving the correct “what” for a path-integrated “where,” without memorizing every edge in the state graph. Internally, it develops grid-like periodic activity in the position module and place-like conjunction activity in the memory module, including hippocampal remapping across contexts. Changing training statistics to match animal behavior yields boundary, object-vector, landmark, and splitter-like representations, and it constrains remapping via grid–place alignment correlations seen in real data.

Why does TEM treat navigation as a prediction problem instead of a direct mapping from state to action or reward?

TEM receives sequences of observations and the actions that produced them, then trains to predict the next observation. Minimizing prediction error forces the model to exploit regularities in the data. When the environment is structured, the optimal solution is to infer relational structure (how actions move through space/graphs), because that structure makes future observations predictable. This is why the model can generalize: it learns rules that generate many transitions rather than storing every transition explicitly.

How do the position module and memory module work together to produce correct next-observation predictions?

The position module updates an internal location belief using actions only, via path integration. At each step, the current location belief is combined with the sensory observation and stored in the memory module as an association (“at X, I saw Y”). For prediction, the model path-integrates through the action sequence to reach a final location belief, then queries memory using that position cue to retrieve the sensory observation associated with that location. Because memory is associative, it can reconstruct full content from partial information (position-only or sensory-only queries).

What is the key difference between TEM and a lookup-table approach in terms of what must be experienced to reach high accuracy?

A lookup table stores transitions as (previous observation + action) → next observation. To reach 100% accuracy, it must encounter all possible combinations, scaling with the number of edges in the relational graph. TEM instead learns structure: it can infer missing relationships after visiting enough nodes. In the transcript’s family-tree example, perfect performance depends more on node coverage than on enumerating every edge, reflecting learned abstractions rather than brute-force memorization.

What emergent neural-like representations arise inside TEM, and how are they linked to hippocampal and entorhinal cell types?

After training on grid-like analogs of 2D space, units in the position module develop periodic activity that tiles space with regular hexagonal patterns—described as grid-like and band-cell-like. Memory-module units form conjunctions of position and sensory input, resembling place cells that fire in specific spatial patches. Across environments, place-like firing remaps because sensory contexts change, mirroring hippocampal remapping. The transcript stresses these patterns are not hard-coded; they emerge from optimizing the prediction objective starting from random parameters.

How does changing the statistics of sensory experience affect the representations TEM learns?

When training sequences are generated to mimic mouse-like behavior—spending more time near boundaries and approaching objects—the model’s optimal representations shift accordingly. The transcript describes boundary-selective responses, object-vector-like cells that respond to distance and direction from objects, and landmark-like selectivity in the memory module. In an alternation task, TEM learns the rule that reward alternates between sides and develops splitter-cell-like activity modulated by both position and future turn direction, suggesting it can map latent structure not directly given in raw observations.

What does TEM suggest about why hippocampal place-cell remapping is not entirely random?

In TEM, place-like units are conjunctions of sensory and structural (grid-like) information. That means place-field shifts across environments are constrained by the underlying grid structure. When a place field remaps, it tends to land on locations that align with particular grid-cell hexagonal patterns. The transcript notes a predicted correlation between grid–place alignment across environments, and reports that experimental data support this relationship in real brains.

Review Questions

  1. How does path integration in the position module enable prediction when the final sensory observation is missing from the sequence?
  2. Why does TEM’s accuracy scale more with visited nodes than with visited edges, and what does that imply about learned structure?
  3. What evidence in the transcript links grid-like and place-like representations to constrained remapping rather than purely random drift?

Key Points

  1. 1

    TEM frames navigation as next-observation prediction from action–observation sequences, with learning driven by prediction-error minimization.

  2. 2

    A factorized architecture separates “where” (position module using actions and path integration) from “what” (memory module binding position to sensory inputs).

  3. 3

    Associative memory lets the model retrieve sensory content from partial cues, enabling prediction via a path-integrated location query.

  4. 4

    Compared with lookup tables, TEM reaches high accuracy without memorizing every possible transition edge, because it learns relational structure.

  5. 5

    Emergent internal representations resemble entorhinal grid-like activity and hippocampal place-cell conjunctions, including remapping across environments.

  6. 6

    When training data reflect animal-like behavior, TEM develops boundary, object-vector, landmark-like, and splitter-cell-like selectivity and learns latent task rules.

  7. 7

    TEM links remapping to grid–place alignment: place-field shifts across environments are constrained by underlying grid structure rather than drifting randomly.

Highlights

TEM’s prediction success comes from path-integrating an internal location state and then using that state to retrieve the next sensory observation from associative memory.
Grid-like periodic activity and place-like conjunction activity emerge from optimization rather than being hard-coded into the model.
A node-coverage argument explains why TEM can generalize with far fewer experiences than a transition lookup table.
Changing sensory statistics to match mouse behavior produces boundary, object-vector, landmark-like, and splitter-cell-like representations.
Remapping is predicted to preserve structural constraints: place-field shifts across environments correlate with grid–place alignment in both the model and real neural data.

Topics

  • Tolman–Eichenbaum Machine
  • Predictive Coding
  • Path Integration
  • Grid Cells
  • Place Cells
  • Hippocampal Remapping
  • Transformer Equivalence

Mentioned

  • James Whittington
  • TEM