Playing a Neural Network's version of GTA V: GAN Theft Auto
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The neural network generates GTA 5 frames directly from player key presses, effectively acting as the game environment rather than just an agent controlling a conventional engine.
Briefing
A neural network can run an entire slice of Grand Theft Auto 5—generating the visuals, responding to key presses, and reproducing game-like physics—without relying on the game’s original rules engine. In “GAN Theft Auto,” the model takes real-time player inputs (left/right, turning into obstacles, interacting with the road) and outputs pixel frames that include not just scenery and motion, but also lighting, reflections, shadows, and collision responses. The result matters because it demonstrates a practical path from “AI that plays a game” to “AI that becomes the game,” hinting at future software where core simulation and rendering are learned rather than hand-coded.
The project builds on NVIDIA’s GameGAN research, but pushes beyond a simple GAN that maps inputs to frames. Here, the neural network is effectively the environment: it decides how the road looks, how the car turns, what happens when the vehicle hits walls or barriers, how background elements shift with movement, and how light behaves on the car’s surfaces. The team also adds super sampling on top of the model output to reduce native pixelation, including a selected “times 8” super-sampled version that produces the most playable, visually coherent results.
Getting to GTA 5 required solving data and compute bottlenecks. Manual data collection would have taken months, so Daniel Kukiela created a rules-based AI to drive through GTA 5 and generate training footage. Multiple AI agents were used simultaneously to accelerate data gathering, with cars appearing to clip or overlap in the same road segment during capture. Training itself was GPU-memory intensive; early experiments ran on RTX 8000 systems, and later work leveraged NVIDIA’s DGX station with four A180 GPUs (320 GB total VRAM) to scale experiments.
Initial models were trained on a limited highway section and struggled with hard boundaries—driving into walls could quickly confuse the system. Subsequent training introduced wall collisions, and the model learned plausible contact behavior: the car approaches, impacts, then turns and slides along the barrier rather than simply breaking down. Lighting improvements followed as well, including moving sun reflections on windows and matte surface highlights, plus shadows that track with motion.
The team also experimented with other vehicles, adding police cars and collisions. Results were mixed: sometimes the model handled interactions partially (including a collision where a police car appears to split), but often other cars would disappear or the player vehicle could not steer through them reliably. Even so, the project’s strongest takeaway is the breadth of learned dynamics—road perspective changes, background parallax, and subtle vehicle roll during turns—emerging from training on real gameplay frames.
After roughly two months of work and limited training scope, the project still produced a playable, GTA-5-recognizable environment entirely generated by a neural network. The team plans to host the base model and a super-sampler via GitHub, and invites others to train and share new variants, with the next obvious research targets being larger roaming ranges and more robust multi-car collision behavior.
Cornell Notes
“GAN Theft Auto” demonstrates a neural network that functions as the game environment for a portion of GTA 5. Instead of using GTA’s original physics and rendering, the model takes player key inputs and generates pixel frames in real time, including road visuals, turning behavior, lighting, reflections, and shadows. Early versions struggled with hard boundaries like walls, but later training added collision scenarios and produced more realistic wall impacts and sliding. Super sampling (including a times 8 option) improves visual clarity enough to make the generated GTA slice playable. The work matters because it moves beyond “AI plays a game” toward “AI simulates a game,” raising the prospect of learned physics and rendering replacing large parts of traditional engines.
How does the system turn player actions into a generated GTA 5 world?
What changed between early models and later ones regarding collisions?
Why did super sampling matter, and what version was highlighted?
How did the team gather enough GTA 5 data without months of manual play?
What evidence suggests the model learned more than just frame-to-frame texture?
How reliable were interactions with other cars?
Review Questions
- What inputs does the neural network receive, and what outputs does it generate to replace the game’s simulation loop?
- Why did wall collisions initially fail, and what training change improved the model’s behavior?
- Which learned visual cues (lighting, reflections, shadows, parallax, roll) were highlighted as evidence of dynamic understanding?
Key Points
- 1
The neural network generates GTA 5 frames directly from player key presses, effectively acting as the game environment rather than just an agent controlling a conventional engine.
- 2
Wall collisions improved after training data included hard boundary events; later models produced impacts followed by sliding along barriers.
- 3
Super sampling significantly reduces pixelation, with a selected times 8 model delivering the most playable visual quality.
- 4
Training required large-scale data generation; manual play was replaced by a rules-based driving AI that collected footage quickly.
- 5
Compute constraints shaped the workflow, with early training on RTX 8000 systems and later scaling using an NVIDIA DGX station with four A180 GPUs.
- 6
The model learned dynamic lighting and reflections (including moving sun highlights) plus shadows that track with motion.
- 7
Multi-car interactions remain unreliable: police car collisions sometimes work partially, but other-car behavior often degrades into disappearance or steering failure.