Playing a Neural Network's version of GTA V: GAN Theft Auto

TL;DR

The neural network generates GTA 5 frames directly from player key presses, effectively acting as the game environment rather than just an agent controlling a conventional engine.

Briefing Cornell Notes

Briefing

A neural network can run an entire slice of Grand Theft Auto 5—generating the visuals, responding to key presses, and reproducing game-like physics—without relying on the game’s original rules engine. In “GAN Theft Auto,” the model takes real-time player inputs (left/right, turning into obstacles, interacting with the road) and outputs pixel frames that include not just scenery and motion, but also lighting, reflections, shadows, and collision responses. The result matters because it demonstrates a practical path from “AI that plays a game” to “AI that becomes the game,” hinting at future software where core simulation and rendering are learned rather than hand-coded.

The project builds on NVIDIA’s GameGAN research, but pushes beyond a simple GAN that maps inputs to frames. Here, the neural network is effectively the environment: it decides how the road looks, how the car turns, what happens when the vehicle hits walls or barriers, how background elements shift with movement, and how light behaves on the car’s surfaces. The team also adds super sampling on top of the model output to reduce native pixelation, including a selected “times 8” super-sampled version that produces the most playable, visually coherent results.

Getting to GTA 5 required solving data and compute bottlenecks. Manual data collection would have taken months, so Daniel Kukiela created a rules-based AI to drive through GTA 5 and generate training footage. Multiple AI agents were used simultaneously to accelerate data gathering, with cars appearing to clip or overlap in the same road segment during capture. Training itself was GPU-memory intensive; early experiments ran on RTX 8000 systems, and later work leveraged NVIDIA’s DGX station with four A180 GPUs (320 GB total VRAM) to scale experiments.

Initial models were trained on a limited highway section and struggled with hard boundaries—driving into walls could quickly confuse the system. Subsequent training introduced wall collisions, and the model learned plausible contact behavior: the car approaches, impacts, then turns and slides along the barrier rather than simply breaking down. Lighting improvements followed as well, including moving sun reflections on windows and matte surface highlights, plus shadows that track with motion.

The team also experimented with other vehicles, adding police cars and collisions. Results were mixed: sometimes the model handled interactions partially (including a collision where a police car appears to split), but often other cars would disappear or the player vehicle could not steer through them reliably. Even so, the project’s strongest takeaway is the breadth of learned dynamics—road perspective changes, background parallax, and subtle vehicle roll during turns—emerging from training on real gameplay frames.

After roughly two months of work and limited training scope, the project still produced a playable, GTA-5-recognizable environment entirely generated by a neural network. The team plans to host the base model and a super-sampler via GitHub, and invites others to train and share new variants, with the next obvious research targets being larger roaming ranges and more robust multi-car collision behavior.

Cornell Notes

“GAN Theft Auto” demonstrates a neural network that functions as the game environment for a portion of GTA 5. Instead of using GTA’s original physics and rendering, the model takes player key inputs and generates pixel frames in real time, including road visuals, turning behavior, lighting, reflections, and shadows. Early versions struggled with hard boundaries like walls, but later training added collision scenarios and produced more realistic wall impacts and sliding. Super sampling (including a times 8 option) improves visual clarity enough to make the generated GTA slice playable. The work matters because it moves beyond “AI plays a game” toward “AI simulates a game,” raising the prospect of learned physics and rendering replacing large parts of traditional engines.

How does the system turn player actions into a generated GTA 5 world?

The model runs in Python and receives live key presses from the player (for example, left/right turning). It outputs pixel values for each frame, so the neural network determines what the road looks like, how the car turns, and what happens during interactions such as hitting walls or barriers. In this setup, the game’s original rules engine isn’t driving the behavior; the learned model is.

What changed between early models and later ones regarding collisions?

Early training data didn’t include hard boundary events like wall impacts, so when the car drove into walls during inference, the model often became confused quickly. After adding wall collision scenarios to training, the model learned a more consistent response: the car approaches, hits the wall, then is forced to turn straight and slide along the barrier rather than failing outright.

Why did super sampling matter, and what version was highlighted?

The base model outputs were noticeably pixelated. To improve clarity, the team added super sampling on top of the generated frames. Among many super-sampler variants, they selected a times 8 super sampling model as the best-looking option, producing a substantial visual upgrade over the native output.

How did the team gather enough GTA 5 data without months of manual play?

Manual collection wasn’t feasible within the available compute window. Daniel Kukiela built a rules-based driving AI to generate training footage automatically. To speed up capture, multiple AI agents (12) drove around the same road segment in parallel, producing a larger dataset for training the GAN models.

What evidence suggests the model learned more than just frame-to-frame texture?

The model reproduced dynamic, physics-like and camera-like effects: moving sun reflections and window highlights, shadows that behave as expected, background parallax (mountains shifting at the right pace as the car travels), and subtle vehicle roll during turns. These behaviors indicate the network learned structured relationships between motion, viewpoint, and visual outcomes rather than only static appearance.

How reliable were interactions with other cars?

Interactions with police cars were inconsistent. In some cases, the model partially responded—one example showed a police car collision where the vehicle appeared to split. But often other cars would disappear or the player vehicle couldn’t steer through them correctly (e.g., being dragged along straight despite attempts to turn). The project notes that more work is needed for robust multi-car collision handling.

Review Questions

What inputs does the neural network receive, and what outputs does it generate to replace the game’s simulation loop?
Why did wall collisions initially fail, and what training change improved the model’s behavior?
Which learned visual cues (lighting, reflections, shadows, parallax, roll) were highlighted as evidence of dynamic understanding?

Key Points

1
The neural network generates GTA 5 frames directly from player key presses, effectively acting as the game environment rather than just an agent controlling a conventional engine.
2
Wall collisions improved after training data included hard boundary events; later models produced impacts followed by sliding along barriers.
3
Super sampling significantly reduces pixelation, with a selected times 8 model delivering the most playable visual quality.
4
Training required large-scale data generation; manual play was replaced by a rules-based driving AI that collected footage quickly.
5
Compute constraints shaped the workflow, with early training on RTX 8000 systems and later scaling using an NVIDIA DGX station with four A180 GPUs.
6
The model learned dynamic lighting and reflections (including moving sun highlights) plus shadows that track with motion.
7
Multi-car interactions remain unreliable: police car collisions sometimes work partially, but other-car behavior often degrades into disappearance or steering failure.

Highlights

The model doesn’t just render a scene—it decides how the car turns, how the road behaves, and what happens on collisions by generating pixel frames from inputs.

After adding wall collision training, the car began to hit barriers and slide along them in a more game-like way.

Moving sun reflections, matte highlights, and shadows appear to be learned and change correctly as the vehicle moves.

Background parallax and subtle vehicle roll during turns emerged from training, even though the system wasn’t explicitly given those rules.

Other-car collisions with police cars were mixed: partial successes occurred, but cars often disappeared or blocked steering unpredictably.

Topics

GANs
GameGAN
Super Sampling
Neural Rendering
Learned Physics

Mentioned

NVIDIA
GameGAN
DGX
RTX
A180
OpenAI Gym
Daniel Kukiela
GAN
VRAM