A. I. Learns to Play Starcraft 2 (Reinforcement Learning)
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Convert StarCraft 2 into an RL problem by defining a compact observation (a handcrafted minimap) and a small discrete macro action set.
Briefing
A reinforcement-learning agent can learn to play StarCraft 2 at least at the “macro” level by using a custom, simplified minimap representation as input and a small set of high-level actions—then training with a reward that favors sustained combat rather than resource hoarding or endless survival. After starting from a baseline where random action selection never wins, the training setup produced a model that reached roughly a 70% win rate against the hard computer bot, with peak training reward around 200. The result matters because it shows a workable path from game state to RL signals without needing full control of every in-game micro decision.
The approach begins by reframing StarCraft 2’s complexity into inputs, outputs, and reward. Instead of feeding raw video frames, the agent receives a “mini-map” built from a blank canvas: minerals, gas, building health, enemy starting location, enemy units, enemy structures, the player’s nexus and other structures, vespene, and unit positions. Visibility is encoded by brightness—minerals not currently observed by units appear dim—so the agent can infer uncertainty. Attack units (void rays) are colored distinctly to make combat-relevant information easier to parse.
On the action side, the agent is not controlling every unit command. It chooses among six macro actions implemented as explicit rules: (0) expand by ensuring supply, probes, assimilators, and a new base; (1) build stargates by creating required tech structures (gateways and cybernetics cores) and then stargates (currently capped at one per base); (2) build a void ray when affordable; (3) scout by sending a probe infrequently (throttled to once every 200 frames to avoid constant “suicide scouting”); (4) attack by prioritizing nearby units and buildings, otherwise moving void rays toward the enemy start location; and (5) retreat/fall back by returning void rays to base after fights.
The training hinges on reward design, where naive choices can derail learning. A benchmark run of 200 games using random actions produced zero wins, confirming the task is nontrivial. For learning, the system avoids sparse “win/loss only” rewards because the agent faces thousands of steps per game and can’t easily assign credit. Intermediate rewards are used instead: the agent earns a small per-step reward for each void ray actively in combat, while a terminal reward (and punishment) is applied for winning or losing. Attempts to reward resource gathering or total army size tended to encourage stalling—lengthening games or building without finishing—so combat-focused reward proved more effective.
Finally, the implementation connects StarCraft 2 to Stable Baselines 3 (using PPO) through a custom OpenAI Gym environment. Because the game loop and the RL loop can’t communicate directly in a clean way, the system relies on a shared state file (pickle) and multiple processes: one waits for actions from the RL model, another executes commands in StarCraft 2, redraws the minimap, computes rewards, and signals episode completion. After training, the best model achieved about a 70% win rate versus the hard bot, marking a strong early milestone.
The next planned step is to improve beyond macro decisions—especially micro-level targeting and movement. Experiments with “hunting” void rays at random coordinates when no enemies are visible performed worse, suggesting that coordinate selection should likely become its own learned micro policy rather than hand-coded logic. The project is positioned as a foundation for combining macro RL with a separate micro RL module later.
Cornell Notes
The project turns StarCraft 2 into a reinforcement-learning problem by feeding a handcrafted minimap-like state into a PPO agent and restricting decisions to six macro actions (expand, stargates, build void rays, scout, attack, retreat). Minerals, gas, structures, and units are drawn onto a blank map with color coding; unseen resources are dim to reflect limited visibility. Training avoids sparse win/loss rewards by giving small per-step rewards when void rays are actively fighting, plus terminal rewards for winning or losing. Random-action play never wins in 200 trials, but the best trained model reaches about a 70% win rate against the hard computer bot. The work sets up a foundation for later micro-focused learning, such as learning where to move and attack rather than hand-coding it.
Why does the setup avoid using raw game frames, and what does the minimap representation encode?
What are the six macro actions, and how do they translate into concrete in-game behavior?
Why is reward shaping necessary here, and what goes wrong with naive rewards?
How was the baseline performance measured before training, and what did it show?
How does the system connect Stable Baselines 3 to StarCraft 2 despite process communication limits?
What does the training outcome indicate, and what limitation remains?
Review Questions
- How does the minimap brightness scheme represent uncertainty, and why might that matter for scouting and attacking decisions?
- Which reward components were tested to avoid stalling behavior, and why did combat-focused per-step rewards work better than resource-focused rewards?
- What engineering tradeoff does the shared pickle state file introduce, and how does it affect the RL training loop’s step timing?
Key Points
- 1
Convert StarCraft 2 into an RL problem by defining a compact observation (a handcrafted minimap) and a small discrete macro action set.
- 2
Encode partial observability in the minimap by using brightness to distinguish observed versus unseen minerals.
- 3
Use intermediate rewards to solve the credit-assignment problem created by thousands of steps per game; win/loss-only signals are too sparse.
- 4
Avoid reward signals that incentivize stalling, such as rewarding resource gathering or simply increasing army size without requiring victory.
- 5
Benchmark random-action play to confirm the task is hard and that later wins reflect learning rather than chance.
- 6
Integrate Stable Baselines 3 PPO with StarCraft 2 via a custom Gym environment and a shared state file to coordinate actions, observations, and rewards across processes.
- 7
Treat macro learning as a foundation and plan micro learning separately, since hand-coded hunting/coordinate logic underperformed.