A. I. Learns to Play Starcraft 2 (Reinforcement Learning)

TL;DR

Convert StarCraft 2 into an RL problem by defining a compact observation (a handcrafted minimap) and a small discrete macro action set.

Briefing Cornell Notes

Briefing

A reinforcement-learning agent can learn to play StarCraft 2 at least at the “macro” level by using a custom, simplified minimap representation as input and a small set of high-level actions—then training with a reward that favors sustained combat rather than resource hoarding or endless survival. After starting from a baseline where random action selection never wins, the training setup produced a model that reached roughly a 70% win rate against the hard computer bot, with peak training reward around 200. The result matters because it shows a workable path from game state to RL signals without needing full control of every in-game micro decision.

The approach begins by reframing StarCraft 2’s complexity into inputs, outputs, and reward. Instead of feeding raw video frames, the agent receives a “mini-map” built from a blank canvas: minerals, gas, building health, enemy starting location, enemy units, enemy structures, the player’s nexus and other structures, vespene, and unit positions. Visibility is encoded by brightness—minerals not currently observed by units appear dim—so the agent can infer uncertainty. Attack units (void rays) are colored distinctly to make combat-relevant information easier to parse.

On the action side, the agent is not controlling every unit command. It chooses among six macro actions implemented as explicit rules: (0) expand by ensuring supply, probes, assimilators, and a new base; (1) build stargates by creating required tech structures (gateways and cybernetics cores) and then stargates (currently capped at one per base); (2) build a void ray when affordable; (3) scout by sending a probe infrequently (throttled to once every 200 frames to avoid constant “suicide scouting”); (4) attack by prioritizing nearby units and buildings, otherwise moving void rays toward the enemy start location; and (5) retreat/fall back by returning void rays to base after fights.

The training hinges on reward design, where naive choices can derail learning. A benchmark run of 200 games using random actions produced zero wins, confirming the task is nontrivial. For learning, the system avoids sparse “win/loss only” rewards because the agent faces thousands of steps per game and can’t easily assign credit. Intermediate rewards are used instead: the agent earns a small per-step reward for each void ray actively in combat, while a terminal reward (and punishment) is applied for winning or losing. Attempts to reward resource gathering or total army size tended to encourage stalling—lengthening games or building without finishing—so combat-focused reward proved more effective.

Finally, the implementation connects StarCraft 2 to Stable Baselines 3 (using PPO) through a custom OpenAI Gym environment. Because the game loop and the RL loop can’t communicate directly in a clean way, the system relies on a shared state file (pickle) and multiple processes: one waits for actions from the RL model, another executes commands in StarCraft 2, redraws the minimap, computes rewards, and signals episode completion. After training, the best model achieved about a 70% win rate versus the hard bot, marking a strong early milestone.

The next planned step is to improve beyond macro decisions—especially micro-level targeting and movement. Experiments with “hunting” void rays at random coordinates when no enemies are visible performed worse, suggesting that coordinate selection should likely become its own learned micro policy rather than hand-coded logic. The project is positioned as a foundation for combining macro RL with a separate micro RL module later.

Cornell Notes

The project turns StarCraft 2 into a reinforcement-learning problem by feeding a handcrafted minimap-like state into a PPO agent and restricting decisions to six macro actions (expand, stargates, build void rays, scout, attack, retreat). Minerals, gas, structures, and units are drawn onto a blank map with color coding; unseen resources are dim to reflect limited visibility. Training avoids sparse win/loss rewards by giving small per-step rewards when void rays are actively fighting, plus terminal rewards for winning or losing. Random-action play never wins in 200 trials, but the best trained model reaches about a 70% win rate against the hard computer bot. The work sets up a foundation for later micro-focused learning, such as learning where to move and attack rather than hand-coding it.

Why does the setup avoid using raw game frames, and what does the minimap representation encode?

Instead of feeding pixel video frames, the agent uses a simplified minimap built from a blank canvas sized to the game map. It draws minerals, vespene gas, building and unit health cues, enemy starting location, enemy units, enemy structures, the player’s nexus and other structures, and void rays. Visibility is encoded by brightness: resources not currently observed by units appear dim, while observed ones are brighter. This reduces input complexity and makes combat-relevant state (especially void rays and enemy presence) easier to learn from.

What are the six macro actions, and how do they translate into concrete in-game behavior?

Action 0 expands: it checks supply, builds probes (workers), adds assimilators for vespene, and then creates a new base at another resource location. Action 1 builds stargates: it ensures gateways and cybernetics cores exist, then constructs stargates (with a current rule limiting one stargate per base). Action 2 builds a void ray if affordable. Action 3 scouts by sending a probe infrequently—throttled to once every 200 frames. Action 4 attacks by prioritizing close units, then close buildings, then known units, then known buildings; if nothing is found, void rays move toward the enemy start location. Action 5 retreats by returning void rays to base after fights.

Why is reward shaping necessary here, and what goes wrong with naive rewards?

Games can involve 5,000–10,000+ action steps, so win/loss-only rewards are too sparse for the agent to learn which actions caused success. Intermediate rewards help assign credit. Rewards for resource gathering or total army size can cause stalling: the agent may prolong the game to collect more resources or build indefinitely without actually finishing. The chosen solution gives a small per-step reward for each void ray that is actively in combat, encouraging engagement and progress toward elimination, while still applying terminal rewards for winning or losing.

How was the baseline performance measured before training, and what did it show?

A benchmark run tested 200 games where actions were chosen randomly. The agent won zero games, establishing that random macro decisions are not sufficient and that any later wins reflect learning rather than luck.

How does the system connect Stable Baselines 3 to StarCraft 2 despite process communication limits?

A custom OpenAI Gym environment wraps the interaction. The environment’s reset method prepares a shared state file (pickle) containing the minimap observation, the reward/action fields, and a done flag. A separate process runs StarCraft 2 via the Blizzard API and waits for new actions written into the shared file. The RL training loop (Stable Baselines 3 PPO) calls the environment step method, which blocks until the game process updates the state with a new observation and computed reward. This shared-file approach is described as “spaghetified” but functional.

What does the training outcome indicate, and what limitation remains?

The best model peaked around a reward of ~200 and achieved about a 70% win rate against the hard computer bot, compared with 0% for random actions. The remaining limitation is that decisions are macro-level and void-ray combat behavior is still largely rule-driven; improving micro—like learning where to move and how to hunt beyond the enemy start location—likely requires a separate learned micro policy.

Review Questions

How does the minimap brightness scheme represent uncertainty, and why might that matter for scouting and attacking decisions?
Which reward components were tested to avoid stalling behavior, and why did combat-focused per-step rewards work better than resource-focused rewards?
What engineering tradeoff does the shared pickle state file introduce, and how does it affect the RL training loop’s step timing?

Key Points

1
Convert StarCraft 2 into an RL problem by defining a compact observation (a handcrafted minimap) and a small discrete macro action set.
2
Encode partial observability in the minimap by using brightness to distinguish observed versus unseen minerals.
3
Use intermediate rewards to solve the credit-assignment problem created by thousands of steps per game; win/loss-only signals are too sparse.
4
Avoid reward signals that incentivize stalling, such as rewarding resource gathering or simply increasing army size without requiring victory.
5
Benchmark random-action play to confirm the task is hard and that later wins reflect learning rather than chance.
6
Integrate Stable Baselines 3 PPO with StarCraft 2 via a custom Gym environment and a shared state file to coordinate actions, observations, and rewards across processes.
7
Treat macro learning as a foundation and plan micro learning separately, since hand-coded hunting/coordinate logic underperformed.

Highlights

A random-action baseline produced 0 wins across 200 games, setting a clear difficulty floor for the RL agent.

The minimap input is built from a blank canvas with color-coded minerals, gas, structures, and void rays, and dim brightness marks unseen resources.

Reward design centered on small per-step gains when void rays are actively fighting, avoiding stalling that came from resource- or size-based rewards.

Despite communication hurdles, PPO training worked by coordinating the RL loop and StarCraft 2 loop through a shared pickle state file.

The best trained model reached about a 70% win rate against the hard computer bot, showing macro RL can produce real competitive behavior.

Topics

Reinforcement Learning
StarCraft 2
Stable Baselines 3
PPO
Game State Representation