Why Github Why?

TL;DR

A timing “sleep” implemented as a tight Bash while-loop can become infinite if elapsed-time checks miss the intended boundary under system load.

Briefing Cornell Notes

Briefing

GitHub Actions has been plagued for years by a runaway “sleep” loop in its runner code—an error that can peg CPU at 100%, stall CI pipelines, and rack up real cloud costs. The core failure traces to a small Bash construct meant to wait until a target time, but under certain timing conditions it can miss the stop condition and continue indefinitely, turning a supposed delay into a zombie process that keeps burning compute.

The transcript links the problem to a historical evolution of the runner’s “safe sleep” logic. Early public code used Windows-friendly workarounds: if a native sleep command wasn’t available, it would approximate a delay using ping (sending packets once per second) and, in some variants, even “echo” repeatedly into a null sink to consume time. That kind of hacky timing approach might be tolerable when it actually yields control back to the system. The later version, however, relied on a tight while-loop that checks elapsed time using Bash’s special Seconds variable. If the loop’s timing drifts—because the system is busy and the loop doesn’t hit the expected boundary within a millisecond—the condition can fail to terminate. Worse, when the loop is actively spinning rather than sleeping, it can consume an entire CPU core for the duration of what should have been a short wait.

That behavior matters because GitHub Actions runners typically have limited CPU resources. If one job burns CPU while “waiting,” it reduces available capacity for other jobs, causing queues to back up. The transcript claims this manifested as CI systems so backed up that even commits to the master branch could fail to get checked.

A specific fix is described as changing the termination logic from a strict inequality to a less-than-or-equal comparison, which prevents the loop from running forever when timing overshoots. The transcript cites an example where a single runner ran for 5,135 hours stuck in the loop. Since GitHub CI is billed per CPU minute, the estimated cost for that one process is put at about $2,400—illustrating how a tiny logic bug can translate into thousands of dollars in wasted spend.

Despite the fix, the transcript criticizes the long delay between patching and merging, describing issues that lingered for years and were only resolved after extended periods of neglect. It also points to additional Actions-related defects, including a “failed to hash files” problem and other messy code patterns that allegedly introduced new complexity and bugs.

Overall, the transcript frames the situation as a mismatch between the platform’s scale—millions of projects depend on it—and the apparent sloppiness in its runner code and maintenance process. The takeaway is blunt: a single flawed while-loop in CI infrastructure can silently degrade reliability and inflate costs, and the harm persists when fixes don’t land promptly or cleanly.

Cornell Notes

GitHub Actions runner code contains a timing bug that can turn a short “sleep” into an infinite CPU-burning loop. The failure comes from a Bash while-loop that waits until elapsed time reaches a target; if timing drifts, the loop may never satisfy its exit condition. When the loop spins instead of using a real sleep mechanism, it can consume an entire CPU core on a runner, backing up CI queues and preventing even new commits from running. A later correction adjusts the loop’s comparison logic (using a less-than-or-equal style check) to stop the runaway behavior. The transcript emphasizes the cost impact: one stuck runner reportedly ran 5,135 hours, costing roughly $2,400 based on per-CPU-minute billing.

What specific coding pattern can cause a “sleep” function to run forever in CI infrastructure?

A tight Bash while-loop that waits for elapsed time using the Seconds variable can fail to terminate if the loop doesn’t hit the expected boundary. If the condition is written so that it expects an exact threshold (or a strict inequality) and the system is busy enough that the loop’s checks “jump over” the stopping point, the exit condition may never become true. In that case, the loop can spin continuously, consuming CPU rather than yielding.

Why does CPU spinning matter more on CI runners than on a typical workstation?

CI runners have limited compute. If a runner has only a couple CPU cores and one job burns a core at 100% while “waiting,” it reduces capacity for other jobs. That throttling compounds over time: more jobs queue up, and pipelines can become so delayed that new commits (including those to master) may not get processed promptly.

How do timing workarounds like ping-based delays relate to the bug described?

Earlier code used Windows-friendly hacks: if sleep wasn’t available, it approximated a delay by pinging once per second and counting packets. That approach can still behave like a delay without necessarily spinning at full CPU. The transcript contrasts that with a later “safe sleep” that uses an elapsed-time while-loop; when that loop spins and misses its stop condition, it becomes a runaway process.

What change is described as fixing the infinite loop, and why would it help?

The transcript says a fix changed the loop’s termination logic to use a less-than-or-equal style comparison rather than a strict not-equals/strict inequality approach. That adjustment makes the loop exit even when elapsed time overshoots the target, preventing the “missed boundary” scenario that can otherwise lead to an infinite run.

How can a small CI bug translate into measurable money loss?

Because CI billing is tied to CPU usage over time. The transcript cites an example where a single runner ran for 5,135 hours in the stuck loop. With an estimated rate of about $8 per CPU minute, that one process is calculated at roughly $2,400 in wasted compute—before considering knock-on effects like delayed builds and additional queued jobs.

What other Actions issues are mentioned beyond the runaway loop?

The transcript briefly highlights a “failed to hash files” issue attributed to a diff merged two weeks prior, and it also criticizes other code patterns described as overly complex or refactored in ways that allegedly introduced additional bugs. These are presented as further evidence of broader maintenance and code-quality problems.

Review Questions

How can a while-loop that checks elapsed time using Seconds fail to terminate under load?
What economic mechanism turns a CPU-spinning CI bug into thousands of dollars of cost?
Why might a comparison operator change (e.g., strict vs less-than-or-equal) prevent an infinite loop in timing code?

Key Points

1
A timing “sleep” implemented as a tight Bash while-loop can become infinite if elapsed-time checks miss the intended boundary under system load.
2
When the loop spins instead of sleeping, it can consume 100% of a CPU core on a CI runner, reducing capacity for other jobs.
3
Runner CPU starvation can back up CI queues enough to delay or block even master-branch checks.
4
A fix described as adjusting the loop’s comparison logic (to a less-than-or-equal style) can prevent overshoot from turning the loop into a zombie process.
5
The transcript estimates significant direct costs from runaway runners by applying per-CPU-minute billing to multi-day stuck processes.
6
Long delays between identifying fixes and merging them can prolong outages and wasted spend even after a patch exists.
7
Additional Actions defects (like failed hashing) are cited as part of a broader pattern of reliability and code-quality concerns.

Highlights

A “sleep” meant for delays can turn into an infinite CPU-burning loop when timing drift prevents the exit condition from ever being met.

The described fix hinges on changing the loop’s termination comparison so overshoot doesn’t trap the process forever.

One cited incident claims a single runner ran 5,135 hours, translating to roughly $2,400 in wasted CPU time under per-CPU-minute billing.

CI backlogs can become severe enough that even master branch commits may not get checked in time.

Topics

GitHub Actions
CI Runners
Bash While Loop
Timing Bugs
Build Backlogs