The Magic Of ARM w/ Casey Muratori

TL;DR

Instruction-set semantics alone don’t justify “ARM is intrinsically lower power”; ABI and compiler choices can make assembly look similar across ARM and x86/x64 for simple code.

Briefing Cornell Notes

Briefing

ARM and x86/x64 aren’t fundamentally different “instruction sets for power” so much as they differ in how their machine code is encoded and decoded—and that decoding difference can matter for chip design. The most concrete hardware-level contrast comes from instruction encoding: x86/x64 uses variable-length instructions, while ARM (especially ARM64) uses fixed-size instruction encodings (with some ARM variants like Thumb introducing 16-bit forms). Variable-length encoding forces CPUs to do more complex decoding work to figure out where the next instruction begins, which complicates scaling to wider, multi-instruction-per-cycle decoders.

To make the comparison tangible, the discussion uses Compiler Explorer (godbolt.org), a tool that compiles the same C code and shows the resulting assembly and even the exact machine-code bytes. With a trivial function that just returns a value, the assembly looks nearly identical across architectures—differences mostly come from ABI conventions for where parameters and return values live. For example, on x86-64 System V ABI (as assumed by the setup), the first input parameter is placed in EDI and the return value comes back in EAX, so returning the first parameter may require a move. On ARM64, the first parameter and the return value can share the same register (R0/W0), so the compiler can avoid that move in simple cases; when returning a different parameter, ARM must move values into the return register too. The takeaway: the “ARM is better for low power” claim can’t be justified by instruction semantics alone, because the ISA-level differences don’t obviously show up in these tiny examples.

The hardware-level difference appears when looking at the actual bytes. x86/x64 instructions can be compact (e.g., a one-byte return) or longer (e.g., multi-byte encodings for other operations), so instruction boundaries aren’t known until decoding. ARM’s encodings are more regular: instructions are typically a consistent size, making it easier for the CPU to decode multiple instructions in parallel by fixed offsets. The practical implication is that x86 decoders often rely on “guessing” instruction boundaries and then falling back when guesses fail, while ARM can feed a straightforward fixed-offset decoding pipeline. That extra decoding complexity translates into real engineering cost—more circuitry, more power, and more difficulty sustaining high decode throughput.

Still, the conversation pushes back on attributing ARM’s historical low-power advantage purely to ISA design. Modern ARM chips (including Snapdragon and Apple M-series) execute complex instruction sets via micro-ops, similar to x86; the idea that ARM is “reduced” in the meaningful sense doesn’t hold up. Instead, the stronger explanation is market and ecosystem inertia: ARM grew up in low-power niches, with many licensees optimizing for battery life and thermal limits, while x86/x64 grew up in higher-power performance markets. Even when x86/x64 later pivoted toward efficiency, hardware design can’t change overnight. The openness of ARM licensing is also framed as a structural advantage—more designers can build ARM-based low-power systems, increasing the volume of innovation in that direction.

In the end, the most defensible “why ARM can be lower power” story is narrower: fixed-length decoding can simplify and reduce the cost of scaling instruction decode, which may help at the margins. But the biggest historical gap is portrayed as business dynamics and design lineage rather than a single magic ISA feature.

Cornell Notes

ARM’s advantage in low power is often overstated as an ISA-level truth, but the discussion identifies a real hardware-relevant difference: ARM’s instruction encodings are more regular (often fixed-size), while x86/x64 uses variable-length instructions. Variable-length encoding forces CPUs to decode more intelligently to find instruction boundaries, which complicates scaling to wide, multi-instruction-per-cycle decoders. ABI conventions explain why simple assembly listings can look similar across ARM and x86/x64 even when the machine-code bytes differ. Historically, ARM’s low-power dominance is attributed less to ISA “magic” and more to market dynamics: ARM’s ecosystem evolved around low-power design, while x86/x64 evolved around performance and only later shifted toward efficiency.

Why do ARM and x86/x64 assembly listings look almost the same in simple examples, even though the underlying machine code differs a lot?

Because many visible differences come from ABI conventions, not from the core instruction semantics. In the example used with Compiler Explorer, returning a parameter depends on where the ABI places arguments and return values. On x86-64 (System V ABI assumed), the first argument is in EDI and the return value is in EAX, so returning the first parameter may require a move. On ARM64, the first argument and return value can both use R0/W0, so the compiler can omit that move for the simplest case. When returning a different parameter, ARM must move it into the return register too—making the assembly patterns converge again.

What is the biggest hardware-level ISA difference highlighted between x86/x64 and ARM?

Instruction encoding regularity. x86/x64 instructions are variable-length, so the CPU can’t know where the next instruction starts until it decodes the current one. ARM (especially ARM64) uses fixed-size instruction encodings, so instruction boundaries are predictable at fixed offsets. That predictability makes it easier to decode multiple instructions in parallel without complex boundary-guessing logic.

How does variable-length decoding affect CPU performance and power?

Modern CPUs chase “instructions per clock” by decoding multiple instructions each cycle. With variable-length x86/x64, the decoder must determine instruction boundaries, often using speculative/guessing approaches and fallback paths when guesses are wrong. That extra decode complexity costs silicon area and power, and it makes scaling decode width harder. ARM’s fixed offsets allow straightforward parallel decoding by copying the decoder across fixed byte positions.

Does the discussion claim ARM is always more power efficient because of the ISA?

No. It treats the decoding advantage as a plausible contributor, but argues it’s not enough to explain ARM’s large historical lead by itself. ARM chips still execute complex ISAs through micro-ops, much like x86/x64. The larger historical explanation is market and ecosystem: ARM’s licensing and early focus on low-power devices drove many designers to optimize for efficiency, while x86/x64’s lineage emphasized high performance and only later added stronger low-power efforts.

Why do “hundreds of instructions” exist in modern ARM, and does that contradict the idea of ARM being “reduced”?

The discussion frames instruction-set growth as a response to performance targets and system needs. CPUs moved from single-instruction/single-data toward SIMD-style operations (adding wide operations like vector arithmetic), which multiplies the number of instruction variants. Additionally, system control and security features require many instructions. Even if the ISA is large, the hardware cost is often managed via micro-ops, so having many instructions doesn’t mean the chip has a dedicated circuit for each one.

What role does Compiler Explorer (godbolt.org) play in the comparison?

It provides fast, side-by-side compilation outputs across architectures and compilers, including assembly and the exact machine-code bytes. That lets the discussion move from “assembly looks similar” to “the actual encoding differs,” by showing how x86/x64 instruction lengths vary while ARM encodings are more regular. The tool also supports interactive changes (e.g., changing return values or data types) to see how ABI-driven moves appear.

Review Questions

What specific mechanism makes x86/x64 decoding harder than ARM according to the discussion, and how does that relate to multi-instruction-per-cycle performance?
How do ABI conventions explain differences in whether a compiler needs to move values into the return register on ARM versus x86-64?
Why does the discussion argue that ARM’s historical low-power advantage can’t be fully explained by ISA differences alone?

Key Points

1
Instruction-set semantics alone don’t justify “ARM is intrinsically lower power”; ABI and compiler choices can make assembly look similar across ARM and x86/x64 for simple code.
2
The most direct hardware-relevant ISA contrast is instruction encoding regularity: ARM’s fixed-size encodings make instruction boundaries predictable, while x86/x64 variable-length encodings require more complex decoding.
3
Variable-length decoding complicates scaling to wide decoders because the CPU must determine where each instruction ends and the next begins, often involving speculative boundary handling.
4
ARM’s historical low-power dominance is attributed more to market and ecosystem dynamics—ARM’s licensing and early focus on low-power designs—than to a single ISA feature.
5
Modern ARM and x86/x64 both execute complex ISAs via micro-ops, so “reduced instruction set” is not a reliable explanation for power differences today.
6
Binary size and instruction-cache behavior can still matter: larger, variable-length encodings can contribute to instruction-cache pressure and more frequent instruction-cache misses.
7
Having many instructions is largely a byproduct of performance targets (e.g., SIMD/vector) and system/security requirements, with hardware cost managed through micro-op translation rather than dedicated circuitry per instruction.

Highlights

ABI conventions can make ARM and x86/x64 assembly look nearly identical in trivial functions; the real divergence shows up in machine-code encoding and decoding behavior.

x86/x64 variable-length instructions force CPUs to decode to find instruction boundaries, making wide parallel decoding harder than ARM’s fixed-offset approach.

ARM’s low-power reputation is framed as historically driven more by ecosystem and licensing than by ISA “magic,” since both sides rely on micro-ops for complex execution.

Compiler Explorer is used to connect the abstract ISA debate to concrete machine-code bytes, revealing how instruction length variability differs across architectures.

Topics

ARM vs x86-64
Instruction Encoding
CPU Decoding
ABI Conventions
Low-Power Hardware

Mentioned

Compiler Explorer
godbolt.org
Snapdragon
Apple M series
Casey Muratori
Matt Godbolt
Ken Shirriff
TJ
ABI
ISA
IPC
SIMD
NEON
SVE
ABI