Mind-bending new programming language for GPUs just dropped...
Based on Fireship's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Bend’s core promise is automatic parallel execution for code that can be parallelized, without requiring manual CUDA or concurrency primitives.
Briefing
A new GPU-focused programming language called Bend is pitching a simple promise: write high-level, Python-like code and get parallel execution “for free,” without manually managing CUDA, blocks, locks, mutexes, or other low-level concurrency details. The pitch matters because parallel computing can turn week-long workloads into shorter runs by spreading work across many CPU cores—or even thousands of GPU cores—but doing that correctly is notoriously hard. Bend’s central claim is that it can automatically run anything that can be parallelized in parallel, avoiding the fragile, error-prone choreography that typically comes with multithreading and GPU programming.
Under the hood, Bend’s approach centers on representing computations as a graph of “interaction combinators,” a structure that organizes the steps of a program into nodes and rewrite rules. When nodes interact, the runtime repeatedly rewrites the computation according to those rules, enabling parts of the work to proceed simultaneously. Once the graph has been fully reduced, results are merged back into the expression returned from the function. This model traces back to interaction combinators research from the 1990s, but Bend wraps it in a higher-level language so developers don’t have to work directly with the underlying runtime.
Bend interfaces with a runtime called the Higher Order Virtual Machine (HVM). The language itself is implemented in Rust and uses syntax described as similar to Python, including a straightforward “Hello World” style example with a main function returning a string. Running code can be done via a command like `bend run`, which—at least by default—uses a Rust interpreter and executes sequentially, like a conventional language.
The performance story shifts when the same algorithm is executed with different backends. Bend reportedly removes the need for explicit loops: instead of a `for` loop, it uses a `fold`, described as a data-type “search and replace” that can consume recursive structures such as lists or trees in parallel. A complementary `bend` keyword is used to construct the recursive data type that `fold` will consume.
In a benchmark-style demonstration, an algorithm that counts and sums values takes more than 10 minutes when run on a single thread. Running the identical code with `bend run-cpu` reportedly uses all 24 CPU threads and drops runtime to about 30 seconds. The same code then reportedly accelerates further on an Nvidia RTX 490 using `bend run-cuda`, reaching roughly 1.5 seconds—without modifying the algorithm. The takeaway is less about a new programming trick and more about a workflow shift: parallelism is treated as a default execution property rather than a manual engineering task.
Cornell Notes
Bend is a Python-like programming language aimed at parallel execution on CPUs and GPUs without requiring developers to write low-level concurrency or CUDA code. It represents programs as graphs of “interaction combinators,” then repeatedly rewrites those graphs using rules that enable parallel progress. The runtime behind it is the Higher Order Virtual Machine (HVM), while Bend provides a higher-level interface implemented in Rust. Bend also avoids traditional loops by using `fold` over recursive data types, which can be consumed in parallel. In a performance demo, the same counting-and-summing algorithm runs far faster when executed with CPU and CUDA backends, dropping from 10+ minutes (single thread) to ~30 seconds (24 CPU threads) and then to ~1.5 seconds on an Nvidia RTX 490.
What problem Bend is trying to solve, and why parallelism is hard in practice?
How does Bend’s execution model work at a high level?
What role does the Higher Order Virtual Machine (HVM) play?
Why does Bend avoid traditional loops, and what replaces them?
What does the performance demonstration claim across CPU and GPU backends?
Review Questions
- How does representing a computation as an interaction-combinator graph enable parallel execution compared with a single-threaded execution model?
- What is the relationship between `fold` and recursive data types in Bend, and how does that replace traditional looping?
- Why might a developer prefer Bend’s approach over writing CUDA or explicit multithreaded code for GPU acceleration?
Key Points
- 1
Bend’s core promise is automatic parallel execution for code that can be parallelized, without requiring manual CUDA or concurrency primitives.
- 2
Computations are modeled as graphs of interaction combinators, and execution proceeds through rewrite rules that allow parallel progress.
- 3
The Higher Order Virtual Machine (HVM) is the underlying runtime, while Bend provides a Python-like interface implemented in Rust.
- 4
Bend avoids traditional loops by using `fold` to consume recursive data types (like lists or trees) in a parallel-friendly way.
- 5
A single algorithm can be run sequentially, on CPU threads, or on a CUDA GPU backend using different `bend run-*` commands without rewriting the algorithm.
- 6
A benchmark-style example reports large speedups: 10+ minutes (single thread) to ~30 seconds (24 CPU threads) to ~1.5 seconds on an Nvidia RTX 490 (CUDA).