Mind-bending new programming language for GPUs just dropped...

TL;DR

Bend’s core promise is automatic parallel execution for code that can be parallelized, without requiring manual CUDA or concurrency primitives.

Briefing Cornell Notes

Briefing

A new GPU-focused programming language called Bend is pitching a simple promise: write high-level, Python-like code and get parallel execution “for free,” without manually managing CUDA, blocks, locks, mutexes, or other low-level concurrency details. The pitch matters because parallel computing can turn week-long workloads into shorter runs by spreading work across many CPU cores—or even thousands of GPU cores—but doing that correctly is notoriously hard. Bend’s central claim is that it can automatically run anything that can be parallelized in parallel, avoiding the fragile, error-prone choreography that typically comes with multithreading and GPU programming.

Under the hood, Bend’s approach centers on representing computations as a graph of “interaction combinators,” a structure that organizes the steps of a program into nodes and rewrite rules. When nodes interact, the runtime repeatedly rewrites the computation according to those rules, enabling parts of the work to proceed simultaneously. Once the graph has been fully reduced, results are merged back into the expression returned from the function. This model traces back to interaction combinators research from the 1990s, but Bend wraps it in a higher-level language so developers don’t have to work directly with the underlying runtime.

Bend interfaces with a runtime called the Higher Order Virtual Machine (HVM). The language itself is implemented in Rust and uses syntax described as similar to Python, including a straightforward “Hello World” style example with a main function returning a string. Running code can be done via a command like `bend run`, which—at least by default—uses a Rust interpreter and executes sequentially, like a conventional language.

The performance story shifts when the same algorithm is executed with different backends. Bend reportedly removes the need for explicit loops: instead of a `for` loop, it uses a `fold`, described as a data-type “search and replace” that can consume recursive structures such as lists or trees in parallel. A complementary `bend` keyword is used to construct the recursive data type that `fold` will consume.

In a benchmark-style demonstration, an algorithm that counts and sums values takes more than 10 minutes when run on a single thread. Running the identical code with `bend run-cpu` reportedly uses all 24 CPU threads and drops runtime to about 30 seconds. The same code then reportedly accelerates further on an Nvidia RTX 490 using `bend run-cuda`, reaching roughly 1.5 seconds—without modifying the algorithm. The takeaway is less about a new programming trick and more about a workflow shift: parallelism is treated as a default execution property rather than a manual engineering task.

Cornell Notes

Bend is a Python-like programming language aimed at parallel execution on CPUs and GPUs without requiring developers to write low-level concurrency or CUDA code. It represents programs as graphs of “interaction combinators,” then repeatedly rewrites those graphs using rules that enable parallel progress. The runtime behind it is the Higher Order Virtual Machine (HVM), while Bend provides a higher-level interface implemented in Rust. Bend also avoids traditional loops by using `fold` over recursive data types, which can be consumed in parallel. In a performance demo, the same counting-and-summing algorithm runs far faster when executed with CPU and CUDA backends, dropping from 10+ minutes (single thread) to ~30 seconds (24 CPU threads) and then to ~1.5 seconds on an Nvidia RTX 490.

What problem Bend is trying to solve, and why parallelism is hard in practice?

Parallel computing can reduce runtimes dramatically by using many cores at once, but writing correct parallel code is difficult. Manual approaches typically require dealing with CUDA primitives (like blocks), and concurrency hazards such as race conditions, deadlocks, and thread starvation. Even when parallel code works, it can be complex to maintain and may require switching to lower-level languages like C++ for GPU performance.

How does Bend’s execution model work at a high level?

Bend structures computations into a graph of “interaction combinators.” Execution proceeds by applying rewrite rules: when nodes interact, the runtime rewrites the computation into forms that can be processed in parallel. This continues until the computation is fully reduced, after which the final result is merged back into the returned expression.

What role does the Higher Order Virtual Machine (HVM) play?

HVM is the runtime that implements the interaction-combinator approach. Bend is positioned as a higher-level language that interfaces with HVM rather than being used directly with HVM. The language is implemented in Rust, and its syntax is described as similar to Python to keep the developer experience familiar.

Why does Bend avoid traditional loops, and what replaces them?

Bend reportedly does not use `for` loops. Instead, it uses `fold`, described as a data-type “search and replace” operation that can consume recursive data types like lists or trees. Algorithms that would normally rely on looping can be rewritten in terms of `fold`, enabling parallel consumption of recursive structures.

What does the performance demonstration claim across CPU and GPU backends?

The same counting-and-summing algorithm is shown taking over 10 minutes on a single thread. Running it with `bend run-cpu` reportedly uses all 24 CPU threads and reduces runtime to about 30 seconds. Running it with `bend run-cuda` on an Nvidia RTX 490 reportedly brings runtime down to roughly 1.5 seconds without changing the algorithm.

Review Questions

How does representing a computation as an interaction-combinator graph enable parallel execution compared with a single-threaded execution model?
What is the relationship between `fold` and recursive data types in Bend, and how does that replace traditional looping?
Why might a developer prefer Bend’s approach over writing CUDA or explicit multithreaded code for GPU acceleration?

Key Points

1
Bend’s core promise is automatic parallel execution for code that can be parallelized, without requiring manual CUDA or concurrency primitives.
2
Computations are modeled as graphs of interaction combinators, and execution proceeds through rewrite rules that allow parallel progress.
3
The Higher Order Virtual Machine (HVM) is the underlying runtime, while Bend provides a Python-like interface implemented in Rust.
4
Bend avoids traditional loops by using `fold` to consume recursive data types (like lists or trees) in a parallel-friendly way.
5
A single algorithm can be run sequentially, on CPU threads, or on a CUDA GPU backend using different `bend run-*` commands without rewriting the algorithm.
6
A benchmark-style example reports large speedups: 10+ minutes (single thread) to ~30 seconds (24 CPU threads) to ~1.5 seconds on an Nvidia RTX 490 (CUDA).

Highlights

Bend treats parallelism as a default execution property: the same code can run sequentially, on all CPU threads, or on a GPU via backend commands.

The interaction-combinator graph plus rewrite rules is the mechanism that enables parallel execution without explicit threading logic.

Replacing loops with `fold` over recursive data types is presented as the key programming model for parallelism in Bend.

The demo claims dramatic runtime reductions on the same algorithm across single-thread, CPU-parallel, and CUDA-parallel runs.

Topics

Parallel Computing
GPU Programming
Interaction Combinators
Fold and Recursion
Higher Order Virtual Machine

Mentioned

Nvidia RTX 490
CUDA
HVM