Processing 100+ GBs Of Data In Seconds Using Polars GPU Engine

TL;DR

Enable GPU acceleration by passing an engine parameter (GPU) inside Polars collect() so supported expressions run on Nvidia hardware.

Briefing Cornell Notes

Briefing

Polars’ GPU engine can cut multi-step data preprocessing workloads from seconds to hundreds of milliseconds by offloading supported operations to Nvidia GPUs—while still keeping a safety net when a query can’t run on the GPU. In side-by-side tests on a Tesla T4 in Google Colab, simple aggregations dropped from about ~1.88 seconds on CPU to ~294 ms on GPU, and more complex group-by and join pipelines showed similar order-of-magnitude speedups.

The workflow starts with installing Polars 1.5.0 with a provided wheel build, then adding GPU-related dependencies (including Pi nvml) and visualization tooling (HoloViews, HoloViews’ plotting backend via HoloViews/HP hvplot, plus related Jupyter components). The setup also includes a pragmatic data-sizing step: it pulls a simulated financial transactions dataset from Kaggle, but uses only 20% of the data when GPU memory is under 24 GB, reducing storage and iteration friction.

Once the data is loaded lazily (via Polars’ scan_par pattern) and operations are expressed as query plans, the key performance lever is the engine selection inside collect(). For CPU runs, collect() executes normally. For GPU runs, the collect() call passes an engine parameter set to GPU, which attempts to execute supported expressions on the GPU; if something isn’t supported, Polars can fall back to CPU automatically. A stricter mode is also demonstrated: setting a “raise on unsupported” style flag (described as raise uncore onore fail) to True forces an error instead of fallback, useful for catching unsupported operations early.

The transcript walks through several benchmarks. Summing transaction amounts (total spending) completes in ~1.88 seconds on CPU but in ~294 ms on GPU. Grouping by customer_id, summing amount, and sorting by the totals takes ~8.67 seconds on CPU and ~344 ms on GPU. Even SQL-style queries—translated through Polars’ SQL interface into an execution plan—show similar gains: a “top 5 customers by total spend” query runs in ~7.53 seconds on CPU versus ~371 ms on GPU.

The biggest gains appear when the workload grows in complexity. A “largest single transaction per customer” style aggregation drops from ~7.85 seconds (CPU) to ~390 ms (GPU). An end-to-end pipeline that joins cleaned rainfall data (2010–2020) with transaction aggregates, including type casting for messy date fields, takes ~19.2 seconds on CPU but ~554 ms on GPU.

Finally, the transcript highlights what happens when GPU support is incomplete. Operations like rolling mean (noted as do rolling uncore mean) are not GPU-executable, so Polars either falls back to CPU (when fallback is allowed) or raises an error (when strict mode is enabled). The practical takeaway is that Polars can accelerate real preprocessing and visualization pipelines on Nvidia hardware without forcing every transformation to be GPU-compatible.

Cornell Notes

Polars’ GPU engine can accelerate data preprocessing by running supported operations on Nvidia GPUs and falling back to CPU when needed. In benchmarks on a Tesla T4, a simple sum of transaction amounts dropped from about 1.88 seconds on CPU to about 294 ms on GPU. More complex workloads—group-by + sort, SQL-style “top 5” queries, max-per-customer aggregations, and even join-heavy pipelines with type casting—also fell from many seconds to hundreds of milliseconds. The engine is enabled via the engine parameter in collect(), and strict mode can be turned on to raise errors for unsupported GPU operations. This makes it practical for production-style preprocessing where not every transformation is GPU-ready.

How does Polars decide whether to run an operation on the GPU or CPU?

GPU execution is triggered by passing an engine parameter (described as engine is equal to GPU) inside collect(). Polars attempts to execute supported expressions on the GPU; if a particular instruction can’t run on the GPU, Polars can automatically fall back to CPU. A stricter option is demonstrated with a “raise on unsupported” flag (set to True), which prevents fallback and instead raises an error when GPU execution isn’t possible.

What benchmark results show the magnitude of speedup?

For a total spending aggregation (sum of the amount column), CPU time is about 1.88 seconds while GPU time is about 294 ms. For customer-level totals (group by customer_id, sum amount, sort descending), CPU is about 8.67 seconds versus about 344 ms on GPU. For a SQL-style “top 5 customers by total spend” query, CPU is about 7.53 seconds versus about 371 ms on GPU. A join-heavy pipeline combining rainfall data with transaction aggregates drops from about 19.2 seconds on CPU to about 554 ms on GPU.

Why does the transcript limit the dataset size to 20% sometimes?

The setup checks available GPU memory using Pi nvml. If memory is less than 24 GB, it downloads only 20% of the Kaggle dataset to avoid storage and memory constraints; if memory is sufficient, it uses the full dataset. This keeps preprocessing experiments feasible in a notebook environment.

How are SQL queries handled when using the GPU engine?

Polars supports both a DataFrame-style API and an SQL interface. The transcript shows a SQL query (select customer_id, sum(amount) as sum_amount, group by customer_id, order by sum_amount desc, limit 5) executed through Polars’ SQL query path (described as polars’ SQL query into a plan). When collect() is called with the GPU engine, the SQL is translated into an execution plan that can run on the GPU where supported.

What kinds of operations may still require CPU execution?

Some transformations aren’t GPU-executable in the demonstrated setup. Rolling mean is given as an example (noted as do rolling uncore mean). When fallback is allowed, Polars runs such steps on the CPU; when strict mode is enabled (raise on unsupported), the same operation would fail instead of falling back.

Review Questions

In what exact place in the Polars workflow does the GPU engine get enabled, and what happens when an operation isn’t supported on the GPU?
Compare the CPU vs GPU timings for at least two different workloads mentioned (e.g., sum aggregation, group-by + sort, join pipeline). What pattern do they share?
How does strict mode (raising on unsupported GPU operations) change debugging and reliability compared with automatic CPU fallback?

Key Points

1
Enable GPU acceleration by passing an engine parameter (GPU) inside Polars collect() so supported expressions run on Nvidia hardware.
2
Expect large speedups for aggregations, group-by + sort, and join-heavy preprocessing pipelines—often moving from multi-second CPU runtimes to hundreds of milliseconds on GPU.
3
Use a memory-aware data download strategy (via Pi nvml) to select full data or a 20% subset depending on available GPU memory.
4
Polars supports both DataFrame-style operations and an SQL interface; both can be translated into execution plans that benefit from the GPU engine when supported.
5
Strict mode can be used to raise errors for unsupported GPU operations instead of silently falling back to CPU, helping catch performance regressions early.
6
Some operations (like rolling mean) may remain CPU-only; plan preprocessing steps accordingly or rely on fallback behavior.

Highlights

A simple sum of transaction amounts fell from ~1.88 seconds on CPU to ~294 ms on GPU using Polars’ GPU engine on a Tesla T4.

Customer-level group-by + sort dropped from ~8.67 seconds to ~344 ms, showing that the speedup persists beyond trivial aggregations.

A join-heavy pipeline with type casting and rainfall/transaction integration went from ~19.2 seconds on CPU to ~554 ms on GPU.

Strict GPU mode can be configured to fail loudly when an operation (e.g., rolling mean) can’t run on the GPU.

Topics

Polars GPU Engine
Nvidia GPU Acceleration
Data Preprocessing
Lazy Execution
CPU Fallback

Mentioned

Krish Naik
GPU
Pi nvml
SQL
CPU
EDA