Jul 4, 2025 Current Resources

RAPIDS Adds GPU Polars Streaming, a Unified GNN API, and Zero-Code ML Speedups

RAPIDS, a suite of NVIDIA CUDA-X libraries for Python data science, released version 25.06, introducing exciting new features. These include a Polars GPU streaming engine, a unified API for graph neural networks (GNNs), and acceleration for support vector machines with zero code changes required. In this blog post, we’ll explore a few of these updates.

Polars GPU engine updates

In September 2024, we worked with the Polars team to launch a Polars GPU engine built on top of NVIDIA cuDF. The 25.06 release brings some significant updates to Polars GPU engine capabilities.

Streaming executor is now experimentally available

With the 25.06 release, we introduced streaming execution in the Polars GPU engine. The streaming executor leverages data partitioning and parallel processing to enable execution on larger-than-VRAM datasets. To use this new streaming executor, users can pass an appropriately configured GPUEngine object to the Polars collect call:

streaming_engine = GPUEngine(
    executor="streaming",
    executor_options={"scheduler": "synchronous"}
)
result = query.collect(engine=streaming_engine)

This new streaming mode also enables users to scale data processing workflows to multiple GPUs. This can help accelerate analytics operations on datasets that scale from hundreds of GBs to TBs. For operations that require data movement between partitions (such as joins and groupbys), a new shuffle mechanism handles redistributing data between devices. Multi-GPU execution is orchestrated through the Dask distributed scheduler and requires first setting up a Dask client:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

client = Client(LocalCUDACluster())

multi_gpu_streaming = GPUEngine(
executor="streaming",
executor_options={"scheduler": "distributed"}
)

result = query.collect(engine=multi_gpu_streaming)

The streaming executor is currently under active development, and unsupported operations will fall back to using the in-memory executor. To learn more, check out our recent blog and NVIDIA GTC Paris talk.

Support for rolling aggregations and more column manipulations

The latest release also includes support for some key new DataFrame functionality in the Polars GPU engine. First, we have added support for .rolling() operations in Polars, allowing users to create rolling groups based on some other column in their Dataframe. This is especially useful when working with time series datasets.

dates = [
    "2025-01-01 13:45:48",
    "2025-01-01 16:42:13",
    "2025-01-01 16:45:09",
    "2025-01-02 18:12:48",
    "2025-01-03 19:45:32",
    "2025-01-08 23:16:43",
]

df = (
    pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]})
    .with_columns(pl.col("dt").str.strptime(pl.Datetime()))
    .lazy()
)

query = (
    df.rolling(index_column="dt", period="2d")
    .agg(
        pl.sum("a").alias("sum_a"),
        pl.min("a").alias("min_a"),
        pl.max("a").alias("max_a"),
    )
)

query.collect(engine="gpu")

Additionally, the GPU engine now supports a wider class of expressions for manipulating datetime columns. New methods that are now supported include .strftime() and .cast_time_unit()and more are planned in upcoming releases as we continue to expand our overall API coverage.

Unified API for GNNs

NVIDIA cuGraph-PyG has further integrated WholeGraph for accelerated feature fetching, creating what we refer to as the Unified API. This new API now enables users to use WholeGraph’s accelerated feature storage in single-GPU workflows, while eliminating the need to modify scripts for multi-GPU or multi-node workflows.

With the Unified API, the same GNN training script used for prototyping on a single GPU works on a single node with multiple GPUs, and on multiple nodes. The torchrun command from PyTorch manages process setup, making the Unified API familiar to most PyTorch users.

Check out our examples to learn how to get started.

Zero code change cuML enhancements

In March, we launched our zero-code-change accelerator for scikit-learn, powered by cuML, available in open beta. With the 25.06 release, cuML brings even more zero-code-change functionality to users.

Support vector machines with zero code change

NVIDIA cuML expanded its zero-code-change acceleration capabilities with the addition of support vector machines. Support Vector Classification (SVC) and Support Vector Regression (SVR) are powerful algorithms that handle high-dimensional data well and can see significant speed-ups when executed on the GPU. With the addition of these estimators to cuML’s zero-code-change interface, existing scikit-learn workflows that leverage support vector machines can now be accelerated with no modifications needed. Note that there are some key differences between the cuML and scikit-learn implementations of SVC and SVR that users should be aware of.

To learn more about cuML’s zero-code-change support for various algorithms, visit our documentation.

Improved scikit-learn compatibility

The 25.06 release also includes a significant under-the-hood redesign of the way scikit-learn estimators are accelerated by cuML. This change enhances scikit-learn parity, with better parameter validation and improved error handling. Additionally, with the redesign, cuML has enhanced compatibility with scikit-learn’s API, making it even easier to accelerate third-party libraries that integrate with scikit-learn today.

Random Forest integration with the updated Forest Inference Library

cuML’s Random Forest estimators (RandomForestRegressor and RandomForestClassifier) received an upgrade with the integration of the faster and more robust Forest Inference Library (FIL). This implementation delivers higher performance and better memory management while maintaining backward compatibility.

Users should note that some current API knobs specific to the previous implementation are now deprecated and will be removed in the upcoming 25.08 release.

RAPIDS Memory Manager compatibility with NVIDIA Blackwell decompression engine

The RAPIDS Memory Manager (RMM) library added new capabilities to ensure users can access the latest NVIDIA hardware capabilities. In our latest release, the RMM async memory resource added support for the hardware-based decompression engine on compatible NVIDIA Blackwell GPUs. With the new decompression engine, users can see performance improvements in IO-heavy workflows.

Additionally, RMM is now a precompiled shared library rather than a header-only library. We anticipate that this change will unlock new features in the future.

Platform updates: Python and NVIDIA CUDA support

The 25.06 release added support for Python 3.13 to all RAPIDS libraries and is also the final release that will support CUDA 11. Beginning with the 25.08 RAPIDS release, CUDA 11 will be removed. Users wanting to continue using CUDA 11 may pin to 25.06. This is documented in RSN 48.

Conclusion

The RAPIDS 25.06 release brings zero-code-change functionality to new machine learning algorithms, a new Polars GPU streaming engine, hardware decompression capabilities for async memory resources, and more.

We welcome your feedback on GitHub. You can also join the 3,500+ members of the RAPIDS Slack community to talk GPU-accelerated data processing.

If you’re new to RAPIDS, check out these resources to get started. To learn more about accelerated data science, explore our DLI Learning Path and enroll in a hands-on course.