DeePMD-kit v3: A Multiple-Backend Framework for Machine Learning Potentials

Jinzhe Zeng, Duo Zhang, Anyang Peng, Xiangyu Zhang, S Z He, Yan Wang, Xinzijian Liu, Hangrui Bi, Yifan Li, Chun Cai, +37 more

Journal of Chemical Theory and Computation·2025·Materials Science·51 citations

8 min read

Read the full paper at DOI or on arxiv

TL;DR

DeePMD-kit v3 adds a pluggable multi-backend framework supporting TensorFlow, PyTorch, JAX, and PaddlePaddle while keeping DeePMD-kit v2’s Python and C/C++ APIs non-breaking.

Briefing Cornell Notes

Briefing

This paper addresses a practical but increasingly central bottleneck in machine learning potentials (MLPs) for atomistic simulation: most MLP software is tightly coupled to a single deep-learning framework (TensorFlow, PyTorch, JAX, or PaddlePaddle). As scientific workflows become more complex—mixing training pipelines, inference engines, differentiable force-field components, and simulation backends—developers and users face friction when components rely on different frameworks. The research question is therefore not about a new potential model’s physics accuracy, but about software architecture: can DeePMD-kit be redesigned so that the same user-facing interfaces (training, inference, molecular dynamics integration) work across multiple machine-learning backends, enabling interoperability and reducing integration costs?

The significance is twofold. First, it directly impacts reproducibility and usability: users can switch backends to match hardware constraints, performance needs, or ecosystem compatibility without rewriting workflows. Second, it enables “composability” of the broader MLP ecosystem. The authors position DeePMD-kit v3 as a hub that can integrate external models and differentiable force-field modules that may be implemented in different frameworks. In the broader context of materials science and computational chemistry, where end-to-end pipelines often span data generation (e.g., active learning), model training, and large-scale molecular dynamics (MD), backend interoperability is a prerequisite for scalable and maintainable research software.

Methodologically, the paper is primarily a software engineering and benchmarking study. The authors introduce DeePMD-kit v3’s multiple-backend framework, designed to be pluggable while preserving backward compatibility. They explicitly state that no breaking changes were made to the existing Python and C/C++ APIs from DeePMD-kit v2, so existing integrations with MD and workflow packages can continue to work. The framework uses a unified set of interfaces for users and developers; backend-specific implementations are invoked under the hood. A key engineering challenge is ensuring that the “same model” yields consistent results across backends. To address this, the authors implement serialization/deserialization and develop tests that compare outputs when models are converted between backends.

The paper also describes several design principles that enable backend-agnostic development. They introduce metaprogramming to generate backend-specific classes from a reference implementation (the DP backend). They add an “atomic model” abstraction that decomposes learned quantities into sums of atomic contributions, enabling developers to implement only per-atom terms while the framework handles summation and derives forces and virials via derivatives with respect to coordinates and cell vectors. They further refactor computations that previously relied on custom TensorFlow operators (e.g., neighbor list and coordinate matrix gradients) so that other backends can use standard operators and automatic differentiation to support higher-order derivatives such as Hessians.

A major practical extension is support for graph neural network (GNN) style models (e.g., DPA-2). For distributed/multi-GPU MD, the authors implement a customized C++ operator using MPI to exchange atom and edge features between processors or GPU cards, avoiding the high overhead of rebuilding neighbor lists for ghost atoms in an extended cutoff.

For benchmarking, the authors evaluate MD performance (time per MD step in milliseconds) using DeePMD-kit v3 integrated with LAMMPS. They test three model families: DPA-1 without attention layers ( L=0), DPA-1 with two attention layers ( L=2), and DPA-2 (medium). Benchmarks are run in both single and double precision on three GPU types: NVIDIA H100 (80 GB), A800 (40 GB), and 4090 (24 GB). Each calculation is repeated 500 times to obtain an average speed. The benchmark uses a water system with varying atom counts, and results are reported in the main figures and supporting tables. The paper also notes backend limitations at the time of writing: only DPA-1 (L=0) supports model compression; TensorFlow does not support DPA-2; JAX does not support model compression; and PaddlePaddle is still under development.

The key quantitative results are the relative performance trends across backends and model types, plus concrete examples of backend-dependent speed and memory behavior. From the supporting tables:

1) On H100 (80 GB), for DPA-1 (L=0) in FP64, the fastest backend at 12288 atoms is TensorFlow compressed (TFc) at 34.46 ms/step, compared with PyTorch at 41.09 ms/step and JAX at 64.76 ms/step. At 6144 atoms, TFc is 17.50 ms/step versus PyTorch 21.47 and JAX 34.19. 2) On H100 (80 GB), for DPA-1 (L=2) in FP64 at 12288 atoms, the fastest is TensorFlow at 14.09 ms/step (JAX is 21.46 and PyTorch is 30.84). For DPA-1 (L=2) in FP32 at 12288 atoms, TensorFlow is 12.11 ms/step versus PyTorch 18.43 and JAX 51.97. 3) For DPA-2 (medium) on H100 in FP64, the fastest backend at 12288 atoms is PyTorch at 217.21 ms/step, while TensorFlow is 286.22 ms/step and JAX is “OOM” (out of memory) at that size. In FP32 at 12288 atoms, PyTorch is again best at 132.45 ms/step, with TensorFlow 146.37 and JAX 240.44. 4) On A100 (40 GB), for DPA-2 in FP64 at 6144 atoms, TensorFlow is 231.34 ms/step, PyTorch is 315.86 ms/step, and JAX is 239.78 ms/step; the best is TensorFlow. At 12288 atoms in FP64, TensorFlow is 69.48 ms/step (note: this corresponds to DPA-1 (L=0) in the table; for DPA-2 at 12288 atoms, JAX and others are marked OOM in that row), illustrating that memory constraints can dominate. 5) On 4090 (24 GB), for DPA-1 (L=0) in FP32 at 1536 atoms, TensorFlow is 6.24 ms/step, PyTorch is 7.87, and JAX is 12.22. For DPA-1 (L=2) in FP32 at 3072 atoms, TensorFlow is 5.49 ms/step, PyTorch is 8.23, and JAX is 21.69.

Overall, the authors’ main performance claim is qualitative but supported by these numbers: no single backend consistently outperforms the others across all models and precisions. They state that for DPA-1 without attention layers, compressed TensorFlow is fastest; for more computationally demanding models (DPA-1 with attention and DPA-2), JAX generally achieves the highest performance, with explicit exceptions: DPA-2 FP32 at 12288 atoms on H100 where PyTorch outperforms JAX, and DPA-1 (L=2) FP64 on 4090 where TensorFlow outperforms JAX. They also emphasize that GPU memory usage differs by backend, affecting which backend is feasible for a given system size.

Limitations are acknowledged in two ways. First, the paper explicitly lists current feature gaps: TensorFlow backend lacks DPA-2 support, JAX lacks model compression, and PaddlePaddle is under development. Second, the benchmark scope is limited to a specific set of models (DPA-1 variants and DPA-2 medium), a water system, and performance measured as ms/step; the paper does not provide accuracy comparisons (e.g., energy/force errors) across backends, implying that the primary contribution is interoperability and performance engineering rather than new scientific results.

Practically, the implications are clear for different stakeholders. Researchers using DeePMD-kit can choose the backend that best matches their hardware and model type, and can integrate DeePMD-kit with external PyTorch/JAX/Paddle-based components. Developers can add new backends or models more easily due to the atomic model and descriptor block abstractions, and due to the Array API-based approach for backend implementation. The paper also demonstrates extensibility via plugins: DeePMD-GNN integrates external GNN potentials (MACE and NequIP) into the PyTorch backend, and a DMFF plugin integrates long-range interaction methods (Ewald summation and charge equilibration) into DeePMD-kit’s PyTorch backend, enabling hybrid MLP + classical long-range physics workflows.

In sum, DeePMD-kit v3’s core contribution is a multi-backend, interoperable architecture that preserves existing APIs while enabling backend switching, model conversion, and integration with external MLP and differentiable force-field packages. The benchmark results show that backend choice can materially affect MD throughput and feasibility under memory constraints, reinforcing the value of the multi-backend design for real-world simulation workflows.

Cornell Notes

DeePMD-kit v3 introduces a multi-backend framework that preserves DeePMD-kit’s existing user and developer interfaces while enabling interchangeable TensorFlow, PyTorch, JAX, and PaddlePaddle backends. The paper also demonstrates extensibility through plugins (external GNN potentials and DMFF long-range interactions) and benchmarks MD throughput across backends for DPA-1 and DPA-2 models on multiple GPUs.

What problem does DeePMD-kit v3 aim to solve?

It targets the integration difficulty caused by DeePMD-kit v2 being TensorFlow-based, which makes it hard to combine MLP components that rely on different deep-learning frameworks in complex MD workflows.

What study design or evaluation approach is used?

The paper uses a software architecture development study plus performance benchmarking of MD simulations (ms/step) across multiple backends and GPU types, with repeated runs (500 repeats per configuration).

How does the framework maintain a unified user interface across backends?

Users interact with backend-agnostic training/inference interfaces; backend-specific modules are invoked internally. Models are saved in backend-specific formats and backend detection/selection happens automatically at inference time.

How do the authors address consistency of results across backends?

They implement serialization/deserialization and backend conversion, and they develop tests that compare outputs when the same model is executed under different backends.

What is the “atomic model” design principle introduced in v3?

It assumes the learned quantity can be decomposed into atomic contributions, enabling the framework to sum atomic energies and compute forces/virials via derivatives with respect to coordinates and cell vectors.

How is higher-order differentiation (e.g., Hessians) supported in non-TensorFlow backends?

The framework replaces custom TensorFlow operators for neighbor list/coordinate matrix gradients with standard operators in each backend, leveraging automatic differentiation capabilities.

How does v3 handle GNN-style message passing efficiently in distributed/multi-GPU MD?

It implements a customized C++ operator using MPI to exchange atom and edge features between processors/GPUs, avoiding expensive ghost-atom neighbor-list rebuilding.

What benchmark setup is used to compare backends?

LAMMPS is interfaced with DeePMD-kit v3; benchmarks run DPA-1 (L=0), DPA-1 (L=2), and DPA-2 (medium) in FP32 and FP64 on H100 (80 GB), A800 (40 GB), and 4090 (24 GB), using a water system with varying atom counts and 500 repeats per configuration.

What is one concrete performance result showing backend dependence?

On H100 in FP64 for DPA-1 (L=0) at 12288 atoms, TF compressed (TFc) is 34.46 ms/step versus PyTorch 41.09 and JAX 64.76.

What is a key implication for users choosing a backend?

No single backend is always fastest; backend choice depends on model complexity, precision, and GPU memory limits. Users should benchmark or select the backend that best matches their model and hardware.

Review Questions

Explain how DeePMD-kit v3 preserves backward compatibility while adding multiple backends. What interfaces remain unchanged?
Describe the atomic model abstraction and how it leads to force and virial computation.
Why is supporting GNN message passing in multi-GPU MD nontrivial, and what solution does the paper implement?
Using at least one example from the tables, show how backend performance can reverse depending on model type and precision.
What current feature gaps (e.g., DPA-2 support, model compression support) limit the multi-backend system at the time of publication?

Key Points

1
DeePMD-kit v3 adds a pluggable multi-backend framework supporting TensorFlow, PyTorch, JAX, and PaddlePaddle while keeping DeePMD-kit v2’s Python and C/C++ APIs non-breaking.
2
A unified interface lets users train, save, convert, and run inference without needing to know backend-specific details; models are serialized in backend-specific formats and backend detection occurs at inference.
3
The paper introduces design abstractions (metaprogramming, atomic model decomposition, descriptor block) to make model development backend-agnostic and maintainable.
4
Neighbor list/coordinate matrix computations are refactored to use standard operators in non-TensorFlow backends, enabling Hessian training via automatic differentiation.
5
For GNN models, v3 implements MPI-based feature exchange to support efficient multi-node/multi-GPU MD without expensive ghost-atom neighbor-list rebuilding.
6
Benchmarks on water systems show backend-dependent throughput and feasibility: e.g., on H100 FP64 for DPA-1 (L=0) at 12288 atoms, TF compressed is 34.46 ms/step vs PyTorch 41.09 and JAX 64.76.
7
The authors find no universal “best” backend across all models; memory usage differences can determine which backend can run large systems.
8
Extensibility is demonstrated via plugins: DeePMD-GNN (integrating MACE and NequIP into PyTorch) and a DMFF plugin (Ewald summation and QEq for long-range interactions).

Highlights

“DeePMD-kit version 3… features a multi-backend framework that supports TensorFlow, PyTorch, JAX, and PaddlePaddle backends, and demonstrates the versatility of this architecture through the integration of other MLPs packages and of Differentiable Molecular Force Field.”

On H100 (80 GB), DPA-1 (L=0), FP64, 12288 atoms: TFc = 34.46 ms/step, PT = 41.09 ms/step, JAX = 64.76 ms/step.

On H100 (80 GB), DPA-2, FP64, 12288 atoms: PT = 217.21 ms/step, TF = 286.22 ms/step, JAX = OOM.

“no single backend consistently outperforms others across all models.”

Topics

Machine learning potentials
Atomistic simulation
Molecular dynamics software
Graph neural networks for interatomic potentials
Differentiable force fields
High-performance computing (MPI, multi-GPU)
Software interoperability and reproducible research tooling
Model compression and deployment

Mentioned

DeePMD-kit
TensorFlow
PyTorch
JAX
PaddlePaddle
NumPy
LAMMPS
i-PI
AMBER
CP2K
OpenMM
GROMACS
ASE
ABACUS
DP-GEN
MLatom
DeepMD-GNN (plugin)
DMFF (Differentiable Molecular Force Field)
Array API
MPI
TorchScript
jax2tf
PaddlePaddle CINN
Jinzhe Zeng
Duo Zhang
Anyang Peng
Xiangyu Zhang
Sensen He
Han Wang
Linfeng Zhang
Darrin M. York
Weile Jia
MLP - Machine learning potential
MD - Molecular dynamics
GNN - Graph neural network
MPI - Message Passing Interface
FP32 - Single-precision floating point (32-bit)
FP64 - Double-precision floating point (64-bit)
OOM - Out-of-memory
TF - TensorFlow backend
PT - PyTorch backend
JAX - JAX backend
TFc - Compressed TensorFlow model
PTc - Compressed PyTorch model
QEq - Charge equilibration
Ewald - Ewald summation method
JIT - Just-in-time compilation
CINN - PaddlePaddle JIT compiler (as named in the paper)