ADAPT: Lightweight, Long-Range Machine Learning Force Fields Without Graphs

Evan Dramko Department of Computer Science, Rice University, Houston, TX, USA Yihuang Xiong Rice Advanced Materials Institute, Rice University, Houston, TX, USA Yizhi Zhu Department of Materials Science and Nanoengineering, Rice University, Houston, TX, USA Rice Advanced Materials Institute, Rice University, Houston, TX, USA Thayer School of Engineering, Dartmouth College, Hanover, NH, USA Geoffroy Hautier Corresponding author: gh55@rice.edu Department of Materials Science and Nanoengineering, Rice University, Houston, TX, USA Rice Advanced Materials Institute, Rice University, Houston, TX, USA Thayer School of Engineering, Dartmouth College, Hanover, NH, USA Thomas Reps Department of Computer Sciences, University of Wisconsin–Madison, Madison, WI, USA Christopher Jermaine Department of Computer Science, Rice University, Houston, TX, USA Anastasios Kyrillidis Corresponding author: ak85@rice.edu Department of Computer Science, Rice University, Houston, TX, USA

Abstract

Point defects play a central role in driving the properties of materials. First-principles methods are widely used to compute defect energetics and structures, including at scale for high-throughput defect databases. However, these methods are computationally expensive, making machine-learning force fields (MLFFs) an attractive alternative for accelerating structural relaxations. Most existing MLFFs are based on graph neural networks (GNNs), which can suffer from oversmoothing and poor representation of long-range interactions. Both of these issues are especially of concern when modeling point defects. To address these challenges, we introduce the Accelerated Deep Atomic Potential Transformer (ADAPT), an MLFF that replaces graph representations with a direct coordinates-in-space formulation and explicitly considers all pairwise atomic interactions. Atoms are treated as “tokens,” with a Transformer encoder modeling their interactions. Applied to a dataset of silicon point defects, ADAPT achieves a $\sim 33\%$ reduction in both force and energy prediction errors relative to a state-of-the-art GNN-based model, while requiring only a fraction of the computational cost.

1 Introduction

First-principles computations offer a powerful way to compute and predict materials and molecular structure and energetics. However, these physics-based approaches have a substantial computational cost. Machine learning force fields (MLFFs)—also referred to as machine learning interatomic potentials (MLIPs)—present a computationally efficient alternative. MLFFs often exhibit runtimes orders of magnitude lower than Density Functional Theory (DFT), making them increasingly considered in materials-discovery pipelines. MLFFs leverage large datasets to build a function approximating the original DFT calculations.

State-of-the-art MLFFs are often graph-based and equivariant neural networks (GNNs) [1, 2], excelling on bulk datasets and many chemistry tasks [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. GNNs often excel when training data is scarce; exactly the situation with expensive DFT trajectories. GNN MLFF are experiencing intense and rapid developments with for instance the introduction of specialized attention mechanisms [13, 6] and higher-order information in message passing [3].

GNNs have been considered to compute point-defect properties, which are usually simulated on a large periodic supercell with an isolated defect center. The first approaches focused on fitting GNNs to defect-formation energies data [14, 15], but more recent work has used MLFFs to compute forces and accelerate first-principles atomic relaxation [16]. However, challenges in directly applying GNNs to point defects have been raised. For instance, one work [17] suggested modifying GNNs to focus on the local defect region to combat oversmoothing [18]. We also note that defect computations typically involve large supercells of hundred to thousands of atoms, and are computationally demanding for the message-passing algorithms used in GNNs. Recent work [19] showed success on a GNN “one-hop” initial-to-relaxed approach for defects in 2D materials. Such an approach though might require prohibitive amounts of data [20, 21, 22] for use in complicated 3D complex defect trajectories.

Consideration of only local interactions is inherent to graph architectures; however, non-local interactions play a vital role in the structural formation of defects. Inspired by the success of Transformers [23] in natural language [24], computer vision [25], and computational biology [26], we explore an alternative to directly handle such relationships: a coordinate-based Transformer with attention computed over all possible atom interactions, trained to predict per-atom forces from raw Cartesian coordinates and atomic features. This new approach is referred to as Accelerated Deep Atomic Potential Transformer (ADAPT), and is trained on a DFT database of defects in silicon, primarily consisting of complex defects. We show that ADAPT achieves state-of-the-art performance (both in energy and forces), outperforming pretrained universal MLFFs, such as MACE [3] and MatterSim [5], as well as MACE retrained on the same data set. Further, ADAPT demonstrates a training cost two orders of magnitude lower than message-passing architectures.

2 Results

In contrast to MACE [3] and related model architectures, ADAPT employs distinct networks for predicting atomic forces and structure energies. As mentioned before, both proposed architectures eschew graphs and inductive biases entirely, instead focusing on precise representations of geometries. Our primary aim is to develop force and energy predictors tailored for defect computations, with the longer-term objective of bypassing costly DFT relaxations altogether.

ADAPT adopts the now standard tokenization paradigm [27] from deep learning of breaking inputs into sequences of tokens. Here, each token corresponds to a single atom, so a structure with $n$ atoms is represented by $n$ tokens. Every token is initially a 12-dimensional vector containing:

(x,y,z,\text{column},\text{row},\chi,r_{\text{cov}},N_{\text{val}},E_{\text{ion}_{1}},E_{\text{EA}},r_{\text{atom}},V_{\text{mol}}),

where we define $x,y,z$ as the coordinates of the atom, column is the atom’s group, row is the atom’s period, $\chi$ is the electronegativity, $r_{\text{cov}}$ is the covalent radius, $N_{\text{val}}$ is the number of valence electrons, $E_{\text{ion}_{1}}$ is the first ionization energy of the atom, $E_{\text{EA}}$ is the electron affinity, $r_{\text{atom}}$ is the atomic radius, and $V_{\text{mol}}$ is the molar volume. These specific descriptors are used because they were naturally present in the raw data. Determining the best set of descriptors remains an open problem. ADAPT has been designed to predict the forces and energy for structures that are simulated on computations in a supercell. We consider defect computations in silicon as our motivating example. Full details on the training are available in Supplementary Material Section B.

2.1 Force-Prediction Methodology

Herein, we consider the model architecture used to predict per-atom force vectors, as shown in Figure 1. It can be viewed as a function mapping each token to a corresponding force vector.

Embedding. Rather than working in the native 12-dimensional space, we embed each token into a higher-dimensional space of size $d_{\text{model}}$ (a user-set hyperparameter). High-dimensional representations enable neural networks to map complex nonlinear dynamics into spaces where linear and simple nonlinear transformations suffice to approximate the underlying oracle function¹¹1The oracle function denotes the assumed true generative function of the real world from which the data originates..

A multi-layer perceptron (MLP) [28] is used to learn the embedding transformation, and can be represented as:

\texttt{MLP}(\mathbf{x})=\mathbf{W}_{k}\sigma\Bigl(\mathbf{W}_{k-1}\sigma\bigl(\dots\sigma(\mathbf{W}_{0}\mathbf{x}+\mathbf{b}_{0})\dots\bigr)+\mathbf{b}_{k-1}\Bigr)+\mathbf{b}_{k},

(1)

where $\mathbf{x}\in\mathbb{R}^{12}$ is the input token, $\sigma$ is the element-wise ReLU operation,²²2ReLU $(x)=\max(0,x)$ , $\mathbf{b}_{j}\in\mathbb{R}^{d_{\text{out},j}}$ are the trainable bias terms, and $\mathbf{W}_{j}\in\mathbb{R}^{d_{\text{out},j}\times d_{\text{in},j}}$ are learnable weight matrices. Here $d_{\text{in},0}=12$ , and $\mathbf{W}_{k}\in\mathbb{R}^{d_{\text{model}}\times d_{\text{out},k-1}}$ . The embedding MLP is applied independently to each token.

[Uncaptioned image] — Figure 1: ADAPT architecture. Each of $n$ atoms is embedded via an identically weighted MLP, passed through a stack of Attention-based encoder blocks, and linearly projected from $(n\times d_{\text{model}})$ to $(n\times 3)$ force vectors.

2.1.1 Transformer Encoder

The embedded sequence is processed by $k$ encoder blocks. Each block has the same structure but distinct parameters. A block is defined by:

$\displaystyle\mathbf{H}_{1}$	$\displaystyle=\texttt{LN}\bigl(\mathbf{X}_{\text{in}}+\texttt{Attn}(\mathbf{X}_{\text{in}})\bigr),$	(2)
$\displaystyle\mathbf{H}_{2}$	$\displaystyle=\texttt{FFN}\bigl(\texttt{LN}(\mathbf{H}_{1})\bigr),$	(3)
$\displaystyle\mathbf{X}_{\text{out}}$	$\displaystyle=\texttt{LN}(\mathbf{H}_{2}+\mathbf{H}_{1}).$	(4)

The main components are:

$(i)$ Layer Normalization (LN). This is used to ensure numeric stability in training, and prevent the chaining together of multiplied terms from growing or shrinking rapidly. Given an input $\mathbf{x}\in\mathbb{R}^{H}$ , layer norm normalizes across feature channels:

	$\displaystyle\mu$	$\displaystyle=\tfrac{1}{H}\sum_{i=1}^{H}x_{i},$	$\displaystyle\sigma^{2}$	$\displaystyle=\tfrac{1}{H}\sum_{i=1}^{H}(x_{i}-\mu)^{2},$		(5)
	$\displaystyle\hat{x}_{i}$	$\displaystyle=\frac{x_{i}-\mu}{\sqrt{\sigma^{2}+\epsilon}},$	$\displaystyle y_{i}$	$\displaystyle=\gamma_{i}\hat{x}_{i}+\beta_{i},\quad i=1,\dots,H,$		(6)

where $\bm{\gamma},\bm{\beta}\in\mathbb{R}^{H}$ are learnable parameters and $\epsilon$ is a small constant for stability.

$(ii)$ (Multiheaded) Scaled Dot-Produce Attention (Attn). In the model, this is the only place where the tokens³³3Recall each token corresponds to an atom. interact and influence each other. In multiheaded attention, each “head” performs an Attention operation over a subset of the data. Given $\mathbf{X}\in\mathbb{R}^{n\times d_{\text{model}}}$ (sequence length $n$ ), each head $i=1,\dots,h$ is defined by:

\text{head}_{i}=\operatorname{softmax}\left(\frac{\mathbf{Q}_{i}\mathbf{K}_{i}^{\mathsf{T}}}{\sqrt{d_{k}}}\right)\mathbf{V}_{i},

(7)

where

\mathbf{Q}_{i}=\mathbf{X}\mathbf{W}_{\mathbf{Q}_{i}},\quad\mathbf{K}_{i}=\mathbf{X}\mathbf{W}_{\mathbf{K}_{i}},\quad\mathbf{V}_{i}=\mathbf{X}\mathbf{W}_{\mathbf{V}_{i}},

(8)

with projection matrices $\mathbf{W}_{\mathbf{Q}_{i}},\mathbf{W}_{\mathbf{K}_{i}},\mathbf{W}_{\mathbf{V}_{i}}\in\mathbb{R}^{d_{\text{model}}\times d_{k}}$ . The raw similarity matrix $\mathbf{Q}_{i}\mathbf{K}_{i}^{\mathsf{T}}\in\mathbb{R}^{n\times n}$ encodes pairwise token similarities. The row-wise softmax⁴⁴4Softmax: $\text{softmax}(\mathbf{z}_{i})=\frac{e^{\mathbf{z}_{i}}}{\sum_{j=1}^{n}e^{\mathbf{z}_{j}}}$ maps each row into a probability distribution over tokens.

Outputs from all heads are concatenated and projected:

\texttt{Attn}(\mathbf{X})=\operatorname{Concat}(\text{head}_{1},\dots,\text{head}_{h})\mathbf{W}_{O},

(9)

with $\mathbf{W}_{O}\in\mathbb{R}^{hd_{k}\times d_{\text{model}}}$ .

$(iii)$ Feed-Forward Network (FFN). FFNs work on individual tokens independently, and do not allow any interactions between tokens. They allow for expressive transformations of the token beyond what Attention alone can capture. A position-wise MLP, applied identically to each token:

\texttt{FFN}(\mathbf{H})=\mathbf{W}_{2}\operatorname{ReLU}\bigl(\mathbf{W}_{1}\mathbf{H}^{\mathsf{T}}+\mathbf{b}_{1}\bigr)+\mathbf{b}_{2},

(10)

where

\mathbf{H}\in\mathbb{R}^{n\times d},\quad\mathbf{W}_{1}\in\mathbb{R}^{d_{\text{ff}}\times d},\quad\mathbf{W}_{2}\in\mathbb{R}^{d\times d_{\text{ff}}},\quad\mathbf{b}_{1}\in\mathbb{R}^{d_{\text{ff}}},\quad\mathbf{b}_{2}\in\mathbb{R}^{d}.

$(iv)$ Dropout. Dropout randomly masks neuron activations (set to $0$ ), resampled at each pass during training. This has been shown to prevent models from overfitting to the data, and improve generalizability. It is applied to the outputs of attention and feed-forward layers. Following convention, we exclude it from the equations for the model definition since it is only used during training and not inference.

Force Projection. Finally, after the encoder blocks, forces are obtained by a linear projection:

\mathbf{\widehat{y}}=\mathbf{X}_{\text{enc}}\mathbf{W}_{\text{out}},\quad\mathbf{W}_{\text{out}}\in\mathbb{R}^{d_{\text{model}}\times 3},

(11)

producing per-token force vectors $(f_{x},f_{y},f_{z})$ . The resulting tensor has shape $n\times 3$ . Appendix C covers standard Transformer computations in further detail.

2.1.2 Handling Imbalance in Scaling

In crystalline defects, we see that there is a substantial disparity between the scale of forces in the local area of the defects, and in the bulk lattice. A similar imbalance occurs across atomic feature magnitudes, where certain descriptors (see Section 2.1) differ by several orders of magnitude. Such imbalance in the scale of features is known to cause issues in the training of NNs [29, 30]. This disparity motivates the use of a specialized loss function, as discussed below.

Loss Function. Training requires a differentiable objective that captures the mismatch between predicted and true atomic forces. A natural baseline is the mean‑squared error (MSE). Plain MSE, however, does not bias towards any one atom implicitly, even though domain knowledge tells us that atoms nearest the defects dominate the crystal’s mechanical response.

To emphasize these critical regions, we introduce a new loss function: “importance‑weighted MSE.” In particular, we create an importance mask $\mathbf{m}\in\mathbb{R}_{+}^{n}$ , where each of the $n$ atoms, $a_{i}$ , receives weight:

m_{i}=\prod_{j\in\mathcal{D}}\Bigl(1+\frac{\lambda_{1}}{\lVert\mathbf{r}_{i}-\mathbf{r}_{j}\rVert^{2}+\lambda_{2}}\Bigr),\quad\mathcal{{D}}=\{\text{defects}\},

(12)

where ${\mathcal{D}}$ is the set of defect locations⁵⁵5The formulation used herein does not consider vacancies, but could easily be modified to do so if necessary., and $\mathbf{r}_{i}$ is the coordinate vector for atom $i$ . This is similar to laws observed in nature, where the effect of many interactions decay as a power law of the distance between them.⁶⁶6An alternative weighting would be $\sum\ln{\frac{1}{\lVert\mathbf{r}_{i}-\mathbf{r}_{j}\rVert^{2}+\lambda_{2}}}.$ (13) We experimented with Eq. (13), but found that for silicon defects Eq. (12) gave better results. It is possible that Eq. (13) would perform better in some applications.It is possible that other weighting rules perform well; we present one that worked well for our training data. Hyperparameters $\lambda_{1},\lambda_{2}$ are used to ensure numerical stability and to “temper” the scaling. The resulting loss becomes:

\mathcal{L}(\mathbf{\widehat{y}},\mathbf{y})=\sum_{i}m_{i}\sum_{j}({\widehat{y}}_{i,j}-{y}_{i,j})^{2}

Where $\mathbf{y},\mathbf{\widehat{y}}$ are the actual and predicted forces for each of the atoms (indexed $i$ ) and across each of the $3$ components of the force vectors (indexed $j$ ).

where the force vectors predicted by the model is denoted $\mathbf{\widehat{y}}$ , and we have actual force vectors $\mathbf{y}$ . While this weighting produces comparable—but often slightly worse— $\mathcal{L}_{2}$ error as a plain MSE loss function, we find that it performs better when we consider practical use of the network. Section 2.3 details this difference.

2.2 Energy Prediction

We train a separate formation energy-predictor model to complement the MLFF. For this task, we consider three distinct architectures: (1) a decoder E, (2) a multilayer perceptron (MLP) 2.1, and (3) an MLP $+$ residual network. In each case, the model receives only the atomic structure and returns an estimated crystal energy. Architectures (1) and (2) serve as natural baselines; the decoder as a single-output is the natural extension of the encoder framework, and the MLP is a widely used approach [31, 32, 33]. Architecture (3), however, substantially outperforms both, and we adopt it as our primary design.

2.2.1 MLP $+$ Residual Architecture

Residuals connections, where the input and output of a layer are added together, have become widespread in ML literature. It has been noted that the residual architecture bears striking resemblance to Euler integration [34, 35] making it a common choice [36, 37, 38] when considering modeling physical systems which are governed by differential equations. The architecture of a MLP with residual connections for raw input tokens $\mathbf{x}$ is:

	$\displaystyle\mathbf{t}_{0}$	$\displaystyle=\sigma(\mathbf{W}_{0}\mathbf{x}+\mathbf{b}_{0})$
	$\displaystyle\mathbf{h}_{0}$	$\displaystyle=\texttt{LN}(\mathbf{P}_{0}\mathbf{t}_{0}+\mathbf{t}_{0})$
	$\displaystyle\mathbf{t}_{1}$	$\displaystyle=\sigma(\mathbf{W}_{1}\mathbf{h}_{0}+\mathbf{b}_{1})$
	$\displaystyle\mathbf{h}_{1}$	$\displaystyle=\texttt{LN}(\mathbf{P}_{1}\mathbf{t}_{1}+\mathbf{t}_{1})$
		$\displaystyle~~\vdots$
	$\displaystyle\mathbf{\widehat{y}}$	$\displaystyle=\mathbf{W}_{k}\mathbf{h}_{k}+\mathbf{b}_{k}$

where $\mathbf{W}_{i},\mathbf{P}_{i},\mathbf{b}_{i}$ are learnable weight matrices/vectors of any mathematically valid dimensions. Dropout 2.1.1 is applied after each ReLU activation function $\sigma$ , and all other notation matches that used in Section 2.1.1. Unlike Transformers, MLPs and MLP $+$ residuals, require fixed‑length inputs. Based on the structures present in our data, we pad⁷⁷7“Padding” refers to the creation of dummy atoms where all values are 0. every structure to 220 atoms before feeding it to the network. The selection of 220 atoms stems from the regular Si lattice box in the dataset having $6^{3}=216$ atoms, and allowance for the inclusion of dopants. For larger systems, the energy-predictor model can be retrained or fine‑tuned with a higher maximum length rather than truncating atoms.

Table 1: Selection Performance

Information	$\mathcal{L}_{2}$ Error
Decoder	23.5508
MLP Only	50.3728
MLP + residual	11.1683

2.2.2 Model Selection and Comparison

To quantify performance, we train each candidate for 200 epochs, save the weights from the best validation step, and evaluate on the test set. The results are shown in Table 1.

The MLP $+$ residual achieves the lowest error, justifying its selection as the recommended architecture. After adopting it, we further refine the model with an additional 200 epochs of training until convergence.

Refer to caption — Figure 2: Side-by-side comparison of outputs. Top row: ADAPT. Bottom row: MACE retrained on the data used to train ADAPT. Predicted forces are shown in black, actual forces are shown in red.

2.3 Numerical Results

The primary criterion for comparing MLFFs is accuracy in force and energy prediction, typically measured by $\mathcal{L}_{2}$ or MAE error. We benchmark ADAPT against two state-of-the-art models: MACE [3] and MatterSim [5]. To ensure comparability, we train both MACE and ADAPT from scratch on a dataset of $6{,}082$ silicon defect DFT trajectories from our previous works, which contains both simple and complex defects with a total of 56 elements[39, 40]. Only charge neutral defects are considered in this work. Details of DFT calculations are provided in Supplementary information. All testing cases are complex defects. We additionally report results from previously benchmarked MACE models[41]. For MatterSim, which is positioned as a large-scale foundation model, retraining is computationally prohibitive; we therefore evaluate using its publicly released checkpoints. All models are tested on $100$ structures whose trajectories were not included in training.

Recall that the primary motivation for MLFFs is to generate relaxation trajectories. Metrics such as $\mathcal{L}_{2}$ loss of predicted forces and energies are a proxy used to compare MLFFs, but they are not the main goal. In practice, the decisive measure of MLFF capability is its performance in the meta-stable structure-determination pipeline, diagrammed in Figure 3. To this end, we do not evaluate on full trajectories because $\mathcal{L}_{2}$ error can be misleading in the latter steps of crystalline-defect structure relaxation. When atomic forces are near zero, $\mathcal{L}_{2}$ often favors trivial or uninformative predictions. For example, the zero vector, $\vec{\mathbf{0}}$ , can achieve lower error than nontrivial force predictions—even though it is not helpful in practice. This phenomenon occurs because most atoms in the bulk lattice undergo negligible displacement, allowing a model to minimize error by suppressing all motion across the lattice, at the cost of missing the subtle, yet critical, displacements that govern structural evolution.

In practice, however, MLFFs and relaxation procedures are often tolerant to small perturbations in the bulk lattice. Predictions typically exhibit small stochastic deviations, yet these are often self-correcting over successive relaxation steps. The practical utility of MLFFs lies in their ability to capture the significant atomic-force vectors that drive structural rearrangements. By evaluating on candidate structures from the beginnings of trajectories rather than full trajectories, the standard $\mathcal{L}_{2}$ metric better reflects practical utility for defects. These initial configurations often contain larger force magnitudes, reducing the advantage of trivial predictions.

Force Predictions. Table 5 shows that the small ADAPT configuration ( $d_{\text{model}}=256$ , $d_{\text{ff}}=512$ , 80 epochs) outperforms its larger counterpart ( $d_{\text{model}}=512$ , $d_{\text{ff}}=1024$ , 750 epochs). The larger configuration exhibited overfitting, indicating that the smaller model already distilled nearly all available information from the data. Accordingly, no further model training on the same inputs is likely to achieve a meaningful performance gain⁸⁸8Under the assumption of no additional inductive biases..

Results are summarized in Table 5: ADAPT achieves a $33\%$ error reduction relative to retrained MACE, and far outperforms the strongest pretrained model. Scatter plots of force and energy errors across all predictions are shown in Figure 5, and examples showing the effect on selected structures are included in Figure 2. The accuracy in forces obtained with ADAPT is around 0.01 eV/A as MAE. This is in the order of magnitude of the stopping criteria for many atomic relaxation within DFT including in our data set. This indicates that ADAPT could be a good surrogate to DFT relaxation and at least provide useful pre-relaxation.

Energy Predictions. We also show that the ADAPT defect formation energy-predictor model produces performance superior to both MACE and MatterSim. A table of results is given as Table 5, and scatter plots showing the results are given in Figure 6. We achieve near identical error to MatterSim 5M—the best of the existing energy predictors—after 200 epochs, and reach our final result—with a better than $30\%$ reduction in MAE error over MatterSim 5M—after 400 epochs.

2.4 Computational Efficiency

Force Predictions. An advantage of the ADAPT architecture is its computational efficiency. Training Small ADAPT required approximately 2.24 minutes per epoch on a single NVIDIA A100, and converged after 80 epochs (totaling 3 compute hours). In comparison, retraining MACE required 8.5 minutes per epoch for 300 epochs on 16 NVIDIA A100s, amounting to 680 compute hours: more than 227 $\times$ the amount of compute used to train ADAPT’s force-prediction model. The compact design of ADAPT permits training on commodity hardware, including workstations and even consumer-grade laptops equipped with GPUs,⁹⁹9The authors successfully trained Small ADAPT on a personal laptop. thereby significantly reducing hardware requirements for adoption. This accessibility is consistent with the overarching objective of the MLFF literature: to accelerate structural determination by reducing dependence on large-scale computational resources.

These improvements are attributed to the departure from graph-based architectures. Graph neural networks inherently involve sparse operations, which are not easily expressed in the dense linear algebraic form favored by modern accelerators. Consequently, graph-based models typically exhibit lower hardware utilization due to sparse operations, which lack the extensive optimization and backend support available with dense-matrix operations [42]. By forgoing graph representations and adopting architectural paradigms widely developed in natural-language processing and computer vision—where such operations benefit from extensive backend and library support—ADAPT achieves markedly higher computational throughput.

Energy Prediction. MACE generates energy predictions concurrently with force predictions within the same forward pass, yielding identical timing characteristics for both quantities. ADAPT trains an additional energy-predictor model, which required 1.93 compute hours on a single NVIDIA A100 GPU. Model training was conducted for 400 epochs, with the duration of a single epoch being 29 seconds on the same hardware. When including this cost, training both ADAPT models takes a total of 4.92 A100 hours, which is still more than 138 $\times$ faster than MACE.

3 Discussion

On the Use of Separate Models. ADAPT employs separate models for force and energy prediction, a design choice that carries several practical advantages. First, when only one quantity is required, the corresponding model can be deployed independently, reducing both runtime and memory consumption. This could be particularly important for defect-MLFF, as defect properties are often simulated in large supercells containing hundreds of atoms. This efficiency is relevant for practitioners working on local workstations or clusters with limited hardware capacity. Second, the separation increases modularity: force and energy predictors can be updated or retrained independently, allowing the integration of datasets without both quantities present, and enabling incremental model refinements without retraining the entire system.

We note, however, that separating forces and energies comes with important trade-offs. Because no physical constraint links the two predictions, the resulting MLFF is non-conservative: forces are not guaranteed to correspond to gradients of the energy surface. While recent studies suggest that abandoning this constraint may yield more efficient neural networks and even improved accuracy in some settings [43, 44, 45], we refrain from using such models for molecular dynamics simulations [46, 47]. Moreover, modularity itself introduces limitations. Some applications—–such as the FIRE optimizer [48]—–require forces and energies simultaneously. In these cases, a joint model is often more parameter-efficient [49], as it learns a shared representation across tasks and can exploit the inherent correlations between forces and energies, potentially improving generalization when sufficient data are available¹⁰¹⁰10Interpretations of neural-network representations should be made cautiously: the “black-box” nature of the architecture makes it difficult to directly characterize internal dynamics..

Architectural considerations also play a role in the two-model system. Unlike conventional neural networks, which allow outputs to be flexibly defined, Transformer architectures are inherently structured around token-to-token transformations. In ADAPT, where tokens correspond to atoms, the energy of the structure constitutes a non-token, global output. Accommodating this mismatch requires additional mechanisms. Extensive prior literature on this issue has yielded two main strategies: $i)$ the introduction of “special” tokens representing global properties [50, 51], and $ii)$ the use of specialized output heads appended to the model [52].

Given the limited training data available for silicon defects, it is not surprising [53, 54, 55] that a simpler MLP with residual connections outperformed a Transformer decoder in this setting—see Table 1. Nonetheless, the authors expect that, with sufficient force and energy data, Transformer architectures augmented with specialized heads may provide a more scalable and accurate solution. The design of such heads remains an active area of research, and identifying architectures that best balance modularity, efficiency, and accuracy is an open problem.

Coordinates vs. Graphs. GNNs are the default backbone for modern MLFFs [3, 4, 6, 5, 8] where atoms define nodes, and atomic bonds or proximity determine edge placement. By encoding geometric priors (permutation, rotation, and translation invariance), they incorporate strong inductive biases that improve data efficiency [1, 56, 57, 58] and have been argued to stabilize relaxation trajectories [13].

Representing continuous atomic interactions using discrete graph topologies introduces mismatches that can limit accuracy, especially in defects where long-range effects and precise geometries are important. GNNs inherently restrict interactions to local regions, relying on network depth to propagate forward information that is outside the interaction radius. This approach often leads to over-smoothing and over-squashing [59, 60], where long-range signals degrade rapidly as depth increases. Bulk crystal far from the defect core can substantially shape local defect structures. While long-range influences are less critical in many other chemical systems, neglecting them in crystalline materials can cause large errors. The poor performance of GNNs on large periodic systems—an issue especially relevant in modeling crystalline defects—has been noted [13, 17]. Adding long-range interactions into graph architectures [13, 6] often leads to significant cost in computation and model complexity. Thus, we arrive at the motivation for using an alternative MLFF strategy for modeling crystal defects in ADAPT, and a need recognized in [13, 17] as well.

Table 2: Full vs Local Interaction.

Allowed Interactions (%)	Total $\mathcal{L}_{2}$ Loss
1.46	$13.16^{*}$
18.7	$13.61^{\dagger}$
51.3	$11.13^{\dagger}$
100	$8.11^{*}$

Radius is the percentage of every-to-every interactions allowed during training and inference. Interactions are controlled in Attention via Key-Structural Masks (Appendix D). Lower scores mean less error.
Note: ^∗ training converged after 80 epochs; ^† training ran for 200 epochs until convergence.

With the advent of Transformer architectures and growing datasets, it is now feasible to move away from hard-coded geometric priors and instead focus on explicit representations of global distances and angles. ADAPT employs a Transformer encoder (Section 2.1, Appendix C) with full, unmasked self-attention, enabling all-to-all comparisons between atoms at each layer. This approach directly captures non-bonded and long-range interactions without depending on depth-based message passing. Although the model lacks explicit geometric equivariances, permutation invariance is inherent to unmasked attention, and experiments show that translational and rotational invariances can be learned sufficiently well from data. The importance of global attention is underscored in Table 2: restricting attention to local neighborhoods—as in GNNs—drastically degrades performance.

Accurate Representation of Geometries. Graphs excel at capturing connectivity, but do not inherently encode exact distances or angles. To handle this deficiency, many GNN variants supplement node and edge features with geometric data [1, 13, 3, 8, 4]; however, such information must still be passed iteratively from neighbor to neighbor, which can introduce truncation and discretization errors—an effect that compounds with increasing path lengths between atoms.

By contrast, a coordinate-based approach gives direct access to precise pairwise distances and angles for all atoms in a single computation step. This approach not only avoids approximations from multi-hop propagation, but also preserves geometric detail across all interaction scales.

Limitations and Future Directions. The ADAPT architecture is not inherently limited to defect relaxation or force prediction. However, it remains an open problem to determine ADAPT’s applicability to other problems including diverse bulk structures. Additionally, Transformers typically require substantial quantities of data [53, 54, 55], making ADAPT unsuitable for tasks with limited training data. Our work however points out that GNN-free MLFFs can reach high accuracy.

Future directions include $i)$ enforcing physical invariances algorithmically within both the architecture and the loss; $ii)$ extending training beyond silicon to encompass a wider class of defects and materials; $iii)$ developing force-field models that integrate physical constraints directly into the model architecture; and $iv)$ extending the framework to simulate charged defects in semiconductors.

4 Acknowledgments and Availability

4.1 Code and Data Availability

The datasets generated and/or analyzed during the current study are available in the “ADAPT Stable” repository, [released after publication].
The underlying code and training/validation datasets for this study are available in the GitHub repository: ADAPT-released and can be accessed via this link [released after publication].

4.2 Acknowledgments

This study was funded by NSF grants CCF-2212558, CCF-2212557, and CCF 1918651. The first principles work has been supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences in Quantum Information Science under Award Number DE-SC0022289. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 using NERSC award BES-ERCAP0020966. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the sponsoring entities.

This research was funded in part by: The Robert A. Welch Foundation (grant No. C-2118 A.K.); Rice University (Faculty Initiative award); NSF CAREER (award no. 2145629); an Amazon Research Award; a Microsoft Research Award.

4.3 Competing Interests

All authors declare no financial or non-financial competing interests.

References

[1] Michael M Bronstein, Joan Bruna, Taco Cohen and Petar Veličković “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges” In arXiv preprint arXiv:2104.13478, 2021
[2] Patrick Reiser et al. “Graph neural networks for materials science and chemistry” In Communications Materials 3.1 Nature Publishing Group UK London, 2022, pp. 93
[3] Ilyes Batatia et al. “MACE: Higher order equivariant message passing neural networks for fast and accurate force fields” In Advances in neural information processing systems 35, 2022, pp. 11423–11436
[4] Bowen Deng et al. “CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling” In Nature Machine Intelligence 5.9 Nature Publishing Group UK London, 2023, pp. 1031–1041
[5] Han Yang et al. “Mattersim: A deep learning atomistic model across elements, temperatures and pressures” In arXiv preprint arXiv:2405.04967, 2024
[6] J Thorben Frank, Oliver T Unke, Klaus-Robert Müller and Stefan Chmiela “A Euclidean transformer for fast and stable machine learned force fields” In Nature Communications 15.1 Nature Publishing Group UK London, 2024, pp. 6539
[7] Igor Poltavsky and Alexandre Tkatchenko “Machine learning force fields: Recent advances and remaining challenges” In The journal of physical chemistry letters 12.28 ACS Publications, 2021, pp. 6551–6564
[8] Chi Chen and Shyue Ping Ong “A universal graph deep learning interatomic potential for the periodic table” In Nature Computational Science 2.11 Nature Publishing Group US New York, 2022, pp. 718–728
[9] Kamal Choudhary and Brian DeCost “Atomistic line graph neural network for improved materials property predictions” In npj Computational Materials 7.1 Nature Publishing Group UK London, 2021, pp. 185
[10] Kristof Schütt et al. “Schnet: A continuous-filter convolutional neural network for modeling quantum interactions” In Advances in neural information processing systems 30, 2017
[11] Simon Batzner et al. “E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials” In Nature communications 13.1 Nature Publishing Group UK London, 2022, pp. 2453
[12] Albert Musaelian et al. “Learning local equivariant representations for large-scale atomistic dynamics” In Nature Communications 14.1 Nature Publishing Group UK London, 2023, pp. 579
[13] J Thorben Frank, Oliver T Unke and Klaus-Robert Müller “So3krates: Equivariant attention for interactions on arbitrary length-scales in molecular systems” In arXiv preprint arXiv:2205.14276, 2022
[14] Md Habibur Rahman et al. “Accelerating defect predictions in semiconductors using graph neural networks” In APL Machine Learning 2.1 AIP Publishing, 2024
[15] Xiaofeng Xiang, Dylan Soh and Scott Dunham “Exploration of deep learning models for accelerated defect property predictions and device design of cubic semiconductor crystals” In The Journal of Physical Chemistry C 128.21 ACS Publications, 2024, pp. 8821–8829
[16] Irea Mosquera-Lois, Seán R Kavanagh, Alex M Ganose and Aron Walsh “Machine-learning structural reconstructions for accelerated point defect calculations” In npj Computational Materials 10.1 Nature Publishing Group UK London, 2024, pp. 121
[17] Qimin Yan, Swastik Kar, Sugata Chowdhury and Arun Bansil “The case for a defect genome initiative” In Advanced Materials 36.11 Wiley Online Library, 2024, pp. 2303098
[18] Qimai Li, Zhichao Han and Xiao-Ming Wu “Deeper insights into graph convolutional networks for semi-supervised learning” In Proceedings of the AAAI conference on artificial intelligence 32.1, 2018
[19] Ziduo Yang et al. “Modeling crystal defects using defect informed neural networks” In npj Computational Materials 11.1 Nature Publishing Group UK London, 2025, pp. 229
[20] Arturo D Lopez-Rojas and Carlos A Cruz-Villar “Neural networks as an approximator for a family of optimization algorithm solutions for online applications” In Neural Computing and Applications 36.6 Springer, 2024, pp. 3125–3140
[21] Brandon Amos “Tutorial on amortized optimization”, 2025 arXiv: https://arxiv.org/abs/2202.00665
[22] Ruizhong Qiu, Zhiqing Sun and Yiming Yang “Dimes: A differentiable meta solver for combinatorial optimization problems” In Advances in Neural Information Processing Systems 35, 2022, pp. 25531–25546
[23] Ashish Vaswani et al. “Attention is all you need” In Advances in neural information processing systems 30, 2017
[24] Ce Zhou et al. “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt” In International Journal of Machine Learning and Cybernetics Springer, 2024, pp. 1–65
[25] Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
[26] Josh Abramson et al. “Accurate structure prediction of biomolecular interactions with AlphaFold 3” In Nature 630.8016 Nature Publishing Group UK London, 2024, pp. 493–500
[27] Jonathan J Webster and Chunyu Kit “Tokenization as the initial phase in NLP” In COLING 1992 volume 4: The 14th international conference on computational linguistics, 1992
[28] George Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of control, signals and systems 2.4 Springer, 1989, pp. 303–314
[29] Adam Khakhar and Jacob Buckman “Neural regression for scale-varying targets” In arXiv preprint arXiv:2211.07447, 2022
[30] Jae-Han Lee, Chul Lee and Chang-Su Kim “Learning multiple pixelwise tasks based on loss scale balancing” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5107–5116
[31] Dipendra Jha et al. “Elemnet: Deep learning the chemistry of materials from only elemental composition” In Scientific reports 8.1 Nature Publishing Group UK London, 2018, pp. 17593
[32] Yingzong Liang et al. “A universal model for accurately predicting the formation energy of inorganic compounds” In Science China Materials 66.1 Springer, 2023, pp. 343–351
[33] Linfeng Zhang et al. “Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics” In Physical review letters 120.14 APS, 2018, pp. 143001
[34] Johannes Müller “On the space-time expressivity of ResNets” In arXiv preprint arXiv:1910.09599, 2019
[35] Jonas Baggenstos and Diyora Salimova “Approximation properties of residual neural networks for Kolmogorov PDEs” In arXiv preprint arXiv:2111.00215, 2021
[36] Mahdi Movahedian Moghaddam, Kourosh Parand and Saeed Reza Kheradpisheh “Advanced Physics-Informed Neural Network with Residuals for Solving Complex Integral Equations” In arXiv preprint arXiv:2501.16370, 2025
[37] A Noorizadegan, R Cavoretto, Der-Liang Young and CHUIN-SHAN Chen “Stable weight updating: A key to reliable PDE solutions using deep learning” In Engineering Analysis with Boundary Elements 168 Elsevier, 2024, pp. 105933
[38] Karthik Kashinath et al. “Physics-informed machine learning: case studies for weather and climate modelling” In Philosophical Transactions of the Royal Society A 379.2194 The Royal Society Publishing, 2021, pp. 20200093
[39] Yihuang Xiong et al. “Computationally Driven Discovery of T Center-like Quantum Defects in Silicon” In Journal of the American Chemical Society 146.44, 2024, pp. 30046–30056
[40] Yihuang Xiong et al. “High-throughput identification of spin-photon interfaces in silicon” In Science Advances 9.40, 2023, pp. eadh8617 DOI: 10.1126/sciadv.adh8617
[41] Ilyes Batatia et al. “A foundation model for atomistic materials chemistry” In arXiv preprint arXiv:2401.00096, 2023
[42] Shengwen Liang et al. “EnGN: A high-throughput and energy-efficient accelerator for large graph neural networks” In IEEE Transactions on Computers 70.9 IEEE, 2020, pp. 1511–1525
[43] Johannes Klicpera, Florian Becker and Stephan Günnemann “Gemnet: Universal directional graph neural networks for molecules” In Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021, pp. 6790–6802
[44] Mark Neumann et al. “Orb: A Fast, Scalable Neural Network Potential. 2024” In arXiv preprint arXiv:2410.22570 33
[45] Yi-Lun Liao, Brandon Wood, Abhishek Das and Tess Smidt “Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations” In arXiv preprint arXiv:2306.12059, 2023
[46] Filippo Bigi, Marcel Langer and Michele Ceriotti “The dark side of the forces: assessing non-conservative force models for atomistic machine learning” In arXiv preprint arXiv:2412.11569, 2024
[47] Ryan Jacobs et al. “A practical guide to machine learning interatomic potentials–Status and future” In Current Opinion in Solid State and Materials Science 35 Elsevier, 2025, pp. 101214
[48] Erik Bitzek et al. “Structural relaxation made simple” In Physical review letters 97.17 APS, 2006, pp. 170201
[49] Yu Zhang and Qiang Yang “A survey on multi-task learning” In IEEE transactions on knowledge and data engineering 34.12 IEEE, 2021, pp. 5586–5609
[50] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019 arXiv: https://arxiv.org/abs/1810.04805
[51] Jean-Baptiste Alayrac et al. “Flamingo: a Visual Language Model for Few-Shot Learning”, 2022 arXiv: https://arxiv.org/abs/2204.14198
[52] Long Ouyang et al. “Training language models to follow instructions with human feedback”, 2022 arXiv: https://arxiv.org/abs/2203.02155
[53] Yahui Liu et al. “Efficient training of visual transformers with small datasets” In Advances in Neural Information Processing Systems 34, 2021, pp. 23818–23830
[54] Haoran Zhu, Boyuan Chen and Carter Yang “Understanding why vit trains badly on small datasets: An intuitive perspective” In arXiv preprint arXiv:2302.03751, 2023
[55] Yian Zhang, Alex Warstadt, Haau-Sing Li and Samuel R Bowman “When do you need billions of words of pretraining data?” In arXiv preprint arXiv:2011.04946, 2020
[56] Tsz Wai Ko and Shyue Ping Ong “Data-efficient construction of high-fidelity graph deep learning interatomic potentials” In npj Computational Materials 11.1 Nature Publishing Group UK London, 2025, pp. 65
[57] Johannes Kiechle et al. “Graph Neural Networks: A Suitable Alternative to MLPs in Latent 3D Medical Image Classification?” In International Workshop on Graphs in Biomedical Image Analysis, 2024, pp. 12–22 Springer
[58] Marco Oliva, Soubarna Banik, Josip Josifovski and Alois Knoll “Graph Neural Networks for Relational Inductive Bias in Vision-based Deep Reinforcement Learning of Robot Control”, 2022 arXiv: https://arxiv.org/abs/2203.05985
[59] Jhony H Giraldo, Konstantinos Skianis, Thierry Bouwmans and Fragkiskos D Malliaros “On the trade-off between over-smoothing and over-squashing in deep graph neural networks” In Proceedings of the 32nd ACM international conference on information and knowledge management, 2023, pp. 566–576
[60] T. Rusch, Michael M. Bronstein and Siddhartha Mishra “A Survey on Oversmoothing in Graph Neural Networks”, 2023 arXiv: https://arxiv.org/abs/2303.10993
[61] Anubhav Jain et al. “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation” In APL materials 1.1 American Institute of PhysicsAIP, 2013, pp. 11002 DOI: 10.1063/1.4812323
[62] Kiran Mathew et al. “Atomate: A high-level interface to generate, execute, and analyze computational materials science workflows” In Computational Materials Science 139, 2017, pp. 140–152 DOI: 10.1016/j.commatsci.2017.07.030
[63] Shyue Ping Ong et al. “Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis” In Computational Materials Science 68, 2013, pp. 314–319 DOI: 10.1016/j.commatsci.2012.10.028
[64] G. Kresse and J. Furthmüller “Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set” In Phys. Rev. B 54 American Physical Society, 1996, pp. 11169–11186 DOI: 10.1103/PhysRevB.54.11169
[65] G. Kresse and J. Furthmüller “Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set” In Computational Materials Science 6.1, 1996, pp. 15–50 DOI: 10.1016/0927-0256(96)00008-0
[66] P.. Blöchl “Projector augmented-wave method” In Phys. Rev. B 50 American Physical Society, 1994, pp. 17953–17979 DOI: 10.1103/PhysRevB.50.17953
[67] John P Perdew, Kieron Burke and Matthias Ernzerhof “Generalized gradient approximation made simple” In Physical review letters 77.18 APS, 1996, pp. 3865
[68] Jiankang Deng, Jia Guo, Niannan Xue and Stefanos Zafeiriou “Arcface: Additive angular margin loss for deep face recognition” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699

Appendix A Individual Contributions

Table 3: Author contributions by role (filled = contributed)

	Software.	Domain.	Method	Data Cur.	MACE.	Writing
ED
YX
YZ
CJ
TR
GH
TK

Roles: Software.=Creation of project software and documentation; Domain.=Domain Knowledge; Method. = Design of MLFF architecture; Data Cur.=Data Curation; MACE=Training of MACE; Writing=Writing and Editing.

Appendix B Dataset Details

The DFT trajectories dataset contains both simple and complex defects in silicon, which correspond to our previous works [39, 40]. The complex defects are in substitutional-interstitial configuration. The defect elements in the dataset span most of the periodic table besides the noble gas, rare-earth, and the ones that are difficult to implantable, giving in total 56 elements [40]. In this work, we extract $252{,}240$ number of single-point calculations of neutral charge defects from the relaxation trajectories. The high-throughput defect computations were performed using the automatic workflows that are implemented in atomate software package [61, 62, 63]. The first-principles calculations were performed using Vienna Ab-initio Simulation Package (VASP) [64, 65] with the projector augmented wave (PAW) method [66]. All the calculations were spin-polarized at the Perdew-Burke-Erzhenhoff (PBE) level[67]. Defect atoms were embedded in a Si supercell with 216 atoms. 520 eV cutoff energies were used for the plane-wave basis and the Brillouin zone was sampled with single $\Gamma$ . All the defect structures were optimized at a fixed volume until the ionic forces were smaller than 0.01 eV/Å.

Appendix C Architecture Details and Hyperparameters

Transformer Details.

A full writeup of the mathematics of Scaled Dot-Product Attention and Transformers can be found at the following links:

•

Attention: https://evandramko.github.io/files/attention.pdf
•

Transformers: https://evandramko.github.io/files/transformer.pdf

Hyperparameters.

•

ADAPT: We define the “small” model size by: [ $d_{\text{model}}=256$ , $d_{\text{ff}}=512$ , #-layers $=8$ , #-heads $=8$ , dropout rate $=0.05$ ] trained for 80 epochs. The “large” model size is: [ $d_{\text{model}}=512$ , $d_{\text{ff}}=1024$ , #-layers $=8$ , #-heads $=8$ , dropout rate $=0.05$ ] trained for 750 epochs. All training was in single precision.
•

MACE: The retrained version of MACE (v0.3.14, PyTorch 2.6.0) uses: num_interactions=2, num_channels=256, max_L=2, correlation=3, r_max=5.0, trained for 300 epochs on single precision (float32).

C.1 Evaluation At Different Levels

While $\mathcal{L}_{2}$ error is the conventional standard for comparing force predictions, we find that it is insufficient to fully capture the dynamics of point defects in crystals. To perform a more appropriate comparison, we use two complementary levels. (i) Model level (MLFF): accuracy of force and energy predictions. (ii) Predictor level: quality of the final relaxed structure obtained by running a geometry optimizer with the MLFF.

Model-Level Evaluation of Forces: When comparing candidate models, in addition to the loss scores (see Section 2.1.2), we also consider the average angle and magnitude errors separately. We use the dot product to calculate the angular error in degrees via¹¹¹¹11In practice, we clamp the $\cdot$ $\arccos(\cdot)$ to ensure that $\arccos$ is always operating on valid values. This detail is omitted for clarity in the provided formula.

\displaystyle\mathit{angle}(\mathbf{y},\mathbf{\widehat{y}})

\displaystyle=\arccos\left(\frac{\mathbf{y}\cdot\mathbf{\widehat{y}}}{||\mathbf{y}||_{2}\cdot||\mathbf{\widehat{y}}||_{2}}\right)\cdot\frac{180}{\pi},

and we calculate the difference in magnitudes via

\displaystyle\mathit{mag}(\mathbf{y},\mathbf{\widehat{y}})

\displaystyle=\left\lvert\left\lVert\mathbf{y}\right\rVert_{2}-\left\lVert\mathbf{\widehat{y}}\right\rVert_{2}\right\rvert

These results help to determine whether the model is genuinely learning the underlying dynamics or artificially minimizing error by predicting uniformly negligible forces—knowing that in reality, most of them will be close to zero.¹²¹²12In practice, many implementations of different models tended to produce near-zero results for all forces, and then stop improving. From a domain perspective, it is often more important to predict the direction (angle) of the force correctly than its exact magnitude. Although this angular-magnitude metric is differentiable and theoretically usable as a loss function for the MLFF, in practice it is difficult to balance the angular and magnitude components effectively. Empirical results show that angular-loss functions are often brittle and require significant engineering effort to implement reliably [68]—a result borne out in our own experiments. In contrast, using a weighted mean-squared-error (MSE) loss is simpler, more robust, and yields strong performance at both the MLFF and Predictor (Structural-Relaxation) levels, making it the preferred choice. However, we did use the angle-prediction performance of models to compare and rank different training runs and different hyperparameter choices for our models.

Evaluation of Energy: The total energy of the crystal is represented with a single number, making evaluation very easy. We use the common $\mathcal{L}_{2}$ distance metric.

Evaluation of Predictor (Figure 3): In order to evaluate the final result of the full relaxation procedure, we use the well known SOAP and delta Q metrics. Other checkers (such as those which check bond lengths) are also viable, although we do not use them in this work.

Appendix D Masking in Attention

When restricting interactions in Attention, we apply masks to the attention logit matrix

QK^{\mathsf{T}}\in\mathbb{R}^{B\times H\times T\times T},

where $B$ is the batch size, $H$ the number of heads, and $T$ the sequence length (number of tokens). Masking is applied along the Key dimension (the columns), so that certain tokens cannot be attended to. We use two types of masks:

1.

Padding mask. To enable batching, all sequences are padded.¹³¹³13Padding means appending dummy tokens, typically all zeros, to make every sequence the same length. Padding tokens must not affect the model’s output, so we mask them out of the attention computation.
2.

Restricted visibility (local radius). To study the effect of limiting each token’s visible neighborhood, we compute a restricted attention mask. Allowed interactions are precomputed from the $\mathcal{L}_{2}$ distances between raw coordinates, and then the same mask is applied to every attention step in the forward pass.

Key masking mechanism.

After computing $QK^{\mathsf{T}}$ , all disallowed positions are replaced with -inf. During the row-wise softmax, these entries become zero, ensuring that they cannot contribute, regardless of the values in $V$ . Consequently, masked tokens never influence the update of valid tokens. Query values at masked positions can be arbitrary (“nonsense” numbers),¹⁴¹⁴14Some implementations explicitly zero them out after each attention layer for safety and clarity. but they cannot affect non-padded tokens.

Appendix E Decoder

The natural extension of using an encoder to predict forces is to use a decoder to predict energy. While the encoder architecture produces a per-token output 2.1.1, the decoder architecture produces individual outputs, like a scalar crystal energy, using a similar Attention/Transformer based architecture. The decoder design we use starts with a stack of encoder layers like in the force-prediction model 2.1.1, but instead of the final linear down-scaling, the stack is followed by a decoder head. This head defines a “dummy” token, $\mathbf{q}$ , which is used to allow the calculations to shrink the output to a constant size. This modification requires us to use a slightly different notation; rather than having Attn as an function of a single variable, we denote it as a function of three variables. Each is used (in order) to provide the conditioning of one of $\mathbf{Q},\mathbf{K},\mathbf{V}$ .

The Decoder architecture is formulated as:

	$\displaystyle\mathbf{M}$	$\displaystyle=\texttt{encoder}(\mathbf{X});$
	$\displaystyle\mathbf{h}_{0}$	$\displaystyle=\texttt{LN}(\mathbf{q}+\texttt{Attn}(\mathbf{q},\mathbf{M},\mathbf{M}));$
	$\displaystyle\mathbf{h}_{1}$	$\displaystyle=\texttt{LN}(\mathbf{h}_{0}+\texttt{MLP}(\mathbf{h}_{0}));$
	$\displaystyle\mathbf{\widehat{y}}$	$\displaystyle=\mathbf{W}\mathbf{h}_{1}+\mathbf{b},$

where the notation follows that used in Section 2.1.1, and dropout is applied after Attn and MLP. Recall that $\mathbf{M}\in\mathbbm{R}^{n\times d_{model}}$ , and note that $\mathbf{W}\in\mathbbm{R^{1\times n}}$ . Although it is a matrix of shape $\mathbf{q}\in\mathbbm{R}^{(1\times d_{model})}$ we denote it in lowercase vector form to make clear that it has only one non-trivial dimension. We train both the encoder and decoder layers jointly.