Learning to Solve
Optimization Problems Constrained
with Partial Differential Equations

Yusuf Güven
Istanbul Technical University
Control and Automation Engineering Department
Istanbul, Turkey
ngu6fh@virginia.edu
&Vincenzo Di Vito and Ferdinando Fioretto
University of Virginia
Department of Computer Science
Charlottesville, USA
{eda8pc,fioretto}@virginia.edu
Equal Contribution

​ ​ The implementation of the methods used to reproduce the experiments described in this work con be found at https://github.com/vdvf96/PDE-OP.
Abstract

Partial differential equation (PDE)-constrained optimization arises in many scientific and engineering domains, such as energy systems, fluid dynamics, and material design. In these problems, the decision variables (e.g., control inputs or design parameters) are tightly coupled with the PDE state variables, and the feasible set is implicitly defined by the governing PDE constraints. This coupling makes the problems computationally demanding, as it requires handling high-dimensional discretizations and dynamic constraints. To address these challenges, this paper introduces a learning-based framework that integrates a dynamic predictor with an optimization surrogate. The dynamic predictor, a novel time-discrete Neural Operator (Lu_2021, ), efficiently approximates system trajectories governed by PDE dynamics, while the optimization surrogate leverages proxy optimizer techniques (kotary2021end, ) to approximate the associated optimal decisions. This dual-network design enables real-time approximation of optimal strategies while explicitly capturing the coupling between decisions and PDE dynamics. We validate the proposed approach on benchmark PDE-constrained optimization tasks including Burgers’ equation, heat equation, and voltage regulation, and demonstrate that it achieves solution quality comparable to classical control-based algorithms such as the Direct Method and Model Predictive Control (MPC), while providing up to four orders of magnitude improvement in computational speed.

Antiquus S. Hippocampus, Natalia Cerebro & Amelie P. Amygdale Equal Contribution
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213, USA
{hippo,brain,jen}@cs.cranberry-lemon.edu
&Ji Q. Ren & Yevgeny LeNet
Department of Computational Neuroscience
University of the Witwatersrand
Joburg, South Africa
{robot,net}@wits.ac.za
and Coauthor
Affiliation
Address
email

1 Introduction

Partial differential equations (PDEs) are central to modeling a wide range of physical and engineering systems, capturing phenomena that evolve jointly over space and time. Applications include fluid transport, which can be modeled with Burgers’ equations which are key in fluid mechanics and gas dynamics, thermal processes captured by heat equations, often used in financial mathematics and image analysis, and voltage dynamics in power systems. Optimizing decisions in such domains requires solving PDE-constrained optimization problems, where control or design objectives must be optimized while ensuring consistency with the underlying governing PDE dynamics.

Despite their importance, PDE-constrained optimization remains computationally intractable. Traditional approaches typically rely on finite element and finite difference discretizations combined with adjoint-based optimization. These methods often suffer from high dimensionality, high computational cost, and convergence difficulties when applied to nonlinear dynamics. For example, when applied to the nonlinear Burgers’ equations, our experiments show that the runtimes of the adjoint-based and Nonlinear MPC approaches run in approximately 120s120\;s and 878s878\;s; such computational cost hinder their applicability in large-scale or real-time decision-making tasks.

In various classes of continuous optimization tasks not constrained by PDEs, proxy optimizers (kotary2021end, ) have recently emerged as a powerful tool to quickly find approximate solutions. In parallel, researchers have focused on developing a variety of surrogates (raissi2017physics, ; kovachki2024neuraloperatorlearningmaps, ; li2021fourier, ; chen2019neural, ) for capturing system dynamics. Our approach is a unique combination of elements from both these areas, and consists of a dual-network architecture to tackle PDE-constrained optimization problems where both the state dynamics and the control decisions evolve in space and time.

Contributions. Specifically, the paper makes the following contributions: (1) It introduces a novel learning-based method to efficiently approximate solutions of PDE-constrained optimization problems. Our approach, PDE-OP (Optimization Proxy), consists of a dual-network architecture in which a dynamic predictor captures the system dynamics, while a surrogate controller approximates the optimal control actions. These components are trained in a synergistic manner in a self-supervised fashion using a primal–dual method. (2) It shows that the proposed time-discrete Neural Operator accurately captures the dynamics governed by PDEs. (3) It demonstrates that the proposed framework achieves decision quality comparable to state-of-the-art numerical solvers, such as Model Predictive Control and the Adjoint-sensitivity method, while providing several orders of magnitude improvements in computational speed. The approach is validated on a set of widely adopted PDE-constrained optimization tasks, including Burgers’, heat, and voltage equations. We believe that these results could boost the adoption of learning-based methods for control problems that require high-fidelity dynamic modeling and near real-time optimal decision-making.

2 Related Work

A central paradigm for decision-making coupled with dynamical systems is Model Predictive Control (MPC) (mayne2000constrained, ). It repeatedly solves an optimization problem over a finite time horizon using predicted system states. While MPC provides high-quality control decisions, its reliance on repeatedly solving an optimization problem makes it computationally intractable and limits its applicability in real-time settings. In a similar fashion, exact PDE-constrained optimization methods rely on time–space discretization of the governing equations, after which the resulting high-dimensional system is incorporated into an optimization problem (hinze2008optimization, ; troeltzsch2010optimal, ). While theoretically well-founded, these approaches become quickly impractical as the problem size grows, especially in the presence of nonlinear dynamics or when real-time feasibility is required. To mitigate these limitations, surrogate models have been developed for reduced-order adjoint solutions (zahr2014progressive, ), sensitivity PDEs (dihlmann2015certified, ), and learning PDE operators (Hwang_2022, ), offering significant computational advantages. Nevertheless, these methods often face trade-offs between accuracy and efficiency and limited adaptability across operating regimes(kovachki2024neuraloperatorlearningmaps, ; Lu_2021, ; willcox2021physics, ).

In parallel, to approximate solutions of constrained optimization problems in real-time setting, machine learning researchers have developed a class of methods known as proxy optimizers. These methods employ machine learning models, to learn mappings from problem parameters or system configurations to optimal decisions, thereby significantly accelerating inference (kotary2021end, ) compared to classical optimization solvers. Both supervised approaches (fioretto2020lagrangian, ), which rely on precomputed optimal solutions, and self-supervised ones (park2023self, ), which exploit knowledge of the problem structure, have been proposed, showing promising results. A recurring challenge in this setting is ensuring feasibility: learned solutions may violate problem constraints. This issue has been tackled through penalty-based training (fioretto2020lagrangian, ), implicit-layer formulations that embed constraints directly into the model architecture (donti2020dc3, ), or post-processing projections to restore feasibility (kotary2024learningjointmodelsprediction, ).

A large body of recent work has also focused on learning surrogates for PDE dynamics. Physics-Informed Neural Networks (2019JCoPh.378..686R, ) directly encode PDE residuals into the loss function, enabling solutions that adhere to the governing equations. Neural operator approaches, such as the Fourier Neural Operator (li2021fourier, ) and DeepONet (Lu_2021, ), learn mappings between infinite-dimensional function spaces, which allows them to approximate solution operators of families of PDEs, with strong generalization across different boundary conditions and coefficients (kovachki2024neuraloperatorlearningmaps, ). Other operator-learning frameworks (li2020multipole, ; bhattacharya2021model, ) and kernel-based methods (raissi2021deepgreen, ) have also been developed. While these approaches excel at reproducing PDE solutions, they are typically not designed to incorporate decision variables or to directly solve PDE-constrained optimization problems. The closest related work is (divito2024learningsolvedifferentialequation, ), where the authors propose a learning-based surrogate solver to tackle ODE or SDE-constrained optimization tasks. Our approach goes further, by integrating operator-learning surrogates within the optimization pipeline, thereby coupling the representation of PDE dynamics with proxy optimizers and enabling real-time solutions to PDE-constrained optimization tasks.

3 Settings and Goals

This paper focuses on optimization problem constrained by a system of partial differential equations:

Minimizeu:Q\displaystyle\operatorname*{Minimize}_{\,{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}:Q\to\mathbb{R}}\;\; L(y(T,))+0TΩΦ(u(t,x),y(t,x))𝑑x𝑑t+0TΩu(t,x)22𝑑x𝑑tJ(u(t,x),y(t,x))\displaystyle\overbrace{L\!\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(T,\cdot)\big)\;+\;\int_{0}^{T}\!\!\int_{\Omega}\Phi\!\big({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\big)\,d{x}\,dt+\int_{0}^{T}\!\!\int_{\Omega}\ \|u(t,x)\|^{2}_{2}d{x}\,dt}^{J({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x}))} (1a)
s.t. ty(t,x)=t(y(t,x),u(t,x),t,x),(t,x)Q,\displaystyle\partial_{t}{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\;=\;{\mathcal{I}}_{t}\!\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x}),{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),t,{x}\big),\;\;\;\;\;(t,{x})\in Q, (1b)
xy(t,x)=x(y(t,x),u(t,x),t,x),(t,x)Q,\displaystyle\partial_{{x}}{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\;=\;{\mathcal{I}}_{x}\!\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x}),{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),t,{x}\big),\;\;\;\;(t,{x})\in Q, (1c)
y(t0,x)=yt0(x),xΩ,\displaystyle{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t_{0},{x})\;=\;{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}_{t_{0}}({x}),\quad{x}\in\Omega, (1d)
y(t,x)=y(t,x),(t,x)[t0,T]×Ω,\displaystyle{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\;=\;{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}_{\partial}(t,{x}),\quad\quad(t,{x})\in[t_{0},T]\times\partial\Omega, (1e)
𝒈(u(t,x),y(t,x) 0,𝒉(u(t,x),y(t,x))= 0,(t,x)Q.\displaystyle\bm{g}({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,x)\;\leq\;\bm{0},\quad\bm{h}({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,x))\;=\;\bm{0},\quad(t,{x})\in Q. (1f)

In this formulation, the control action is denoted by u:Q{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}:Q\to\mathbb{R}, where Q=[t0,T]×ΩQ=[t_{0},T]\times\Omega is the space–time domain, with t0,Tt_{0},T\in\mathbb{R} being the initial time instant and the time horizon, respectively, and Ω\Omega being the spatial domain, e.g., if xx\in\mathbb{R}, then Ω\Omega\subseteq\mathbb{R}. The system state is represented by y:Q{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}:Q\to\mathbb{R}. tt and x{x} denote the independent temporal and spatial variable, respectively. The optimization objective (1a) combines a terminal cost L(y(T,))L\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(T,\cdot)\big), which depends on the state at the final time TT, with a running cost Φ\Phi, integrated over time and space, and an effort cost, which is the energy used to control the system.

The system dynamics are governed by two PDE constraints. Equation (1b) models the temporal evolution of the state, while (1c) captures spatial dependencies, for example through diffusion–reaction relations. Initial conditions (1d) prescribe the state distribution at t=t0t=t_{0}, while boundary conditions (1e) specify the state behavior on the spatial boundary Ω\partial\Omega. Finally, (1f) represents additional inequality and equality constraints on (u,y)({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}},{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}), which may encode feasibility requirements or application-specific restrictions.

Refer to caption
Figure 1: Temperature regulation process: control actions u(t,x){{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x) steer the system from the initial state y(t0,x){{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t_{0},x) towards a target state y(T,x){{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(T,x).

Temperature control example. In the context of thermal regulation problems, the decision variable u(t,x){{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x) represents localized actuators applied along the medium, while the state variable y(t,x){{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,x) describes the temperature distribution over space and time. These two components are tightly coupled: the control inputs directly enter the governing PDE as forcing terms, shaping the spatio-temporal evolution of the state, while the optimization objective is evaluated on the resulting state trajectory. System dynamics (1b1c) follow a controlled reaction–diffusion equation, which models heat diffusion, dissipation, and externally applied control. The initial condition (1d) specifies the temperature distribution at t=t0t=t_{0}, and the boundary condition (1e) enforces zero-flux (Neumann) boundaries, preventing heat transfer across the domain boundary. The objective 1a combines a terminal cost to match the target temperature profile, a running cost for deviations along the horizon, and a control effort penalty. A simplified illustration is provided in Fig.1, while a full mathematical description is given in Section 5.2.

Challenges. While fundamental for many applications, Problem (1) presents three key challenges:

  1. 1.

    Finding optimal solutions to Problem (1) is computationally intractable. Even without the differential equation constraints, the decision version of the problem alone is NP-hard in general.

  2. 2.

    Achieving high-quality approximations of the system dynamics (1b)-(1c) in near real-time, poses the second significant challenge. The non-linearity of these dynamics further complicate the task.

  3. 3.

    Finally, the integration of the system dynamics in the optimization framework renders traditional numerical methods impractical for real-time applications.

4 PDE-Optimization Proxy

To tackle the challenges discussed above, we introduce PDE-Optimization Proxy (PDE-OP): a fully differentiable surrogate for PDE-constrained optimization. PDE-OP adopts a dual-network architecture: the first network, 𝒰ω\mathcal{U}_{\omega}, approximates the optimal decision variable u(t,x){u}^{*}(t,{x}), while the second, 𝒴θ\mathcal{Y}_{\theta}, predicts the corresponding state variables y(t,x){y}(t,{x}) using a discrete-time neural operator architecture. Here, ω\omega and θ\theta denote the trainable parameters of 𝒰\mathcal{U} and 𝒴\mathcal{Y}, respectively. These two components are jointly trained in an end-to-end differentiable manner, and a schematic illustration of their interaction is provided in Figure 2. In what follows, we first describe each network individually and then detail their integration within the proposed learning framework.

Learning Goal. PDE-OP trains on a distribution Π\Pi of problem instances generated by different states y(t,x)y(t,x) for various tt and xx values. For instance, in the thermal regulation example, this distribution corresponds to the variety of temperature profiles that arise from different control inputs. With reference to (1a), the learning objective is:

Minimizeω,θ𝔼y(t,x)Π[𝒥(𝒰ω(t,x,y),𝒴θ(t,x,u^))]\displaystyle\operatorname*{Minimize}_{\omega,\theta}\mathbb{E}_{{y(t,x)}\sim\Pi}\left[{\cal J}\left(\mathcal{U}_{\omega}(t,x,y),\mathcal{Y}_{\theta}(t,{x},\hat{u})\right)\right] (2a)
s.t.(1b)–(1f), (2b)

where, to simplify the notation, we denote the estimated control variable as u^(t,x)=𝒰ω(t,x,y^)\hat{{u}}(t,x)\!=\!\mathcal{U}_{\omega}(t,x,\hat{y}), while the state variable estimate is denoted as y^(t,x)=𝒴θ(t,x,u^)\hat{{y}}(t,x)\!=\!\mathcal{Y}_{\theta}(t,x,\hat{u}). Given the complexity of solving Problem (2), the goal is to develop a fast and accurate neural PDE optimization surrogate. This approach uses the concept of proxy optimizers (kotary2021end, ), which is detailed next.

Refer to caption
Figure 2: At each time instant tkt_{k}, PDE-OP uses a dual network architecture consisting of a proxy optimization model 𝒰ω\mathcal{U}_{\omega} to estimate the basis functions weights 𝒄(tk)\bm{{c}}(t_{k}) and construct the control action u^(tk,x)\hat{{{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}}}(t_{k},x), and a neural-DE model 𝒴θ\mathcal{Y}_{\theta} to estimate the state-variables y^(tk+1,x)\hat{{{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}}}(t_{k+1},x), with the objective function 𝒥(u^(t,x),y^(t,x))\mathcal{J}(\hat{{{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}}}(t,x),\hat{{{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}}}(t,x)) capturing the overall loss.

4.1 Surrogate Controller

The first objective is to establish a neural network-based mapping that transforms the system state y(t,x)y(t,x) of a PDE-constrained optimization problem 1 to a control action u(t,x)u(t,x). Practically, this mapping is implemented as a neural network 𝒰ω\mathcal{U}_{\omega}, which takes the current system state y^(t,x)\hat{y}(t,x) and desired terminal state y(T,x)y(T,x) as input, and outputs an array of coefficients 𝒄^(t)\hat{\bm{c}}(t) that are then used to estimate the corresponding optimal decision variables. These coefficients are the weights used to combine a set of basis functions to represent the control action, and are described next.

Basis function representation of the control action. To approximate the optimal control action using a neural network, we represent the control trajectory u(t,x){u}(t,{x}) as a linear combination of basis functions. This representation has also been proposed in basis_cite_1 ; basis_cite_2 . This design choice is motivated by the fact that, although the output of a neural network lies in a finite-dimensional space, combining it with continuous basis functions ϕi(x){\phi_{i}(x)} allows us to construct a continuous approximation of the control trajectory u(t,x){u}(t,{x}). In this way, the controller model 𝒰ω\mathcal{U}_{\omega} operates in a low-dimensional space, while the basis functions map these finite number of outputs into a continuous-time signal. As shown in the additional experiments provided in Appendix 9.6, direct modeling of continuous-time control actions, either with a neural network or classical method, leads to qualitatively similar performance to our basis expansion approach, but while the computational cost of our method remains extremely low, the running time of the classical methods grows significantly. These results show that our method is suitable for real-time tasks.
Consider the control action u(tk,x){u}(t_{k},{x}), where tkt_{k} denotes a collocation point; at each time instant tkt_{k}, the control u(tk,x){u}(t_{k},{x}) is expressed as a weighted sum of MM continuous basis functions {ϕi(x)}i=1M\{\phi_{i}(x)\}_{i=1}^{M}:

u(tk,x)=i=1Mci(tk)ϕi(x),tk=0,,T,{u}(t_{k},{x})=\sum_{i=1}^{M}c_{i}(t_{k})\,\phi_{i}(x),\qquad t_{k}=0,\ldots,T, (3)

where 𝒄(tk)=[c1(tk)cM(tk)]\bm{c}(t_{k})=[c_{1}(t_{k})\;\ldots\;c_{M}(t_{k})]^{\top} denotes the coefficients of the basis functions at time instant tkt_{k}. For example, in one of our experiments, we adopt sinusoidal basis functions, e.g. ϕ1(x)=sin(πx/X)\phi_{1}(x)=\sin(\pi x/X), ϕ2(x)=cos(πx/X)\phi_{2}(x)=\cos(\pi x/X), \dots, ϕ2M1(x)=sin(Mπx/X)\phi_{2M-1}(x)=\sin(M\pi x/X), ϕ2M(x)=cos(Mπx/X)\phi_{2M}(x)=\cos(M\pi x/X), which provide a Fourier-like expansion of the control signal. Here MM denotes the number of sinusoidal modes included in the expansion, with higher values of MM allowing for better approximations of the control variable.

Surrogate controller model training. The role of the controller surrogate 𝒰ω\mathcal{U}_{\omega} is to predict the array of coefficients 𝒄(tk)=[c1(tk)cM(tk)]T\bm{c}(t_{k})=[c_{1}(t_{k})\;\dots\;c_{M}(t_{k})]^{T}. Formally, the network 𝒰ω\mathcal{U}_{\omega} approximates the function

U:(y(tk,x),y(T,x))𝒄(tk),{U}:\big({y}(t_{k},{x}),y(T,x)\big)\mapsto\bm{c}(t_{k}), (4)

which maps the state variable y(tk,x){y}(t_{k},x) and desired terminal state y(T,x){y}(T,x) to the corresponding controller weights 𝒄(tk)\bm{c}(t_{k}), subsequently used to construct the control action u(tk,x)u(t_{k},x) according to (3). The surrogate controller model 𝒰ω\mathcal{U}_{\omega} is trained on-the-fly, using a dataset 𝒟={(𝒚^i(tk,x),y(T,x)}i=1N\mathcal{D}=\{(\hat{{\bm{y}}}^{i}(t_{k},{x}),\,{y}(T,{x})\}_{i=1}^{N}, where 𝒚^(tk,x)\hat{{\bm{y}}}(t_{k},{x}) denotes the state variables estimated by the dynamic predictor 𝒴θ\mathcal{Y}_{\theta}, y(T,x){y}(T,{x}) is the desired terminal state and NN is the number of samples. Here, bold symbols 𝒚(tk,x)\bm{y}(t_{k},x) denote the space-discrete version of the corresponding vector y(tk,x)y(t_{k},x), with each collocation point corresponding to a specific output of 𝒴θ\mathcal{Y}_{\theta}, e.g. 𝒚(tk,x)=[y^(tk,x1)y^(tk,xn)]T\bm{y}(t_{k},x)=\big[\hat{y}(t_{k},x_{1})\;\ldots\;\hat{y}(t_{k},x_{n})\big]^{T}. Note that this setting is unsupervised: optimal weights 𝒄(tk)\bm{c}(t_{k}) are unknown, and candidate weights 𝒄^(tk)=𝒰ω(𝒚^(tk,x),y(T,x))\hat{\bm{c}}(t_{k})=\mathcal{U}_{\omega}(\hat{\bm{y}}(t_{k},x),y(T,x)) are evaluated solely by the objective value 𝒥\mathcal{J} in (1a) they induce. Figure 2 illustrates the interaction between PDE-OP’s proxy optimizer 𝒰ω\mathcal{U}_{\omega} and dynamic predictor 𝒴θ\mathcal{Y}_{\theta} at time instant t=tkt=t_{k}. Given the predicted state y^(tk,x)\hat{{y}}(t_{k},{x}) (for k=0k=0 we have y^(t0,x)=y0(x)\hat{{y}}(t_{0},{x})={y}_{0}(x)), the surrogate controller outputs an estimate of 𝒄(tk)\bm{c}(t_{k}) which are then used to construct the control action u(tk,x){u}(t_{k},{x}); this is subsequently used by 𝒴θ\mathcal{Y}_{\theta} to predict the next state y^(tk+1,x)\hat{{y}}(t_{k+1},{x}). This process is repeated at each collocation point, until the terminal state y(T,x){y}(T,{x}) is estimated.
To build our surrogate controller, we build on the proxy optimizer method proposed in (park2023self, ), which allows self-supervised training without access to precomputed optimal solutions, while directly encouraging feasibility through primal–dual updates. Once trained, the model 𝒰ω{\cal U}_{\omega} can be used to generate near-optimal solutions at low cost. The next section discusses how the state variables are estimated using a discrete-time Neural Operator architecture.

4.2 Discrete-time Neural Operator

The second component of PDE-OP is a discrete-time Neural Operator architecture, here denoted as 𝒴θ\mathcal{Y}_{\theta}, which models the PDE solution operator. The proposed dynamic predictor employs two subnetworks: the first, which we denote branch network, encodes input functions (e.g., control signals), and the second, called trunk network, encodes the coordinates where the solution is evaluated (e.g., spatial locations). Their outputs are then combined to approximate the target operator. This architecture is inspired by (Lu_2021, ), a neural operator framework designed to approximate mappings between infinite-dimensional function spaces.

Recall that the PDE-constrained problem couples system dynamics and decision variables. To take account for this coupling, the proposed architecture incorporates both the system state and the controller output. Given the current estimated state 𝒚^(tk,x)\hat{\bm{y}}(t_{k},x) and the control weights 𝒄^(tk)\hat{\bm{c}}(t_{k}) estimated by 𝒰ω\mathcal{U}_{\omega}, the network 𝒴θ\mathcal{Y}_{\theta} outputs an estimate of the state at the subsequent collocation point in time:

𝒚^(tk+1,x)=𝒴θ(t,x,𝒚^(tk,x),𝒄^(tk)).\hat{\bm{y}}(t_{k+1},x)=\mathcal{Y}_{\theta}\!\left(t,x,\hat{\bm{y}}(t_{k},x),\hat{\bm{c}}(t_{k})\right). (5)

Rolling out (5) across k=0,,T1k=0,\ldots,T-1 yields the full estimated state trajectory, {𝒚^(tk,x)}k=0T\{\hat{\bm{y}}(t_{k},x)\}_{k=0}^{T}.

Branch network. The branch net encodes the estimated system state sampled at nn fixed sensor locations, 𝒚^(tk,x)\hat{\bm{y}}(t_{k},x), together with estimated control weights 𝒄^(tk)\hat{\bm{c}}(t_{k}), and maps this concatenated input to a dd-dimensional vector of coefficients 𝒃(𝒚^(tk,x),𝒄^(tk))d\bm{b}\!\left(\hat{\bm{y}}(t_{k},x),\hat{\bm{c}}(t_{k})\right)\in\mathbb{R}^{d}.
Trunk network. The trunk net encodes the spatial query coordinate xx, producing a latent feature vector 𝜸(x)d\bm{\gamma}(x)\in\mathbb{R}^{d}.
The estimated subsequent (in time) state 𝒚^(tk+1,x)\hat{\bm{y}}(t_{k+1},x) at location xix_{i} is obtained by combining the branch coefficients and trunk features through an inner product:

y^(tk+1,xi)=j=1dbj(y^(tk,xi),𝒄^(tk))γj(xi),i=1,,n.\hat{y}(t_{k+1},x_{i})\;=\;\sum_{j=1}^{d}b_{j}\!\left(\hat{{y}}(t_{k},x_{i}),\hat{\bm{c}}(t_{k})\right)\,\gamma_{j}(x_{i}),\;\;i=1,\dots,n. (6)

The remainder of this section details the method by which the state variables are precisely estimated using the proposed discrete-time Neural Operator.

Model initialization and training. Given the proposed time-discrete neural operator model 𝒴θ\mathcal{Y}_{\theta} takes as input the weights of the basis functions estimated by the surrogate controller 𝒰θ\mathcal{U}_{\theta}, it is practical to initialize the dynamic predictor model. To achieve this, we initialize 𝒴θ\mathcal{Y}_{\theta} in a supervised fashion; its training data are generated by sampling time-varying control weight sequences 𝒄(tk)\bm{c}(t_{k}) from a gaussian distribution, e.g., ci(tk)𝒩(0,σi)c_{i}(t_{k})\sim\mathcal{N}(0,\sigma_{i}), and by specifying the terminal state y(T,x)y(T,x). The corresponding solution y(t,x)y(t,x) satisfying (1b)–(1f) is computed using a numerical PDE solver. Each trajectory is discretized into one-step training samples of the form (y(tk,x),𝒄(tk))y(tk+1,x)(y(t_{k},x),\bm{c}(t_{k}))\mapsto y(t_{k+1},x). Formally, the dataset of training pairs is given by 𝒟={(yi(tk,x),𝒄i(tk),yi(tk+1,x))|tk{0,,T1}}i=1N\mathcal{D}=\Big\{\big(y^{i}(t_{k},x),\bm{c}^{i}(t_{k}),y^{i}(t_{k+1},x)\big)\;\Big|\;t_{k}\in\{0,\dots,T-1\}\Big\}_{i=1}^{N}. The network 𝒴θ\mathcal{Y}_{\theta} is trained to minimize the prediction error between its outputs and the ground-truth values. Formally, this model is trained by minimizing the loss:

minθ𝔼(𝒄,y)𝒟[y^(t,x)y(t,x)2],\min_{\theta}\;\mathbb{E}_{(\bm{c},y)\sim\mathcal{D}}\Big[\big\|\hat{y}(t,x)-y(t,x)\big\|^{2}\Big], (7)

where 𝒟\mathcal{D} denotes the dataset of pairs (𝒄,y)(\bm{c},y). This training strategy ensures that 𝒴θ\mathcal{Y}_{\theta} learns to approximate the PDE solution operator given a distribution of weights 𝒄(t)\bm{c}(t) approximating the distribution of estimated weights generated by the surrogate controller 𝒰ω\mathcal{U}_{\omega}, so that it can provide accurate PDE-solutions given the predicted controller weights 𝒄^(t)\hat{\bm{c}}(t).

4.3 Handling Static and Dynamics Constraints Jointly

To integrate the proposed time-discrete neural perator within the decision process, this paper proposes a Primal Dual (LD) learning approach, which is inspired by the augmented Lagrangian relaxation technique (Hestenes1969MultiplierAG, ) adopted in classic optimization. In Lagrangian relaxation, some or all the problem constraints are relaxed into the objective function using Lagrangian multipliers to capture the penalties induced by violating them. The proposed formulation leverages Lagrangian duality to integrate trainable regularization terms that encapsulates violations of the constraint functions. When all the constraints are relaxed, the violation-based Lagrangian of problem (1) is

Minimizeu𝒥(u(t,x),y(t,x))+𝝀h|𝒉(u(t,x),y(t,x))|+𝝀gmax(0,𝒈(u(t,x),y(t,x))),\operatorname*{Minimize}_{{u}}\mathcal{J}({{u(t,x)}},{{y}}(t,x))+\bm{\lambda}^{\top}_{h^{\prime}}|\bm{h}^{\prime}({{u(t,x)}},{{y}}(t,x))|+\bm{\lambda}^{\top}_{g}\max(0,\bm{g}({{u(t,x)}},{{y}}(t,x))),

where 𝒥\mathcal{J}, 𝒈\bm{g} are defined in (1a), and (1f), respectively, and 𝒉\bm{h}^{\prime} is defined as follows,

𝒉(u,y)=[ty(t,x)t(y(t,x),u(t,x),t,x)xy(t,x)x(y(t,x),u(t,x),t,x)y(0,x)y0(x)y(t,x)y(t,x)𝒉(u,y)]=𝟎.\bm{h}^{\prime}(u,y)\;=\;\begin{bmatrix}\partial_{t}y(t,x)-\mathcal{I}_{t}\!\big(y(t,x),u(t,x),t,x\big)\\[2.0pt] \partial_{x}y(t,x)-\mathcal{I}_{x}\!\big(y(t,x),u(t,x),t,x\big)\\[2.0pt] y(0,x)-y_{0}(x)\\[2.0pt] y(t,x)-y_{\partial}(t,x)\\[2.0pt] \bm{h}(u,y)\end{bmatrix}=\bm{0}. (8)

It denotes the set of all equality constraints of problem (1), thus extending the constraints 𝒉\bm{h} in (1e), with the system dynamics (1b)-(1c) and the initial and boundary conditions equations (1d)-(1e) written in an implicit form as above. Therein, 𝝀h\bm{\lambda}_{h^{\prime}} and 𝝀g\bm{\lambda}_{g} are the vectors of Lagrange multipliers associated with functions 𝒉\bm{h}^{\prime} and 𝒈\bm{g}, e.g. λhi,λgj{\lambda}^{i}_{h^{\prime}},{\lambda}^{j}_{g} are associated with the ii-th equality hih^{\prime}_{i} in 𝒉\bm{h}^{\prime} and jj-th inequality gjg_{j} in 𝒈\bm{g}, respectively. The key advantage of expressing the system dynamics (1b)-(1c) and initial and boundary conditions (1d)-(1e) in the same implicit form as the equality constraints 𝒉\bm{h}, (as shown in (8)), is treating the system dynamics in the same manner as the constraint functions 𝒉\bm{h}. This enables us to satisfy the system dynamics and the static set of constraints ensuring that they are incorporated seamlessly into the optimization process.

The proposed primal-dual learning method uses an iterative approach to find good values of the primal ω\omega, θ\theta and dual 𝝀h,𝝀g\bm{\lambda}_{h^{\prime}},\bm{\lambda}_{g} variables; it uses an augmented modified Lagrangian as a loss function to train the prediction u^\hat{{u}}, y^(t)\hat{{y}}(t) as employed

PDE-OP(u^,y^)=𝒥(u^(t,x),y^(t,x))+𝝀h|𝒉(u^(t,x),y^(t,x))|+𝝀gmax(0,𝒈(u^(t,x),y^(t,x))),\begin{split}\mathcal{L}^{\text{PDE-OP}}(\hat{{u}},\hat{{y}})&\!\!=\!\!\mathcal{J}({{\hat{u}(t,x)}},{{\hat{y}}}(t,x))\!+\!\bm{\lambda}^{\top}_{h^{\prime}}|\bm{h}^{\prime}(\hat{{u}}(t,x),\hat{{y}}(t,x))|\!+\!\bm{\lambda}^{\top}_{g}\max(0,\bm{g}(\hat{{u}}(t,x),\hat{{y}}(t,x))),\!\!\!\!\end{split} (9)

where 𝒥(u^(t,x),y^(t,x))\mathcal{J}({{\hat{u}(t,x)}},{{\hat{y}}}(t,x)) represents the total objective cost, while 𝝀h|𝒉(u^(t,x),y^(t,x))|\bm{\lambda}^{\top}_{h^{\prime}}|\bm{h}^{\prime}(\hat{{u}}(t,x),\hat{{y}}(t,x))| and 𝝀gmax(0,𝒈(u^(t,x),y^(t,x)))\bm{\lambda}^{\top}_{g}\max(0,\bm{g}(\hat{{u}}(t,x),\hat{{y}}(t,x))) measures the constraint violations incurred by prediction u^(t,x)\hat{{u}}(t,x) and y^(t,x)\hat{{y}}(t,x). The loss function in (9) combines an objective term to ensure minimization of the objective function (1a) with a weighted penalty on violations of the constraint functions 𝒉,𝒈\bm{h}^{\prime},\bm{g}. This accounts for the contribution of both networks 𝒰,𝒴\mathcal{U},\mathcal{Y} during training, which is balanced via iterative updates of the multipliers in (9) based on the amount of violation of the associated constraint function.

At iteration j+1j+1, finding the optimal parameters ω,θ\omega,\theta requires solving

ωj+1,θj+1=argminω,θ𝔼y(t,x)𝒟[PDE-OP(𝒰ωj𝝀j(t,x,y^),𝒴θj𝝀j(t,x,u^))],\omega^{j+1},\theta^{j+1}=\arg\min_{\omega,\theta}\mathbb{E}_{y(t,x)\sim\mathcal{D}}\left[\mathcal{L}^{\text{PDE-OP}}\left(\mathcal{U}_{\omega^{j}}^{\bm{\lambda}^{j}}(t,x,\hat{y}),\mathcal{Y}_{\theta^{j}}^{\bm{\lambda}^{j}}\left(t,x,\hat{u}\right)\right)\right], (10)

where 𝒰ωλj\mathcal{U}_{\omega}^{\lambda^{j}} and 𝒴θλj\mathcal{Y}_{\theta}^{\lambda^{j}} denote the DE-OP’s optimization and predictor models 𝒰ω\mathcal{U}_{\omega} and 𝒴θ\mathcal{Y}_{\theta}, at iteration jj, with 𝝀j=[𝝀hj𝝀gj]\bm{\lambda}^{j}=[\bm{\lambda}^{j}_{h^{\prime}}\ \bm{\lambda}^{j}_{g}]^{\top}. This step is approximated using a stochastic gradient descent method

ωj+1=ωjηωPDE-OP(𝒰ωj𝝀j(t,x,y^),𝒴θj𝝀j(t,x,u^))θj+1=θjηθPDE-OP(𝒰ωj𝝀j(t,x,y^),𝒴θj𝝀j(t,x,u^))\begin{split}\omega^{j+1}&=\omega^{j}-\eta\nabla_{\omega}\mathcal{L}^{\text{PDE-OP}}\left(\mathcal{U}_{\omega^{j}}^{\bm{\lambda}^{j}}(t,x,\hat{y}),\mathcal{Y}_{\theta^{j}}^{\bm{\lambda}^{j}}\left(t,x,\hat{u}\right)\right)\\ \theta^{j+1}&=\theta^{j}-\eta\nabla_{\theta}\mathcal{L}^{\text{PDE-OP}}\left(\mathcal{U}_{\omega^{j}}^{\bm{\lambda}^{j}}(t,x,\hat{y}),\mathcal{Y}_{\theta^{j}}^{\bm{\lambda}^{j}}\left(t,x,\hat{u}\right)\right)\end{split} (11)

where η\eta denotes the learning rate and ω\nabla_{\omega}\mathcal{L} and θ\nabla_{\theta}\mathcal{L} represent the gradients of the loss function \mathcal{L} with respect to the parameters ω\omega and θ\theta, respectively, at the current iteration kk. Importantly, this step does not recompute the training parameters from scratch, but updates the weights ω\omega, θ\theta based on their value at the previous iteration. Finally, the Lagrange multipliers are updated as

𝝀hj+1=𝝀hj+ρ|𝒉(u^(t,x),y^(t,x))|𝝀gj+1=𝝀gj+ρmax(𝟎,𝒈(u^(t,x),y^(t,x)))\begin{split}\bm{\lambda}^{j+1}_{h^{\prime}}&=\bm{\lambda}^{j}_{h^{\prime}}+\rho|\bm{h}^{\prime}(\hat{{u}}(t,x),\hat{{y}}(t,x))|\\ \bm{\lambda}^{j+1}_{g}&=\bm{\lambda}^{j}_{g}+\rho\max\left(\bm{0},\bm{g}(\hat{{u}}(t,x),\hat{{y}}(t,x))\right)\end{split} (12)

where ρ\rho denotes the Lagrange step size and the max operation is performed element-wise.

Next, we evaluate our method on a range of PDE-constrained optimization tasks with varying levels of complexity, including cases space-time varying control actions and nonlinear PDE dynamics.

5 Experimental Setting

In this section, we evaluate the proposed method on three optimal control tasks of increasing complexity. In these tasks the dynamics are described by the following PDE equations: Voltage Dynamics (linear PDE with time-invariant control input u(x)u(x)), 1D Heat equation (linear PDE with time-variant control input u(t,x)u(t,x)), and 1D viscous Burgers’ equation (nonlinear PDE with time-variant control input u(t,x)u(t,x)). PDE-OP is compared against state-of-the-art approaches for optimal control problems: the Direct method (Direct-Method, ), the Adjoint-Sensitivity method (adjoint-method, ), and (non-linear) Model Predictive Control (NMPC) (MPC, ). These methods are widely adopted in the literature, all of which rely on numerical PDE solvers whenever dynamics or gradients are evaluated. The Open-loop (direct/adjoint) Control method requires a full forward rollout per gradient step and a backward adjoint evalution, while MPC solves a finite-horizon problem at every control step (i.e., a receding horizon). The runtime of these methods is determined by repeated calls to the underlying numerical PDE solver and grows with the spatial grid size (nn), horizon length (TT), and complexity (e.g., non-linearity) of the PDE dynamics. For each experiment and method, we report the accuracy, measured as the L2L_{2} error between the terminal state and the predicted terminal state, y(T,x)y^(T,x)22\|y(T,x)-\hat{y}(T,x)\|^{2}_{2}, the inference time of PDE-OP and runtime of the classical methods, both expressed in seconds. For each task, we assess the capability of the proposed time-discrete neural operator to capture system dynamics by comparing its outputs against the solutions obtained from a numerical PDE solver.
Please refer to Appendices 8, and 10 for details on the baseline methods and hyperparameter settings, respectively. In Appendix 9.4 we report an ablation to evaluate the impact of our basis functions, introduced in Section 4.1). In summary, this experiment shows that using a full formulation is impractical for NMPC, pushing solve times well beyond real-time limits. Finally, we show that Adjoint methods suffer from inaccurate gradients, which leads to a marked decrease in solution quality.

5.1 Voltage Control Optimization

The problem of regulating the voltage magnitude in a leaky electrical transmission line, subject to a reaction-diffusion PDE with Neumann conditions is described as follows:

Minimizeu(x)\displaystyle\operatorname*{Minimize}_{u(x)}\quad Ω(y(T,x)ytarget(x))2𝑑x+γΩu(x)2𝑑x\displaystyle\int_{\Omega}(y(T,x)-y_{\text{target}}(x))^{2}dx+\gamma\int_{\Omega}u(x)^{2}dx (13a)
s.t. yt=D2yx2β(yyref(x))+αu(x),\displaystyle\frac{\partial y}{\partial t}=D\frac{\partial^{2}y}{\partial x^{2}}-\beta(y-y_{\text{ref}}(x))+\alpha u(x),\quad (x,t)Ω×[0,T]\displaystyle\forall(x,t)\in\Omega\times[0,T] (13b)
y(0,x)=y0(x),\displaystyle y(0,x)=y_{0}(x),\quad xΩ,{t=0}\displaystyle\forall x\in\Omega,\{\ t=0\} (13c)
yx(t,x)=0,\displaystyle\frac{\partial y}{\partial x}(t,x)=0,\quad xΩ,t[0,T]\displaystyle\forall x\in\partial\Omega,\ t\in[0,T] (13d)
uminu(x)umax.\displaystyle u_{\text{min}}\leq u(x)\leq u_{\text{max}}.\quad xΩ.\displaystyle\forall x\in\Omega. (13e)

Here, y(t,x)y(t,x)\in\mathbb{R} represents the voltage profile across the space-time domain defined by the cartesian product between Ω\Omega and time horizon [0,T][0,T]. The system dynamics, expressed by (13b), model the physical phenomena of diffusion (D2yx2D\frac{\partial^{2}y}{\partial x^{2}}), voltage leakage towards a reference yref(x)y_{\text{ref}}(x) (β(yyref(x))-\beta(y-y_{\text{ref}}(x))) and the applied control action (αu(x)\alpha u(x)). DD, β\beta, and α\alpha are all constants representing the diffusion rate, leakage rate, and control gain, respectively. The system is initialized from a state y0(x)y_{0}(x) according to (13c) and is constrained by zero-flux Neumann boundary conditions (13d), implying no voltage diffusion across the boundaries of the domain. The control decision u(x)u(x) is a space-dependent function representing the externally applied voltage, constrained by actuator limits in (13e). The objective (13a) minimizes a combination of terminal cost and control energy.

Datasets and methods. In the system dynamics (13b), we set the diffusion rate D=0.1D=0.1, leakage rate β=1.0\beta=1.0, and control gain α=2.0\alpha=2.0, with a uniform reference voltage yref(x)=1.0y_{\text{ref}}(x)=1.0. The ground-truth dynamics are simulated using a second-order implicit Crank-Nicolson finite-difference method(Crank_Nicolson_1947, ). To initialize the proposed time-discrete Neural Operator 𝒴θ\cal Y_{\theta}, we generate a dataset by sampling a diverse set of control functions u(x)u(x) from a zero-mean Gaussian Random Field (GRF) with a squared-exponential kernel, with mean m(x)m(x) and covariance function k(x1,x2)k_{\ell}(x_{1},x_{2}) as:

m(x)=0,k(x1,x2)=σ2exp(x1x22222),m(x)=0,\qquad k_{\ell}(x_{1},x_{2})\;=\;\sigma^{2}\exp\!\left(-\frac{\|x_{1}-x_{2}\|_{2}^{2}}{2\ell^{2}}\right), (14)

where >0\ell>0 is the length–scale and σ2>0\sigma^{2}>0 the marginal variance. The corresponding solution y(t,x)y(t,x) is computed using a numerical solver, and then split into 80%80\% training, 10%10\% validation, and 10%10\% test set.

PDE-OP’s model 𝒴θ\mathcal{Y}_{\theta} estimates the voltage profile y(t,x)y(t,x); for t=Tt=T this prediction yield y(T,x)y(T,x), which is compared against the target profile ytarget(x)y_{\text{target}}(x). PDE-OP’s surrogate controller 𝒰ω\mathcal{U}_{\omega} is trained to produce optimal control action u^(x)\hat{u}(x) while minimizing objective (13a). Our approach is benchmarked against Direct and Adjoint-Method, which are described in Appendix 8.1- 8.2. At inference time, PDE-OP’s surrogate controller 𝒰ω\mathcal{U}_{\omega} is tested on three unseen target profiles of different level of complexities: constant (ytarget(x)=p1y_{\text{target}}(x)=p_{1}), linear ramp (ytarget(x)=p1x+p2y_{\text{target}}(x)=p_{1}x+p_{2}), and sine wave (ytarget(x)=p1+p2(sinf0x+p3)y_{\text{target}}(x)=p_{1}+p_{2}(\sin f_{0}x+p_{3})). For each target, the control u^(x)\hat{u}(x) is generated in a single forward pass.

Refer to caption
Figure 3: Voltage Control Optimization; comparison of PDE-OP and classical methods for three target profiles: ytarget(x)=1+0.2sin(6x)y_{\text{target}}(x)=1+0.2\sin(6x) (left), ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5 (center), and ytarget(x)=1y_{\text{target}}(x)=1 (right)
Refer to caption
Figure 4: Voltage control optimization; comparison between the solution estimate (center) of PDE-OP’s dynamic predictor 𝒴θ\cal Y_{\theta} at test time with the solution obtained with a numerical algorithm (left) and related absolute error (right).

Results. Figure 4 shows the predicted voltage (central sub-figure of Fig. 4) of PDE-OP’s dynamic component 𝒴θ\mathcal{Y}_{\theta} at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. The figure shows that the predicted voltage closely matches the numerical PDE-solver solution, with an absolute error (right sub-figure) on the order of 10210^{-2} across the space–time domain.

Figure 3 presents a comparison between PDE-OP’s predicted terminal state y^(T,x)\hat{y}(T,x) and that obtained using the classical methods, over 33 target profiles ytarget(x)y_{\text{target}}(x): sinusoidal, linear and constant. This figure shows that the voltage distribution produced by PDE-OP is well aligned with those produced by the Direct and the Adjoint-sensitivity method. As shown in the first row of Table 1, the direct method is the slowest, requiring 26.097s26.097s and yielding an MSE of 7.46×1037.46\times 10^{-3}. The Adjoint method runs in 0.755s0.755s with a similar accuracy, 7.13×1037.13\times 10^{-3}. Notably, PDE-OP is the fastest, by a significant margin: it runs in only 0.0350.035 s (21.6×\sim\!21.6\times faster than Adjoint; 746×\sim\!746\times faster than Direct) and yield the lowest MSE, with 6.95×1036.95\times 10^{-3}. These results show the enormous potential of the proposed approach, which produces solutions that are qualitatively similar (or even slightly better) than the classical methods, while being between 2020 and 750750 times faster than them.

Table 1: Comparison across PDE tasks and classical methods. All results use sinusoidal target profiles. Further details and results for linear and constant profiles can be found in Appendix 9.
PDE System Control Type Methods Runtime Accuracy
Voltage Control
(Reaction-Diffusion)
Open-Loop
Direct (Finite Diff.)
Adjoint-Method
PDE-OP (ours)
26.097
0.755
0.035
0.00746
0.00713
0.00695
1D Heat Equation
(Linear PDE)
Closed-Loop
LMPC
Adjoint-Method
PDE-OP (ours)
0.384
0.662
0.124
0.00035
0.00035
0.00035
1D Burgers’ Equation
(Nonlinear PDE)
Closed-Loop
NMPC
Adjoint-Method
PDE-OP (ours)
878.667
120.696
0.062
0.00007
0.00023
0.00030
Refer to caption
Figure 5: Optimal Control of the Heat Equation; comparison of PDE-OP and classical methods for three target profiles: ytarget(x)=0.6+0.3(sin(2x))y_{\text{target}}(x)=0.6+0.3(\sin(2x)), ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5, and ytarget(x)=1y_{\text{target}}(x)=1.
Refer to caption
Figure 6: Optimal Control of the 1D Heat Equation: comparison between the solution estimate (center) of PDE-OP’s dynamic predictor 𝒴θ\cal Y_{\theta} at test time with the solution obtained with a numerical algorithm (left) and related absolute error (right).

5.2 Optimal Control of the Heat Equation

This task involves controlling the evolution of temperature in a one-dimensional domain described by a reaction–diffusion PDE. Given an initial profile, the controller aims to reach a prescribed target state with minimal control effort. The problem is formulated as:

Minimize𝒄(t)t[0,T]\displaystyle\operatorname*{Minimize}_{\bm{c}(t)_{t\in[0,T]}}\quad =terminal+λrunning+γeffort\displaystyle\mathcal{L}=\mathcal{L}_{\text{terminal}}+\lambda\mathcal{L}_{\text{running}}+\gamma\mathcal{L}_{\text{effort}} (15a)
s.t. yt=D2yx2β(yyref(x))+αu(t,x),\displaystyle\frac{\partial y}{\partial t}=D\frac{\partial^{2}y}{\partial x^{2}}-\beta(y-y_{\text{ref}}(x))+\alpha u(t,x),\quad (x,t)Ω×[0,T]\displaystyle\forall(x,t)\in\Omega\times[0,T] (15b)
u(t,x)=𝒄(t)ϕ(x),\displaystyle u(t,x)=\bm{c}(t)^{\top}\bm{\phi}(x),\quad (x,t)Ω×[0,T]\displaystyle\forall(x,t)\in\Omega\times[0,T] (15c)
y(0,x)=y0(x),\displaystyle y(0,x)=y_{0}(x),\quad xΩ\displaystyle\forall x\in\Omega (15d)
yx(t,x)=0,\displaystyle\frac{\partial y}{\partial x}(t,x)=0,\quad xΩ,t[0,T]\displaystyle\forall x\in\partial\Omega,\ t\in[0,T] (15e)
cmin𝒄(t)cmax,\displaystyle c_{\text{min}}\leq\bm{c}(t)\leq c_{\text{max}},\quad t[0,T].\displaystyle\forall\ t\in[0,T]. (15f)

The loss components in (15a) defines as follows.

terminal\displaystyle\mathcal{L}_{\text{terminal}} :=Ω|y(T,x)ytarget(x)|2𝑑x,\displaystyle:=\int_{\Omega}\left|y(T,x)-y_{\text{target}}(x)\right|^{2}\,dx, (16a)
running\displaystyle\mathcal{L}_{\text{running}} :=0TΩ|y(t,x)ytarget(x)|2𝑑x𝑑t,\displaystyle:=\int_{0}^{T}\int_{\Omega}\left|y(t,x)-y_{\text{target}}(x)\right|^{2}\,dxdt, (16b)
effort\displaystyle\mathcal{L}_{\text{effort}} :=0T𝒄(t)22𝑑t.\displaystyle:=\int_{0}^{T}||\bm{c}(t)||_{2}^{2}\,dt. (16c)

Here y(t,x)y(t,x) represents the temperature profile. The objective (15a) balances three terms: a terminal loss for final state accuracy, a running loss for temperature tracking, and an effort loss for control efficiency, described by (16a)–(16c).

Datasets and methods. The solutions of (15b) are generated using a Crank-Nicolson method, which deal with time-varying source term by averaging the control input over each time step. To train our models, we generate a dataset of diverse trajectories. First, smooth, time-varying control weights (𝒄(t)\bm{c}(t)) are generated by sampling each of the MM weight trajectories from a GRF over the time domain. The full spatial-temporal control field u(t,x)u(t,x) is then reconstructed from these weights constrained by (15f) using a set of fixed basis functions, as defined in (15c). For each generated control field, the full PDE is solved to generate a ground-truth temperature evolution y(t,x)y(t,x). Each complete simulation yields a training sample, consisting of the control input weight trajectory {𝒄(tk)}\{\bm{c}(t_{k})\}, and the corresponding state trajectory {y(tk,x)}\{y(t_{k},x)\} evaluated at a set of fixed sensor locations. The dataset is then split into 80% training, 10% validation and 10% test.

Our framework uses two neural networks. First discrete-time neural operator, 𝒴θ\mathcal{Y}_{\theta}, is trained as a surrogate model described in Section (4.2). It act as a one-step time-advancement operator, learning the mapping (y(tk,x),𝒄(tk))y(tk+1,x)(y(t_{k},x),\bm{c}(t_{k}))\rightarrow y(t_{k+1},x). Second, PDE-OP’s surrogate controller 𝒰ω\mathcal{U}_{\omega} is a recurrent neural network that acts as sequential decision-maker. At each time step tkt_{k}, it takes the current state y(tk,x)y(t_{k},x) and the final target ytarget(x)y_{\text{target}}(x) to produce the control weights (𝒄(tk))(\bm{c}(t_{k})). The controller is trained end-to-end by unrolling the trajectory using the surrogate 𝒴θ\mathcal{Y}_{\theta}. We benchmark our approach against a highly-tuned Adjoint-Method and Linear MPC(LMPC). Further implementation details are in Appendix 8.3 - 8.4.

At inference, we evaluate PDE-OP model on three unseen target profiles: constant(ytarget(x)=p1y_{\text{target}}(x)=p_{1}), linear ramp(ytarget(x)=p1x+p2y_{\text{target}}(x)=p_{1}x+p_{2}), and sine wave(ytarget(x)=p1+p2(sinf0x+p3)y_{\text{target}}(x)=p_{1}+p_{2}(\sin f_{0}x+p_{3})).

Results. Figure 6 shows the temperature (central sub-figure of Fig. 6) estimated by PDE-OP’s dynamic component 𝒴θ\mathcal{Y}_{\theta} at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. The figure shows that also in this setting the predicted temperature aligns closely with the numerical PDE-solver solution, with an absolute error on the order of 10310^{-3} across the space–time domain.

Figure 5 illustrates the comparison of PDE-OP ’s predicted terminal state y^(T,x)\hat{y}(T,x) and those obtained using the classical methods, over the same target profiles introduced in the previous experiment. As also shown in Table 1, all three methods achieve the same accuracy, reporting a MSE of 3.5×1043.5\times 10^{-4}. However, PDE-OP is by far the fastest method, running in 0.124s0.124\,\text{s}. This corresponds to a 3.1×3.1\times speedup over LMPC (running in 0.384s0.384\,\text{s}) and a 5.3×5.3\times speedup over the Adjoint method (which runs in 0.662s0.662\,\text{s}). These results demonstrate that even in a more complex setting, with time-variant control actions and linear PDE-dynamics, our method produces solutions that are qualitatively comparable to the classical methods, but at a significantly reduced computation cost.

5.3 Optimal Control of the 1D Burgers’ Equation

To test the ability of PDE-OP to handle nonlinear dynamics, we address the control of the 1D viscous Burgers’ equation with Dirichlet boundary condition. The control input and external force u(t,x)u(t,x), is parameterized identically to the heat equation task, using time-varying weights (𝒄(tk))(\bm{c}(t_{k})) for a set of basis functions. The optimization goal is to find the optimal trajectory of these weights subject to Burger’s equation (17b), initial (17d) and Dirchlet’s boundary condition (17e) and box constraints (17f).

Minimize𝒄(t)t[0,T]\displaystyle\operatorname*{Minimize}_{\bm{c}(t)_{t\in[0,T]}}\quad =terminal+λrunning+γeffort\displaystyle\mathcal{L}=\mathcal{L}_{\text{terminal}}+\lambda\mathcal{L}_{\text{running}}+\gamma\mathcal{L}_{\text{effort}} (17a)
s.t. yt+yyx=ν2yx2+u(t,x),\displaystyle\frac{\partial y}{\partial t}+y\frac{\partial y}{\partial x}=\nu\frac{\partial^{2}y}{\partial x^{2}}+u(t,x), (x,t)Ω×[0,T]\displaystyle\forall(x,t)\in\Omega\times[0,T] (17b)
u(t,x)=𝒄(t)ϕ(x),\displaystyle u(t,x)=\bm{c}(t)^{\top}\bm{\phi}(x), (x,t)Ω×[0,T]\displaystyle\forall(x,t)\in\Omega\times[0,T] (17c)
y(0,x)=y0(x),\displaystyle y(0,x)=y_{0}(x), xΩ\displaystyle\forall x\in\Omega (17d)
y(t,x)=0,\displaystyle y(t,x)=0, xΩ,t[0,T]\displaystyle\forall x\in\partial\Omega,\ t\in[0,T] (17e)
cmin𝒄(t)cmax,\displaystyle c_{\text{min}}\leq\bm{c}(t)\leq c_{\text{max}},\quad t[0,T].\displaystyle\forall\ t\in[0,T]. (17f)

y(t,x)y(t,x) is velocity field, and ν\nu represents the viscosity parameter. The PDE in (17b) contains convective nonlinearity (yyx)(y\frac{\partial y}{\partial x}) which has a significant control challenge not present in the linear heat equation. The objective retrains the same structure, balancing final state accuracy, path tracking, and control efficiency.

Datasets and methods. The ground-truth dynamics for the nonlinear Burgers’ equation are simulated using a high-fidelity numerical solver. The dataset generation process mirrors that of the heat equation: time-varying weights 𝒄(t)\bm{c}(t) are sampled from a GRF, and the force field u(t,x)u(t,x) is reconstructed using spatial basis functions, and the PDE is solved to generate trajectories. Due to the system’s high nonlinearity, we benchmark our method against a much more general and computationally expensive Nonlinear MPC (NMPC) and Adjoint-Sensitivity Method. Further implementation details are in Appendix 8.3 - 8.4.

At inference, we test our PDE-OP model on three unseen target profiles: constant (ytarget(x)=p1y_{\text{target}}(x)=p_{1}), parabolic (ytarget(x)=p1x(1x)y_{\text{target}}(x)=p_{1}x(1-x)) and sine wave (ytarget(x)=p1(sinf0x+p2)y_{\text{target}}(x)=p_{1}(\sin f_{0}x+p_{2})). We don’t use a a linear target profile, because of the Dirichlet BCs.

Refer to caption
Figure 7: Optimal control of the 1D Burgers’ equation; comparison of PDE-OP and classical methods for three target profiles: ytarget(x)=0.8(sin(x))y_{\text{target}}(x)=0.8(\sin(x)) (left), ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1-x) (center), and ytarget(x)=0y_{\text{target}}(x)=0 (right).
Refer to caption
Figure 8: Optimal control of the 1D Burgers’ equation: comparison between the solution estimate (center) of PDE-OP’s dynamic predictor 𝒴θ\cal Y_{\theta} at test time with the solution obtained with a numerical algorithm (left) and related absolute error (right).

Results. Figure 8 shows the system dynamics (central sub-figure of Fig. 8) estimated by PDE-OP’s dynamic component 𝒴θ\mathcal{Y}_{\theta} at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. The figure shows that the predicted temperature aligns closely with the numerical PDE-solver solution; with absolute error on the order of 10310^{-3} across the space–time domain, demonstrating the ability of PDE-OP’s dynamic predictor to accurately capture system dynamics even under nonlinearities.

Figure 7 illustrates the comparison of PDE-OP ’s predicted terminal state y^(T,x)\hat{y}(T,x) and those obtained using the classical methods, over the same target profiles introduced in the previous experiments. As shown in Figure 7 and Table 1, each method produces highly accurate solutions, with NMPC achieving the lowest error (MSE=7.0×105\text{MSE}=7.0\times 10^{-5}) but requiring 878.667s878.667s. The Adjoint method is slightly less accurate, yielding a MSE=2.3×104\text{MSE}=2.3\times 10^{-4} in 120.696s120.696s. In contrast, PDE-OP produces qualitatively comparable predictions while being the fastest 0.062s0.062s by a big margin - approximately 1.42×1041.42\times 10^{4} faster than NMPC and 1.95×1031.95\times 10^{3} faster than the Adjoint method!

6 Conclusion

This work was motivated by the efficiency requirements associated with solving partial differential equations-constrained optimization problems. It introduced a novel learning-based framework, PDE-OP, which incorporates differential equation constraints into optimization tasks for real-time applications. The approach uses a dual-network architecture, with one approximating the control strategies and another capturing the associated PDEs. This architecture exploits a primal-dual method to ensure that both the dynamics dictated by the PDEs and the optimization objectives are concurrently learned and respected. This integration enables end-to-end differentiation, allowing for efficient gradient-based optimization, and, to our knowledge, solves PDE-constrained optimization problems in real-time for the first time. Empirical evaluations across a set of PDE-constrained optimization tasks illustrated PDE-OP’s capability to address these complex challenges adeptly. Our comprehensive results demonstrate the effectiveness of our approach and its broad potential applicability across various scientific and engineering domains where system dynamics arise in optimization or control processes.

7 Acknowledgments

The material is based upon work supported by National Science Foundations (NSF) awards 2533631, 2401285, 2334448, and 2334936, and Defense Advanced Research Projects Agency (DARPA) under Contract No. #HR0011252E005. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or DARPA.

References

  • (1) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, March 2021.
  • (2) James Kotary, Ferdinando Fioretto, Pascal Van Hentenryck, and Bryan Wilder. End-to-end constrained optimization learning: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4475–4482, 2021.
  • (3) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations, 2017.
  • (4) Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces, 2024.
  • (5) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations, 2021.
  • (6) Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 6572–6583. Curran Associates Inc., 2018.
  • (7) David Q Mayne, James B Rawlings, CV Rao, and PO Scokaert. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814, 2000.
  • (8) Michael Hinze, René Pinnau, Michael Ulbrich, and Stefan Ulbrich. Optimization with PDE Constraints, volume 23 of Mathematical Modelling: Theory and Applications. Springer, 2008.
  • (9) Fredi Tröltzsch. Optimal Control of Partial Differential Equations: Theory, Methods, and Applications, volume 112 of Graduate Studies in Mathematics. American Mathematical Society, 2010.
  • (10) Matthew J Zahr and Charbel Farhat. Progressive construction of a parametric reduced-order model for pde-constrained optimization. International Journal for Numerical Methods in Engineering, 2014. https://arxiv.org/abs/1407.7618.
  • (11) Markus A Dihlmann and Bernard Haasdonk. Certified pde-constrained parameter optimization using reduced basis surrogate models for evolution problems. Computational Optimization and Applications, 60:753–787, 2015.
  • (12) Rakhoon Hwang, Jae Yong Lee, Jin Young Shin, and Hyung Ju Hwang. Solving pde-constrained control problems using operator learning. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4):4504–4512, June 2022.
  • (13) Karen Willcox and Omar Ghattas. The imperative of physics-based machine learning and the role of reduced-order models. Nature Computational Science, 1(3):166–168, 2021.
  • (14) Ferdinando Fioretto, Pascal Van Hentenryck, Terrence WK Mak, Cuong Tran, Federico Baldo, and Michele Lombardi. Lagrangian duality for constrained deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 118–135. Springer, 2020.
  • (15) Seonho Park and Pascal Van Hentenryck. Self-supervised primal-dual learning for constrained optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4052–4060, 2023.
  • (16) Priya L Donti, David Rolnick, and J Zico Kolter. Dc3: A learning method for optimization with hard constraints. In ICLR, 2020.
  • (17) James Kotary, Vincenzo Di Vito, Jacob Cristopher, Pascal Van Hentenryck, and Ferdinando Fioretto. Learning joint models of prediction and optimization, 2024.
  • (18) M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, Feb 2019.
  • (19) Zongyi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6755–6766, 2020.
  • (20) Kaushik Bhattacharya, Bamdad Hosseini, Nikola B. Kovachki, and Andrew M. Stuart. Model reduction and neural networks for parametric pdes. In Mathematical and Scientific Machine Learning (MSML), volume 145 of Proceedings of Machine Learning Research, pages 203–233. PMLR, 2021.
  • (21) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. The deepgreen algorithm for learning generic pdes. Journal of Computational Physics, 420:109707, 2021.
  • (22) Vincenzo Di Vito, Mostafa Mohammadian, Kyri Baker, and Ferdinando Fioretto. Learning to solve differential equation constrained optimization problems, 2024.
  • (23) Mingquan Feng, Zhijie Chen, Yixin Huang, Yizhou Liu, and Junchi Yan. Optimal control operator perspective and a neural adaptive spectral method. Proceedings of the AAAI Conference on Artificial Intelligence, 39(14):14567–14575, Apr. 2025.
  • (24) Zhexian Li, Athanassios S. Fokas, and Ketan Savla. On linear quadratic regulator for the heat equation with general boundary conditions. In 2024 IEEE 63rd Conference on Decision and Control (CDC), pages 7594–7599, 2024.
  • (25) Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4:303–320, 1969.
  • (26) John T. Betts. Practical Methods for Optimal Control and Estimation Using Nonlinear Programming. SIAM, 2 edition, 2010.
  • (27) Yang Cao, Shengtai Li, Linda Petzold, and Radu Serban. Adjoint sensitivity analysis for differential-algebraic equations: The adjoint dae system and its numerical solution. SIAM Journal on Scientific Computing, 24(3):1076–1089, 2003.
  • (28) Max Schwenzer, Muzaffer Ay, Thomas Bergs, and Dirk Abel. Review on model predictive control: an engineering perspective. The International Journal of Advanced Manufacturing Technology, 117:1327 – 1349, 2021.
  • (29) J. Crank and P. Nicolson. A practical method for numerical evaluation of solutions of partial differential equations of the heat-conduction type. Mathematical Proceedings of the Cambridge Philosophical Society, 43(1):50–67, 1947.
  • (30) Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
  • (31) Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
  • (32) Dieter Kraft. Algorithm 733: Tomp–fortran modules for optimal control calculations. ACM Transactions on Mathematical Software (TOMS), 20:262 – 281, 1994.
  • (33) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • (34) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.

Appendix

8 Classical Methods

8.1 Direct Method (Finite-Difference Approach)

The direct method treats the simulator as a black box mapping the bounded control vector 𝒖\bm{u} to the discrete objective J(𝒖)J(\bm{u}) and employs a standard optimizer to minimize it. Here the control variable is discretized at each collocation point: 𝒖Nx\bm{u}\in\mathbb{R}^{N_{x}}.

Mechanism. We use L-BFGS-B[30] with per–coordinate box bounds uminujumaxu_{\min}\leq u_{j}\leq u_{\max} to minimize (31). No analytic gradients are provided; instead, numerical derivatives are queried via finite differences. With a forward-difference method,

JujJ(𝒖+ϵj𝒆j)J(𝒖)ϵj,ϵj=η(1+|uj|),\frac{\partial J}{\partial u_{j}}\;\approx\;\frac{J(\bm{u}+\epsilon_{j}\bm{e}_{j})-J(\bm{u})}{\epsilon_{j}},\qquad\epsilon_{j}=\eta\,\bigl(1+|u_{j}|\bigr), (18)

where 𝒆j\bm{e}_{j} is the jj-th unit vector and η\eta is a small scalar (e.g., μ=106\mu=10^{-6}). For higher accuracy (to which corresponds a double computational cost), the central-difference method is adopted:

JhujJh(𝒖+ϵj𝒆j)Jh(𝒖ϵj𝒆j)2ϵj.\frac{\partial J_{h}}{\partial u_{j}}\;\approx\;\frac{J_{h}(\bm{u}+\epsilon_{j}\bm{e}_{j})-J_{h}(\bm{u}-\epsilon_{j}\bm{e}_{j})}{2\epsilon_{j}}. (19)

Computational cost. Each gradient evaluation requires Nx+1N_{x}\!+\!1 simulator calls with forward differences (or 2Nx2N_{x} with central differences), where NxN_{x} denotes the number of spatial collocation points. Since one simulator call is a complete rollout (27) over NtN_{t} time steps, the complexity per iteration is 𝒪((Nx+1)NtNx)\mathcal{O}\big((N_{x}\!+\!1)\,N_{t}\,N_{x}\big). This method is robust and simple, but its cost scales linearly with NxN_{x}, which becomes prohibitive for dense spatial discretizations.

8.2 Adjoint Sensitivity Method for Open-Loop Problem

The Adjoint-Sensitivity Method efficiently computes exact gradients of PDE-constrained objectives w.r.t. high-dimensional control variables. Crucially, the cost of one gradient evaluation is one forward solve + one backward adjoint solve, regardless of the number of control parameters. We derive the continuous adjoint for the open-loop, time-invariant Voltage Control Optimization (5.1) and then we present the discrete-time adjoint state used in our Crank–Nicolson (CN) implementation.

Function spaces & BCs. Consider uL2(Ω)u\in L^{2}(\Omega), yL2([0,T]×Ω)y\in L^{2}([0,T]\!\times\!\Omega), and impose homogeneous Neumann BCs on VV so that the boundary terms vanish in the variational calculus; the adjoint inherits the same Neumann BCs.

The problem consists of finding u(x)u(x) minimizing

J(u)=12Ω(y(T,x)ytarget(x))2𝑑x+γ2Ωu(x)2𝑑x,J(u)=\tfrac{1}{2}\!\int_{\Omega}\!\!(y(T,x)-y_{\text{target}}(x))^{2}\,dx\;+\;\tfrac{\gamma}{2}\!\int_{\Omega}\!u(x)^{2}\,dx, (20)

subject to

ytD2yx2+β(yyref(x))αu(x)=0,\displaystyle\frac{\partial y}{\partial t}-D\frac{\partial^{2}y}{\partial x^{2}}+\beta\big(y-y_{\text{ref}}(x)\big)-\alpha u(x)=0, (t,x)[0,T]×Ω,\displaystyle\forall(t,x)\in[0,T]\times\Omega, (21a)
y(0,x)=y0(x),\displaystyle y(0,x)=y_{0}(x), xΩ,\displaystyle\forall x\in\Omega, (21b)
xy(t,x)=0,\displaystyle\partial_{x}y(t,x)=0, (t,x)[0,T]×Ω,\displaystyle\forall(t,x)\in[0,T]\times\partial\Omega, (21c)

where γ>0\gamma>0 is the regularization weight in the control effort.

Lagrangian. We define

(u,y,p)=J(u)+0TΩp(t,x)[ytD2yx2+β(yyref(x))αu(x)]𝑑x𝑑t,\mathcal{L}(u,y,p)=J(u)+\int_{0}^{T}\!\!\int_{\Omega}p(t,x)\!\left[\frac{\partial y}{\partial t}-D\frac{\partial^{2}y}{\partial x^{2}}+\beta\big(y-y_{\text{ref}}(x)\big)-\alpha u(x)\right]dx\,dt, (22)

where p(t,x)p(t,x) is the adjoint state. By imposing yJ=0\nabla_{y}J=0 and integrating by parts in tt and xx (with Neumann BCs) yields the adjoint system.

Adjoint PDE.

ptD2px2+βp=0,xp|Ω=0.-\frac{\partial p}{\partial t}-D\frac{\partial^{2}p}{\partial x^{2}}+\beta\,p=0,\qquad\partial_{x}p|_{\partial\Omega}=0. (23)

Terminal condition.

p(T,x)=y(T,x)ytarget(x).p(T,x)=y(T,x)-y_{\text{target}}(x). (24)

Gradient w.r.t. u(x)u(x). Because uu is time-invariant, variation w.r.t. uu gives

u(x)J=γu(x)α0Tp(t,x)𝑑t.\nabla_{u(x)}J\;=\;\gamma\,u(x)\;-\;\alpha\int_{0}^{T}p(t,x)\,dt. (25)

We optimize 𝒖\bm{u} with L -BFGS -B under elementwise bounds uminuiumaxu_{\min}\leq u_{i}\leq u_{\max}; no projection step is needed.

Semi-discrete forward dynamics. To simplify the notation, let 𝒚(t)=[y(t,x1)y(t,xi)y(t,xNx)]Nx\bm{y}(t)=\Big[y(t,x_{1})\;\dots\;y(t,x_{i})\;\dots\;y(t,x_{N_{x}})\Big]\in\mathbb{R}^{N_{x}} be the discretized version of y(t,x)y(t,x) on a uniform grid with Neumann BCs, 𝑳Nx×Nx\bm{L}\in\mathbb{R}^{N_{x}\times N_{x}} be the the Neumann Laplacian, and 𝒖Nx\bm{u}\in\mathbb{R}^{N_{x}} the static control vector with elements u(xi)u(x_{i}). Define 𝑨:=D𝑳β𝑰Nx\bm{A}:=D\,\bm{L}-\beta\,\bm{I}_{N_{x}} and 𝒔:=β𝒚ref\bm{s}:=\beta\,\bm{y}_{\text{ref}}. Then

𝒚˙(t)=𝑨𝒚(t)+α𝒖+𝒔.\dot{\bm{y}}(t)\;=\;\bm{A}\,\bm{y}(t)\;+\;\alpha\,\bm{u}\;+\;\bm{s}. (26)

Crank–Nicolson (CN) step. With Δt=T/Nt\Delta t=T/N_{t} and tk=kΔtt_{k}=k\Delta t, CN gives

(𝑰NxΔt2𝑨)𝒚(tk+1)=(𝑰Nx+Δt2𝑨)𝒚(tk)+Δtα𝒖+Δt𝒔.\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)\bm{y}(t_{k+1})\;=\;\big(\bm{I}_{N_{x}}+\tfrac{\Delta t}{2}\bm{A}\big)\,\bm{y}(t_{k})\;+\;\Delta t\,\alpha\,\bm{u}\;+\;\Delta t\,\bm{s}. (27)

Equivalently, the affine one-step map is

𝒚(tk+1)\displaystyle\bm{y}(t_{k+1}) =𝑨CN𝒚(tk)+𝑩stat𝒖+𝒅,\displaystyle=\bm{A}_{\mathrm{CN}}\,\bm{y}(t_{k})\;+\;\bm{B}_{\mathrm{stat}}\,\bm{u}\;+\;\bm{d}, (28)
𝑨CN\displaystyle\bm{A}_{\mathrm{CN}} =(𝑰NxΔt2𝑨)1(𝑰Nx+Δt2𝑨),𝑩stat=Δt(𝑰NxΔt2𝑨)1α𝑰Nx,\displaystyle=\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-1}\!\big(\bm{I}_{N_{x}}+\tfrac{\Delta t}{2}\bm{A}\big),\qquad\bm{B}_{\mathrm{stat}}=\Delta t\,\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-1}\,\alpha\,\bm{I}_{N_{x}}, (29)
𝒅\displaystyle\bm{d} =Δt(𝑰NxΔt2𝑨)1𝒔.\displaystyle=\Delta t\,\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-1}\,\bm{s}. (30)

Discrete objective. With 𝑴=Δx𝑰Nx\bm{M}=\Delta x\,\bm{I}_{N_{x}} we have:

Jh(𝒖)=12𝒚Nt𝒚target𝑴2+γ2𝒖𝑴2=12(𝒚Nt𝒚target)𝑴(𝒚Nt𝒚target)+γ2𝒖𝑴𝒖.J_{h}(\bm{u})=\tfrac{1}{2}\,\|\bm{y}_{N_{t}}-\bm{y}_{\text{target}}\|_{\bm{M}}^{2}+\tfrac{\gamma}{2}\,\|\bm{u}\|_{\bm{M}}^{2}=\tfrac{1}{2}(\bm{y}_{N_{t}}-\bm{y}_{\text{target}})^{\top}\bm{M}(\bm{y}_{N_{t}}-\bm{y}_{\text{target}})+\tfrac{\gamma}{2}\,\bm{u}^{\top}\bm{M}\bm{u}. (31)

Discrete adjoint & gradient w.r.t. static u\bm{u}.

𝝀Nt=𝑴(𝒚Nt𝒚target),𝝀k=𝑨CN𝝀k+1,k=Nt1,,0,\bm{\lambda}_{N_{t}}=\bm{M}(\bm{y}_{N_{t}}-\bm{y}_{\text{target}}),\qquad\bm{\lambda}_{k}=\bm{A}_{\mathrm{CN}}^{\top}\,\bm{\lambda}_{k+1},\quad k=N_{t}-1,\dots,0, (32)
𝒖Jh=γ𝑴𝒖k=0Nt1𝑩stat𝝀k+1=γ𝑴𝒖αΔt(𝑰NxΔt2𝑨)(k=0Nt1𝝀k+1).\nabla_{\bm{u}}J_{h}=\gamma\,\bm{M}\,\bm{u}-\sum_{k=0}^{N_{t}-1}\bm{B}_{\mathrm{stat}}^{\top}\,\bm{\lambda}_{k+1}\;=\;\gamma\,\bm{M}\,\bm{u}-\alpha\,\Delta t\,\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-\!\top}\!\left(\sum_{k=0}^{N_{t}-1}\bm{\lambda}_{k+1}\right). (33)

Computational cost. One forward CN rollout (27) and one backward adjoint sweep (32) yield the exact discrete gradient (33); then, 𝒖\bm{u} is updated with L-BFGS-B under box bounds. In 1D, the CN matrix is time-invariant and can be prefactored once, so that each step reduces to cheap triangular-matrix solver call; the totla cost per iteration is 𝒪(NtNx)\mathcal{O}(N_{t}N_{x}).

8.3 Time-Varying Adjoint for Closed-Loop Baselines

The adjoint method extends naturally to a time-varying control u(t,x)u(t,x). The continuous adjoint PDE and terminal/boundary conditions remain as per Eq. (23)–(24), while the gradients w.r.t. u(t,x)u(t,x) are calculated as:

u(t,x)J=γu(t,x)αp(t,x).\nabla_{u(t,x)}J\;=\;\gamma\,u(t,x)\;-\;\alpha\,p(t,x). (34)

We use this time-varying adjoint for the Heat and Burgers experiments (please refer to Section 5.2 and 5.3); the discrete forms below match our CN implementations.

Heat optimization problem — CN discretization.

With 𝒚(tk+1)=𝑨𝒚(tk)+𝑩𝒖(tk)+𝒈\bm{y}(t_{k+1})=\bm{A}\bm{y}(t_{k})+\bm{B}\bm{u}(t_{k})+\bm{g} and a quadratic running/terminal costs, the discrete adjoint equation and per-step gradient are

𝝀Nt\displaystyle\bm{\lambda}_{N_{t}} =𝑸(𝒚Nt𝒚target),𝝀k=𝑨𝝀k+1+𝑻Q(𝒚(tk)𝒚target),\displaystyle=\bm{Q}\,(\bm{y}_{N_{t}}-\bm{y}_{\text{target}}),\qquad\bm{\lambda}_{k}=\bm{A}^{\top}\bm{\lambda}_{k+1}+\bm{T}_{Q}\,(\bm{y}(t_{k})-\bm{y}_{\text{target}}), (35)
𝒖(tk)Jh\displaystyle\nabla_{\bm{u}(t_{k})}J_{h} =𝑹𝒖(tk)+𝑩𝝀k+1.\displaystyle=\bm{R}\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{\lambda}_{k+1}. (36)

This is the gradient used by LMPC in Section 5.2. 𝑸,𝑻Q, and 𝑹\bm{Q},\bm{T}_{Q},\text{ and }\bm{R} represent the weight matrices.

Burgers’ optimization problem — implicit CN step.

We write one CN step as the implicit map 𝒚(tk+1)=𝒢(𝒚(tk),𝒖(tk);Δt)\bm{y}(t_{k+1})=\mathcal{G}(\bm{y}(t_{k}),\bm{u}(t_{k});\Delta t) with Dirichlet endpoints enforced at each step. Let

Ak=IΔt2J(𝒚(tk+1)),Bk=IΔt2J(𝒚(tk)),A_{k}\;=\;I-\tfrac{\Delta t}{2}\,J(\bm{y}(t_{k+1})),\qquad B_{k}\;=\;-I-\tfrac{\Delta t}{2}\,J(\bm{y}(t_{k})), (37)

where J(𝒚)=𝑫diag(𝒚)+ν𝑫2J(\bm{y})=-\bm{D}\,\mathrm{diag}(\bm{y})+\nu\,\bm{D}^{2} is the Jacobian of the semi-discrete residual, and 𝑩\bm{B} stacks the actuator shapes. For running cost Lk=Δt2𝒚(tk+1)𝒚target22+αΔt2𝒖(tk)22L_{k}=\tfrac{\Delta t}{2}\|\bm{y}(t_{k+1})-\bm{y}_{\text{target}}\|_{2}^{2}+\tfrac{\alpha\Delta t}{2}\|\bm{u}(t_{k})\|_{2}^{2}, the closed-loop adjoint/backward sweep is

Ak𝒒k\displaystyle A_{k}^{\top}\bm{q}_{k} =Δt(𝒚(tk+1)𝒚target),\displaystyle=\Delta t\,(\bm{y}(t_{k+1})-\bm{y}_{\text{target}}), (38)
𝒑k\displaystyle\bm{p}_{k} =Δt(𝒚(tk)𝒚target)Bk𝒒k,\displaystyle=\Delta t\,(\bm{y}(t_{k})-\bm{y}_{\text{target}})-B_{k}^{\top}\bm{q}_{k}, (39)
𝒖(tk)Jh\displaystyle\nabla_{\bm{u}(t_{k})}J_{h} =Δt(α𝒖(tk)+𝑩𝒒k),\displaystyle=\Delta t\big(\alpha\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{q}_{k}\big), (40)

which are the NMPC equations we implemented and related to the experiments in Section 5.3. Here 𝒒k\bm{q}_{k} and 𝒑k\bm{p}_{k} are adjoint the variables.

8.4 Model Predictive Control (MPC)

In what follows, we introduce the MPC algorithm; to simplify notation we omit the dependency from the spatial variable xx. At each time instant tkt_{k}, MPC solves a finite-horizon problem of NpN_{p} steps given inputs {𝒖(ti)}i=kk+Np1\{\bm{u}(t_{i})\}_{i=k}^{k+N_{p}-1}; it applies the first input 𝒖(tk)\bm{u}(t_{k}), then applies shifts and iterates(receding horizon).

Linear MPC for the Heat equation.

With Crank–Nicolson discretization and cosine-basis expansion for the control action, the dynamics are linear:

𝒚(ti+1)=𝑨𝒚(ti)+𝑩𝒄(ti)+𝒈,𝒚(ti)Nx,𝒄(ti)M.\bm{y}(t_{i+1})\;=\;\bm{A}\,\bm{y}(t_{i})\;+\;\bm{B}\,\bm{c}(t_{i})\;+\;\bm{g},\qquad\bm{y}(t_{i})\in\mathbb{R}^{N_{x}},\;\;\bm{c}(t_{i})\in\mathbb{R}^{M}. (41)

We solve the convex QP

min{𝒄(ti)}i=kk+Np1\displaystyle\min_{\{\bm{c}(t_{i})\}_{i=k}^{k+N_{p}-1}} i=kk+Np1𝒚(ti)𝒚target𝑻Q2+𝒄(ti)𝑹2+𝒚(tk+Np)𝒚target𝑸2\displaystyle\sum_{i=k}^{k+N_{p}-1}\|\bm{y}(t_{i})-\bm{y}_{\text{target}}\|_{\bm{T}_{Q}}^{2}+\|\bm{c}(t_{i})\|_{\bm{R}}^{2}+\|\bm{y}(t_{k+N_{p}})-\bm{y}_{\text{target}}\|_{\bm{Q}}^{2} (42)
s.t. (41)\displaystyle(\ref{eq:lmpc_dyn})
cmin𝒄(ti)cmax.\displaystyle c_{\min}\leq\bm{c}(t_{i})\leq c_{\max}.

Here, 𝑨=ML1MR\bm{A}=M_{L}^{-1}M_{R}, 𝑩=ML1(Δt𝑩c)\bm{B}=M_{L}^{-1}(\Delta t\,\bm{B}_{c}), 𝐠=ML1(Δt𝐬)\mathbf{g}=M_{L}^{-1}(\Delta t\,\mathbf{s}) (Neumann BCs), 𝐁c\mathbf{B}_{c} stacks cos(jπx/X)\cos(j\pi x/X); (42) is solved with CVXPY/OSQP [31]. Here MLM_{L} and MRM_{R} are the CN matrices of the system, obtained from the semi-discrete PDE operator AcA_{c}:

ML=IΔt2Ac,MR=I+Δt2Ac.M_{L}\;=\;I-\tfrac{\Delta t}{2}\,A_{c},\qquad M_{R}\;=\;I+\tfrac{\Delta t}{2}\,A_{c}. (43)

This way one CN step writes as:

ML𝒚(tk+1)=MR𝒚(tk)+Δt𝑩c𝒄(tk)+Δt𝒔.M_{L}\,\bm{y}(t_{k+1})\;=\;M_{R}\,\bm{y}(t_{k})\;+\;\Delta t\,\bm{B}_{c}\,\bm{c}(t_{k})\;+\;\Delta t\,\bm{s}. (44)

Nonlinear MPC for the Burgers equation.

For viscous Burgers, the CN step is nonlinear due to convection term; we write one step as the implicit map

𝒚(ti+1)=𝒢(𝒚(ti),𝒄(ti);Δt),\bm{y}(t_{i+1})\;=\;\mathcal{G}(\bm{y}(t_{i}),\bm{c}(t_{i});\Delta t), (45)

where 𝒢\mathcal{G} is the unique CN solution for Burgers with input 𝑩𝒄(ti)\bm{B}\bm{c}(t_{i}) (computed via fsolve on the CN residual and by imposing Dirichlet endpoints at each step).

min{𝒄(ti)}i=kk+Np1\displaystyle\min_{\{\bm{c}(t_{i})\}_{i=k}^{k+N_{p}-1}} i=kk+Np1𝒚(ti)𝒚target𝑻Q2+𝒄(ti)𝑹2+𝒚(tk+Np)𝒚target𝑸2\displaystyle\sum_{i=k}^{k+N_{p}-1}\|\bm{y}(t_{i})-\bm{y}_{\text{target}}\|_{\bm{T}_{Q}}^{2}+\|\bm{c}(t_{i})\|_{\bm{R}}^{2}+\|\bm{y}(t_{k+N_{p}})-\bm{y}_{\text{target}}\|_{\bm{Q}}^{2} (46)
s.t. 45,cmin𝒄(ti)cmax.\displaystyle\ref{eq:nmpc_dyn},\;\;c_{\min}\leq\bm{c}(t_{i})\leq c_{\max}.

We use single shooting over the full-order model: propagate with (45), optimize with SLSQP[32] under box bounds, apply 𝐲(tk)\mathbf{y}(t_{k}), advance one CN step, and iterate. Here 𝐁\mathbf{B} stacks sin((j+1)πx/X)\sin((j{+}1)\pi x/X) shapes used in the implementation. Notably, 𝐓Q,𝐑, and 𝐐{\mathbf{T}_{Q}},{\mathbf{R}},\text{ and }\mathbf{Q} represent the weight matrices, which correspond to the multipliers 𝝀\bm{\lambda} in our PDE-OP method.

Choice of controller.

For the voltage optimization task, we use Direct and Adjoint-method for open loop. For the heat equation, CN yields a linear PDE, so the finite-horizon problem is a convex QP, which can be solved with Linear MPC. For Burgers’ equation, the dynamics are nonlinear and CN produces an implicit nonlinear update, so we use Nonlinear MPC, with NMPC optimizing over an implicit CN step. In both the heat and burgers’ tasks, we also use Adjoint-method for closed loop, and report its results.

9 Additional Experimental Results

In this section we evaluate the runtime and terminal tracking error (MSE) achieved by each method on polynomial target profiles, and report these metrics in Tables 2, 3, and 4. All methods use the same spatial/temporal discretization and solver settings as described in Appendix 10.

Refer to caption
Figure 9: Optimal Control of the 1D Burgers’ equation with direct modeling; comparison between the solution estimate (center) of PDE-OP’s dynamic predictor 𝒴θ\cal Y_{\theta} at test time with the solution obtained with a numerical algorithm (left) and related absolute error (right).

9.1 Voltage Control Optimization

We test each method on additional target profiles ytarget(x)=p1x+p2y_{\text{target}}(x)=p_{1}x+p_{2}: (i) constant: (p1,p2)=(0,1)(p_{1},p_{2})=(0,1) so ytarget(x)=1y_{\text{target}}(x)=1; (ii) ramp: (p1,p2)=(1,0.5)(p_{1},p_{2})=(1,0.5) so ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5. (Equivalently, ytarget(x)=p1x+p2y_{\text{target}}(x)=p_{1}x+p_{2} with (p1,p2)=(0,1)(p_{1},p_{2})=(0,1) and (1,0.5)(1,0.5), respectively.) Since a ramp target profile is infeasible under zero-flux (Neumann) boundaries, all methods can only produce approximations.

Table 2: Voltage Optimization Task. For each method and target profile we report runtime (s) and MSE.
Methods ytarget(x)=1y_{\text{target}}(x)=1 ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5
Runtime Accuracy Runtime Accuracy
Direct (Finite Diff.) 2.422s 4.6498e-11 12.810s 0.00005
Adjoint-Method 0.049s 2.8213e-12 0.464s 0.00004
PDE-OP (ours) 0.037s 1.5275e-6 0.050s 0.00007

Results.

Constant target ytarget(x)=1y_{\text{target}}(x)=1. The Adjoint method attains the highest accuracy (MSE 2.82×𝟏𝟎𝟏𝟐\mathbf{2.82\times 10^{-12}}) in 0.0490.049 s, while the Direct method is much slower at 2.4222.422 s with MSE 4.65×10114.65\times 10^{-11}.PDE-OP is fastest at 0.037\mathbf{0.037} s (about 65.5×65.5\times faster than Direct and 1.32×1.32\times faster than Adjoint) and reports an MSE of 1.53×1061.53\times 10^{-6}.

Ramp target ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5. The Adjoint method yields the lowest error (4.0×𝟏𝟎𝟓\mathbf{4.0\times 10^{-5}}), running in 0.4640.464 s; the Direct method achieves 5.0×1055.0\times 10^{-5} but takes 12.81012.810 s. PDE-OP runs in 0.050\mathbf{0.050} s, about 256×256\times faster than Direct and 9.3×9.3\times faster than Adjoint method, producing an MSE of 7.0×1057.0\times 10^{-5}, which is on same order of magnitude of the baselines.

9.2 Optimal Control of the 1D Heat Equation

For the 1D heat equation (closed-loop control), we consider affine targets ytarget(x)=p1x+p2y_{\text{target}}(x)=p_{1}x+p_{2} with (p1,p2)=(0,1)(p_{1},p_{2})=(0,1) and (1,0.5)(1,0.5). Under zero-flux (Neumann) boundary conditions, the ramp profile is not exactly admissible, as its derivative at the boundaries is nonzero; all controllers therefore track its best achievable approximation under these constraint. Here, we compare the proposed PDE-OP model with LMPC and the Adjoint-Sensitivity Method.

Table 3: Heat Equation Task. For each method and target profile we report runtime (s) and MSE.
Methods ytarget(x)=1y_{\text{target}}(x)=1 ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5
Runtime Accuracy Runtime Accuracy
LMPC 0.443s 2.3289e-14 0.562s 0.00008
Adjoint-Method 0.485s 6.6612e-11 0.552s 0.00008
PDE-OP (ours) 0.223s 1.5210e-6 0.132s 0.00009

Results.

Constant target ytarget(x)=1y_{\text{target}}(x)=1. LMPC achieves produces accurate terminal profile (MSE 2.33×𝟏𝟎𝟏𝟒\mathbf{2.33\times 10^{-14}}) and runs in 0.4430.443 s; the Adjoint runs in 0.4850.485 s with MSE 6.66×10116.66\times 10^{-11}. PDE-OP is fastest at 0.223\mathbf{0.223} s (about 2.0×2.0\times faster than LMPC and 2.2×2.2\times than Adjoint) with a still small error of 1.52×1061.52\times 10^{-6}.

Ramp target ytarget(x)=x+0.5y_{\text{target}}(x)=x+0.5. LMPC and Adjoint achieves for the highest accuracy (MSE=8.0×𝟏𝟎𝟓\mathbf{8.0\times 10^{-5}}) with runtimes 0.5620.562 s and 0.5520.552 s, respectively. PDE-OP runs only in 0.132\mathbf{0.132} s (about 4.3×4.3\times faster LMPC and 4.2×4.2\times than the Adjoint), with an MSE of 9.0×1059.0\times 10^{-5} of the same order of magnitude as the baselines.

9.3 Optimal Control of the 1D Burgers’ Equation

For the 1D viscous Burgers equation with Dirichlet boundaries y(t,0)=y(t,L)=0y(t,0)=y(t,L)=0, admissible steady targets must satisfy these conditions. We therefore consider (i) a zero (constant) target ytarget(x)=0y_{\text{target}}(x)=0, and (ii) a parabolic target ytarget(x)=p1x(1x)y_{\text{target}}(x)=p_{1}\,x(1-x) (we use p1=2.0p_{1}=2.0 in experiments). We compare our method with NMPC and time-varying Adjoint baseline.

Table 4: Burger Optimization Task. For each method and target profile we report runtime (s) and MSE.
Methods ytarget(x)=0y_{\text{target}}(x)=0 ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1-x)
Runtime Accuracy Runtime Accuracy
NMPC 132.243s 5.6760e-14 671.033s 0.00006
Adjoint-Method 21.487s 5.8290e-11 122.116s 0.00013
PDE-OP (ours) 0.059s 0.00002 0.073s 0.00031

Results.

Zero target ytarget(x)=0y_{\text{target}}(x)=0. NMPC reaches nearly machine precision (MSE 5.68×𝟏𝟎𝟏𝟒\mathbf{5.68\times 10^{-14}}) but is slow at 132.243132.243 s; the Adjoint baseline is faster at 21.48721.487 s with MSE 5.83×10115.83\times 10^{-11}. PDE-OP is fastest at 0.059\mathbf{0.059} s (about 2.24×1032.24\times 10^{3} vs. NMPC and 3.64×1023.64\times 10^{2} vs. Adjoint) with a small but larger error of 2.0×1052.0\times 10^{-5}.

Parabolic target ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1\!-\!x). NMPC attains the lowest error (MSE 6.0×𝟏𝟎𝟓\mathbf{6.0\times 10^{-5}}) in 671.033671.033 s; Adjoint yields 1.3×1041.3\times 10^{-4} at 122.116122.116 s. PDE-OP runs in 0.073\mathbf{0.073} s—about 9.19×1039.19\times 10^{3} faster than NMPC and 1.67×1031.67\times 10^{3} faster than Adjoint—with MSE 3.1×1043.1\times 10^{-4} (same order of magnitude).

Takeaway. For nonlinear Burgers dynamics with Dirichlet BCs, NMPC/Adjoint provide the best MSE, while PDE-OP delivers three to four orders of magnitude lower runtime with modest accuracy loss.

9.4 Discrete-Time Neural Operator with Direct Control Field u(x,t)u(x,t)

In this section we describe a variant of our discrete-time neural operator that learns directly the spatio–temporal control field u(t,x)u(t,x) rather than the weights of the basis function as introduced in Section 4.2. Similarly to the original DeepONet [1], PDE-OP’s model 𝒴θ\mathcal{Y}_{\theta} factors into a branch network that encodes input functions and a trunk network that encodes query coordinates. At each time tkt_{k}, given the current state y(tk,x)y(t_{k},x) and the applied control u(tk,x)u(t_{k},x), this model predicts the next state at coordinate xx:

𝒚^(tk+1,x)=𝒴θ(t,x,𝒚^(tk,x),𝒖^(tk,x))\hat{\bm{y}}(t_{k+1},x)\;=\;\mathcal{Y}_{\theta}\!\big(t,\,x,\,\hat{\bm{y}}(t_{k},x),\,\hat{\bm{u}}({t_{k}},x)\big) (47)

Branch network. The branch net encodes the estimated system state 𝒚^(tk,x)\hat{\bm{y}}(t_{k},x) sampled at nn fixed sensor locations together with estimated control signals 𝒖^(tk,x)\hat{\bm{u}}(t_{k},x), and maps this concatenated input to a dd-dimensional vector of coefficients 𝒃(𝒚^(tk,x),𝒖^(tk))d\bm{b}\!\left(\hat{\bm{y}}(t_{k},x),\hat{\bm{u}}(t_{k})\right)\in\mathbb{R}^{d}.
Trunk network. The trunk net encodes the spatial query coordinate xx, producing a latent feature vector 𝜸(x)d\bm{\gamma}(x)\in\mathbb{R}^{d}.
The estimated subsequent (in time) state 𝒚^(tk+1,x)\hat{\bm{y}}(t_{k+1},x) at location xix_{i} is obtained by combining the branch coefficients and trunk features through an inner product:

y^(tk+1,xi)=j=1dbj(y^(tk,xi),𝒖^(tk,xi))γj(xi),i=1,,n.\hat{y}(t_{k+1},x_{i})\;=\;\sum_{j=1}^{d}b_{j}\!\left(\hat{{y}}(t_{k},x_{i}),\hat{\bm{u}}(t_{k},x_{i})\right)\,\gamma_{j}(x_{i}),\;\;i=1,\dots,n. (48)

Model initialization and training (direct uu). In this setting the dynamic predictor 𝒴θ\mathcal{Y}_{\theta} takes as input the control action sampled on the spatial grid, together with the current state. 𝒴θ\mathcal{Y}_{\theta} is initialized in a supervised fashion by generating the target trajectories with a numerical PDE solver under diverse time-varying controls u(t,x)u(t,x).

We sample admissible control sequences {u(tk,xi)}\{u(t_{k},x_{i})\} on the grid {xi}i=1n\{x_{i}\}_{i=1}^{n} from a zero-mean Gaussian random field (GRF) with prescribed spatial covariance (and optional temporal correlation), after which a clipping operation is applied to meet the actuator bounds uminu(tk,xi)umaxu_{\min}\leq u(t_{k},x_{i})\leq u_{\max}. Rolling the solver with these inputs produces states y(tk,)y(t_{k},\cdot) satisfying (1b)-(1f). Each trajectory is converted into one-step training pairs

(y(tk,xi),u(tk,xi))y(tk+1,xi),k=0,,T1,i=1,,n\big(y(t_{k},x_{i}),\,u(t_{k},x_{i})\big)\;\longmapsto\;y(t_{k+1},x_{i}),\qquad k=0,\dots,T-1,\quad i=1,\dots,n

Formally, the dataset consists of triples 𝒟={(𝒚(tk,x),𝒖(tk,x),𝒚(tk+1,x))}\mathcal{D}=\Big\{\big(\bm{y}(t_{k},x),\,\bm{u}(t_{k},x),\,\bm{y}(t_{k+1},x)\big)\Big\} , with 𝒚(tk,x)=[y(tk,x1),,y(tk,xn)]\bm{y}(t_{k},x)=[y(t_{k},x_{1}),\ldots,y(t_{k},x_{n})]^{\top} and 𝒖(tk,x)=[u(tk,x1),,u(tk,xn)]\bm{u}(t_{k},x)=[u(t_{k},x_{1}),\ldots,u(t_{k},x_{n})]^{\top}. 𝒴θ\mathcal{Y}_{\theta} is trained to minimize the following loss

minθ𝔼(𝒖,y)𝒟[y^(t,x)y(t,x)2],\min_{\theta}\;\mathbb{E}_{(\bm{u},y)\sim\mathcal{D}}\Big[\big\|\hat{y}(t,x)-y(t,x)\big\|^{2}\Big], (49)

9.5 MPC for Direct Control Field u(x,t)u(x,t)

Here we describe a “direct-uu” variant for viscous Burgers with Dirichlet boundaries, where the control action is parameterized at every spatial collocation point. At each time tkt_{k} a vector, the optimization task consists of finding 𝒖(tk,x)Nx\bm{u}(t_{k},x)\in\mathbb{R}^{N_{x}} (e.g., for a given time instant tkt_{k} we have NxN_{x} spatial collocation point), and applying an implicit Crank–Nicolson step to propagate the state in time from tkt_{k} to tk+1t_{k+1}.

min{𝒖(ti)}i=kk+Np1\displaystyle\min_{\{\bm{u}(t_{i})\}_{i=k}^{k+N_{p}-1}} i=kk+Np1𝒚(ti)𝒚target𝑻Q2+𝒖(ti)𝑹2+𝒚(tk+Np)𝒚target𝑸2\displaystyle\sum_{i=k}^{k+N_{p}-1}\|\bm{y}(t_{i})-\bm{y}_{\text{target}}\|_{\bm{T}_{Q}}^{2}+\|\bm{u}(t_{i})\|_{\bm{R}}^{2}+\|\bm{y}(t_{k+N_{p}})-\bm{y}_{\text{target}}\|_{\bm{Q}}^{2} (50)
s.t. 45,umin𝒖(ti)umax.\displaystyle\ref{eq:nmpc_dyn},\;\;u_{\min}\leq\bm{u}(t_{i})\leq u_{\max}.

In case of closed loop (NMPC), a horizon of length NpN_{p} yields NpNxN_{p}\,N_{x} decision variables per iteration; we warm start across receding horizons and use bound-constrained quasi-Newton (e.g., L-BFGS-B/SLSQP). Here 𝐓Q,𝐑, and 𝐐{\mathbf{T}_{Q}},{\mathbf{R}},\text{ and }\mathbf{Q} represent the weight matrices, which correspond to identical with the Lagrange multipliers 𝝀\bm{\lambda} of our PDE-OP’s training algorithm. To compute the gradients, we either backpropagate through the discrete CN integrator using automatic differentiation or use the fully discrete adjoint; both methods are exact but require one forward rollout and one backward sweep over the horizon. This grid-wise formulation removes any basis bias and shows the attainable tracking when the controller has full spatial freedom, but it is While direct modeling of the control action doesn’t require any basis representation, its computational cost higher w.r.t. to the case in which a basis representation is adopted: the per-step cost scales roughly with NpNtNxN_{p}\,N_{t}\,N_{x}, due to line-search/gradient evaluation entailing multiple CN solves. We therefore include it as a comprehensive baseline to contextualize the efficiency of our reduced-parameter (mode-based) controllers.

9.6 Experimental Results For Direct Control Field u(t,x)u(t,x)

When using the direct control field modeling for u(t,x)u(t,x), we evaluate each method on the nonlinear, time-variant optimal control of the 1D Burgers’ equation, using the same target profiles presented in Section 5.3. Tables 5-6 report the runtime (s) and terminal tracking error (MSE) of each method. All methods use the same spatial/temporal discretization and solver settings as in 10. Figure 9 shows the predicted voltage (central sub-figure of Fig. 9) of PDE-OP’s dynamic component 𝒴θ\mathcal{Y}_{\theta} at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. Figures 10 and 11 show the comparison of PDE-OP’s predicted terminal state y^(T,x)\hat{y}(T,x) and those obtained using the baseline methods, over the same target profiles introduced in the main text. In figure 10, we set Np=1N_{p}=1 for NMPC and Np=10N_{p}=10 for the Adjoint-method, while in figure 11, we set Np=4N_{p}=4 for NMPC and Np=20N_{p}=20 for the Adjoint-method.

Table 5: PDE-OP vs. classical baselines with short horizon NpN_{p}. Here we use Np=1N_{p}{=}1 for NMPC and Np=10N_{p}{=}10 for adjoint. We report runtime (s) and MSE (lower is better).
ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1-x) ytarget(x)=0.8(sin(x))y_{\text{target}}(x)=0.8(\sin(x)) ytarget(x)=0y_{\text{target}}(x)=0
Method Runtime MSE Runtime MSE Runtime MSE
NMPC 39.254s 0.00002 76.368s 0.00014 36.483s 7.9141e-06
Adjoint-Method 47.805s 0.00124 79.644s 0.00317 24.956s 2.7618e-08
PDE-OP (ours) 0.267s 0.00003 0.256s 0.00008 0.257s 0.00008

Results.

Across all targets, PDE-OP is two orders of magnitude faster than classical methods, between 100100300×300\times faster then the NMPC and the Adjoint. In terms of accuracy, for the sine target, PDE-OP achieves the lowest MSE; for the parabola, NMPC attains the best results with PDE-OP close behind; for the zero target, the adjoint method achieves the lowest MSE with PDE-OP producing an MSE of the order of 10510^{-5}.

Table 6: PDE-OP vs. classical baselines with longer horizons. Here we use Np=4N_{p}{=}4 for NMPC and Np=20N_{p}{=}20 for adjoint. We report runtime (s) and MSE (lower is better).
ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1-x) ytarget(x)=0.8(sin(x))y_{\text{target}}(x)=0.8(\sin(x)) ytarget(x)=0y_{\text{target}}(x)=0
Method Runtime MSE Runtime MSE Runtime MSE
NMPC 826.176s 0.00016 3143.735s 0.00032 470.851s 0.00061
Adjoint-Method 186.910s 0.00129 263.891s 0.00316 51.702s 4.9213e-09
PDE-OP (ours) 0.254s 0.00003 0.277s 0.00008 0.956s 0.00008

Results.

With Np=4N_{p}{=}4 (NMPC) and Np=20N_{p}{=}20 (adjoint), PDE-OP remains dramatically faster, three to four orders of magnitude—than classical solvers. For the parabola target, PDE-OP is 3250×\sim\!3250\times faster than NMPC and 740×\sim\!740\times faster than adjoint, while also reporting the best MSE (3×1053{\times}10^{-5}). For the sine target, the speedups over classical methods is 11300×\sim\!11300\times vs. NMPC and 950×\sim\!950\times vs. adjoint, while attaining (again) the lowest accuracy (8×1058{\times}10^{-5}). For the zero target, the Adjoint method attains near-perfect regulation (4.9×1094.9{\times}10^{-9}), with PDE-OP is still fastest, running in 0.956s0.956\,s and approximately 54×54\times faster than the Adjoint method and 495×495\times faster than NMPC, while reporting a very small tracking error (8×1058{\times}10^{-5}).

Refer to caption
Figure 10: Comparison between PDE-OP and each baseline method’s solution on three different target profile: ytarget(x)=0.8(sin(x))y_{\text{target}}(x)=0.8(\sin(x)) (left), ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1-x) (center), and ytarget(x)=0y_{\text{target}}(x)=0 (right). We set Np=1N_{p}=1 for NMPC, and Np=10N_{p}=10 for adjoint-method.
Refer to caption
Figure 11: Comparison between PDE-OP and each baseline method’s solution on three different target profile: ytarget(x)=0.8(sin(x))y_{\text{target}}(x)=0.8(\sin(x)) (left), ytarget(x)=2x(1x)y_{\text{target}}(x)=2x(1-x) (center), and ytarget(x)=0y_{\text{target}}(x)=0 (right). We set Np=4N_{p}=4 for NMPC, and Np=20N_{p}=20 for adjoint-method.

10 Hyperparameter Settings

In Tables 7, 8, and 9, we provide the values of the hyperparameters of PDE-OP model and the baselines baseline methods related to the voltage optimization experiments; Tables 10, 11, and 12, report the values of the hyperparameters of PDE-OP model and the baselines baseline methods related to the heat equation experiments; lastly, Tables 13, 14, and 13 report the values of the hyperparameters of PDE-OP model and the baselines baseline methods related to the Burgers’ equation experiments. For a fair evaluation, each method adopts the same PDE, grid, and horizon parameters n,Nx,Nt,[umin,umax],[cmin,cmax]n,N_{x},N_{t},[u_{\min},u_{\max}],[c_{\min},c_{\max}] .

Table 7: Voltage equation: simulation parameters and model/training hyperparameters.
MLP [33] a fully connected feedforward network.
Category Hyperparameter Value
PDE & Simulation Domain length LL 1.01.0
Final time TT 5.05.0
Diffusion DD 0.10.1
Leakage β\beta 1.01.0
Control gain α\alpha 2.02.0
Initial state V(x,0)V(x,0) 0.00.0
Reference Vref(x)V_{\text{ref}}(x) 1.01.0
Control bounds [umin,umax][u_{\min},u_{\max}] [1.0, 1.0][-1.0,\;1.0]
Discretization Solver grid (n)(n) 101101
Solver steps (Nt)(N_{t}) 101101
Number of sensors (Nx=n)(N_{x}=n) 101101
PDE-OP’s 𝒴θ\cal Y_{\theta} Learning Rate 5×1045\times 10^{-4}
Number of epochs 10001000
Optimizer Adam
Batch size 10241024
Activation function SiLU
Lagrange Multiplier ρ\rho 0.050.05
Branch net (b)(b): # layers 44
Branch net (b)(b): hidden sizes [512,512,512,512][512,512,512,512]
Trunk net (γ)(\gamma): # layers 44
Trunk net (γ)(\gamma): hidden sizes [512,512,512,512][512,512,512,512]
Feature dimension (d)(d) 512512
PDE-OP’s 𝒰ω\mathcal{U}_{\omega} Learning rate 1×1031\times 10^{-3}
Number of epochs 10001000
Optimizer AdamW
Batch size 256256
Activation function RELU
MLP: # layers 33
MLP: hidden sizes [256,256,256][256,256,256]
Table 8: Voltage equation: Direct(finite-difference) method parameters.
Category Hyperparameter Value / Note
Optimizer Algorithm L-BFGS-B (box bounds [umin,umax][u_{\min},u_{\max}])
Max iterations 100
Tolerances ftol=106\texttt{ftol}=10^{-6} (and/or gtol)
Initialization 𝒖(t0)=𝟎\bm{u}(t_{0})=\mathbf{0} (uniform)
Finite diff. Stencil forward (optionally central)
Step size ϵj=η(1+|uj|)\epsilon_{j}=\eta(1+|u_{j}|), η=106\eta=10^{-6}
Cost eval Rollout solver CN
Runtime note Nx+1N_{x}{+}1 evals/grad (or 2Nx2N_{x} for central)
Table 9: Voltage equation: Adjoint-method parameters.
Category Hyperparameter Value / Note
Optimizer Algorithm L-BFGS-B (box bounds [umin,umax][u_{\min},u_{\max}])
Max iterations 100
Tolerances ftol=106\texttt{ftol}=10^{-6} (and/or gtol)
Initialization 𝒖(t0)=𝟎\bm{u}(t_{0})=\mathbf{0}
Adjoint solves Forward stepper CN
Backward sweep discrete adjoint (one pass)
Numerics Linear solves direct (LU) or iterative tol 101010^{-10}
Table 10: Heat equation: simulation parameters and model/training hyperparameters.
LSTM [34] denotes a Long Short-Term Memory network.
Category Hyperparameter Value
PDE & Simulation Domain length LL 1.01.0
Final time TT 1.01.0
Diffusion DD 0.10.1
Leakage β\beta 0.50.5
Control gain α\alpha 2.02.0
Initial state V(x,0)V(x,0) 0.00.0
Reference Vref(x)V_{\text{ref}}(x) 0.00.0
Weight bounds [cmin,cmax][c_{\min},c_{\max}] [1.0, 1.0][-1.0,\;1.0]
Discretization Solver grid (n)(n) 4141
Solver steps (Nt)(N_{t}) 4141
Number of sensors (Nx=n)(N_{x}=n) 4141
Number of basis functions (M)(M) 66
PDE-OP’s 𝒴θ\cal Y_{\theta} Learning Rate 1×1031\times 10^{-3}
Number of epochs 20002000
Optimizer AdamW
Batch size 128128
Activation function RELU
Branch net (b)(b): # layers 44
Branch net (b)(b): hidden sizes [512,512,512,512][512,512,512,512]
Trunk net (γ)(\gamma): # layers 44
Trunk net (γ)(\gamma): hidden sizes [512,512,512,512][512,512,512,512]
Feature dimension (d)(d) 512512
PDE-OP’s 𝒰ω\mathcal{U}_{\omega} Learning rate 1×1041\times 10^{-4}
Number of epochs 50005000
Optimizer AdamW
Batch size 20482048
Activation function RELU
LSTM: # layers 22
LSTM: hidden sizes [256,256][256,256]
Table 11: Heat equation: Linear MPC (LMPC) parameters. ()(^{*}) The basis function are used by each method.
Category Parameter Value / Note
Horizon Prediction length NpN_{p} 1010
Dynamics State update 𝒚(tk+1)=𝑨𝒚(tk)+𝑩𝒖(tk)+𝒈\bm{y}(t_{k+1})=\bm{A}\bm{y}(t_{k})+\bm{B}\bm{u}(t_{k})+\bm{g}
Discretization Crank–Nicolson (linear)
Actuation Basis cos(jπx/L)\cos(j\pi x/L), j=0,,m1j=0,\dots,m-1
Solver QP solver OSQP (CVXPY), warm start
Stopping Tolerances default OSQP (primal/dual res.)
Table 12: Heat equation: Adjoint method parameters.
Category Parameter Value / Note
Horizon Prediction length NpN_{p} 1010
Forward Integrator Crank–Nicolson (linear), same 𝑨,𝑩,𝒈\bm{A},\bm{B},\bm{g}
Adjoint Backward recursion 𝝀Nt=𝑸(𝒚Nt𝒚target)\bm{\lambda}_{N_{t}}=\bm{Q}\,(\bm{y}_{N_{t}}-\bm{y}_{\text{target}})
𝝀k=𝑨𝝀k+1+𝑻Q(𝒚(tk)𝒚target)\ \bm{\lambda}_{k}=\bm{A}^{\top}\bm{\lambda}_{k+1}+\bm{T}_{Q}\,(\bm{y}(t_{k})-\bm{y}_{\text{target}})
Gradient Per step 𝒖(tk)Jh=𝑹𝒖(tk)+𝑩𝝀k+1.\nabla_{\bm{u}(t_{k})}J_{h}=\bm{R}\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{\lambda}_{k+1}.
Optimizer Algorithm L-BFGS-B, warm start
Stopping Max iters / tolerances 100; ftol=106\texttt{ftol}=10^{-6}
Table 13: Burgers’ equation: simulation parameters and model/training hyperparameters.
LSTM[34] denotes a Long Short-Term Memory network.
Category Hyperparameter Value
PDE & Simulation Domain length LL 1.01.0
Final time TT 4.04.0
Viscosity ν\nu 0.030.03
Weight bounds [cmin,cmax][c_{\min},c_{\max}] [1.0, 1.0][-1.0,\;1.0]
Discretization Solver grid (n)(n) 8181
Solver steps (Nt)(N_{t}) 201201
Number of sensors (Nx=n)(N_{x}=n) 8181
Number of basis functions (M)(M) 44
PDE-OP’s 𝒴θ\cal Y_{\theta} Learning Rate 1×1031\times 10^{-3}
Number of epochs 20002000
Optimizer AdamW
Batch size 512512
Activation function RELU
Branch net (b)(b): # layers 33
Branch net (b)(b): hidden sizes [256,256,256][256,256,256]
Trunk net (γ)(\gamma): # layers 55
Trunk net (γ)(\gamma): hidden sizes [256,256,256,256,256][256,256,256,256,256]
Feature dimension (d)(d) 128128
PDE-OP’s 𝒰ω\mathcal{U}_{\omega} Learning rate 1×1041\times 10^{-4}
Number of epochs 50005000
Optimizer AdamW
Batch size 20482048
Activation function RELU
LSTM: # layers 22
LSTM: hidden sizes [256,256][256,256]
Table 14: Burgers’ equation: Nonlinear MPC (NMPC) parameters. ()(^{*}) The basis function are used by each method.
Category Parameter Value / Note
Horizon Prediction length NpN_{p} e.g. 1010 (receding horizon)
Dynamics Implicit step (CN) 𝒚(tk+1)=𝒢(𝒚(tk),𝒄(tk))\bm{y}(t_{k+1})\;=\;\mathcal{G}(\bm{y}(t_{k}),\bm{c}(t_{k}))
Actuation Basis sin((j+1)πx/X)\sin\!\big((j{+}1)\pi x/X\big), j=0,,M1j=0,\dots,M-1
Solver Nonlinear optimizer SLSQP (single-shooting), box bounds; warm start by shift
CN inner Newton solve per step tolerance 101010^{-10}, max iters 2525
Table 15: Burgers’ equation: Adjoint method parameters.
Category Parameter Value / Note
Horizon Prediction lentgh NpN_{p} 10
Forward Implicit CN dynamics as in NMPC, Dirichlet enforced each step
Jacobian Semi-discrete residual J(𝒚)=𝑫diag(𝒚)+ν𝑫2J(\bm{y})=-\bm{D}\,\mathrm{diag}(\bm{y})+\nu\,\bm{D}^{2}
Adjoint Backward sweep Ak𝒒k=Δt(𝒚(tk+1)𝒚target)A_{k}^{\top}\bm{q}_{k}=\Delta t\,(\bm{y}(t_{k+1})-\bm{y}_{\text{target}});
𝒑k=Δt(𝒚(tk)𝒚target)Bk𝒒k\bm{p}_{k}=\Delta t\,(\bm{y}(t_{k})-\bm{y}_{\text{target}})-B_{k}^{\top}\bm{q}_{k}
Gradient Per-step control gradient 𝒖(tk)J=Δt(α𝒖(tk)+𝑩𝒒k),\nabla_{\bm{u}(t_{k})}J=\Delta t\big(\alpha\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{q}_{k}\big),
Optimizer Outer optimizer L-BFGS-B, warm start