Learning to Solve
Optimization Problems Constrained
with Partial Differential Equations

Yusuf Güven
Istanbul Technical University
Control and Automation Engineering Department
Istanbul, Turkey
ngu6fh@virginia.edu
&Vincenzo Di Vito^∗ and Ferdinando Fioretto
University of Virginia
Department of Computer Science
Charlottesville, USA
{eda8pc,fioretto}@virginia.edu
Equal Contribution

The implementation of the methods used to reproduce the experiments described in this work con be found at https://github.com/vdvf96/PDE-OP.

Abstract

Partial differential equation (PDE)-constrained optimization arises in many scientific and engineering domains, such as energy systems, fluid dynamics, and material design. In these problems, the decision variables (e.g., control inputs or design parameters) are tightly coupled with the PDE state variables, and the feasible set is implicitly defined by the governing PDE constraints. This coupling makes the problems computationally demanding, as it requires handling high-dimensional discretizations and dynamic constraints. To address these challenges, this paper introduces a learning-based framework that integrates a dynamic predictor with an optimization surrogate. The dynamic predictor, a novel time-discrete Neural Operator (Lu_2021, ), efficiently approximates system trajectories governed by PDE dynamics, while the optimization surrogate leverages proxy optimizer techniques (kotary2021end, ) to approximate the associated optimal decisions. This dual-network design enables real-time approximation of optimal strategies while explicitly capturing the coupling between decisions and PDE dynamics. We validate the proposed approach on benchmark PDE-constrained optimization tasks including Burgers’ equation, heat equation, and voltage regulation, and demonstrate that it achieves solution quality comparable to classical control-based algorithms such as the Direct Method and Model Predictive Control (MPC), while providing up to four orders of magnitude improvement in computational speed.

Antiquus S. Hippocampus, Natalia Cerebro & Amelie P. Amygdale ^∗Equal Contribution
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213, USA
{hippo,brain,jen}@cs.cranberry-lemon.edu
&Ji Q. Ren & Yevgeny LeNet
Department of Computational Neuroscience
University of the Witwatersrand
Joburg, South Africa
{robot,net}@wits.ac.za
and Coauthor
Affiliation
Address
email

1 Introduction

Partial differential equations (PDEs) are central to modeling a wide range of physical and engineering systems, capturing phenomena that evolve jointly over space and time. Applications include fluid transport, which can be modeled with Burgers’ equations which are key in fluid mechanics and gas dynamics, thermal processes captured by heat equations, often used in financial mathematics and image analysis, and voltage dynamics in power systems. Optimizing decisions in such domains requires solving PDE-constrained optimization problems, where control or design objectives must be optimized while ensuring consistency with the underlying governing PDE dynamics.

Despite their importance, PDE-constrained optimization remains computationally intractable. Traditional approaches typically rely on finite element and finite difference discretizations combined with adjoint-based optimization. These methods often suffer from high dimensionality, high computational cost, and convergence difficulties when applied to nonlinear dynamics. For example, when applied to the nonlinear Burgers’ equations, our experiments show that the runtimes of the adjoint-based and Nonlinear MPC approaches run in approximately $120\;s$ and $878\;s$ ; such computational cost hinder their applicability in large-scale or real-time decision-making tasks.

In various classes of continuous optimization tasks not constrained by PDEs, proxy optimizers (kotary2021end, ) have recently emerged as a powerful tool to quickly find approximate solutions. In parallel, researchers have focused on developing a variety of surrogates (raissi2017physics, ; kovachki2024neuraloperatorlearningmaps, ; li2021fourier, ; chen2019neural, ) for capturing system dynamics. Our approach is a unique combination of elements from both these areas, and consists of a dual-network architecture to tackle PDE-constrained optimization problems where both the state dynamics and the control decisions evolve in space and time.

Contributions. Specifically, the paper makes the following contributions: (1) It introduces a novel learning-based method to efficiently approximate solutions of PDE-constrained optimization problems. Our approach, PDE-OP (Optimization Proxy), consists of a dual-network architecture in which a dynamic predictor captures the system dynamics, while a surrogate controller approximates the optimal control actions. These components are trained in a synergistic manner in a self-supervised fashion using a primal–dual method. (2) It shows that the proposed time-discrete Neural Operator accurately captures the dynamics governed by PDEs. (3) It demonstrates that the proposed framework achieves decision quality comparable to state-of-the-art numerical solvers, such as Model Predictive Control and the Adjoint-sensitivity method, while providing several orders of magnitude improvements in computational speed. The approach is validated on a set of widely adopted PDE-constrained optimization tasks, including Burgers’, heat, and voltage equations. We believe that these results could boost the adoption of learning-based methods for control problems that require high-fidelity dynamic modeling and near real-time optimal decision-making.

2 Related Work

A central paradigm for decision-making coupled with dynamical systems is Model Predictive Control (MPC) (mayne2000constrained, ). It repeatedly solves an optimization problem over a finite time horizon using predicted system states. While MPC provides high-quality control decisions, its reliance on repeatedly solving an optimization problem makes it computationally intractable and limits its applicability in real-time settings. In a similar fashion, exact PDE-constrained optimization methods rely on time–space discretization of the governing equations, after which the resulting high-dimensional system is incorporated into an optimization problem (hinze2008optimization, ; troeltzsch2010optimal, ). While theoretically well-founded, these approaches become quickly impractical as the problem size grows, especially in the presence of nonlinear dynamics or when real-time feasibility is required. To mitigate these limitations, surrogate models have been developed for reduced-order adjoint solutions (zahr2014progressive, ), sensitivity PDEs (dihlmann2015certified, ), and learning PDE operators (Hwang_2022, ), offering significant computational advantages. Nevertheless, these methods often face trade-offs between accuracy and efficiency and limited adaptability across operating regimes(kovachki2024neuraloperatorlearningmaps, ; Lu_2021, ; willcox2021physics, ).

In parallel, to approximate solutions of constrained optimization problems in real-time setting, machine learning researchers have developed a class of methods known as proxy optimizers. These methods employ machine learning models, to learn mappings from problem parameters or system configurations to optimal decisions, thereby significantly accelerating inference (kotary2021end, ) compared to classical optimization solvers. Both supervised approaches (fioretto2020lagrangian, ), which rely on precomputed optimal solutions, and self-supervised ones (park2023self, ), which exploit knowledge of the problem structure, have been proposed, showing promising results. A recurring challenge in this setting is ensuring feasibility: learned solutions may violate problem constraints. This issue has been tackled through penalty-based training (fioretto2020lagrangian, ), implicit-layer formulations that embed constraints directly into the model architecture (donti2020dc3, ), or post-processing projections to restore feasibility (kotary2024learningjointmodelsprediction, ).

A large body of recent work has also focused on learning surrogates for PDE dynamics. Physics-Informed Neural Networks (2019JCoPh.378..686R, ) directly encode PDE residuals into the loss function, enabling solutions that adhere to the governing equations. Neural operator approaches, such as the Fourier Neural Operator (li2021fourier, ) and DeepONet (Lu_2021, ), learn mappings between infinite-dimensional function spaces, which allows them to approximate solution operators of families of PDEs, with strong generalization across different boundary conditions and coefficients (kovachki2024neuraloperatorlearningmaps, ). Other operator-learning frameworks (li2020multipole, ; bhattacharya2021model, ) and kernel-based methods (raissi2021deepgreen, ) have also been developed. While these approaches excel at reproducing PDE solutions, they are typically not designed to incorporate decision variables or to directly solve PDE-constrained optimization problems. The closest related work is (divito2024learningsolvedifferentialequation, ), where the authors propose a learning-based surrogate solver to tackle ODE or SDE-constrained optimization tasks. Our approach goes further, by integrating operator-learning surrogates within the optimization pipeline, thereby coupling the representation of PDE dynamics with proxy optimizers and enabling real-time solutions to PDE-constrained optimization tasks.

3 Settings and Goals

This paper focuses on optimization problem constrained by a system of partial differential equations:


$\displaystyle\operatorname*{Minimize}_{\,{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}:Q\to\mathbb{R}}\;\;$	$\displaystyle\overbrace{L\!\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(T,\cdot)\big)\;+\;\int_{0}^{T}\!\!\int_{\Omega}\Phi\!\big({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\big)\,d{x}\,dt+\int_{0}^{T}\!\!\int_{\Omega}\ \\|u(t,x)\\|^{2}_{2}d{x}\,dt}^{J({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x}))}$	(1a)
s.t.	$\displaystyle\partial_{t}{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\;=\;{\mathcal{I}}_{t}\!\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x}),{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),t,{x}\big),\;\;\;\;\;(t,{x})\in Q,$	(1b)
	$\displaystyle\partial_{{x}}{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\;=\;{\mathcal{I}}_{x}\!\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x}),{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,{x}),t,{x}\big),\;\;\;\;(t,{x})\in Q,$	(1c)
	$\displaystyle{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t_{0},{x})\;=\;{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}_{t_{0}}({x}),\quad{x}\in\Omega,$	(1d)
	$\displaystyle{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,{x})\;=\;{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}_{\partial}(t,{x}),\quad\quad(t,{x})\in[t_{0},T]\times\partial\Omega,$	(1e)
	$\displaystyle\bm{g}({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,x)\;\leq\;\bm{0},\quad\bm{h}({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x),{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,x))\;=\;\bm{0},\quad(t,{x})\in Q.$	(1f)

In this formulation, the control action is denoted by ${{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}:Q\to\mathbb{R}$ , where $Q=[t_{0},T]\times\Omega$ is the space–time domain, with $t_{0},T\in\mathbb{R}$ being the initial time instant and the time horizon, respectively, and $\Omega$ being the spatial domain, e.g., if $x\in\mathbb{R}$ , then $\Omega\subseteq\mathbb{R}$ . The system state is represented by ${{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}:Q\to\mathbb{R}$ . $t$ and ${x}$ denote the independent temporal and spatial variable, respectively. The optimization objective (1a) combines a terminal cost $L\big({{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(T,\cdot)\big)$ , which depends on the state at the final time $T$ , with a running cost $\Phi$ , integrated over time and space, and an effort cost, which is the energy used to control the system.

The system dynamics are governed by two PDE constraints. Equation (1b) models the temporal evolution of the state, while (1c) captures spatial dependencies, for example through diffusion–reaction relations. Initial conditions (1d) prescribe the state distribution at $t=t_{0}$ , while boundary conditions (1e) specify the state behavior on the spatial boundary $\partial\Omega$ . Finally, (1f) represents additional inequality and equality constraints on $({{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}},{{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}})$ , which may encode feasibility requirements or application-specific restrictions.

Refer to caption — Figure 1: Temperature regulation process: control actions ${{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x)$ steer the system from the initial state ${{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t_{0},x)$ towards a target state ${{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(T,x)$ .

Temperature control example. In the context of thermal regulation problems, the decision variable ${{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u}}(t,x)$ represents localized actuators applied along the medium, while the state variable ${{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}y}}(t,x)$ describes the temperature distribution over space and time. These two components are tightly coupled: the control inputs directly enter the governing PDE as forcing terms, shaping the spatio-temporal evolution of the state, while the optimization objective is evaluated on the resulting state trajectory. System dynamics (1b–1c) follow a controlled reaction–diffusion equation, which models heat diffusion, dissipation, and externally applied control. The initial condition (1d) specifies the temperature distribution at $t=t_{0}$ , and the boundary condition (1e) enforces zero-flux (Neumann) boundaries, preventing heat transfer across the domain boundary. The objective 1a combines a terminal cost to match the target temperature profile, a running cost for deviations along the horizon, and a control effort penalty. A simplified illustration is provided in Fig.1, while a full mathematical description is given in Section 5.2.

Challenges. While fundamental for many applications, Problem (1) presents three key challenges:

1.

Finding optimal solutions to Problem (1) is computationally intractable. Even without the differential equation constraints, the decision version of the problem alone is NP-hard in general.
2.

Achieving high-quality approximations of the system dynamics (1b)-(1c) in near real-time, poses the second significant challenge. The non-linearity of these dynamics further complicate the task.
3.

Finally, the integration of the system dynamics in the optimization framework renders traditional numerical methods impractical for real-time applications.

4 PDE-Optimization Proxy

To tackle the challenges discussed above, we introduce PDE-Optimization Proxy (PDE-OP): a fully differentiable surrogate for PDE-constrained optimization. PDE-OP adopts a dual-network architecture: the first network, $\mathcal{U}_{\omega}$ , approximates the optimal decision variable ${u}^{*}(t,{x})$ , while the second, $\mathcal{Y}_{\theta}$ , predicts the corresponding state variables ${y}(t,{x})$ using a discrete-time neural operator architecture. Here, $\omega$ and $\theta$ denote the trainable parameters of $\mathcal{U}$ and $\mathcal{Y}$ , respectively. These two components are jointly trained in an end-to-end differentiable manner, and a schematic illustration of their interaction is provided in Figure 2. In what follows, we first describe each network individually and then detail their integration within the proposed learning framework.

Learning Goal. PDE-OP trains on a distribution $\Pi$ of problem instances generated by different states $y(t,x)$ for various $t$ and $x$ values. For instance, in the thermal regulation example, this distribution corresponds to the variety of temperature profiles that arise from different control inputs. With reference to (1a), the learning objective is:


	$\displaystyle\operatorname*{Minimize}_{\omega,\theta}\mathbb{E}_{{y(t,x)}\sim\Pi}\left[{\cal J}\left(\mathcal{U}_{\omega}(t,x,y),\mathcal{Y}_{\theta}(t,{x},\hat{u})\right)\right]$		(2a)
	s.t.(1b)–(1f),		(2b)

where, to simplify the notation, we denote the estimated control variable as $\hat{{u}}(t,x)\!=\!\mathcal{U}_{\omega}(t,x,\hat{y})$ , while the state variable estimate is denoted as $\hat{{y}}(t,x)\!=\!\mathcal{Y}_{\theta}(t,x,\hat{u})$ . Given the complexity of solving Problem (2), the goal is to develop a fast and accurate neural PDE optimization surrogate. This approach uses the concept of proxy optimizers (kotary2021end, ), which is detailed next.

4.1 Surrogate Controller

The first objective is to establish a neural network-based mapping that transforms the system state $y(t,x)$ of a PDE-constrained optimization problem 1 to a control action $u(t,x)$ . Practically, this mapping is implemented as a neural network $\mathcal{U}_{\omega}$ , which takes the current system state $\hat{y}(t,x)$ and desired terminal state $y(T,x)$ as input, and outputs an array of coefficients $\hat{\bm{c}}(t)$ that are then used to estimate the corresponding optimal decision variables. These coefficients are the weights used to combine a set of basis functions to represent the control action, and are described next.

Basis function representation of the control action. To approximate the optimal control action using a neural network, we represent the control trajectory ${u}(t,{x})$ as a linear combination of basis functions. This representation has also been proposed in basis_cite_1 ; basis_cite_2 . This design choice is motivated by the fact that, although the output of a neural network lies in a finite-dimensional space, combining it with continuous basis functions ${\phi_{i}(x)}$ allows us to construct a continuous approximation of the control trajectory ${u}(t,{x})$ . In this way, the controller model $\mathcal{U}_{\omega}$ operates in a low-dimensional space, while the basis functions map these finite number of outputs into a continuous-time signal. As shown in the additional experiments provided in Appendix 9.6, direct modeling of continuous-time control actions, either with a neural network or classical method, leads to qualitatively similar performance to our basis expansion approach, but while the computational cost of our method remains extremely low, the running time of the classical methods grows significantly. These results show that our method is suitable for real-time tasks.
Consider the control action ${u}(t_{k},{x})$ , where $t_{k}$ denotes a collocation point; at each time instant $t_{k}$ , the control ${u}(t_{k},{x})$ is expressed as a weighted sum of $M$ continuous basis functions $\{\phi_{i}(x)\}_{i=1}^{M}$ :

{u}(t_{k},{x})=\sum_{i=1}^{M}c_{i}(t_{k})\,\phi_{i}(x),\qquad t_{k}=0,\ldots,T,

(3)

where $\bm{c}(t_{k})=[c_{1}(t_{k})\;\ldots\;c_{M}(t_{k})]^{\top}$ denotes the coefficients of the basis functions at time instant $t_{k}$ . For example, in one of our experiments, we adopt sinusoidal basis functions, e.g. $\phi_{1}(x)=\sin(\pi x/X)$ , $\phi_{2}(x)=\cos(\pi x/X)$ , $\dots$ , $\phi_{2M-1}(x)=\sin(M\pi x/X)$ , $\phi_{2M}(x)=\cos(M\pi x/X)$ , which provide a Fourier-like expansion of the control signal. Here $M$ denotes the number of sinusoidal modes included in the expansion, with higher values of $M$ allowing for better approximations of the control variable.

Surrogate controller model training. The role of the controller surrogate $\mathcal{U}_{\omega}$ is to predict the array of coefficients $\bm{c}(t_{k})=[c_{1}(t_{k})\;\dots\;c_{M}(t_{k})]^{T}$ . Formally, the network $\mathcal{U}_{\omega}$ approximates the function

{U}:\big({y}(t_{k},{x}),y(T,x)\big)\mapsto\bm{c}(t_{k}),

(4)

which maps the state variable ${y}(t_{k},x)$ and desired terminal state ${y}(T,x)$ to the corresponding controller weights $\bm{c}(t_{k})$ , subsequently used to construct the control action $u(t_{k},x)$ according to (3). The surrogate controller model $\mathcal{U}_{\omega}$ is trained on-the-fly, using a dataset $\mathcal{D}=\{(\hat{{\bm{y}}}^{i}(t_{k},{x}),\,{y}(T,{x})\}_{i=1}^{N}$ , where $\hat{{\bm{y}}}(t_{k},{x})$ denotes the state variables estimated by the dynamic predictor $\mathcal{Y}_{\theta}$ , ${y}(T,{x})$ is the desired terminal state and $N$ is the number of samples. Here, bold symbols $\bm{y}(t_{k},x)$ denote the space-discrete version of the corresponding vector $y(t_{k},x)$ , with each collocation point corresponding to a specific output of $\mathcal{Y}_{\theta}$ , e.g. $\bm{y}(t_{k},x)=\big[\hat{y}(t_{k},x_{1})\;\ldots\;\hat{y}(t_{k},x_{n})\big]^{T}$ . Note that this setting is unsupervised: optimal weights $\bm{c}(t_{k})$ are unknown, and candidate weights $\hat{\bm{c}}(t_{k})=\mathcal{U}_{\omega}(\hat{\bm{y}}(t_{k},x),y(T,x))$ are evaluated solely by the objective value $\mathcal{J}$ in (1a) they induce. Figure 2 illustrates the interaction between PDE-OP’s proxy optimizer $\mathcal{U}_{\omega}$ and dynamic predictor $\mathcal{Y}_{\theta}$ at time instant $t=t_{k}$ . Given the predicted state $\hat{{y}}(t_{k},{x})$ (for $k=0$ we have $\hat{{y}}(t_{0},{x})={y}_{0}(x)$ ), the surrogate controller outputs an estimate of $\bm{c}(t_{k})$ which are then used to construct the control action ${u}(t_{k},{x})$ ; this is subsequently used by $\mathcal{Y}_{\theta}$ to predict the next state $\hat{{y}}(t_{k+1},{x})$ . This process is repeated at each collocation point, until the terminal state ${y}(T,{x})$ is estimated.
To build our surrogate controller, we build on the proxy optimizer method proposed in (park2023self, ), which allows self-supervised training without access to precomputed optimal solutions, while directly encouraging feasibility through primal–dual updates. Once trained, the model ${\cal U}_{\omega}$ can be used to generate near-optimal solutions at low cost. The next section discusses how the state variables are estimated using a discrete-time Neural Operator architecture.

4.2 Discrete-time Neural Operator

The second component of PDE-OP is a discrete-time Neural Operator architecture, here denoted as $\mathcal{Y}_{\theta}$ , which models the PDE solution operator. The proposed dynamic predictor employs two subnetworks: the first, which we denote branch network, encodes input functions (e.g., control signals), and the second, called trunk network, encodes the coordinates where the solution is evaluated (e.g., spatial locations). Their outputs are then combined to approximate the target operator. This architecture is inspired by (Lu_2021, ), a neural operator framework designed to approximate mappings between infinite-dimensional function spaces.

Recall that the PDE-constrained problem couples system dynamics and decision variables. To take account for this coupling, the proposed architecture incorporates both the system state and the controller output. Given the current estimated state $\hat{\bm{y}}(t_{k},x)$ and the control weights $\hat{\bm{c}}(t_{k})$ estimated by $\mathcal{U}_{\omega}$ , the network $\mathcal{Y}_{\theta}$ outputs an estimate of the state at the subsequent collocation point in time:

\hat{\bm{y}}(t_{k+1},x)=\mathcal{Y}_{\theta}\!\left(t,x,\hat{\bm{y}}(t_{k},x),\hat{\bm{c}}(t_{k})\right).

(5)

Rolling out (5) across $k=0,\ldots,T-1$ yields the full estimated state trajectory, $\{\hat{\bm{y}}(t_{k},x)\}_{k=0}^{T}$ .

Branch network. The branch net encodes the estimated system state sampled at $n$ fixed sensor locations, $\hat{\bm{y}}(t_{k},x)$ , together with estimated control weights $\hat{\bm{c}}(t_{k})$ , and maps this concatenated input to a $d$ -dimensional vector of coefficients $\bm{b}\!\left(\hat{\bm{y}}(t_{k},x),\hat{\bm{c}}(t_{k})\right)\in\mathbb{R}^{d}$ .
Trunk network. The trunk net encodes the spatial query coordinate $x$ , producing a latent feature vector $\bm{\gamma}(x)\in\mathbb{R}^{d}$ .
The estimated subsequent (in time) state $\hat{\bm{y}}(t_{k+1},x)$ at location $x_{i}$ is obtained by combining the branch coefficients and trunk features through an inner product:

\hat{y}(t_{k+1},x_{i})\;=\;\sum_{j=1}^{d}b_{j}\!\left(\hat{{y}}(t_{k},x_{i}),\hat{\bm{c}}(t_{k})\right)\,\gamma_{j}(x_{i}),\;\;i=1,\dots,n.

(6)

The remainder of this section details the method by which the state variables are precisely estimated using the proposed discrete-time Neural Operator.

Model initialization and training. Given the proposed time-discrete neural operator model $\mathcal{Y}_{\theta}$ takes as input the weights of the basis functions estimated by the surrogate controller $\mathcal{U}_{\theta}$ , it is practical to initialize the dynamic predictor model. To achieve this, we initialize $\mathcal{Y}_{\theta}$ in a supervised fashion; its training data are generated by sampling time-varying control weight sequences $\bm{c}(t_{k})$ from a gaussian distribution, e.g., $c_{i}(t_{k})\sim\mathcal{N}(0,\sigma_{i})$ , and by specifying the terminal state $y(T,x)$ . The corresponding solution $y(t,x)$ satisfying (1b)–(1f) is computed using a numerical PDE solver. Each trajectory is discretized into one-step training samples of the form $(y(t_{k},x),\bm{c}(t_{k}))\mapsto y(t_{k+1},x)$ . Formally, the dataset of training pairs is given by $\mathcal{D}=\Big\{\big(y^{i}(t_{k},x),\bm{c}^{i}(t_{k}),y^{i}(t_{k+1},x)\big)\;\Big|\;t_{k}\in\{0,\dots,T-1\}\Big\}_{i=1}^{N}$ . The network $\mathcal{Y}_{\theta}$ is trained to minimize the prediction error between its outputs and the ground-truth values. Formally, this model is trained by minimizing the loss:

\min_{\theta}\;\mathbb{E}_{(\bm{c},y)\sim\mathcal{D}}\Big[\big\|\hat{y}(t,x)-y(t,x)\big\|^{2}\Big],

(7)

where $\mathcal{D}$ denotes the dataset of pairs $(\bm{c},y)$ . This training strategy ensures that $\mathcal{Y}_{\theta}$ learns to approximate the PDE solution operator given a distribution of weights $\bm{c}(t)$ approximating the distribution of estimated weights generated by the surrogate controller $\mathcal{U}_{\omega}$ , so that it can provide accurate PDE-solutions given the predicted controller weights $\hat{\bm{c}}(t)$ .

4.3 Handling Static and Dynamics Constraints Jointly

To integrate the proposed time-discrete neural perator within the decision process, this paper proposes a Primal Dual (LD) learning approach, which is inspired by the augmented Lagrangian relaxation technique (Hestenes1969MultiplierAG, ) adopted in classic optimization. In Lagrangian relaxation, some or all the problem constraints are relaxed into the objective function using Lagrangian multipliers to capture the penalties induced by violating them. The proposed formulation leverages Lagrangian duality to integrate trainable regularization terms that encapsulates violations of the constraint functions. When all the constraints are relaxed, the violation-based Lagrangian of problem (1) is

\operatorname*{Minimize}_{{u}}\mathcal{J}({{u(t,x)}},{{y}}(t,x))+\bm{\lambda}^{\top}_{h^{\prime}}|\bm{h}^{\prime}({{u(t,x)}},{{y}}(t,x))|+\bm{\lambda}^{\top}_{g}\max(0,\bm{g}({{u(t,x)}},{{y}}(t,x))),

where $\mathcal{J}$ , $\bm{g}$ are defined in (1a), and (1f), respectively, and $\bm{h}^{\prime}$ is defined as follows,

\bm{h}^{\prime}(u,y)\;=\;\begin{bmatrix}\partial_{t}y(t,x)-\mathcal{I}_{t}\!\big(y(t,x),u(t,x),t,x\big)\\[2.0pt] \partial_{x}y(t,x)-\mathcal{I}_{x}\!\big(y(t,x),u(t,x),t,x\big)\\[2.0pt] y(0,x)-y_{0}(x)\\[2.0pt] y(t,x)-y_{\partial}(t,x)\\[2.0pt] \bm{h}(u,y)\end{bmatrix}=\bm{0}.

(8)

It denotes the set of all equality constraints of problem (1), thus extending the constraints $\bm{h}$ in (1e), with the system dynamics (1b)-(1c) and the initial and boundary conditions equations (1d)-(1e) written in an implicit form as above. Therein, $\bm{\lambda}_{h^{\prime}}$ and $\bm{\lambda}_{g}$ are the vectors of Lagrange multipliers associated with functions $\bm{h}^{\prime}$ and $\bm{g}$ , e.g. ${\lambda}^{i}_{h^{\prime}},{\lambda}^{j}_{g}$ are associated with the $i$ -th equality $h^{\prime}_{i}$ in $\bm{h}^{\prime}$ and $j$ -th inequality $g_{j}$ in $\bm{g}$ , respectively. The key advantage of expressing the system dynamics (1b)-(1c) and initial and boundary conditions (1d)-(1e) in the same implicit form as the equality constraints $\bm{h}$ , (as shown in (8)), is treating the system dynamics in the same manner as the constraint functions $\bm{h}$ . This enables us to satisfy the system dynamics and the static set of constraints ensuring that they are incorporated seamlessly into the optimization process.

The proposed primal-dual learning method uses an iterative approach to find good values of the primal $\omega$ , $\theta$ and dual $\bm{\lambda}_{h^{\prime}},\bm{\lambda}_{g}$ variables; it uses an augmented modified Lagrangian as a loss function to train the prediction $\hat{{u}}$ , $\hat{{y}}(t)$ as employed

\begin{split}\mathcal{L}^{\text{PDE-OP}}(\hat{{u}},\hat{{y}})&\!\!=\!\!\mathcal{J}({{\hat{u}(t,x)}},{{\hat{y}}}(t,x))\!+\!\bm{\lambda}^{\top}_{h^{\prime}}|\bm{h}^{\prime}(\hat{{u}}(t,x),\hat{{y}}(t,x))|\!+\!\bm{\lambda}^{\top}_{g}\max(0,\bm{g}(\hat{{u}}(t,x),\hat{{y}}(t,x))),\!\!\!\!\end{split}

(9)

where $\mathcal{J}({{\hat{u}(t,x)}},{{\hat{y}}}(t,x))$ represents the total objective cost, while $\bm{\lambda}^{\top}_{h^{\prime}}|\bm{h}^{\prime}(\hat{{u}}(t,x),\hat{{y}}(t,x))|$ and $\bm{\lambda}^{\top}_{g}\max(0,\bm{g}(\hat{{u}}(t,x),\hat{{y}}(t,x)))$ measures the constraint violations incurred by prediction $\hat{{u}}(t,x)$ and $\hat{{y}}(t,x)$ . The loss function in (9) combines an objective term to ensure minimization of the objective function (1a) with a weighted penalty on violations of the constraint functions $\bm{h}^{\prime},\bm{g}$ . This accounts for the contribution of both networks $\mathcal{U},\mathcal{Y}$ during training, which is balanced via iterative updates of the multipliers in (9) based on the amount of violation of the associated constraint function.

At iteration $j+1$ , finding the optimal parameters $\omega,\theta$ requires solving

\omega^{j+1},\theta^{j+1}=\arg\min_{\omega,\theta}\mathbb{E}_{y(t,x)\sim\mathcal{D}}\left[\mathcal{L}^{\text{PDE-OP}}\left(\mathcal{U}_{\omega^{j}}^{\bm{\lambda}^{j}}(t,x,\hat{y}),\mathcal{Y}_{\theta^{j}}^{\bm{\lambda}^{j}}\left(t,x,\hat{u}\right)\right)\right],

(10)

where $\mathcal{U}_{\omega}^{\lambda^{j}}$ and $\mathcal{Y}_{\theta}^{\lambda^{j}}$ denote the DE-OP’s optimization and predictor models $\mathcal{U}_{\omega}$ and $\mathcal{Y}_{\theta}$ , at iteration $j$ , with $\bm{\lambda}^{j}=[\bm{\lambda}^{j}_{h^{\prime}}\ \bm{\lambda}^{j}_{g}]^{\top}$ . This step is approximated using a stochastic gradient descent method

\begin{split}\omega^{j+1}&=\omega^{j}-\eta\nabla_{\omega}\mathcal{L}^{\text{PDE-OP}}\left(\mathcal{U}_{\omega^{j}}^{\bm{\lambda}^{j}}(t,x,\hat{y}),\mathcal{Y}_{\theta^{j}}^{\bm{\lambda}^{j}}\left(t,x,\hat{u}\right)\right)\\ \theta^{j+1}&=\theta^{j}-\eta\nabla_{\theta}\mathcal{L}^{\text{PDE-OP}}\left(\mathcal{U}_{\omega^{j}}^{\bm{\lambda}^{j}}(t,x,\hat{y}),\mathcal{Y}_{\theta^{j}}^{\bm{\lambda}^{j}}\left(t,x,\hat{u}\right)\right)\end{split}

(11)

where $\eta$ denotes the learning rate and $\nabla_{\omega}\mathcal{L}$ and $\nabla_{\theta}\mathcal{L}$ represent the gradients of the loss function $\mathcal{L}$ with respect to the parameters $\omega$ and $\theta$ , respectively, at the current iteration $k$ . Importantly, this step does not recompute the training parameters from scratch, but updates the weights $\omega$ , $\theta$ based on their value at the previous iteration. Finally, the Lagrange multipliers are updated as

\begin{split}\bm{\lambda}^{j+1}_{h^{\prime}}&=\bm{\lambda}^{j}_{h^{\prime}}+\rho|\bm{h}^{\prime}(\hat{{u}}(t,x),\hat{{y}}(t,x))|\\ \bm{\lambda}^{j+1}_{g}&=\bm{\lambda}^{j}_{g}+\rho\max\left(\bm{0},\bm{g}(\hat{{u}}(t,x),\hat{{y}}(t,x))\right)\end{split}

(12)

where $\rho$ denotes the Lagrange step size and the max operation is performed element-wise.

Next, we evaluate our method on a range of PDE-constrained optimization tasks with varying levels of complexity, including cases space-time varying control actions and nonlinear PDE dynamics.

5 Experimental Setting

In this section, we evaluate the proposed method on three optimal control tasks of increasing complexity. In these tasks the dynamics are described by the following PDE equations: Voltage Dynamics (linear PDE with time-invariant control input $u(x)$ ), 1D Heat equation (linear PDE with time-variant control input $u(t,x)$ ), and 1D viscous Burgers’ equation (nonlinear PDE with time-variant control input $u(t,x)$ ). PDE-OP is compared against state-of-the-art approaches for optimal control problems: the Direct method (Direct-Method, ), the Adjoint-Sensitivity method (adjoint-method, ), and (non-linear) Model Predictive Control (NMPC) (MPC, ). These methods are widely adopted in the literature, all of which rely on numerical PDE solvers whenever dynamics or gradients are evaluated. The Open-loop (direct/adjoint) Control method requires a full forward rollout per gradient step and a backward adjoint evalution, while MPC solves a finite-horizon problem at every control step (i.e., a receding horizon). The runtime of these methods is determined by repeated calls to the underlying numerical PDE solver and grows with the spatial grid size ( $n$ ), horizon length ( $T$ ), and complexity (e.g., non-linearity) of the PDE dynamics. For each experiment and method, we report the accuracy, measured as the $L_{2}$ error between the terminal state and the predicted terminal state, $\|y(T,x)-\hat{y}(T,x)\|^{2}_{2}$ , the inference time of PDE-OP and runtime of the classical methods, both expressed in seconds. For each task, we assess the capability of the proposed time-discrete neural operator to capture system dynamics by comparing its outputs against the solutions obtained from a numerical PDE solver.
Please refer to Appendices 8, and 10 for details on the baseline methods and hyperparameter settings, respectively. In Appendix 9.4 we report an ablation to evaluate the impact of our basis functions, introduced in Section 4.1). In summary, this experiment shows that using a full formulation is impractical for NMPC, pushing solve times well beyond real-time limits. Finally, we show that Adjoint methods suffer from inaccurate gradients, which leads to a marked decrease in solution quality.

5.1 Voltage Control Optimization

The problem of regulating the voltage magnitude in a leaky electrical transmission line, subject to a reaction-diffusion PDE with Neumann conditions is described as follows:


$\displaystyle\operatorname*{Minimize}_{u(x)}\quad$	$\displaystyle\int_{\Omega}(y(T,x)-y_{\text{target}}(x))^{2}dx+\gamma\int_{\Omega}u(x)^{2}dx$		(13a)
s.t.	$\displaystyle\frac{\partial y}{\partial t}=D\frac{\partial^{2}y}{\partial x^{2}}-\beta(y-y_{\text{ref}}(x))+\alpha u(x),\quad$	$\displaystyle\forall(x,t)\in\Omega\times[0,T]$	(13b)
	$\displaystyle y(0,x)=y_{0}(x),\quad$	$\displaystyle\forall x\in\Omega,\{\ t=0\}$	(13c)
	$\displaystyle\frac{\partial y}{\partial x}(t,x)=0,\quad$	$\displaystyle\forall x\in\partial\Omega,\ t\in[0,T]$	(13d)
	$\displaystyle u_{\text{min}}\leq u(x)\leq u_{\text{max}}.\quad$	$\displaystyle\forall x\in\Omega.$	(13e)

Here, $y(t,x)\in\mathbb{R}$ represents the voltage profile across the space-time domain defined by the cartesian product between $\Omega$ and time horizon $[0,T]$ . The system dynamics, expressed by (13b), model the physical phenomena of diffusion ( $D\frac{\partial^{2}y}{\partial x^{2}}$ ), voltage leakage towards a reference $y_{\text{ref}}(x)$ ( $-\beta(y-y_{\text{ref}}(x))$ ) and the applied control action ( $\alpha u(x)$ ). $D$ , $\beta$ , and $\alpha$ are all constants representing the diffusion rate, leakage rate, and control gain, respectively. The system is initialized from a state $y_{0}(x)$ according to (13c) and is constrained by zero-flux Neumann boundary conditions (13d), implying no voltage diffusion across the boundaries of the domain. The control decision $u(x)$ is a space-dependent function representing the externally applied voltage, constrained by actuator limits in (13e). The objective (13a) minimizes a combination of terminal cost and control energy.

Datasets and methods. In the system dynamics (13b), we set the diffusion rate $D=0.1$ , leakage rate $\beta=1.0$ , and control gain $\alpha=2.0$ , with a uniform reference voltage $y_{\text{ref}}(x)=1.0$ . The ground-truth dynamics are simulated using a second-order implicit Crank-Nicolson finite-difference method(Crank_Nicolson_1947, ). To initialize the proposed time-discrete Neural Operator $\cal Y_{\theta}$ , we generate a dataset by sampling a diverse set of control functions $u(x)$ from a zero-mean Gaussian Random Field (GRF) with a squared-exponential kernel, with mean $m(x)$ and covariance function $k_{\ell}(x_{1},x_{2})$ as:

m(x)=0,\qquad k_{\ell}(x_{1},x_{2})\;=\;\sigma^{2}\exp\!\left(-\frac{\|x_{1}-x_{2}\|_{2}^{2}}{2\ell^{2}}\right),

(14)

where $\ell>0$ is the length–scale and $\sigma^{2}>0$ the marginal variance. The corresponding solution $y(t,x)$ is computed using a numerical solver, and then split into $80\%$ training, $10\%$ validation, and $10\%$ test set.

PDE-OP’s model $\mathcal{Y}_{\theta}$ estimates the voltage profile $y(t,x)$ ; for $t=T$ this prediction yield $y(T,x)$ , which is compared against the target profile $y_{\text{target}}(x)$ . PDE-OP’s surrogate controller $\mathcal{U}_{\omega}$ is trained to produce optimal control action $\hat{u}(x)$ while minimizing objective (13a). Our approach is benchmarked against Direct and Adjoint-Method, which are described in Appendix 8.1- 8.2. At inference time, PDE-OP’s surrogate controller $\mathcal{U}_{\omega}$ is tested on three unseen target profiles of different level of complexities: constant ( $y_{\text{target}}(x)=p_{1}$ ), linear ramp ( $y_{\text{target}}(x)=p_{1}x+p_{2}$ ), and sine wave ( $y_{\text{target}}(x)=p_{1}+p_{2}(\sin f_{0}x+p_{3})$ ). For each target, the control $\hat{u}(x)$ is generated in a single forward pass.

Results. Figure 4 shows the predicted voltage (central sub-figure of Fig. 4) of PDE-OP’s dynamic component $\mathcal{Y}_{\theta}$ at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. The figure shows that the predicted voltage closely matches the numerical PDE-solver solution, with an absolute error (right sub-figure) on the order of $10^{-2}$ across the space–time domain.

Figure 3 presents a comparison between PDE-OP’s predicted terminal state $\hat{y}(T,x)$ and that obtained using the classical methods, over $3$ target profiles $y_{\text{target}}(x)$ : sinusoidal, linear and constant. This figure shows that the voltage distribution produced by PDE-OP is well aligned with those produced by the Direct and the Adjoint-sensitivity method. As shown in the first row of Table 1, the direct method is the slowest, requiring $26.097s$ and yielding an MSE of $7.46\times 10^{-3}$ . The Adjoint method runs in $0.755s$ with a similar accuracy, $7.13\times 10^{-3}$ . Notably, PDE-OP is the fastest, by a significant margin: it runs in only $0.035$ s ( $\sim\!21.6\times$ faster than Adjoint; $\sim\!746\times$ faster than Direct) and yield the lowest MSE, with $6.95\times 10^{-3}$ . These results show the enormous potential of the proposed approach, which produces solutions that are qualitatively similar (or even slightly better) than the classical methods, while being between $20$ and $750$ times faster than them.

Table 1: Comparison across PDE tasks and classical methods. All results use sinusoidal target profiles. Further details and results for linear and constant profiles can be found in Appendix 9.

PDE System

Control Type

Methods

Runtime

Accuracy

Voltage Control

(Reaction-Diffusion)

Open-Loop

Direct (Finite Diff.)

Adjoint-Method

PDE-OP (ours)

26.097

0.755

0.035

0.00746

0.00713

0.00695

1D Heat Equation

(Linear PDE)

Closed-Loop

LMPC

Adjoint-Method

PDE-OP (ours)

0.384

0.662

0.124

0.00035

1D Burgers’ Equation

(Nonlinear PDE)

Closed-Loop

NMPC

Adjoint-Method

PDE-OP (ours)

878.667

120.696

0.062

0.00007

0.00023

0.00030

5.2 Optimal Control of the Heat Equation

This task involves controlling the evolution of temperature in a one-dimensional domain described by a reaction–diffusion PDE. Given an initial profile, the controller aims to reach a prescribed target state with minimal control effort. The problem is formulated as:


$\displaystyle\operatorname*{Minimize}_{\bm{c}(t)_{t\in[0,T]}}\quad$	$\displaystyle\mathcal{L}=\mathcal{L}_{\text{terminal}}+\lambda\mathcal{L}_{\text{running}}+\gamma\mathcal{L}_{\text{effort}}$		(15a)
s.t.	$\displaystyle\frac{\partial y}{\partial t}=D\frac{\partial^{2}y}{\partial x^{2}}-\beta(y-y_{\text{ref}}(x))+\alpha u(t,x),\quad$	$\displaystyle\forall(x,t)\in\Omega\times[0,T]$	(15b)
	$\displaystyle u(t,x)=\bm{c}(t)^{\top}\bm{\phi}(x),\quad$	$\displaystyle\forall(x,t)\in\Omega\times[0,T]$	(15c)
	$\displaystyle y(0,x)=y_{0}(x),\quad$	$\displaystyle\forall x\in\Omega$	(15d)
	$\displaystyle\frac{\partial y}{\partial x}(t,x)=0,\quad$	$\displaystyle\forall x\in\partial\Omega,\ t\in[0,T]$	(15e)
	$\displaystyle c_{\text{min}}\leq\bm{c}(t)\leq c_{\text{max}},\quad$	$\displaystyle\forall\ t\in[0,T].$	(15f)

The loss components in (15a) defines as follows.


$\displaystyle\mathcal{L}_{\text{terminal}}$	$\displaystyle:=\int_{\Omega}\left\|y(T,x)-y_{\text{target}}(x)\right\|^{2}\,dx,$	(16a)
$\displaystyle\mathcal{L}_{\text{running}}$	$\displaystyle:=\int_{0}^{T}\int_{\Omega}\left\|y(t,x)-y_{\text{target}}(x)\right\|^{2}\,dxdt,$	(16b)
$\displaystyle\mathcal{L}_{\text{effort}}$	$\displaystyle:=\int_{0}^{T}\|\|\bm{c}(t)\|\|_{2}^{2}\,dt.$	(16c)

Here $y(t,x)$ represents the temperature profile. The objective (15a) balances three terms: a terminal loss for final state accuracy, a running loss for temperature tracking, and an effort loss for control efficiency, described by (16a)–(16c).

Datasets and methods. The solutions of (15b) are generated using a Crank-Nicolson method, which deal with time-varying source term by averaging the control input over each time step. To train our models, we generate a dataset of diverse trajectories. First, smooth, time-varying control weights ( $\bm{c}(t)$ ) are generated by sampling each of the $M$ weight trajectories from a GRF over the time domain. The full spatial-temporal control field $u(t,x)$ is then reconstructed from these weights constrained by (15f) using a set of fixed basis functions, as defined in (15c). For each generated control field, the full PDE is solved to generate a ground-truth temperature evolution $y(t,x)$ . Each complete simulation yields a training sample, consisting of the control input weight trajectory $\{\bm{c}(t_{k})\}$ , and the corresponding state trajectory $\{y(t_{k},x)\}$ evaluated at a set of fixed sensor locations. The dataset is then split into 80% training, 10% validation and 10% test.

Our framework uses two neural networks. First discrete-time neural operator, $\mathcal{Y}_{\theta}$ , is trained as a surrogate model described in Section (4.2). It act as a one-step time-advancement operator, learning the mapping $(y(t_{k},x),\bm{c}(t_{k}))\rightarrow y(t_{k+1},x)$ . Second, PDE-OP’s surrogate controller $\mathcal{U}_{\omega}$ is a recurrent neural network that acts as sequential decision-maker. At each time step $t_{k}$ , it takes the current state $y(t_{k},x)$ and the final target $y_{\text{target}}(x)$ to produce the control weights $(\bm{c}(t_{k}))$ . The controller is trained end-to-end by unrolling the trajectory using the surrogate $\mathcal{Y}_{\theta}$ . We benchmark our approach against a highly-tuned Adjoint-Method and Linear MPC(LMPC). Further implementation details are in Appendix 8.3 - 8.4.

At inference, we evaluate PDE-OP model on three unseen target profiles: constant( $y_{\text{target}}(x)=p_{1}$ ), linear ramp( $y_{\text{target}}(x)=p_{1}x+p_{2}$ ), and sine wave( $y_{\text{target}}(x)=p_{1}+p_{2}(\sin f_{0}x+p_{3})$ ).

Results. Figure 6 shows the temperature (central sub-figure of Fig. 6) estimated by PDE-OP’s dynamic component $\mathcal{Y}_{\theta}$ at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. The figure shows that also in this setting the predicted temperature aligns closely with the numerical PDE-solver solution, with an absolute error on the order of $10^{-3}$ across the space–time domain.

Figure 5 illustrates the comparison of PDE-OP ’s predicted terminal state $\hat{y}(T,x)$ and those obtained using the classical methods, over the same target profiles introduced in the previous experiment. As also shown in Table 1, all three methods achieve the same accuracy, reporting a MSE of $3.5\times 10^{-4}$ . However, PDE-OP is by far the fastest method, running in $0.124\,\text{s}$ . This corresponds to a $3.1\times$ speedup over LMPC (running in $0.384\,\text{s}$ ) and a $5.3\times$ speedup over the Adjoint method (which runs in $0.662\,\text{s}$ ). These results demonstrate that even in a more complex setting, with time-variant control actions and linear PDE-dynamics, our method produces solutions that are qualitatively comparable to the classical methods, but at a significantly reduced computation cost.

5.3 Optimal Control of the 1D Burgers’ Equation

To test the ability of PDE-OP to handle nonlinear dynamics, we address the control of the 1D viscous Burgers’ equation with Dirichlet boundary condition. The control input and external force $u(t,x)$ , is parameterized identically to the heat equation task, using time-varying weights $(\bm{c}(t_{k}))$ for a set of basis functions. The optimization goal is to find the optimal trajectory of these weights subject to Burger’s equation (17b), initial (17d) and Dirchlet’s boundary condition (17e) and box constraints (17f).


$\displaystyle\operatorname*{Minimize}_{\bm{c}(t)_{t\in[0,T]}}\quad$	$\displaystyle\mathcal{L}=\mathcal{L}_{\text{terminal}}+\lambda\mathcal{L}_{\text{running}}+\gamma\mathcal{L}_{\text{effort}}$		(17a)
s.t.	$\displaystyle\frac{\partial y}{\partial t}+y\frac{\partial y}{\partial x}=\nu\frac{\partial^{2}y}{\partial x^{2}}+u(t,x),$	$\displaystyle\forall(x,t)\in\Omega\times[0,T]$	(17b)
	$\displaystyle u(t,x)=\bm{c}(t)^{\top}\bm{\phi}(x),$	$\displaystyle\forall(x,t)\in\Omega\times[0,T]$	(17c)
	$\displaystyle y(0,x)=y_{0}(x),$	$\displaystyle\forall x\in\Omega$	(17d)
	$\displaystyle y(t,x)=0,$	$\displaystyle\forall x\in\partial\Omega,\ t\in[0,T]$	(17e)
	$\displaystyle c_{\text{min}}\leq\bm{c}(t)\leq c_{\text{max}},\quad$	$\displaystyle\forall\ t\in[0,T].$	(17f)

$y(t,x)$ is velocity field, and $\nu$ represents the viscosity parameter. The PDE in (17b) contains convective nonlinearity $(y\frac{\partial y}{\partial x})$ which has a significant control challenge not present in the linear heat equation. The objective retrains the same structure, balancing final state accuracy, path tracking, and control efficiency.

Datasets and methods. The ground-truth dynamics for the nonlinear Burgers’ equation are simulated using a high-fidelity numerical solver. The dataset generation process mirrors that of the heat equation: time-varying weights $\bm{c}(t)$ are sampled from a GRF, and the force field $u(t,x)$ is reconstructed using spatial basis functions, and the PDE is solved to generate trajectories. Due to the system’s high nonlinearity, we benchmark our method against a much more general and computationally expensive Nonlinear MPC (NMPC) and Adjoint-Sensitivity Method. Further implementation details are in Appendix 8.3 - 8.4.

At inference, we test our PDE-OP model on three unseen target profiles: constant ( $y_{\text{target}}(x)=p_{1}$ ), parabolic ( $y_{\text{target}}(x)=p_{1}x(1-x)$ ) and sine wave ( $y_{\text{target}}(x)=p_{1}(\sin f_{0}x+p_{2})$ ). We don’t use a a linear target profile, because of the Dirichlet BCs.

Results. Figure 8 shows the system dynamics (central sub-figure of Fig. 8) estimated by PDE-OP’s dynamic component $\mathcal{Y}_{\theta}$ at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. The figure shows that the predicted temperature aligns closely with the numerical PDE-solver solution; with absolute error on the order of $10^{-3}$ across the space–time domain, demonstrating the ability of PDE-OP’s dynamic predictor to accurately capture system dynamics even under nonlinearities.

Figure 7 illustrates the comparison of PDE-OP ’s predicted terminal state $\hat{y}(T,x)$ and those obtained using the classical methods, over the same target profiles introduced in the previous experiments. As shown in Figure 7 and Table 1, each method produces highly accurate solutions, with NMPC achieving the lowest error ( $\text{MSE}=7.0\times 10^{-5}$ ) but requiring $878.667s$ . The Adjoint method is slightly less accurate, yielding a $\text{MSE}=2.3\times 10^{-4}$ in $120.696s$ . In contrast, PDE-OP produces qualitatively comparable predictions while being the fastest $0.062s$ by a big margin - approximately $1.42\times 10^{4}$ faster than NMPC and $1.95\times 10^{3}$ faster than the Adjoint method!

6 Conclusion

This work was motivated by the efficiency requirements associated with solving partial differential equations-constrained optimization problems. It introduced a novel learning-based framework, PDE-OP, which incorporates differential equation constraints into optimization tasks for real-time applications. The approach uses a dual-network architecture, with one approximating the control strategies and another capturing the associated PDEs. This architecture exploits a primal-dual method to ensure that both the dynamics dictated by the PDEs and the optimization objectives are concurrently learned and respected. This integration enables end-to-end differentiation, allowing for efficient gradient-based optimization, and, to our knowledge, solves PDE-constrained optimization problems in real-time for the first time. Empirical evaluations across a set of PDE-constrained optimization tasks illustrated PDE-OP’s capability to address these complex challenges adeptly. Our comprehensive results demonstrate the effectiveness of our approach and its broad potential applicability across various scientific and engineering domains where system dynamics arise in optimization or control processes.

7 Acknowledgments

The material is based upon work supported by National Science Foundations (NSF) awards 2533631, 2401285, 2334448, and 2334936, and Defense Advanced Research Projects Agency (DARPA) under Contract No. #HR0011252E005. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or DARPA.

References

(1) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, March 2021.
(2) James Kotary, Ferdinando Fioretto, Pascal Van Hentenryck, and Bryan Wilder. End-to-end constrained optimization learning: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4475–4482, 2021.
(3) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations, 2017.
(4) Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces, 2024.
(5) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations, 2021.
(6) Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 6572–6583. Curran Associates Inc., 2018.
(7) David Q Mayne, James B Rawlings, CV Rao, and PO Scokaert. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814, 2000.
(8) Michael Hinze, René Pinnau, Michael Ulbrich, and Stefan Ulbrich. Optimization with PDE Constraints, volume 23 of Mathematical Modelling: Theory and Applications. Springer, 2008.
(9) Fredi Tröltzsch. Optimal Control of Partial Differential Equations: Theory, Methods, and Applications, volume 112 of Graduate Studies in Mathematics. American Mathematical Society, 2010.
(10) Matthew J Zahr and Charbel Farhat. Progressive construction of a parametric reduced-order model for pde-constrained optimization. International Journal for Numerical Methods in Engineering, 2014. https://arxiv.org/abs/1407.7618.
(11) Markus A Dihlmann and Bernard Haasdonk. Certified pde-constrained parameter optimization using reduced basis surrogate models for evolution problems. Computational Optimization and Applications, 60:753–787, 2015.
(12) Rakhoon Hwang, Jae Yong Lee, Jin Young Shin, and Hyung Ju Hwang. Solving pde-constrained control problems using operator learning. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4):4504–4512, June 2022.
(13) Karen Willcox and Omar Ghattas. The imperative of physics-based machine learning and the role of reduced-order models. Nature Computational Science, 1(3):166–168, 2021.
(14) Ferdinando Fioretto, Pascal Van Hentenryck, Terrence WK Mak, Cuong Tran, Federico Baldo, and Michele Lombardi. Lagrangian duality for constrained deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 118–135. Springer, 2020.
(15) Seonho Park and Pascal Van Hentenryck. Self-supervised primal-dual learning for constrained optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4052–4060, 2023.
(16) Priya L Donti, David Rolnick, and J Zico Kolter. Dc3: A learning method for optimization with hard constraints. In ICLR, 2020.
(17) James Kotary, Vincenzo Di Vito, Jacob Cristopher, Pascal Van Hentenryck, and Ferdinando Fioretto. Learning joint models of prediction and optimization, 2024.
(18) M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, Feb 2019.
(19) Zongyi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6755–6766, 2020.
(20) Kaushik Bhattacharya, Bamdad Hosseini, Nikola B. Kovachki, and Andrew M. Stuart. Model reduction and neural networks for parametric pdes. In Mathematical and Scientific Machine Learning (MSML), volume 145 of Proceedings of Machine Learning Research, pages 203–233. PMLR, 2021.
(21) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. The deepgreen algorithm for learning generic pdes. Journal of Computational Physics, 420:109707, 2021.
(22) Vincenzo Di Vito, Mostafa Mohammadian, Kyri Baker, and Ferdinando Fioretto. Learning to solve differential equation constrained optimization problems, 2024.
(23) Mingquan Feng, Zhijie Chen, Yixin Huang, Yizhou Liu, and Junchi Yan. Optimal control operator perspective and a neural adaptive spectral method. Proceedings of the AAAI Conference on Artificial Intelligence, 39(14):14567–14575, Apr. 2025.
(24) Zhexian Li, Athanassios S. Fokas, and Ketan Savla. On linear quadratic regulator for the heat equation with general boundary conditions. In 2024 IEEE 63rd Conference on Decision and Control (CDC), pages 7594–7599, 2024.
(25) Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4:303–320, 1969.
(26) John T. Betts. Practical Methods for Optimal Control and Estimation Using Nonlinear Programming. SIAM, 2 edition, 2010.
(27) Yang Cao, Shengtai Li, Linda Petzold, and Radu Serban. Adjoint sensitivity analysis for differential-algebraic equations: The adjoint dae system and its numerical solution. SIAM Journal on Scientific Computing, 24(3):1076–1089, 2003.
(28) Max Schwenzer, Muzaffer Ay, Thomas Bergs, and Dirk Abel. Review on model predictive control: an engineering perspective. The International Journal of Advanced Manufacturing Technology, 117:1327 – 1349, 2021.
(29) J. Crank and P. Nicolson. A practical method for numerical evaluation of solutions of partial differential equations of the heat-conduction type. Mathematical Proceedings of the Cambridge Philosophical Society, 43(1):50–67, 1947.
(30) Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
(31) Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
(32) Dieter Kraft. Algorithm 733: Tomp–fortran modules for optimal control calculations. ACM Transactions on Mathematical Software (TOMS), 20:262 – 281, 1994.
(33) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
(34) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.

Appendix

8 Classical Methods

8.1 Direct Method (Finite-Difference Approach)

The direct method treats the simulator as a black box mapping the bounded control vector $\bm{u}$ to the discrete objective $J(\bm{u})$ and employs a standard optimizer to minimize it. Here the control variable is discretized at each collocation point: $\bm{u}\in\mathbb{R}^{N_{x}}$ .

Mechanism. We use L-BFGS-B[30] with per–coordinate box bounds $u_{\min}\leq u_{j}\leq u_{\max}$ to minimize (31). No analytic gradients are provided; instead, numerical derivatives are queried via finite differences. With a forward-difference method,

\frac{\partial J}{\partial u_{j}}\;\approx\;\frac{J(\bm{u}+\epsilon_{j}\bm{e}_{j})-J(\bm{u})}{\epsilon_{j}},\qquad\epsilon_{j}=\eta\,\bigl(1+|u_{j}|\bigr),

(18)

where $\bm{e}_{j}$ is the $j$ -th unit vector and $\eta$ is a small scalar (e.g., $\mu=10^{-6}$ ). For higher accuracy (to which corresponds a double computational cost), the central-difference method is adopted:

\frac{\partial J_{h}}{\partial u_{j}}\;\approx\;\frac{J_{h}(\bm{u}+\epsilon_{j}\bm{e}_{j})-J_{h}(\bm{u}-\epsilon_{j}\bm{e}_{j})}{2\epsilon_{j}}.

(19)

Computational cost. Each gradient evaluation requires $N_{x}\!+\!1$ simulator calls with forward differences (or $2N_{x}$ with central differences), where $N_{x}$ denotes the number of spatial collocation points. Since one simulator call is a complete rollout (27) over $N_{t}$ time steps, the complexity per iteration is $\mathcal{O}\big((N_{x}\!+\!1)\,N_{t}\,N_{x}\big)$ . This method is robust and simple, but its cost scales linearly with $N_{x}$ , which becomes prohibitive for dense spatial discretizations.

8.2 Adjoint Sensitivity Method for Open-Loop Problem

The Adjoint-Sensitivity Method efficiently computes exact gradients of PDE-constrained objectives w.r.t. high-dimensional control variables. Crucially, the cost of one gradient evaluation is one forward solve + one backward adjoint solve, regardless of the number of control parameters. We derive the continuous adjoint for the open-loop, time-invariant Voltage Control Optimization (5.1) and then we present the discrete-time adjoint state used in our Crank–Nicolson (CN) implementation.

Function spaces & BCs. Consider $u\in L^{2}(\Omega)$ , $y\in L^{2}([0,T]\!\times\!\Omega)$ , and impose homogeneous Neumann BCs on $V$ so that the boundary terms vanish in the variational calculus; the adjoint inherits the same Neumann BCs.

The problem consists of finding $u(x)$ minimizing

J(u)=\tfrac{1}{2}\!\int_{\Omega}\!\!(y(T,x)-y_{\text{target}}(x))^{2}\,dx\;+\;\tfrac{\gamma}{2}\!\int_{\Omega}\!u(x)^{2}\,dx,

(20)

subject to


$\displaystyle\frac{\partial y}{\partial t}-D\frac{\partial^{2}y}{\partial x^{2}}+\beta\big(y-y_{\text{ref}}(x)\big)-\alpha u(x)=0,$	$\displaystyle\forall(t,x)\in[0,T]\times\Omega,$	(21a)
$\displaystyle y(0,x)=y_{0}(x),$	$\displaystyle\forall x\in\Omega,$	(21b)
$\displaystyle\partial_{x}y(t,x)=0,$	$\displaystyle\forall(t,x)\in[0,T]\times\partial\Omega,$	(21c)

where $\gamma>0$ is the regularization weight in the control effort.

Lagrangian. We define

\mathcal{L}(u,y,p)=J(u)+\int_{0}^{T}\!\!\int_{\Omega}p(t,x)\!\left[\frac{\partial y}{\partial t}-D\frac{\partial^{2}y}{\partial x^{2}}+\beta\big(y-y_{\text{ref}}(x)\big)-\alpha u(x)\right]dx\,dt,

(22)

where $p(t,x)$ is the adjoint state. By imposing $\nabla_{y}J=0$ and integrating by parts in $t$ and $x$ (with Neumann BCs) yields the adjoint system.

Adjoint PDE.

-\frac{\partial p}{\partial t}-D\frac{\partial^{2}p}{\partial x^{2}}+\beta\,p=0,\qquad\partial_{x}p|_{\partial\Omega}=0.

(23)

Terminal condition.

p(T,x)=y(T,x)-y_{\text{target}}(x).

(24)

Gradient w.r.t. $u(x)$ . Because $u$ is time-invariant, variation w.r.t. $u$ gives

\nabla_{u(x)}J\;=\;\gamma\,u(x)\;-\;\alpha\int_{0}^{T}p(t,x)\,dt.

(25)

We optimize $\bm{u}$ with L -BFGS -B under elementwise bounds $u_{\min}\leq u_{i}\leq u_{\max}$ ; no projection step is needed.

Semi-discrete forward dynamics. To simplify the notation, let $\bm{y}(t)=\Big[y(t,x_{1})\;\dots\;y(t,x_{i})\;\dots\;y(t,x_{N_{x}})\Big]\in\mathbb{R}^{N_{x}}$ be the discretized version of $y(t,x)$ on a uniform grid with Neumann BCs, $\bm{L}\in\mathbb{R}^{N_{x}\times N_{x}}$ be the the Neumann Laplacian, and $\bm{u}\in\mathbb{R}^{N_{x}}$ the static control vector with elements $u(x_{i})$ . Define $\bm{A}:=D\,\bm{L}-\beta\,\bm{I}_{N_{x}}$ and $\bm{s}:=\beta\,\bm{y}_{\text{ref}}$ . Then

\dot{\bm{y}}(t)\;=\;\bm{A}\,\bm{y}(t)\;+\;\alpha\,\bm{u}\;+\;\bm{s}.

(26)

Crank–Nicolson (CN) step. With $\Delta t=T/N_{t}$ and $t_{k}=k\Delta t$ , CN gives

\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)\bm{y}(t_{k+1})\;=\;\big(\bm{I}_{N_{x}}+\tfrac{\Delta t}{2}\bm{A}\big)\,\bm{y}(t_{k})\;+\;\Delta t\,\alpha\,\bm{u}\;+\;\Delta t\,\bm{s}.

(27)

Equivalently, the affine one-step map is

$\displaystyle\bm{y}(t_{k+1})$	$\displaystyle=\bm{A}_{\mathrm{CN}}\,\bm{y}(t_{k})\;+\;\bm{B}_{\mathrm{stat}}\,\bm{u}\;+\;\bm{d},$	(28)
$\displaystyle\bm{A}_{\mathrm{CN}}$	$\displaystyle=\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-1}\!\big(\bm{I}_{N_{x}}+\tfrac{\Delta t}{2}\bm{A}\big),\qquad\bm{B}_{\mathrm{stat}}=\Delta t\,\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-1}\,\alpha\,\bm{I}_{N_{x}},$	(29)
$\displaystyle\bm{d}$	$\displaystyle=\Delta t\,\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-1}\,\bm{s}.$	(30)

Discrete objective. With $\bm{M}=\Delta x\,\bm{I}_{N_{x}}$ we have:

J_{h}(\bm{u})=\tfrac{1}{2}\,\|\bm{y}_{N_{t}}-\bm{y}_{\text{target}}\|_{\bm{M}}^{2}+\tfrac{\gamma}{2}\,\|\bm{u}\|_{\bm{M}}^{2}=\tfrac{1}{2}(\bm{y}_{N_{t}}-\bm{y}_{\text{target}})^{\top}\bm{M}(\bm{y}_{N_{t}}-\bm{y}_{\text{target}})+\tfrac{\gamma}{2}\,\bm{u}^{\top}\bm{M}\bm{u}.

(31)

Discrete adjoint & gradient w.r.t. static $\bm{u}$ .

\bm{\lambda}_{N_{t}}=\bm{M}(\bm{y}_{N_{t}}-\bm{y}_{\text{target}}),\qquad\bm{\lambda}_{k}=\bm{A}_{\mathrm{CN}}^{\top}\,\bm{\lambda}_{k+1},\quad k=N_{t}-1,\dots,0,

(32)

\nabla_{\bm{u}}J_{h}=\gamma\,\bm{M}\,\bm{u}-\sum_{k=0}^{N_{t}-1}\bm{B}_{\mathrm{stat}}^{\top}\,\bm{\lambda}_{k+1}\;=\;\gamma\,\bm{M}\,\bm{u}-\alpha\,\Delta t\,\big(\bm{I}_{N_{x}}-\tfrac{\Delta t}{2}\bm{A}\big)^{-\!\top}\!\left(\sum_{k=0}^{N_{t}-1}\bm{\lambda}_{k+1}\right).

(33)

Computational cost. One forward CN rollout (27) and one backward adjoint sweep (32) yield the exact discrete gradient (33); then, $\bm{u}$ is updated with L-BFGS-B under box bounds. In 1D, the CN matrix is time-invariant and can be prefactored once, so that each step reduces to cheap triangular-matrix solver call; the totla cost per iteration is $\mathcal{O}(N_{t}N_{x})$ .

8.3 Time-Varying Adjoint for Closed-Loop Baselines

The adjoint method extends naturally to a time-varying control $u(t,x)$ . The continuous adjoint PDE and terminal/boundary conditions remain as per Eq. (23)–(24), while the gradients w.r.t. $u(t,x)$ are calculated as:

\nabla_{u(t,x)}J\;=\;\gamma\,u(t,x)\;-\;\alpha\,p(t,x).

(34)

We use this time-varying adjoint for the Heat and Burgers experiments (please refer to Section 5.2 and 5.3); the discrete forms below match our CN implementations.

Heat optimization problem — CN discretization.

With $\bm{y}(t_{k+1})=\bm{A}\bm{y}(t_{k})+\bm{B}\bm{u}(t_{k})+\bm{g}$ and a quadratic running/terminal costs, the discrete adjoint equation and per-step gradient are

	$\displaystyle\bm{\lambda}_{N_{t}}$	$\displaystyle=\bm{Q}\,(\bm{y}_{N_{t}}-\bm{y}_{\text{target}}),\qquad\bm{\lambda}_{k}=\bm{A}^{\top}\bm{\lambda}_{k+1}+\bm{T}_{Q}\,(\bm{y}(t_{k})-\bm{y}_{\text{target}}),$		(35)
	$\displaystyle\nabla_{\bm{u}(t_{k})}J_{h}$	$\displaystyle=\bm{R}\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{\lambda}_{k+1}.$		(36)

This is the gradient used by LMPC in Section 5.2. $\bm{Q},\bm{T}_{Q},\text{ and }\bm{R}$ represent the weight matrices.

Burgers’ optimization problem — implicit CN step.

We write one CN step as the implicit map $\bm{y}(t_{k+1})=\mathcal{G}(\bm{y}(t_{k}),\bm{u}(t_{k});\Delta t)$ with Dirichlet endpoints enforced at each step. Let

A_{k}\;=\;I-\tfrac{\Delta t}{2}\,J(\bm{y}(t_{k+1})),\qquad B_{k}\;=\;-I-\tfrac{\Delta t}{2}\,J(\bm{y}(t_{k})),

(37)

where $J(\bm{y})=-\bm{D}\,\mathrm{diag}(\bm{y})+\nu\,\bm{D}^{2}$ is the Jacobian of the semi-discrete residual, and $\bm{B}$ stacks the actuator shapes. For running cost $L_{k}=\tfrac{\Delta t}{2}\|\bm{y}(t_{k+1})-\bm{y}_{\text{target}}\|_{2}^{2}+\tfrac{\alpha\Delta t}{2}\|\bm{u}(t_{k})\|_{2}^{2}$ , the closed-loop adjoint/backward sweep is

$\displaystyle A_{k}^{\top}\bm{q}_{k}$	$\displaystyle=\Delta t\,(\bm{y}(t_{k+1})-\bm{y}_{\text{target}}),$	(38)
$\displaystyle\bm{p}_{k}$	$\displaystyle=\Delta t\,(\bm{y}(t_{k})-\bm{y}_{\text{target}})-B_{k}^{\top}\bm{q}_{k},$	(39)
$\displaystyle\nabla_{\bm{u}(t_{k})}J_{h}$	$\displaystyle=\Delta t\big(\alpha\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{q}_{k}\big),$	(40)

which are the NMPC equations we implemented and related to the experiments in Section 5.3. Here $\bm{q}_{k}$ and $\bm{p}_{k}$ are adjoint the variables.

8.4 Model Predictive Control (MPC)

In what follows, we introduce the MPC algorithm; to simplify notation we omit the dependency from the spatial variable $x$ . At each time instant $t_{k}$ , MPC solves a finite-horizon problem of $N_{p}$ steps given inputs $\{\bm{u}(t_{i})\}_{i=k}^{k+N_{p}-1}$ ; it applies the first input $\bm{u}(t_{k})$ , then applies shifts and iterates(receding horizon).

Linear MPC for the Heat equation.

With Crank–Nicolson discretization and cosine-basis expansion for the control action, the dynamics are linear:

\bm{y}(t_{i+1})\;=\;\bm{A}\,\bm{y}(t_{i})\;+\;\bm{B}\,\bm{c}(t_{i})\;+\;\bm{g},\qquad\bm{y}(t_{i})\in\mathbb{R}^{N_{x}},\;\;\bm{c}(t_{i})\in\mathbb{R}^{M}.

(41)

We solve the convex QP

$\displaystyle\min_{\{\bm{c}(t_{i})\}_{i=k}^{k+N_{p}-1}}$	$\displaystyle\sum_{i=k}^{k+N_{p}-1}\\|\bm{y}(t_{i})-\bm{y}_{\text{target}}\\|_{\bm{T}_{Q}}^{2}+\\|\bm{c}(t_{i})\\|_{\bm{R}}^{2}+\\|\bm{y}(t_{k+N_{p}})-\bm{y}_{\text{target}}\\|_{\bm{Q}}^{2}$	(42)
s.t.	$\displaystyle(\ref{eq:lmpc_dyn})$
	$\displaystyle c_{\min}\leq\bm{c}(t_{i})\leq c_{\max}.$

Here, $\bm{A}=M_{L}^{-1}M_{R}$ , $\bm{B}=M_{L}^{-1}(\Delta t\,\bm{B}_{c})$ , $\mathbf{g}=M_{L}^{-1}(\Delta t\,\mathbf{s})$ (Neumann BCs), $\mathbf{B}_{c}$ stacks $\cos(j\pi x/X)$ ; (42) is solved with CVXPY/OSQP [31]. Here $M_{L}$ and $M_{R}$ are the CN matrices of the system, obtained from the semi-discrete PDE operator $A_{c}$ :

M_{L}\;=\;I-\tfrac{\Delta t}{2}\,A_{c},\qquad M_{R}\;=\;I+\tfrac{\Delta t}{2}\,A_{c}.

(43)

This way one CN step writes as:

M_{L}\,\bm{y}(t_{k+1})\;=\;M_{R}\,\bm{y}(t_{k})\;+\;\Delta t\,\bm{B}_{c}\,\bm{c}(t_{k})\;+\;\Delta t\,\bm{s}.

(44)

Nonlinear MPC for the Burgers equation.

For viscous Burgers, the CN step is nonlinear due to convection term; we write one step as the implicit map

\bm{y}(t_{i+1})\;=\;\mathcal{G}(\bm{y}(t_{i}),\bm{c}(t_{i});\Delta t),

(45)

where $\mathcal{G}$ is the unique CN solution for Burgers with input $\bm{B}\bm{c}(t_{i})$ (computed via fsolve on the CN residual and by imposing Dirichlet endpoints at each step).

	$\displaystyle\min_{\{\bm{c}(t_{i})\}_{i=k}^{k+N_{p}-1}}$	$\displaystyle\sum_{i=k}^{k+N_{p}-1}\\|\bm{y}(t_{i})-\bm{y}_{\text{target}}\\|_{\bm{T}_{Q}}^{2}+\\|\bm{c}(t_{i})\\|_{\bm{R}}^{2}+\\|\bm{y}(t_{k+N_{p}})-\bm{y}_{\text{target}}\\|_{\bm{Q}}^{2}$		(46)
	s.t.	$\displaystyle\ref{eq:nmpc_dyn},\;\;c_{\min}\leq\bm{c}(t_{i})\leq c_{\max}.$		(46)

We use single shooting over the full-order model: propagate with (45), optimize with SLSQP[32] under box bounds, apply $\mathbf{y}(t_{k})$ , advance one CN step, and iterate. Here $\mathbf{B}$ stacks $\sin((j{+}1)\pi x/X)$ shapes used in the implementation. Notably, ${\mathbf{T}_{Q}},{\mathbf{R}},\text{ and }\mathbf{Q}$ represent the weight matrices, which correspond to the multipliers $\bm{\lambda}$ in our PDE-OP method.

Choice of controller.

For the voltage optimization task, we use Direct and Adjoint-method for open loop. For the heat equation, CN yields a linear PDE, so the finite-horizon problem is a convex QP, which can be solved with Linear MPC. For Burgers’ equation, the dynamics are nonlinear and CN produces an implicit nonlinear update, so we use Nonlinear MPC, with NMPC optimizing over an implicit CN step. In both the heat and burgers’ tasks, we also use Adjoint-method for closed loop, and report its results.

9 Additional Experimental Results

In this section we evaluate the runtime and terminal tracking error (MSE) achieved by each method on polynomial target profiles, and report these metrics in Tables 2, 3, and 4. All methods use the same spatial/temporal discretization and solver settings as described in Appendix 10.

9.1 Voltage Control Optimization

We test each method on additional target profiles $y_{\text{target}}(x)=p_{1}x+p_{2}$ : (i) constant: $(p_{1},p_{2})=(0,1)$ so $y_{\text{target}}(x)=1$ ; (ii) ramp: $(p_{1},p_{2})=(1,0.5)$ so $y_{\text{target}}(x)=x+0.5$ . (Equivalently, $y_{\text{target}}(x)=p_{1}x+p_{2}$ with $(p_{1},p_{2})=(0,1)$ and $(1,0.5)$ , respectively.) Since a ramp target profile is infeasible under zero-flux (Neumann) boundaries, all methods can only produce approximations.

Table 2: Voltage Optimization Task. For each method and target profile we report runtime (s) and MSE.

Methods	$y_{\text{target}}(x)=1$		$y_{\text{target}}(x)=x+0.5$
	Runtime	Accuracy	Runtime	Accuracy
Direct (Finite Diff.)	2.422s	4.6498e-11	12.810s	0.00005
Adjoint-Method	0.049s	2.8213e-12	0.464s	0.00004
PDE-OP (ours)	0.037s	1.5275e-6	0.050s	0.00007

Results.

Constant target $y_{\text{target}}(x)=1$ . The Adjoint method attains the highest accuracy (MSE $\mathbf{2.82\times 10^{-12}}$ ) in $0.049$ s, while the Direct method is much slower at $2.422$ s with MSE $4.65\times 10^{-11}$ .PDE-OP is fastest at $\mathbf{0.037}$ s (about $65.5\times$ faster than Direct and $1.32\times$ faster than Adjoint) and reports an MSE of $1.53\times 10^{-6}$ .

Ramp target $y_{\text{target}}(x)=x+0.5$ . The Adjoint method yields the lowest error ( $\mathbf{4.0\times 10^{-5}}$ ), running in $0.464$ s; the Direct method achieves $5.0\times 10^{-5}$ but takes $12.810$ s. PDE-OP runs in $\mathbf{0.050}$ s, about $256\times$ faster than Direct and $9.3\times$ faster than Adjoint method, producing an MSE of $7.0\times 10^{-5}$ , which is on same order of magnitude of the baselines.

9.2 Optimal Control of the 1D Heat Equation

For the 1D heat equation (closed-loop control), we consider affine targets $y_{\text{target}}(x)=p_{1}x+p_{2}$ with $(p_{1},p_{2})=(0,1)$ and $(1,0.5)$ . Under zero-flux (Neumann) boundary conditions, the ramp profile is not exactly admissible, as its derivative at the boundaries is nonzero; all controllers therefore track its best achievable approximation under these constraint. Here, we compare the proposed PDE-OP model with LMPC and the Adjoint-Sensitivity Method.

Table 3: Heat Equation Task. For each method and target profile we report runtime (s) and MSE.

Methods	$y_{\text{target}}(x)=1$		$y_{\text{target}}(x)=x+0.5$
	Runtime	Accuracy	Runtime	Accuracy
LMPC	0.443s	2.3289e-14	0.562s	0.00008
Adjoint-Method	0.485s	6.6612e-11	0.552s	0.00008
PDE-OP (ours)	0.223s	1.5210e-6	0.132s	0.00009

Results.

Constant target $y_{\text{target}}(x)=1$ . LMPC achieves produces accurate terminal profile (MSE $\mathbf{2.33\times 10^{-14}}$ ) and runs in $0.443$ s; the Adjoint runs in $0.485$ s with MSE $6.66\times 10^{-11}$ . PDE-OP is fastest at $\mathbf{0.223}$ s (about $2.0\times$ faster than LMPC and $2.2\times$ than Adjoint) with a still small error of $1.52\times 10^{-6}$ .

Ramp target $y_{\text{target}}(x)=x+0.5$ . LMPC and Adjoint achieves for the highest accuracy (MSE= $\mathbf{8.0\times 10^{-5}}$ ) with runtimes $0.562$ s and $0.552$ s, respectively. PDE-OP runs only in $\mathbf{0.132}$ s (about $4.3\times$ faster LMPC and $4.2\times$ than the Adjoint), with an MSE of $9.0\times 10^{-5}$ of the same order of magnitude as the baselines.

9.3 Optimal Control of the 1D Burgers’ Equation

For the 1D viscous Burgers equation with Dirichlet boundaries $y(t,0)=y(t,L)=0$ , admissible steady targets must satisfy these conditions. We therefore consider (i) a zero (constant) target $y_{\text{target}}(x)=0$ , and (ii) a parabolic target $y_{\text{target}}(x)=p_{1}\,x(1-x)$ (we use $p_{1}=2.0$ in experiments). We compare our method with NMPC and time-varying Adjoint baseline.

Table 4: Burger Optimization Task. For each method and target profile we report runtime (s) and MSE.

Methods	$y_{\text{target}}(x)=0$		$y_{\text{target}}(x)=2x(1-x)$
	Runtime	Accuracy	Runtime	Accuracy
NMPC	132.243s	5.6760e-14	671.033s	0.00006
Adjoint-Method	21.487s	5.8290e-11	122.116s	0.00013
PDE-OP (ours)	0.059s	0.00002	0.073s	0.00031

Results.

Zero target $y_{\text{target}}(x)=0$ . NMPC reaches nearly machine precision (MSE $\mathbf{5.68\times 10^{-14}}$ ) but is slow at $132.243$ s; the Adjoint baseline is faster at $21.487$ s with MSE $5.83\times 10^{-11}$ . PDE-OP is fastest at $\mathbf{0.059}$ s (about $2.24\times 10^{3}$ vs. NMPC and $3.64\times 10^{2}$ vs. Adjoint) with a small but larger error of $2.0\times 10^{-5}$ .

Parabolic target $y_{\text{target}}(x)=2x(1\!-\!x)$ . NMPC attains the lowest error (MSE $\mathbf{6.0\times 10^{-5}}$ ) in $671.033$ s; Adjoint yields $1.3\times 10^{-4}$ at $122.116$ s. PDE-OP runs in $\mathbf{0.073}$ s—about $9.19\times 10^{3}$ faster than NMPC and $1.67\times 10^{3}$ faster than Adjoint—with MSE $3.1\times 10^{-4}$ (same order of magnitude).

Takeaway. For nonlinear Burgers dynamics with Dirichlet BCs, NMPC/Adjoint provide the best MSE, while PDE-OP delivers three to four orders of magnitude lower runtime with modest accuracy loss.

9.4 Discrete-Time Neural Operator with Direct Control Field $u(x,t)$

In this section we describe a variant of our discrete-time neural operator that learns directly the spatio–temporal control field $u(t,x)$ rather than the weights of the basis function as introduced in Section 4.2. Similarly to the original DeepONet [1], PDE-OP’s model $\mathcal{Y}_{\theta}$ factors into a branch network that encodes input functions and a trunk network that encodes query coordinates. At each time $t_{k}$ , given the current state $y(t_{k},x)$ and the applied control $u(t_{k},x)$ , this model predicts the next state at coordinate $x$ :

\hat{\bm{y}}(t_{k+1},x)\;=\;\mathcal{Y}_{\theta}\!\big(t,\,x,\,\hat{\bm{y}}(t_{k},x),\,\hat{\bm{u}}({t_{k}},x)\big)

(47)

Branch network. The branch net encodes the estimated system state $\hat{\bm{y}}(t_{k},x)$ sampled at $n$ fixed sensor locations together with estimated control signals $\hat{\bm{u}}(t_{k},x)$ , and maps this concatenated input to a $d$ -dimensional vector of coefficients $\bm{b}\!\left(\hat{\bm{y}}(t_{k},x),\hat{\bm{u}}(t_{k})\right)\in\mathbb{R}^{d}$ .
Trunk network. The trunk net encodes the spatial query coordinate $x$ , producing a latent feature vector $\bm{\gamma}(x)\in\mathbb{R}^{d}$ .
The estimated subsequent (in time) state $\hat{\bm{y}}(t_{k+1},x)$ at location $x_{i}$ is obtained by combining the branch coefficients and trunk features through an inner product:

\hat{y}(t_{k+1},x_{i})\;=\;\sum_{j=1}^{d}b_{j}\!\left(\hat{{y}}(t_{k},x_{i}),\hat{\bm{u}}(t_{k},x_{i})\right)\,\gamma_{j}(x_{i}),\;\;i=1,\dots,n.

(48)

Model initialization and training (direct $u$ ). In this setting the dynamic predictor $\mathcal{Y}_{\theta}$ takes as input the control action sampled on the spatial grid, together with the current state. $\mathcal{Y}_{\theta}$ is initialized in a supervised fashion by generating the target trajectories with a numerical PDE solver under diverse time-varying controls $u(t,x)$ .

We sample admissible control sequences $\{u(t_{k},x_{i})\}$ on the grid $\{x_{i}\}_{i=1}^{n}$ from a zero-mean Gaussian random field (GRF) with prescribed spatial covariance (and optional temporal correlation), after which a clipping operation is applied to meet the actuator bounds $u_{\min}\leq u(t_{k},x_{i})\leq u_{\max}$ . Rolling the solver with these inputs produces states $y(t_{k},\cdot)$ satisfying (1b)-(1f). Each trajectory is converted into one-step training pairs

\big(y(t_{k},x_{i}),\,u(t_{k},x_{i})\big)\;\longmapsto\;y(t_{k+1},x_{i}),\qquad k=0,\dots,T-1,\quad i=1,\dots,n

Formally, the dataset consists of triples $\mathcal{D}=\Big\{\big(\bm{y}(t_{k},x),\,\bm{u}(t_{k},x),\,\bm{y}(t_{k+1},x)\big)\Big\}$ , with $\bm{y}(t_{k},x)=[y(t_{k},x_{1}),\ldots,y(t_{k},x_{n})]^{\top}$ and $\bm{u}(t_{k},x)=[u(t_{k},x_{1}),\ldots,u(t_{k},x_{n})]^{\top}$ . $\mathcal{Y}_{\theta}$ is trained to minimize the following loss

\min_{\theta}\;\mathbb{E}_{(\bm{u},y)\sim\mathcal{D}}\Big[\big\|\hat{y}(t,x)-y(t,x)\big\|^{2}\Big],

(49)

9.5 MPC for Direct Control Field $u(x,t)$

Here we describe a “direct- $u$ ” variant for viscous Burgers with Dirichlet boundaries, where the control action is parameterized at every spatial collocation point. At each time $t_{k}$ a vector, the optimization task consists of finding $\bm{u}(t_{k},x)\in\mathbb{R}^{N_{x}}$ (e.g., for a given time instant $t_{k}$ we have $N_{x}$ spatial collocation point), and applying an implicit Crank–Nicolson step to propagate the state in time from $t_{k}$ to $t_{k+1}$ .

	$\displaystyle\min_{\{\bm{u}(t_{i})\}_{i=k}^{k+N_{p}-1}}$	$\displaystyle\sum_{i=k}^{k+N_{p}-1}\\|\bm{y}(t_{i})-\bm{y}_{\text{target}}\\|_{\bm{T}_{Q}}^{2}+\\|\bm{u}(t_{i})\\|_{\bm{R}}^{2}+\\|\bm{y}(t_{k+N_{p}})-\bm{y}_{\text{target}}\\|_{\bm{Q}}^{2}$		(50)
	s.t.	$\displaystyle\ref{eq:nmpc_dyn},\;\;u_{\min}\leq\bm{u}(t_{i})\leq u_{\max}.$		(50)

In case of closed loop (NMPC), a horizon of length $N_{p}$ yields $N_{p}\,N_{x}$ decision variables per iteration; we warm start across receding horizons and use bound-constrained quasi-Newton (e.g., L-BFGS-B/SLSQP). Here ${\mathbf{T}_{Q}},{\mathbf{R}},\text{ and }\mathbf{Q}$ represent the weight matrices, which correspond to identical with the Lagrange multipliers $\bm{\lambda}$ of our PDE-OP’s training algorithm. To compute the gradients, we either backpropagate through the discrete CN integrator using automatic differentiation or use the fully discrete adjoint; both methods are exact but require one forward rollout and one backward sweep over the horizon. This grid-wise formulation removes any basis bias and shows the attainable tracking when the controller has full spatial freedom, but it is While direct modeling of the control action doesn’t require any basis representation, its computational cost higher w.r.t. to the case in which a basis representation is adopted: the per-step cost scales roughly with $N_{p}\,N_{t}\,N_{x}$ , due to line-search/gradient evaluation entailing multiple CN solves. We therefore include it as a comprehensive baseline to contextualize the efficiency of our reduced-parameter (mode-based) controllers.

9.6 Experimental Results For Direct Control Field $u(t,x)$

When using the direct control field modeling for $u(t,x)$ , we evaluate each method on the nonlinear, time-variant optimal control of the 1D Burgers’ equation, using the same target profiles presented in Section 5.3. Tables 5-6 report the runtime (s) and terminal tracking error (MSE) of each method. All methods use the same spatial/temporal discretization and solver settings as in 10. Figure 9 shows the predicted voltage (central sub-figure of Fig. 9) of PDE-OP’s dynamic component $\mathcal{Y}_{\theta}$ at test time, which is compared to the solution (left sub-figure) obtained with a numerical PDE-solver. Figures 10 and 11 show the comparison of PDE-OP’s predicted terminal state $\hat{y}(T,x)$ and those obtained using the baseline methods, over the same target profiles introduced in the main text. In figure 10, we set $N_{p}=1$ for NMPC and $N_{p}=10$ for the Adjoint-method, while in figure 11, we set $N_{p}=4$ for NMPC and $N_{p}=20$ for the Adjoint-method.

Table 5: PDE-OP vs. classical baselines with short horizon

N_{p}

. Here we use

N_{p}{=}1

for NMPC and

N_{p}{=}10

for adjoint. We report runtime (s) and MSE (lower is better).

	$y_{\text{target}}(x)=2x(1-x)$		$y_{\text{target}}(x)=0.8(\sin(x))$		$y_{\text{target}}(x)=0$
Method	Runtime	MSE	Runtime	MSE	Runtime	MSE
NMPC	39.254s	0.00002	76.368s	0.00014	36.483s	7.9141e $-$ 06
Adjoint-Method	47.805s	0.00124	79.644s	0.00317	24.956s	2.7618e $-$ 08
PDE-OP (ours)	0.267s	0.00003	0.256s	0.00008	0.257s	0.00008

Results.

Across all targets, PDE-OP is two orders of magnitude faster than classical methods, between $100$ – $300\times$ faster then the NMPC and the Adjoint. In terms of accuracy, for the sine target, PDE-OP achieves the lowest MSE; for the parabola, NMPC attains the best results with PDE-OP close behind; for the zero target, the adjoint method achieves the lowest MSE with PDE-OP producing an MSE of the order of $10^{-5}$ .

Table 6: PDE-OP vs. classical baselines with longer horizons. Here we use

N_{p}{=}4

for NMPC and

N_{p}{=}20

for adjoint. We report runtime (s) and MSE (lower is better).

	$y_{\text{target}}(x)=2x(1-x)$		$y_{\text{target}}(x)=0.8(\sin(x))$		$y_{\text{target}}(x)=0$
Method	Runtime	MSE	Runtime	MSE	Runtime	MSE
NMPC	826.176s	0.00016	3143.735s	0.00032	470.851s	0.00061
Adjoint-Method	186.910s	0.00129	263.891s	0.00316	51.702s	4.9213e $-$ 09
PDE-OP (ours)	0.254s	0.00003	0.277s	0.00008	0.956s	0.00008

Results.

With $N_{p}{=}4$ (NMPC) and $N_{p}{=}20$ (adjoint), PDE-OP remains dramatically faster, three to four orders of magnitude—than classical solvers. For the parabola target, PDE-OP is $\sim\!3250\times$ faster than NMPC and $\sim\!740\times$ faster than adjoint, while also reporting the best MSE ( $3{\times}10^{-5}$ ). For the sine target, the speedups over classical methods is $\sim\!11300\times$ vs. NMPC and $\sim\!950\times$ vs. adjoint, while attaining (again) the lowest accuracy ( $8{\times}10^{-5}$ ). For the zero target, the Adjoint method attains near-perfect regulation ( $4.9{\times}10^{-9}$ ), with PDE-OP is still fastest, running in $0.956\,s$ and approximately $54\times$ faster than the Adjoint method and $495\times$ faster than NMPC, while reporting a very small tracking error ( $8{\times}10^{-5}$ ).

10 Hyperparameter Settings

In Tables 7, 8, and 9, we provide the values of the hyperparameters of PDE-OP model and the baselines baseline methods related to the voltage optimization experiments; Tables 10, 11, and 12, report the values of the hyperparameters of PDE-OP model and the baselines baseline methods related to the heat equation experiments; lastly, Tables 13, 14, and 13 report the values of the hyperparameters of PDE-OP model and the baselines baseline methods related to the Burgers’ equation experiments. For a fair evaluation, each method adopts the same PDE, grid, and horizon parameters $n,N_{x},N_{t},[u_{\min},u_{\max}],[c_{\min},c_{\max}]$ .

Table 7: Voltage equation: simulation parameters and model/training hyperparameters.
MLP [33] a fully connected feedforward network.

Category	Hyperparameter	Value
PDE & Simulation	Domain length $L$	$1.0$
	Final time $T$	$5.0$
	Diffusion $D$	$0.1$
	Leakage $\beta$	$1.0$
	Control gain $\alpha$	$2.0$
	Initial state $V(x,0)$	$0.0$
	Reference $V_{\text{ref}}(x)$	$1.0$
	Control bounds $[u_{\min},u_{\max}]$	$[-1.0,\;1.0]$
Discretization	Solver grid $(n)$	$101$
	Solver steps $(N_{t})$	$101$
	Number of sensors $(N_{x}=n)$	$101$
PDE-OP’s $\cal Y_{\theta}$	Learning Rate	$5\times 10^{-4}$
	Number of epochs	$1000$
	Optimizer	Adam
	Batch size	$1024$
	Activation function	SiLU
	Lagrange Multiplier $\rho$	$0.05$
	Branch net $(b)$ : # layers	$4$
	Branch net $(b)$ : hidden sizes	$[512,512,512,512]$
	Trunk net $(\gamma)$ : # layers	$4$
	Trunk net $(\gamma)$ : hidden sizes	$[512,512,512,512]$
	Feature dimension $(d)$	$512$
PDE-OP’s $\mathcal{U}_{\omega}$	Learning rate	$1\times 10^{-3}$
	Number of epochs	$1000$
	Optimizer	AdamW
	Batch size	$256$
	Activation function	RELU
	MLP: # layers	$3$
	MLP: hidden sizes	$[256,256,256]$

Table 8: Voltage equation: Direct(finite-difference) method parameters.

Category	Hyperparameter	Value / Note
Optimizer	Algorithm	L-BFGS-B (box bounds $[u_{\min},u_{\max}]$ )
	Max iterations	100
	Tolerances	$\texttt{ftol}=10^{-6}$ (and/or gtol)
	Initialization	$\bm{u}(t_{0})=\mathbf{0}$ (uniform)
Finite diff.	Stencil	forward (optionally central)
	Step size	$\epsilon_{j}=\eta(1+\|u_{j}\|)$ , $\eta=10^{-6}$
Cost eval	Rollout solver	CN
	Runtime note	$N_{x}{+}1$ evals/grad (or $2N_{x}$ for central)

Table 9: Voltage equation: Adjoint-method parameters.

Category	Hyperparameter	Value / Note
Optimizer	Algorithm	L-BFGS-B (box bounds $[u_{\min},u_{\max}]$ )
	Max iterations	100
	Tolerances	$\texttt{ftol}=10^{-6}$ (and/or gtol)
	Initialization	$\bm{u}(t_{0})=\mathbf{0}$
Adjoint solves	Forward stepper	CN
	Backward sweep	discrete adjoint (one pass)
Numerics	Linear solves	direct (LU) or iterative tol $10^{-10}$

Table 10: Heat equation: simulation parameters and model/training hyperparameters.
LSTM [34] denotes a Long Short-Term Memory network.

Category	Hyperparameter	Value
PDE & Simulation	Domain length $L$	$1.0$
	Final time $T$	$1.0$
	Diffusion $D$	$0.1$
	Leakage $\beta$	$0.5$
	Control gain $\alpha$	$2.0$
	Initial state $V(x,0)$	$0.0$
	Reference $V_{\text{ref}}(x)$	$0.0$
	Weight bounds $[c_{\min},c_{\max}]$	$[-1.0,\;1.0]$
Discretization	Solver grid $(n)$	$41$
	Solver steps $(N_{t})$	$41$
	Number of sensors $(N_{x}=n)$	$41$
	Number of basis functions $(M)$	$6$
PDE-OP’s $\cal Y_{\theta}$	Learning Rate	$1\times 10^{-3}$
	Number of epochs	$2000$
	Optimizer	AdamW
	Batch size	$128$
	Activation function	RELU
	Branch net $(b)$ : # layers	$4$
	Branch net $(b)$ : hidden sizes	$[512,512,512,512]$
	Trunk net $(\gamma)$ : # layers	$4$
	Trunk net $(\gamma)$ : hidden sizes	$[512,512,512,512]$
	Feature dimension $(d)$	$512$
PDE-OP’s $\mathcal{U}_{\omega}$	Learning rate	$1\times 10^{-4}$
	Number of epochs	$5000$
	Optimizer	AdamW
	Batch size	$2048$
	Activation function	RELU
	LSTM: # layers	$2$
	LSTM: hidden sizes	$[256,256]$

Table 11: Heat equation: Linear MPC (LMPC) parameters.

(^{*})

The basis function are used by each method.

Category	Parameter	Value / Note
Horizon	Prediction length $N_{p}$	$10$
Dynamics	State update	$\bm{y}(t_{k+1})=\bm{A}\bm{y}(t_{k})+\bm{B}\bm{u}(t_{k})+\bm{g}$
	Discretization	Crank–Nicolson (linear)
Actuation	Basis^∗	$\cos(j\pi x/L)$ , $j=0,\dots,m-1$
Solver	QP solver	OSQP (CVXPY), warm start
Stopping	Tolerances	default OSQP (primal/dual res.)

Table 12: Heat equation: Adjoint method parameters.

Category	Parameter	Value / Note
Horizon	Prediction length $N_{p}$	$10$
Forward	Integrator	Crank–Nicolson (linear), same $\bm{A},\bm{B},\bm{g}$
Adjoint	Backward recursion	$\bm{\lambda}_{N_{t}}=\bm{Q}\,(\bm{y}_{N_{t}}-\bm{y}_{\text{target}})$
		$\ \bm{\lambda}_{k}=\bm{A}^{\top}\bm{\lambda}_{k+1}+\bm{T}_{Q}\,(\bm{y}(t_{k})-\bm{y}_{\text{target}})$
Gradient	Per step	$\nabla_{\bm{u}(t_{k})}J_{h}=\bm{R}\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{\lambda}_{k+1}.$
Optimizer	Algorithm	L-BFGS-B, warm start
Stopping	Max iters / tolerances	100; $\texttt{ftol}=10^{-6}$

Table 13: Burgers’ equation: simulation parameters and model/training hyperparameters.
LSTM[34] denotes a Long Short-Term Memory network.

Category	Hyperparameter	Value
PDE & Simulation	Domain length $L$	$1.0$
PDE & Simulation	Final time $T$	$4.0$
	Viscosity $\nu$	$0.03$
	Weight bounds $[c_{\min},c_{\max}]$	$[-1.0,\;1.0]$
Discretization	Solver grid $(n)$	$81$
	Solver steps $(N_{t})$	$201$
	Number of sensors $(N_{x}=n)$	$81$
	Number of basis functions $(M)$	$4$
PDE-OP’s $\cal Y_{\theta}$	Learning Rate	$1\times 10^{-3}$
	Number of epochs	$2000$
	Optimizer	AdamW
	Batch size	$512$
	Activation function	RELU
	Branch net $(b)$ : # layers	$3$
	Branch net $(b)$ : hidden sizes	$[256,256,256]$
	Trunk net $(\gamma)$ : # layers	$5$
	Trunk net $(\gamma)$ : hidden sizes	$[256,256,256,256,256]$
	Feature dimension $(d)$	$128$
PDE-OP’s $\mathcal{U}_{\omega}$	Learning rate	$1\times 10^{-4}$
	Number of epochs	$5000$
	Optimizer	AdamW
	Batch size	$2048$
	Activation function	RELU
	LSTM: # layers	$2$
	LSTM: hidden sizes	$[256,256]$

Table 14: Burgers’ equation: Nonlinear MPC (NMPC) parameters.

(^{*})

The basis function are used by each method.

Category	Parameter	Value / Note
Horizon	Prediction length $N_{p}$	e.g. $10$ (receding horizon)
Dynamics	Implicit step (CN)	$\bm{y}(t_{k+1})\;=\;\mathcal{G}(\bm{y}(t_{k}),\bm{c}(t_{k}))$
Actuation	Basis^∗	$\sin\!\big((j{+}1)\pi x/X\big)$ , $j=0,\dots,M-1$
Solver	Nonlinear optimizer	SLSQP (single-shooting), box bounds; warm start by shift
CN inner	Newton solve per step	tolerance $10^{-10}$ , max iters $25$

Table 15: Burgers’ equation: Adjoint method parameters.

Category	Parameter	Value / Note
Horizon	Prediction lentgh $N_{p}$	10
Forward	Implicit CN dynamics	as in NMPC, Dirichlet enforced each step
Jacobian	Semi-discrete residual	$J(\bm{y})=-\bm{D}\,\mathrm{diag}(\bm{y})+\nu\,\bm{D}^{2}$
Adjoint	Backward sweep	$A_{k}^{\top}\bm{q}_{k}=\Delta t\,(\bm{y}(t_{k+1})-\bm{y}_{\text{target}})$ ;
		$\bm{p}_{k}=\Delta t\,(\bm{y}(t_{k})-\bm{y}_{\text{target}})-B_{k}^{\top}\bm{q}_{k}$
Gradient	Per-step control gradient	$\nabla_{\bm{u}(t_{k})}J=\Delta t\big(\alpha\,\bm{u}(t_{k})+\bm{B}^{\top}\bm{q}_{k}\big),$
Optimizer	Outer optimizer	L-BFGS-B, warm start

Learning to Solve Optimization Problems Constrained with Partial Differential Equations

Abstract

1 Introduction

2 Related Work

3 Settings and Goals

4 PDE-Optimization Proxy

4.1 Surrogate Controller

4.2 Discrete-time Neural Operator

4.3 Handling Static and Dynamics Constraints Jointly

5 Experimental Setting

5.1 Voltage Control Optimization

5.2 Optimal Control of the Heat Equation

5.3 Optimal Control of the 1D Burgers’ Equation

6 Conclusion

7 Acknowledgments

References

8 Classical Methods

8.1 Direct Method (Finite-Difference Approach)

8.2 Adjoint Sensitivity Method for Open-Loop Problem

8.3 Time-Varying Adjoint for Closed-Loop Baselines

Heat optimization problem — CN discretization.

Burgers’ optimization problem — implicit CN step.

8.4 Model Predictive Control (MPC)

Linear MPC for the Heat equation.

Nonlinear MPC for the Burgers equation.

Choice of controller.

9 Additional Experimental Results

9.1 Voltage Control Optimization

Results.

9.2 Optimal Control of the 1D Heat Equation

Results.

9.3 Optimal Control of the 1D Burgers’ Equation

Results.

9.4 Discrete-Time Neural Operator with Direct Control Field u​(x,t)u(x,t)

9.5 MPC for Direct Control Field u​(x,t)u(x,t)

9.6 Experimental Results For Direct Control Field u​(t,x)u(t,x)

Results.

Results.

10 Hyperparameter Settings

Learning to Solve
Optimization Problems Constrained
with Partial Differential Equations

9.4 Discrete-Time Neural Operator with Direct Control Field $u(x,t)$

9.5 MPC for Direct Control Field $u(x,t)$

9.6 Experimental Results For Direct Control Field $u(t,x)$