Intrusion Tolerance for Networked Systems through Two-Level Feedback Control

1/35
Intrusion Tolerance for Networked Systems
through Two-Level Feedback Control
IEEE DSN 2024, Brisbane, Australia
International Conference on Dependable Systems and Networks
Kim Hammar and Rolf Stadler
kimham@kth.se and stadler@kth.se
KTH Royal Institute of Technology
June 27, 2024

2/35
Use Case: Intrusion Tolerance
. . .
Clients
api gateways
Compute nodes
Storage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
I A replicated system offers a service to a client population.
I The system should provide service without disruption.

2/35
Use Case: Intrusion Tolerance
. . .
Attacker Clients
api gateways
Compute nodes
Storage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
I An attacker seeks to intrude on the system and disrupt service.
I The system should tolerate intrusions.

3/35
Intrusion Tolerance (Simplified)
Intrusion event Time of full recovery
Time
Recovery time
Survivability
Loss
Normal
performance
System
performance
Tolerance
Cumulative
performance loss
(want to minimize)

4/35
Increasing Demand for Intrusion-Tolerant Systems
I As our reliance on online services grows, there is an
increasing demand for intrusion-tolerant systems.
I Example applications:
Flight control
computer
Sensors and
actuators
Power grids
e.g., scada systems1.
Safety-critical IT systems
e.g., banking systems,
e-commerce applications2,
healthcare systems, etc.
Real-time control systems
e.g., flight control computer3.
1
Amy Babay et al. “Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid”. In: 2018 48th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 2018, pp. 255–266. doi:
10.1109/DSN.2018.00036.
2
Jukka Soikkeli et al. “Redundancy Planning for Cost Efficient Resilience to Cyber Attacks”. In: IEEE
Transactions on Dependable and Secure Computing 20.2 (2023), pp. 1154–1168. doi:
10.1109/TDSC.2022.3151462.
3
J.H. Wensley et al. “SIFT: Design and analysis of a fault-tolerant computer for aircraft control”. In:
Proceedings of the IEEE 66.10 (1978), pp. 1240–1255. doi: 10.1109/PROC.1978.11114.

5/35
Theoretical Foundations of Intrusion Tolerance
L. Lamport
W. Weibull
R.E. Barlow
W.Shewhart
L. Adleman
A. Shamir
R. Rivest
D. Dolev
N. Lynch
E. Dijkstra J. Gray
B. Liskov
Reliabil
i
t
y
T
h
e
o
r
y
Distributed Systems
C
r
y
p
tography
Intrusion-
tolerant
systems

6/35
Our Contribution
Reliabilit
y
T
h
e
o
r
y
Distributed
Sy
s
t
e
m
s
C
r
yptography
C
o
n
t
r
o
l
a
n
d
d
e
c
i
s
i
o
n
t
h
e
o
r
y
Intrusion-
tolerant
systems
This paper

7/35
Building Blocks of An Intrusion-Tolerant System
H C
∅
Crashed
Healthy Compromised
Crash Crash
Recovery
Compromise
20 40 60 80 100
0.2
0.4
0.6
0.8
1
25 replicas 50 replicas 100 replicas
t
Reliability
. . .
Replicated system
Client interface
Request
Response
Consensus protocol
1 2 3 4 5
1. Intrusion-tolerant consensus protocol
A quorum needs to reach agreement
to tolerate f compromised replicas.
2. Replication strategy
Cost-reliability trade-off.
3. Recovery strategy
Compromises will occur as t → ∞.

8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- No recoveries
Published 1995

8/35
- No recoveries
Published 1998

8/35
- Periodic recoveries
Published 2002

8/35
Published 2004

8/35
- Adaptive replication based on heuristics
Published 2006

8/35
Published 2006

8/35
- Supports both periodic and reactive recoveries
- Does not provide reactive recovery strategies
Published 2007

8/35
Published 2011

8/35
Published 2018

8/35
Published 2023

8/35
Published 2023
Can we do better by leveraging decision theory and optimal control?

9/35
The tolerance Architecture
Two-level recovery and replication control with feedback.
tolerance
Node 1
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node 2
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Hardware
Node Nt
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Hardware
. . .
Consensus protocol
System controller
State estimate Evict or add State estimate Evict or add State estimate Evict or add
. . .
Service requests Responses
Clients Attacker
Intrusion attempts

10/35
Definition 1 (Correct service)
The system provides correct service if the healthy replicas satisfy
the following properties:
Each request is eventually executed. (Liveness)
Each executed request was sent by a client. (Validity)
Each replica executes the same request sequence. (Safety)

11/35
Proposition 1 (Correctness of tolerance)
A system that implements the tolerance architecture provides
correct service if
Network links are authenticated.
At most f nodes are compromised or crashed simultaneously.
Nt ≥ 2f + 1.
The system is partially synchronous.

12/35
Intrusion Tolerance as a Two-Level Control Problem
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients
I The local level models intrusion recovery.
I The global level models replication control.

12/35
Assumption 1
The probability that the system controller fails is negligible.
Assumption 2
Compromise and crash events are statistically independent across
nodes.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients

13/35
The Local Control Problem
I Partially observed Markov decision process Γi .
I Controller actions: (R)ecover and (W)ait. ai,t ∈ {R, W}.
I Node states: SN = {(H)ealthy, (C)ompromised, ∅}. si,t ∈ SN.
I State transition function: f (si,t | si,t, ai,t).
I pC,i : crash probability, pA,i : intrusion probability.
I Observation oi,t ∼ zi (·|si,t): e.g., ids alerts at time t.
H C
∅
Crashed
Healthy Compromised
pC,i pC,i
a
(C)
i = R
a
(A)
i = A

14/35
Node Controller Strategy
I The controller computes the belief
bi,t(s) , P[Si,t = C|ht].
ht , (bi,1, ai,1, oi,2, ai,2, oi,3, . . . , ai,t−1, oi,t).
I Controller strategy:
π : [0, 1] → {W, R}.
Controller
Belief

15/35
Node Controller Objective
I Cost: Ji , ηT
(R)
i + F
(R)
i .
I T
(R)
i is the average time-to-recovery.
I F
(R)
i is the recovery frequency.
I η > 1 is a scaling factor.
I Bounded-time-to-recovery constraint: The time between two
recoveries can be at most ∆R.
10 20 30 40 50 60 70 80 90 100
0.5
1
p = 0.1 p = 0.05 p = 0.025 p = 0.01 p = 0.005
t
Failure (crash or compromise) probability.
p is the failure probability per time-step.

16/35
Threshold Structure of the Optimal Control Strategy
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.6
0.8
1
alpha vectors E[Ji | bi,1]
bi,1
α?
wait region recovery region
The controller’s optimal cost function.
Theorem 2
There exists an optimal control strategy that satisfies
π?
i,t(bi,t) = R ⇐⇒ bi,t ≥ α?
i,t ∀t,
where α?
i,t ∈ [0, 1] is a threshold.

17/35
Efficient Computation of Optimal Recovery Strategies
Algorithm 1: Threshold Optimization
1 Input: Objective function Ji , parametric optimizer po.
2 Output: An approximate optimal control strategy π̂i,θ.
3 Algorithm
4 Θ ← [0, 1].
5 For each θ ∈ Θ, define πi,θ(bi,t) as
6 πi,θ(bi,t) ,
(
R if bi,t ≥ θ
W otherwise.
7 Jθ ← Eπi,θ
[Ji ].
8 π̂i,θ ← po(Θ, Jθ).
9 return π̂i,θ.
I Examples of parameteric optimization algorithmns: cem, bo,
cma-es, de, spsa, etc.

18/35
Efficient Computation of Optimal Recovery Strategies
5 10 15 20 25
100
101
102
103
cem de bo spsa dynamic programming
∆R
Time
(min)
Mean compute time to obtain an optimal recovery strategy for different
values of the bounded-time-to-recovery constraint ∆R.

19/35
The Benefit of Optimal Recovery Control
2 4 6 8 10 12 14 16 18 20
0.2
0.3
0.4
Optimal strategy Periodic strategy
DKL( no intrusion k intrusion )
Ji (4)
Benefit of optimal recovery
Optimal recovery control can significantly reduce opera-
tional cost given that an intrusion detection model is
available.
Key insight

20/35
Intrusion Tolerance as a Two-Level Control Problem
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients

21/35
The Global Control Problem
I Constrained Markov decision process Γ.
I States: SS = {0, 1, . . . , smax}, the number of healthy nodes.
I Controller actions: Add a
(C)
t ∈ {0, 1} nodes.
I Dynamics f : depend on the local nodes.
I Markov strategy:
π : SS → {0, 1}.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients

22/35
System Controller Objective
I Cost: J , limT→∞
PT
t=1
at
T .
I Constraint: T(A) ≥ A, where T(A) is the availability.
A Allowed service downtime per year
0.9 36 days
0.95 18 days
0.99 3 days
0.999 8 hours
0.9999 52 minutes
0.99999 5 minutes
1 0 minutes

23/35
System Reliability Analysis
I The Mean-time-to-failure (mttf) is the mean hitting time
of a state where st ≤ f :
E[T(F)
| S1 = s1] = E(St )t≥1
h
inf {t ≥ 1 | St ≤ f } | S1 = s1
i
.
10 20 30 40 50 60 70 80 90 100
100
200
300
pi = 0.05 pi = 0.025 pi = 0.01
N1
E[T(F)]
The mttf in function of the number of initial nodes N1 and failure
probability per node pi .

24/35
Theorem 3 (Optimal Control Strategy Existence)
Assuming
(A) The Markov chain induced by any control strategy is
unichain.
(B) The availability constraint is feasible.
Then the following holds.
1. There exists an optimal stationary replication control strategy.
2. The optimal strategy has a threshold structure.
3. An optimal replication control strategy can be computed by
using linear programming.

25/35
Efficient Computation of Optimal Replication Control
Strategies
4 8 16 32 64 128 256 512 1024 2048
100
102
Maximum number of nodes smax
Time
(s)
Mean compute time to obtain an optimal replication control strategy.

26/35
The Benefit of Optimal Replication Control
200 400 600 800 1,000
0.5
1
Optimal strategy N1 = 10 N1 = 100
t
Availability
Benefit of optimal replication control
Optimal replication control can guarantee a high service
availability in expectation. The benefit of optimal repli-
cation is mainly prominent for long-running systems.
Key insight

27/35
Summary of the Control-Theoretic Model
I Intrusion recovery control.
I Partially observed Markov decision process.
I Threshold structure of optimal control strategies.
I Efficient computation through stochastic approximation.
I Replication control.
I Constrained Markov decision process.
I Threshold structure of optimal control strategies.
I Efficient computation through linear programming.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients

28/35
Experiment Setup - Testbed

29/35
The tolerance Architecture
Two-level recovery and replication control with feedback.
tolerance
Node 1
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Hardware
Node 2
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Hardware
Node Nt
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Hardware
. . .
Consensus protocol
System controller
State estimate Evict or add State estimate Evict or add State estimate Evict or add
. . .
Service requests Responses
Clients Attacker
Intrusion attempts
I A replicated web service which offers two operations:
I A read operation that returns the service state.
I A write operation that updates the state.

30/35
Intrusion-Tolerant Consensus Protocol (minbft)
Client
Replica 1
(leader)
Replica 2
Replica 3
request prepare commit reply
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
crash
request
view-change
view-change new-view
Replica 1
Replica 2
Replica 3
checkpoint
Controller
Replica 1
(compromised)
Replica 2
Replica 3
recover
request
state
state
New replica
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
join-request join new-view join-reply
System
controller
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
evict-request evict new-view exit-reply
a) Normal operation b) View change
c) Checkpoint
d) State transfer
e) Join f) Evict

31/35
Experiment Setup - Emulated Intrusions
Replica ID Intrusion steps
1 tcp syn scan, ftp brute force
2 tcp syn scan, ssh brute force
3 tcp syn scan, telnet brute force
4 icmp scan, exploit of cve-2017-7494
6 icmp scan, exploit of cwe-89 on dvwa
9 icmp scan, ssh brute force, exploit of cve-2010-0426
10 icmp scan, ssh brute force, exploit of cve-2015-5602
Table 1: Intrusion steps.

32/35
Experiment Setup - Background Traffic
Background services Replica ID(s)
ftp, ssh, mongodb, http, teamspeak 1
ssh, dns, http 2
ssh, telnet, http 3
ssh, samba, ntp 4
ssh 5, 7, 8, 10
dvwa, irc, ssh 6
teamspeak, http, ssh 9
Table 2: Background services.

33/35
Estimated Distributions of Intrusion Alerts
0 2000 4000 6000 8000
b
z
i
(o
i
|
a
(A)
i
)
cve-2010-0426
0 2000 4000 6000 8000
cve-2015-3306
0 2000 4000 6000 8000
b
z
i
(o
i
|
a
(A)
i
)
cve-2015-5602
0 2000 4000 6000 8000
cve-2016-10033
0 2000 4000 6000 8000
b
z
i
(o
i
|
a
(A)
i
)
cwe-89
0 2000 4000 6000 8000
cve-2017-7494
0 2000 4000 6000 8000
oi ∈ O
b
z
i
(o
i
|
a
(A)
i
)
cve-2014-6271
0 5000 10000 15000 20000
oi ∈ O
ftp,ssh,telnet brute force
attack (a
(A)
i = A) false alarms (a
(A)
i = F)
I We estimate the observation distribution z with the
empirical distribution b
Z based on M samples.
I b
z →a.s z as M → ∞ (Glivenko-Cantelli theorem).

34/35
Comparison with State-of-the-art Intrusion-Tolerant
Systems
0
0.5
1
101
102
0
0.1
0.2
0
0.5
1
101
102
0
0.1
0.2
0
0.5
1
15 25 ∞
101
102
tolerance no-recovery periodic periodic-adaptive
15 25 ∞
0
0.1
0.2
15 25 ∞
Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R
Average availability T(A) Average time-to-recovery T(R) Average recovery frequency F(R)
N
1
=
3
N
1
=
6
N
1
=
9
Comparison between our optimal control strategies and the baselines;
x-axes indicate values of ∆R; rows relate to the number of initial nodes
N1.

35/35
Conclusion
I We present a control-theoretic model of intrusion
tolerance.
I We establish structural results.
I We evaluate the optimal control strategies on a testbed.
I Our control-theoretic strategies have stronger theoretical
guarantees and significantly better practical performance than
the heuristic control strategies used in state-of-the-art
intrusion-tolerant systems.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients

Intrusion Tolerance for Networked Systems through Two-Level Feedback Control

More Related Content

Similar to Intrusion Tolerance for Networked Systems through Two-Level Feedback Control (20)

More from Kim Hammar (20)

Recently uploaded (20)

Intrusion Tolerance for Networked Systems through Two-Level Feedback Control