SlideShare a Scribd company logo
1/35
Intrusion Tolerance for Networked Systems
through Two-Level Feedback Control
IEEE DSN 2024, Brisbane, Australia
International Conference on Dependable Systems and Networks
Kim Hammar and Rolf Stadler
kimham@kth.se and stadler@kth.se
KTH Royal Institute of Technology
June 27, 2024
2/35
Use Case: Intrusion Tolerance
. . .
Clients
api gateways
Compute nodes
Storage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
I A replicated system offers a service to a client population.
I The system should provide service without disruption.
2/35
Use Case: Intrusion Tolerance
. . .
Attacker Clients
api gateways
Compute nodes
Storage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
I An attacker seeks to intrude on the system and disrupt service.
I The system should tolerate intrusions.
3/35
Intrusion Tolerance (Simplified)
Intrusion event Time of full recovery
Time
Recovery time
Survivability
Loss
Normal
performance
System
performance
Tolerance
Cumulative
performance loss
(want to minimize)
4/35
Increasing Demand for Intrusion-Tolerant Systems
I As our reliance on online services grows, there is an
increasing demand for intrusion-tolerant systems.
I Example applications:
Flight control
computer
Sensors and
actuators
Power grids
e.g., scada systems1.
Safety-critical IT systems
e.g., banking systems,
e-commerce applications2,
healthcare systems, etc.
Real-time control systems
e.g., flight control computer3.
1
Amy Babay et al. “Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid”. In: 2018 48th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 2018, pp. 255–266. doi:
10.1109/DSN.2018.00036.
2
Jukka Soikkeli et al. “Redundancy Planning for Cost Efficient Resilience to Cyber Attacks”. In: IEEE
Transactions on Dependable and Secure Computing 20.2 (2023), pp. 1154–1168. doi:
10.1109/TDSC.2022.3151462.
3
J.H. Wensley et al. “SIFT: Design and analysis of a fault-tolerant computer for aircraft control”. In:
Proceedings of the IEEE 66.10 (1978), pp. 1240–1255. doi: 10.1109/PROC.1978.11114.
5/35
Theoretical Foundations of Intrusion Tolerance
L. Lamport
W. Weibull
R.E. Barlow
W.Shewhart
L. Adleman
A. Shamir
R. Rivest
D. Dolev
N. Lynch
E. Dijkstra J. Gray
B. Liskov
Reliabil
i
t
y
T
h
e
o
r
y
Distributed Systems
C
r
y
p
tography
Intrusion-
tolerant
systems
6/35
Our Contribution
Reliabilit
y
T
h
e
o
r
y
Distributed
Sy
s
t
e
m
s
C
r
yptography
C
o
n
t
r
o
l
a
n
d
d
e
c
i
s
i
o
n
t
h
e
o
r
y
Intrusion-
tolerant
systems
This paper
7/35
Building Blocks of An Intrusion-Tolerant System
H C
∅
Crashed
Healthy Compromised
Crash Crash
Recovery
Compromise
20 40 60 80 100
0.2
0.4
0.6
0.8
1
25 replicas 50 replicas 100 replicas
t
Reliability
. . .
Replicated system
Client interface
Request
Response
Consensus protocol
1 2 3 4 5
1. Intrusion-tolerant consensus protocol
A quorum needs to reach agreement
to tolerate f compromised replicas.
2. Replication strategy
Cost-reliability trade-off.
3. Recovery strategy
Compromises will occur as t → ∞.
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- No recoveries
Published 1995
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- No recoveries
Published 1998
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2002
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2004
8/35
Prior Work on Intrusion-Tolerant Systems
- Adaptive replication based on heuristics
- Periodic recoveries
Published 2006
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2006
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Supports both periodic and reactive recoveries
- Does not provide reactive recovery strategies
Published 2007
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2011
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2018
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2023
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2023
Can we do better by leveraging decision theory and optimal control?
9/35
The tolerance Architecture
Two-level recovery and replication control with feedback.
tolerance
Node 1
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node 2
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node Nt
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
. . .
Consensus protocol
System controller
State estimate Evict or add State estimate Evict or add State estimate Evict or add
. . .
Service requests Responses
Clients Attacker
Intrusion attempts
10/35
Definition 1 (Correct service)
The system provides correct service if the healthy replicas satisfy
the following properties:
Each request is eventually executed. (Liveness)
Each executed request was sent by a client. (Validity)
Each replica executes the same request sequence. (Safety)
11/35
Proposition 1 (Correctness of tolerance)
A system that implements the tolerance architecture provides
correct service if
Network links are authenticated.
At most f nodes are compromised or crashed simultaneously.
Nt ≥ 2f + 1.
The system is partially synchronous.
12/35
Intrusion Tolerance as a Two-Level Control Problem
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients
I The local level models intrusion recovery.
I The global level models replication control.
12/35
Assumption 1
The probability that the system controller fails is negligible.
Assumption 2
Compromise and crash events are statistically independent across
nodes.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients
13/35
The Local Control Problem
I Partially observed Markov decision process Γi .
I Controller actions: (R)ecover and (W)ait. ai,t ∈ {R, W}.
I Node states: SN = {(H)ealthy, (C)ompromised, ∅}. si,t ∈ SN.
I State transition function: f (si,t | si,t, ai,t).
I pC,i : crash probability, pA,i : intrusion probability.
I Observation oi,t ∼ zi (·|si,t): e.g., ids alerts at time t.
H C
∅
Crashed
Healthy Compromised
pC,i pC,i
a
(C)
i = R
a
(A)
i = A
14/35
Node Controller Strategy
I The controller computes the belief
bi,t(s) , P[Si,t = C|ht].
ht , (bi,1, ai,1, oi,2, ai,2, oi,3, . . . , ai,t−1, oi,t).
I Controller strategy:
π : [0, 1] → {W, R}.
Controller
Belief
15/35
Node Controller Objective
I Cost: Ji , ηT
(R)
i + F
(R)
i .
I T
(R)
i is the average time-to-recovery.
I F
(R)
i is the recovery frequency.
I η > 1 is a scaling factor.
I Bounded-time-to-recovery constraint: The time between two
recoveries can be at most ∆R.
10 20 30 40 50 60 70 80 90 100
0.5
1
p = 0.1 p = 0.05 p = 0.025 p = 0.01 p = 0.005
t
Failure (crash or compromise) probability.
p is the failure probability per time-step.
16/35
Threshold Structure of the Optimal Control Strategy
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.6
0.8
1
alpha vectors E[Ji | bi,1]
bi,1
α?
wait region recovery region
The controller’s optimal cost function.
Theorem 2
There exists an optimal control strategy that satisfies
π?
i,t(bi,t) = R ⇐⇒ bi,t ≥ α?
i,t ∀t,
where α?
i,t ∈ [0, 1] is a threshold.
17/35
Efficient Computation of Optimal Recovery Strategies
Algorithm 1: Threshold Optimization
1 Input: Objective function Ji , parametric optimizer po.
2 Output: An approximate optimal control strategy π̂i,θ.
3 Algorithm
4 Θ ← [0, 1].
5 For each θ ∈ Θ, define πi,θ(bi,t) as
6 πi,θ(bi,t) ,
(
R if bi,t ≥ θ
W otherwise.
7 Jθ ← Eπi,θ
[Ji ].
8 π̂i,θ ← po(Θ, Jθ).
9 return π̂i,θ.
I Examples of parameteric optimization algorithmns: cem, bo,
cma-es, de, spsa, etc.
18/35
Efficient Computation of Optimal Recovery Strategies
5 10 15 20 25
100
101
102
103
cem de bo spsa dynamic programming
∆R
Time
(min)
Mean compute time to obtain an optimal recovery strategy for different
values of the bounded-time-to-recovery constraint ∆R.
19/35
The Benefit of Optimal Recovery Control
2 4 6 8 10 12 14 16 18 20
0.2
0.3
0.4
Optimal strategy Periodic strategy
DKL( no intrusion k intrusion )
Ji (4)
Benefit of optimal recovery
Optimal recovery control can significantly reduce opera-
tional cost given that an intrusion detection model is
available.
Key insight
20/35
Intrusion Tolerance as a Two-Level Control Problem
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients
21/35
The Global Control Problem
I Constrained Markov decision process Γ.
I States: SS = {0, 1, . . . , smax}, the number of healthy nodes.
I Controller actions: Add a
(C)
t ∈ {0, 1} nodes.
I Dynamics f : depend on the local nodes.
I Markov strategy:
π : SS → {0, 1}.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients
22/35
System Controller Objective
I Cost: J , limT→∞
PT
t=1
at
T .
I Constraint: T(A) ≥ A, where T(A) is the availability.
A Allowed service downtime per year
0.9 36 days
0.95 18 days
0.99 3 days
0.999 8 hours
0.9999 52 minutes
0.99999 5 minutes
1 0 minutes
23/35
System Reliability Analysis
I The Mean-time-to-failure (mttf) is the mean hitting time
of a state where st ≤ f :
E[T(F)
| S1 = s1] = E(St )t≥1
h
inf {t ≥ 1 | St ≤ f } | S1 = s1
i
.
10 20 30 40 50 60 70 80 90 100
100
200
300
pi = 0.05 pi = 0.025 pi = 0.01
N1
E[T(F)]
The mttf in function of the number of initial nodes N1 and failure
probability per node pi .
24/35
Theorem 3 (Optimal Control Strategy Existence)
Assuming
(A) The Markov chain induced by any control strategy is
unichain.
(B) The availability constraint is feasible.
Then the following holds.
1. There exists an optimal stationary replication control strategy.
2. The optimal strategy has a threshold structure.
3. An optimal replication control strategy can be computed by
using linear programming.
25/35
Efficient Computation of Optimal Replication Control
Strategies
4 8 16 32 64 128 256 512 1024 2048
100
102
Maximum number of nodes smax
Time
(s)
Mean compute time to obtain an optimal replication control strategy.
26/35
The Benefit of Optimal Replication Control
200 400 600 800 1,000
0.5
1
Optimal strategy N1 = 10 N1 = 100
t
Availability
Benefit of optimal replication control
Optimal replication control can guarantee a high service
availability in expectation. The benefit of optimal repli-
cation is mainly prominent for long-running systems.
Key insight
27/35
Summary of the Control-Theoretic Model
I Intrusion recovery control.
I Partially observed Markov decision process.
I Threshold structure of optimal control strategies.
I Efficient computation through stochastic approximation.
I Replication control.
I Constrained Markov decision process.
I Threshold structure of optimal control strategies.
I Efficient computation through linear programming.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients
28/35
Experiment Setup - Testbed
29/35
The tolerance Architecture
Two-level recovery and replication control with feedback.
tolerance
Node 1
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node 2
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node Nt
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
. . .
Consensus protocol
System controller
State estimate Evict or add State estimate Evict or add State estimate Evict or add
. . .
Service requests Responses
Clients Attacker
Intrusion attempts
I A replicated web service which offers two operations:
I A read operation that returns the service state.
I A write operation that updates the state.
30/35
Intrusion-Tolerant Consensus Protocol (minbft)
Client
Replica 1
(leader)
Replica 2
Replica 3
request prepare commit reply
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
crash
request
view-change
view-change new-view
Replica 1
Replica 2
Replica 3
checkpoint
Controller
Replica 1
(compromised)
Replica 2
Replica 3
recover
request
state
state
New replica
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
join-request join new-view join-reply
System
controller
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
evict-request evict new-view exit-reply
a) Normal operation b) View change
c) Checkpoint
d) State transfer
e) Join f) Evict
31/35
Experiment Setup - Emulated Intrusions
Replica ID Intrusion steps
1 tcp syn scan, ftp brute force
2 tcp syn scan, ssh brute force
3 tcp syn scan, telnet brute force
4 icmp scan, exploit of cve-2017-7494
5 icmp scan, exploit of cve-2014-6271
6 icmp scan, exploit of cwe-89 on dvwa
7 icmp scan, exploit of cve-2015-3306
8 icmp scan, exploit of cve-2016-10033
9 icmp scan, ssh brute force, exploit of cve-2010-0426
10 icmp scan, ssh brute force, exploit of cve-2015-5602
Table 1: Intrusion steps.
32/35
Experiment Setup - Background Traffic
Background services Replica ID(s)
ftp, ssh, mongodb, http, teamspeak 1
ssh, dns, http 2
ssh, telnet, http 3
ssh, samba, ntp 4
ssh 5, 7, 8, 10
dvwa, irc, ssh 6
teamspeak, http, ssh 9
Table 2: Background services.
33/35
Estimated Distributions of Intrusion Alerts
0 2000 4000 6000 8000
b
z
i
(o
i
|
a
(A)
i
)
cve-2010-0426
0 2000 4000 6000 8000
cve-2015-3306
0 2000 4000 6000 8000
b
z
i
(o
i
|
a
(A)
i
)
cve-2015-5602
0 2000 4000 6000 8000
cve-2016-10033
0 2000 4000 6000 8000
b
z
i
(o
i
|
a
(A)
i
)
cwe-89
0 2000 4000 6000 8000
cve-2017-7494
0 2000 4000 6000 8000
oi ∈ O
b
z
i
(o
i
|
a
(A)
i
)
cve-2014-6271
0 5000 10000 15000 20000
oi ∈ O
ftp,ssh,telnet brute force
attack (a
(A)
i = A) false alarms (a
(A)
i = F)
I We estimate the observation distribution z with the
empirical distribution b
Z based on M samples.
I b
z →a.s z as M → ∞ (Glivenko-Cantelli theorem).
34/35
Comparison with State-of-the-art Intrusion-Tolerant
Systems
0
0.5
1
101
102
0
0.1
0.2
0
0.5
1
101
102
0
0.1
0.2
0
0.5
1
15 25 ∞
101
102
tolerance no-recovery periodic periodic-adaptive
15 25 ∞
0
0.1
0.2
15 25 ∞
Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R
Average availability T(A) Average time-to-recovery T(R) Average recovery frequency F(R)
N
1
=
3
N
1
=
6
N
1
=
9
Comparison between our optimal control strategies and the baselines;
x-axes indicate values of ∆R; rows relate to the number of initial nodes
N1.
35/35
Conclusion
I We present a control-theoretic model of intrusion
tolerance.
I We establish structural results.
I We evaluate the optimal control strategies on a testbed.
I Our control-theoretic strategies have stronger theoretical
guarantees and significantly better practical performance than
the heuristic control strategies used in state-of-the-art
intrusion-tolerant systems.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt )
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bNt )
b1 b2 b3 b4 bNt
. . .
Attacker
Clients

More Related Content

PDF
Intrusion Tolerance as a Two-Level Game (Visit to Melbourne University)
PDF
Intrusion Tolerance as a Two-Level Game - GameSec24
PDF
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
PDF
G017124045
PDF
Recovery of Data in Cluster Computing By Using Fault Tolerant Mechanisms
PDF
Checkpointing and Rollback Recovery Algorithms for Fault Tolerance in MANETs:...
PDF
Adaptive check-pointing and replication strategy to tolerate faults in comput...
PDF
E01113138
Intrusion Tolerance as a Two-Level Game (Visit to Melbourne University)
Intrusion Tolerance as a Two-Level Game - GameSec24
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
G017124045
Recovery of Data in Cluster Computing By Using Fault Tolerant Mechanisms
Checkpointing and Rollback Recovery Algorithms for Fault Tolerance in MANETs:...
Adaptive check-pointing and replication strategy to tolerate faults in comput...
E01113138

Similar to Intrusion Tolerance for Networked Systems through Two-Level Feedback Control (20)

PDF
H04553942
PPTX
Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch
PDF
On Resilient Computing
PDF
Fault tolerance
PDF
Optimal Security Response to Network Intrusions in IT Systems
PPTX
Vulnerabilities of control system
PDF
50120130406041 2
PDF
Novel Perspectives in Construction of Recovery Oriented Computing
PDF
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
PDF
IEEE Networking 2016 Title and Abstract
PDF
Recovery in Distributed operating system
PDF
Dissertation Proposal Abstract
PDF
IEEE 2015 NS2 Projects
PDF
RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
PDF
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
PDF
Proposed Algorithm for Surveillance Applications
PDF
CS9222 ADVANCED OPERATING SYSTEMS
PDF
Exploring Fault Tolerance Strategies in Big Data Infrastructures and Their Im...
PPT
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
PDF
Voting protocol
H04553942
Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch
On Resilient Computing
Fault tolerance
Optimal Security Response to Network Intrusions in IT Systems
Vulnerabilities of control system
50120130406041 2
Novel Perspectives in Construction of Recovery Oriented Computing
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
IEEE Networking 2016 Title and Abstract
Recovery in Distributed operating system
Dissertation Proposal Abstract
IEEE 2015 NS2 Projects
RESILIENT VOTING MECHANISMS FOR MISSION SURVIVABILITY IN CYBERSPACE: COMBININ...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Proposed Algorithm for Surveillance Applications
CS9222 ADVANCED OPERATING SYSTEMS
Exploring Fault Tolerance Strategies in Big Data Infrastructures and Their Im...
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
Voting protocol
Ad

More from Kim Hammar (20)

PDF
Approximation in Value Space using Aggregation, with Applications to POMDPs a...
PDF
Adaptive Security Policies via Belief Aggregation and Rollout
PDF
Automated Intrusion Response - CDIS Spring Conference 2024
PDF
Automated Security Response through Online Learning with Adaptive Con jectures
PDF
Självlärande System för Cybersäkerhet. KTH
PDF
Learning Automated Intrusion Response
PDF
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
PDF
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
PDF
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
PDF
Learning Optimal Intrusion Responses via Decomposition
PDF
Digital Twins for Security Automation
PDF
Självlärande system för cyberförsvar.
PDF
Intrusion Response through Optimal Stopping
PDF
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
PDF
Self-Learning Systems for Cyber Defense
PDF
Self-learning Intrusion Prevention Systems.
PDF
Learning Security Strategies through Game Play and Optimal Stopping
PDF
Intrusion Prevention through Optimal Stopping
PDF
Intrusion Prevention through Optimal Stopping and Self-Play
PDF
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Approximation in Value Space using Aggregation, with Applications to POMDPs a...
Adaptive Security Policies via Belief Aggregation and Rollout
Automated Intrusion Response - CDIS Spring Conference 2024
Automated Security Response through Online Learning with Adaptive Con jectures
Självlärande System för Cybersäkerhet. KTH
Learning Automated Intrusion Response
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Optimal Intrusion Responses via Decomposition
Digital Twins for Security Automation
Självlärande system för cyberförsvar.
Intrusion Response through Optimal Stopping
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
Self-Learning Systems for Cyber Defense
Self-learning Intrusion Prevention Systems.
Learning Security Strategies through Game Play and Optimal Stopping
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping and Self-Play
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Intrusion Tolerance for Networked Systems through Two-Level Feedback Control

  • 1. 1/35 Intrusion Tolerance for Networked Systems through Two-Level Feedback Control IEEE DSN 2024, Brisbane, Australia International Conference on Dependable Systems and Networks Kim Hammar and Rolf Stadler kimham@kth.se and stadler@kth.se KTH Royal Institute of Technology June 27, 2024
  • 2. 2/35 Use Case: Intrusion Tolerance . . . Clients api gateways Compute nodes Storage nodes Service replica 1 Service replica 2 Service replica 3 Service replica 4 Client interface & load balancer I A replicated system offers a service to a client population. I The system should provide service without disruption.
  • 3. 2/35 Use Case: Intrusion Tolerance . . . Attacker Clients api gateways Compute nodes Storage nodes Service replica 1 Service replica 2 Service replica 3 Service replica 4 Client interface & load balancer I An attacker seeks to intrude on the system and disrupt service. I The system should tolerate intrusions.
  • 4. 3/35 Intrusion Tolerance (Simplified) Intrusion event Time of full recovery Time Recovery time Survivability Loss Normal performance System performance Tolerance Cumulative performance loss (want to minimize)
  • 5. 4/35 Increasing Demand for Intrusion-Tolerant Systems I As our reliance on online services grows, there is an increasing demand for intrusion-tolerant systems. I Example applications: Flight control computer Sensors and actuators Power grids e.g., scada systems1. Safety-critical IT systems e.g., banking systems, e-commerce applications2, healthcare systems, etc. Real-time control systems e.g., flight control computer3. 1 Amy Babay et al. “Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid”. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 2018, pp. 255–266. doi: 10.1109/DSN.2018.00036. 2 Jukka Soikkeli et al. “Redundancy Planning for Cost Efficient Resilience to Cyber Attacks”. In: IEEE Transactions on Dependable and Secure Computing 20.2 (2023), pp. 1154–1168. doi: 10.1109/TDSC.2022.3151462. 3 J.H. Wensley et al. “SIFT: Design and analysis of a fault-tolerant computer for aircraft control”. In: Proceedings of the IEEE 66.10 (1978), pp. 1240–1255. doi: 10.1109/PROC.1978.11114.
  • 6. 5/35 Theoretical Foundations of Intrusion Tolerance L. Lamport W. Weibull R.E. Barlow W.Shewhart L. Adleman A. Shamir R. Rivest D. Dolev N. Lynch E. Dijkstra J. Gray B. Liskov Reliabil i t y T h e o r y Distributed Systems C r y p tography Intrusion- tolerant systems
  • 8. 7/35 Building Blocks of An Intrusion-Tolerant System H C ∅ Crashed Healthy Compromised Crash Crash Recovery Compromise 20 40 60 80 100 0.2 0.4 0.6 0.8 1 25 replicas 50 replicas 100 replicas t Reliability . . . Replicated system Client interface Request Response Consensus protocol 1 2 3 4 5 1. Intrusion-tolerant consensus protocol A quorum needs to reach agreement to tolerate f compromised replicas. 2. Replication strategy Cost-reliability trade-off. 3. Recovery strategy Compromises will occur as t → ∞.
  • 9. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - No recoveries Published 1995
  • 10. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - No recoveries Published 1998
  • 11. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2002
  • 12. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2004
  • 13. 8/35 Prior Work on Intrusion-Tolerant Systems - Adaptive replication based on heuristics - Periodic recoveries Published 2006
  • 14. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2006
  • 15. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Supports both periodic and reactive recoveries - Does not provide reactive recovery strategies Published 2007
  • 16. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2011
  • 17. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2018
  • 18. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2023
  • 19. 8/35 Prior Work on Intrusion-Tolerant Systems - Fixed number of replicas - Periodic recoveries Published 2023 Can we do better by leveraging decision theory and optimal control?
  • 20. 9/35 The tolerance Architecture Two-level recovery and replication control with feedback. tolerance Node 1 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node 2 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node Nt Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware . . . Consensus protocol System controller State estimate Evict or add State estimate Evict or add State estimate Evict or add . . . Service requests Responses Clients Attacker Intrusion attempts
  • 21. 10/35 Definition 1 (Correct service) The system provides correct service if the healthy replicas satisfy the following properties: Each request is eventually executed. (Liveness) Each executed request was sent by a client. (Validity) Each replica executes the same request sequence. (Safety)
  • 22. 11/35 Proposition 1 (Correctness of tolerance) A system that implements the tolerance architecture provides correct service if Network links are authenticated. At most f nodes are compromised or crashed simultaneously. Nt ≥ 2f + 1. The system is partially synchronous.
  • 23. 12/35 Intrusion Tolerance as a Two-Level Control Problem . . . π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt ) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bNt ) b1 b2 b3 b4 bNt . . . Attacker Clients I The local level models intrusion recovery. I The global level models replication control.
  • 24. 12/35 Assumption 1 The probability that the system controller fails is negligible. Assumption 2 Compromise and crash events are statistically independent across nodes. . . . π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt ) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bNt ) b1 b2 b3 b4 bNt . . . Attacker Clients
  • 25. 13/35 The Local Control Problem I Partially observed Markov decision process Γi . I Controller actions: (R)ecover and (W)ait. ai,t ∈ {R, W}. I Node states: SN = {(H)ealthy, (C)ompromised, ∅}. si,t ∈ SN. I State transition function: f (si,t | si,t, ai,t). I pC,i : crash probability, pA,i : intrusion probability. I Observation oi,t ∼ zi (·|si,t): e.g., ids alerts at time t. H C ∅ Crashed Healthy Compromised pC,i pC,i a (C) i = R a (A) i = A
  • 26. 14/35 Node Controller Strategy I The controller computes the belief bi,t(s) , P[Si,t = C|ht]. ht , (bi,1, ai,1, oi,2, ai,2, oi,3, . . . , ai,t−1, oi,t). I Controller strategy: π : [0, 1] → {W, R}. Controller Belief
  • 27. 15/35 Node Controller Objective I Cost: Ji , ηT (R) i + F (R) i . I T (R) i is the average time-to-recovery. I F (R) i is the recovery frequency. I η > 1 is a scaling factor. I Bounded-time-to-recovery constraint: The time between two recoveries can be at most ∆R. 10 20 30 40 50 60 70 80 90 100 0.5 1 p = 0.1 p = 0.05 p = 0.025 p = 0.01 p = 0.005 t Failure (crash or compromise) probability. p is the failure probability per time-step.
  • 28. 16/35 Threshold Structure of the Optimal Control Strategy 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.6 0.8 1 alpha vectors E[Ji | bi,1] bi,1 α? wait region recovery region The controller’s optimal cost function. Theorem 2 There exists an optimal control strategy that satisfies π? i,t(bi,t) = R ⇐⇒ bi,t ≥ α? i,t ∀t, where α? i,t ∈ [0, 1] is a threshold.
  • 29. 17/35 Efficient Computation of Optimal Recovery Strategies Algorithm 1: Threshold Optimization 1 Input: Objective function Ji , parametric optimizer po. 2 Output: An approximate optimal control strategy π̂i,θ. 3 Algorithm 4 Θ ← [0, 1]. 5 For each θ ∈ Θ, define πi,θ(bi,t) as 6 πi,θ(bi,t) , ( R if bi,t ≥ θ W otherwise. 7 Jθ ← Eπi,θ [Ji ]. 8 π̂i,θ ← po(Θ, Jθ). 9 return π̂i,θ. I Examples of parameteric optimization algorithmns: cem, bo, cma-es, de, spsa, etc.
  • 30. 18/35 Efficient Computation of Optimal Recovery Strategies 5 10 15 20 25 100 101 102 103 cem de bo spsa dynamic programming ∆R Time (min) Mean compute time to obtain an optimal recovery strategy for different values of the bounded-time-to-recovery constraint ∆R.
  • 31. 19/35 The Benefit of Optimal Recovery Control 2 4 6 8 10 12 14 16 18 20 0.2 0.3 0.4 Optimal strategy Periodic strategy DKL( no intrusion k intrusion ) Ji (4) Benefit of optimal recovery Optimal recovery control can significantly reduce opera- tional cost given that an intrusion detection model is available. Key insight
  • 32. 20/35 Intrusion Tolerance as a Two-Level Control Problem . . . π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt ) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bNt ) b1 b2 b3 b4 bNt . . . Attacker Clients
  • 33. 21/35 The Global Control Problem I Constrained Markov decision process Γ. I States: SS = {0, 1, . . . , smax}, the number of healthy nodes. I Controller actions: Add a (C) t ∈ {0, 1} nodes. I Dynamics f : depend on the local nodes. I Markov strategy: π : SS → {0, 1}. . . . π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt ) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bNt ) b1 b2 b3 b4 bNt . . . Attacker Clients
  • 34. 22/35 System Controller Objective I Cost: J , limT→∞ PT t=1 at T . I Constraint: T(A) ≥ A, where T(A) is the availability. A Allowed service downtime per year 0.9 36 days 0.95 18 days 0.99 3 days 0.999 8 hours 0.9999 52 minutes 0.99999 5 minutes 1 0 minutes
  • 35. 23/35 System Reliability Analysis I The Mean-time-to-failure (mttf) is the mean hitting time of a state where st ≤ f : E[T(F) | S1 = s1] = E(St )t≥1 h inf {t ≥ 1 | St ≤ f } | S1 = s1 i . 10 20 30 40 50 60 70 80 90 100 100 200 300 pi = 0.05 pi = 0.025 pi = 0.01 N1 E[T(F)] The mttf in function of the number of initial nodes N1 and failure probability per node pi .
  • 36. 24/35 Theorem 3 (Optimal Control Strategy Existence) Assuming (A) The Markov chain induced by any control strategy is unichain. (B) The availability constraint is feasible. Then the following holds. 1. There exists an optimal stationary replication control strategy. 2. The optimal strategy has a threshold structure. 3. An optimal replication control strategy can be computed by using linear programming.
  • 37. 25/35 Efficient Computation of Optimal Replication Control Strategies 4 8 16 32 64 128 256 512 1024 2048 100 102 Maximum number of nodes smax Time (s) Mean compute time to obtain an optimal replication control strategy.
  • 38. 26/35 The Benefit of Optimal Replication Control 200 400 600 800 1,000 0.5 1 Optimal strategy N1 = 10 N1 = 100 t Availability Benefit of optimal replication control Optimal replication control can guarantee a high service availability in expectation. The benefit of optimal repli- cation is mainly prominent for long-running systems. Key insight
  • 39. 27/35 Summary of the Control-Theoretic Model I Intrusion recovery control. I Partially observed Markov decision process. I Threshold structure of optimal control strategies. I Efficient computation through stochastic approximation. I Replication control. I Constrained Markov decision process. I Threshold structure of optimal control strategies. I Efficient computation through linear programming. . . . π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt ) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bNt ) b1 b2 b3 b4 bNt . . . Attacker Clients
  • 41. 29/35 The tolerance Architecture Two-level recovery and replication control with feedback. tolerance Node 1 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node 2 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node Nt Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware . . . Consensus protocol System controller State estimate Evict or add State estimate Evict or add State estimate Evict or add . . . Service requests Responses Clients Attacker Intrusion attempts I A replicated web service which offers two operations: I A read operation that returns the service state. I A write operation that updates the state.
  • 42. 30/35 Intrusion-Tolerant Consensus Protocol (minbft) Client Replica 1 (leader) Replica 2 Replica 3 request prepare commit reply Replica 1 (leader v) (leader v + 1) Replica 2 Replica 3 crash request view-change view-change new-view Replica 1 Replica 2 Replica 3 checkpoint Controller Replica 1 (compromised) Replica 2 Replica 3 recover request state state New replica Replica 1 (leader v) (leader v + 1) Replica 2 Replica 3 join-request join new-view join-reply System controller Replica 1 (leader v) (leader v + 1) Replica 2 Replica 3 evict-request evict new-view exit-reply a) Normal operation b) View change c) Checkpoint d) State transfer e) Join f) Evict
  • 43. 31/35 Experiment Setup - Emulated Intrusions Replica ID Intrusion steps 1 tcp syn scan, ftp brute force 2 tcp syn scan, ssh brute force 3 tcp syn scan, telnet brute force 4 icmp scan, exploit of cve-2017-7494 5 icmp scan, exploit of cve-2014-6271 6 icmp scan, exploit of cwe-89 on dvwa 7 icmp scan, exploit of cve-2015-3306 8 icmp scan, exploit of cve-2016-10033 9 icmp scan, ssh brute force, exploit of cve-2010-0426 10 icmp scan, ssh brute force, exploit of cve-2015-5602 Table 1: Intrusion steps.
  • 44. 32/35 Experiment Setup - Background Traffic Background services Replica ID(s) ftp, ssh, mongodb, http, teamspeak 1 ssh, dns, http 2 ssh, telnet, http 3 ssh, samba, ntp 4 ssh 5, 7, 8, 10 dvwa, irc, ssh 6 teamspeak, http, ssh 9 Table 2: Background services.
  • 45. 33/35 Estimated Distributions of Intrusion Alerts 0 2000 4000 6000 8000 b z i (o i | a (A) i ) cve-2010-0426 0 2000 4000 6000 8000 cve-2015-3306 0 2000 4000 6000 8000 b z i (o i | a (A) i ) cve-2015-5602 0 2000 4000 6000 8000 cve-2016-10033 0 2000 4000 6000 8000 b z i (o i | a (A) i ) cwe-89 0 2000 4000 6000 8000 cve-2017-7494 0 2000 4000 6000 8000 oi ∈ O b z i (o i | a (A) i ) cve-2014-6271 0 5000 10000 15000 20000 oi ∈ O ftp,ssh,telnet brute force attack (a (A) i = A) false alarms (a (A) i = F) I We estimate the observation distribution z with the empirical distribution b Z based on M samples. I b z →a.s z as M → ∞ (Glivenko-Cantelli theorem).
  • 46. 34/35 Comparison with State-of-the-art Intrusion-Tolerant Systems 0 0.5 1 101 102 0 0.1 0.2 0 0.5 1 101 102 0 0.1 0.2 0 0.5 1 15 25 ∞ 101 102 tolerance no-recovery periodic periodic-adaptive 15 25 ∞ 0 0.1 0.2 15 25 ∞ Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Average availability T(A) Average time-to-recovery T(R) Average recovery frequency F(R) N 1 = 3 N 1 = 6 N 1 = 9 Comparison between our optimal control strategies and the baselines; x-axes indicate values of ∆R; rows relate to the number of initial nodes N1.
  • 47. 35/35 Conclusion I We present a control-theoretic model of intrusion tolerance. I We establish structural results. I We evaluate the optimal control strategies on a testbed. I Our control-theoretic strategies have stronger theoretical guarantees and significantly better practical performance than the heuristic control strategies used in state-of-the-art intrusion-tolerant systems. . . . π1(b1) π2(b2) π3(b3) π4(b4) πNt (bNt ) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bNt ) b1 b2 b3 b4 bNt . . . Attacker Clients