SlideShare a Scribd company logo
UNIVERSITY OF CATANIA
MASTER’S THESIS
A Distributed Algorithm for Stateless
Load Balancing
Author:
Andrea TINO
Supervisor:
Prof. Eng. Orazio
TOMARCHIO
Assistant Supervisor:
Eng. Antonino BLANCATO
A thesis submitted in fulfillment of the requirements
for the degree of Master of Engineering
in the
Faculty of Computer Science Engineering
Department of Electrical, Electronic and Computer Science Engineering
July 21, 2017
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
iii
Declaration of Authorship
I, Andrea TINO, declare that this thesis titled, “A Distributed Algorithm for Stateless
Load Balancing” and the work presented in it are my own. I confirm that:
• This work was done wholly or mainly while in candidature for a research de-
gree at this University.
• Where any part of this thesis has previously been submitted for a degree or
any other qualification at this University or any other institution, this has been
clearly stated.
• Where I have consulted the published work of others, this is always clearly
attributed.
• Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
• Where the thesis is based on work done by myself jointly with others, I have
made clear exactly what was done by others and what I have contributed my-
self.
Signed:
Date:
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
v
“I like thinking that this work of mine kind of reflects my international personality. The
idea of this algorithm has crossed my mind while I was working in Japan (summer 2012)
on non-deterministic mathematical models to describe fast similarity search algorithms. It’s
incredible how some ideas come to life so spontaneously! Then, I started working and devel-
oping the foundations of the algorithm in Italy. After I started as an employee in Microsoft,
I kept on working on this project in Denmark. Even on vacation, I found time to work on
this thesis while roaming in several areas of South Korea. It is also worth mentioning that I
worked on some chapters while I was in The Netherlands.
When I think about this, I feel happy! ”
Andrea Tino
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
vii
University of Catania
Abstract
Faculty of Computer Science Engineering
Department of Electrical, Electronic and Computer Science Engineering
Master of Engineering
A Distributed Algorithm for Stateless Load Balancing
by Andrea TINO
The algorithm object of this thesis deals with the problem of balancing data units
across different stations in the context of storing large amounts of information in
data stores or data centres. The approaches being used today are mainly based on
employing a central balancing node which often requires information from the dif-
ferent stations about their load state.
The algorithm being proposed here follows the opposite strategy for which data is
balanced without the use of any centralized balancing unit, thus fulfilling the dis-
tributed property, and without gathering any information from stations about their
current load state, thus the stateless property.
This document will go through the details of the algorithm by describing the idea
and the mathematical principles behind it. By means of an analytical proof, the equa-
tion of balancing will be devised and introduced. Later on, tests and simulations,
carried on by means of different environments and technologies, will illustrate the
effectiveness of the approach. Results will be introduced and discussed in the second
part of this document together with final notes about current state of art, challenges
and deployment considerations in real scenarios.
(IT) L’algoritmo oggetto della tesi tratta il problema del bilanciamento di unitá
dati all’interno di un pool di diverse stazioni, contestualmente alla necessitá di man-
tenere in persistenza grandi quantitá di informazione all’interno di server-farm o
data-centre. Le strategie tuttora in utilizzo sono principalmente basate sull’impiego
di un componente centrale per il bilanciamento il quale, spesso, necessita di alcune
informazioni da parte dei nodi della rete circa il loro stato attuale di carico.
L’algoritmo proposto in questa sede procede verso un approccio diametralmente
opposto per cui il bilanciamento dati viene effettuato senza l’utilizzo di alcun com-
ponente centralizzato, da cui la proprietá distributed, e senza la necessitá di ottenere
alcun dato dalle stazioni relativamente al loro stato di carico, da cui la proprietá
stateless.
In questo documento, procederemo nell’esaminare i dettagli dell’algoritmo tramite
una descrizione dell’idea di fondo e dei principi matematici alla sua base. Attraverso
l’impiego di una dimostrazione analitica, verrá dedotta e analizzata l’equazione di bi-
lanciamento. Successivamente, procederemo ad esaminare i test e le simulazioni, en-
trambi condotti tramite diverse tecnologie, a supporto dell’efficacia dell’algoritmo.
I risultati verranno esaminati e discussi nella seconda parte di questo documento,
assieme alle note finali riguardo lo stato corrente della tecnologia nel campo del
bilanciamento dati. Verranno esaminate, inoltre, le problematiche e gli scenari di
utilizzo dell’algoritmo.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
ix
Contents
Declaration of Authorship iii
Abstract vii
1 Introduction 1
1.1 About Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Describing the scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Characterization of balancing algorithms . . . . . . . . . . . . . . . . . 2
1.3.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Static vs. dynamic . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.4 Centralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.5 DU retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Well known balancing algorithms . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Weighted Round Robin . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.3 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.4 Source Address Hash . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.5 Least Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.6 Graph based algorithms . . . . . . . . . . . . . . . . . . . . . . . 6
Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . . . 6
RAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Never Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
THRESHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 The algorithm 9
2.1 Network organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Station ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Ring access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Unbalanced ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Balancing the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Extending the ring . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Adapting concepts in extended ring . . . . . . . . . . . . . . . . 19
Defining sizing equations . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Designing hash function φ . . . . . . . . . . . . . . . . . . . . . . 20
Designing r.v. sφ’s PDF . . . . . . . . . . . . . . . . . . . . . . . . 20
Designing r.v. sφ . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Understanding how φ works . . . . . . . . . . . . . . . . . . . . 26
2.4 Ring balancing example . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Defining the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
x
2.4.2 Defining the formatting impulse . . . . . . . . . . . . . . . . . . 27
2.4.3 Binding impulses to stations . . . . . . . . . . . . . . . . . . . . 28
2.4.4 Calculating amplitudes . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.5 Computing functions . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Simulation results 31
3.1 Small-size simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Verifying load balance . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Evaluating load levels per station . . . . . . . . . . . . . . . . . 33
3.2 Large-size simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Evaluating the variance of hash segment amplitudes . . . . . . 37
3.2.3 Evaluating load levels per station . . . . . . . . . . . . . . . . . 39
Migrations flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 System API 43
4.1 Storing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Packet fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Dynamic conditions 49
5.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Updating φ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Broadcasting in DHT . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.2 Load re-arrangement . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.3 Scaling overall impact . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.4 Ring scale-down . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Fault conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Collisions threshold . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Conclusions and final notes 65
6.1 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 What’s next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A C/C++ simulation engine’s architecture 67
Acknowledgements 73
Bibliography 75
xi
List of Abbreviations
BS Balancing System
DSLB Distributed Stateless Load Balancing
PA Proposed Algorithm
SS Storage System
BA Balancing Algorithm
BP Balancing Pool
DLB Data Load Balancing
DLBA Data Load Balancing Algorithm
DU Data Unit
LD Load Distribution
SL Station Load
CDF Cumulative Distribution Function
PDF Probability Density Function
DHT Distributed Hash Table
HS Hash Segment
ID IDentifier
LS Leaf Set
LLS Lower Leaf Set
P2P Peer To Peer
ULS Upper Leaf Set
API Application Program Intrerface
r.v. random variable
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
xiii
List of Symbols
N Number of stations
Ω Balancing pool (set)
P Data Units (packets) (set)
si Station
p Data Unit (packet)
ψ Packet/station assignment application
Σ Load distribution
l Hash length (number of bits)
ξ Hash function
h Hash string (number)
hξ Regular hash string (number)
hφ φ-hash string (number)
η Station packet load (number)
η(ξ) Station packet load (via ξ) (number)
η(φ) Station packet load (via φ) (number)
πi Packet in station probability (probability)
π
(ξ)
i Packet in station probability (via ξ) (probability)
π
(φ)
i Packet in station probability (via φ) (probability)
f PDF (function)
F CDF (function)
F−1 CDF Inverse (function)
g Formatting impulse (function)
G Formatting impulse antiderivative (function)
Λ Leaf set
ΛU Upper leaf set
ΛL Lower leaf set
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
xv
dedicated to my Mother and my Father
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
1
Chapter 1
Introduction
1.1 About Balancing
Under the (umbrella) term balancing, it is possible to refer to different problems and
solutions: balancing of connections, of workloads, of tasks or data. What tells each
single type of balancing apart from the other is actually what is being balanced.
Definition 1 (Balancing). In Computing and Computer Science, it indicates the problem
of distributing an indefinitely high number of entities across multiple subjects (stations).
The selection is performed in order to guarantee that, at any given time, all stations have
(roughly) the same amount of entities.
From which follows:
Definition 2 (Balancing Algorithm). An algorithm, or a system based on a certain algo-
rithm, designed to solve the balancing problem.
In the context of this research work, we are going to focus on a specific type of
balancing:
Definition 3 (Data Load Balancing). A type of balancing focusing on units of data often
referred to as packets or, simply, data units.
The solutions to the latter are referred to as:
Definition 4 (Data Load Balancing Algorithm). A BA targeting DLB in order to solve
it by minimizing a certain objective function.
This research effort focuses on DLB and DLBAs in order to introduce a new al-
gorithm targeting multiple performance metrics.
Objective
The objective of this thesis is actually to employ this algorithm in a cloud storage
system designed to serve different applications. The system must be capable of:
1. Accepting data as input which will be stored in a pool of servers (stations).
2. Retrieving stored data on demand.
3. Removing stored data on demand.
These essential functionalities must be enabled by the BA and its design.
2 Chapter 1. Introduction
1.2 Describing the scenario
Let us describe DLB and its most important aspects in formal terms.
The network A certain number of stations N is always considered and together
they form a balancing pool: a set which we will indicate as Ω. Each station si ∈ Ω
(having i = 1 . . . N) is connected to the others by a generic protocol, we do not con-
sider any specific communication technology, the only required assumption is that
the protocol employs direct addressing of each station (one unique address per sta-
tion).
Station assignment At a any given time, a DU (or packet) p ∈ P (being P the set
of all DUs) must be stored in the BP. The system/algorithm responsible for carrying
out this activity is expressed by application ψ : P → Ω, which basically assigns a DU
to a station.
The way ψ works is essentially the core of the balancing system. The application
will choose a station with the objective of guaranteeing that, at any given time, all
stations roughly have the same amount of packets stored in them.
Remark. Application ψ typically receives a DU as input: ψ(p), however it actually accepts
more arguments: ψ(p, ·) depending on the strategy it uses to perform the balancing.
Loads At any given time, each station si in the BP will have a certain number of
DUs assigned, we indicate this quantity as station load with operator: |si|, thus:
|si| = {p ∈ P : ψ(p) = si} ⊆ N (1.1)
This quantity can sometimes be expressed as the number of bytes (or any of its
multiples) of the total packets stored in a station:
|si| =
∀p∈P:ψ(p)=si
|p| ⊆ R (1.2)
Where |p| indicates the length of a packet (typically in bytes). Otherwise speci-
fied, we will refer to the former definition.
To have an overview of the balancing state, another quantity is introduced: load
distribution indicated with symbol Σ = (|s1|, |s2|, . . . , |sN |), representing the ordered
vector of station loads at any given time.
Time The algorithm runtime does not require a continuos description. Therefore
time will be considered discrete: t ∈ N and characterized by events, an event being,
for instance, the arrival of a new DU to route in the network.
1.3 Characterization of balancing algorithms
DLBAs can be differentiated basing on several points of view. Considering this clas-
sification is important in order to locate the proposed algorithm inside the taxonomy
of today’s most used systems.
1.3.1 Randomness
Algorithms can employ non-deterministic components like pseudo-random number
generators in order to pick up the station to associate to a DU. This approach is
1.3. Characterization of balancing algorithms 3
not that bad because, provided the generator is characterized by an uniform PDF, it
guarantees a fairly good level of balancing at low cost.
Proposition 1 (Determinism). The algorithm being proposed is fully deterministic.
1.3.2 State
In order to perform balancing, algorithms may require stations to keep information
regarding their load state (e.g. disk usage or residual available space). Statefulness
implies that stations communicate the state information to other stations or special
nodes in the network; by employing this knowledge, the BA can perform a more
precise job. The downside is mostly related to communication overhead as state
information must be regularly exchagnged.
Proposition 2 (Statefulness). The algorithm being proposed is stateless, it does not require
any information from stations to be sent in order to perform balancing.
1.3.3 Static vs. dynamic
The balancing can happen at two possible points in time:
• At runtime The association to a station is performed while the DU is being
transmitted to stations. In these conditions, the same DU might be routed to
different stations depending on the contingent situation. It is also possible to
have DUs being re-routed. This behaviour makes the algorithm dynamic. As
a rule of thumb, a BA is dynamic when it is not possible to always know in
advance where a DU will be routed until the algorithm is actually run.
• Before runtime The destination station is known before the algorithm is run.
This makes the algorithm static. As a rule of thumb, a BA is static when the
same DU is always routed to the same station.
Proposition 3 (Staticity). The algorithm being proposed is static.
1.3.4 Centralization
This property determines whether the algorithm requires a central node in the net-
work used to perform the balancing. A centralized BA requires a unit, called bal-
ancer, which takes care of routing the DU to its destination station. Conversely, a
distributed BA will not need this extra component. Centralized BAs are easier to
implement, but they have 2 major downsides:
• All traffic must pass through the balancer which acts like a hub node.
• The balancer represents a single fault unit in the network. If it goes down,
the whole network is compromised. Safety mechanisms can be employed in
order to avoid network downtime by limiting the outage to the balancing fea-
ture only: if the balancer fails, the traffic will still be routed to stations but no
balancing will occur.
This property also impacts the topology of the network. Typically, centralized algo-
rithms employ a star topology where the balancer is the central node.
Proposition 4 (Non-centralization). The algorithm being proposed is distributed.
4 Chapter 1. Introduction
1.3.5 DU retrieval
A very important characteristic of BAs is the way it is possible to retrieve a DU once
it has been stored in a station. This does not really relate to the BA itself as DU re-
trieval is more an aspect concerning the storage algorithm (which employs the BA).
However the 2 systems are connected together and will be treated as one.
An essential part of the data retrieval story is DU and station identification. Since
a DU is assigned to a station, the association performed by si = ψ(p) must be iden-
tifiable. We have already introduced station identifiers, so we need to do the same
with DUs and introduce, for a packet p, its identifier indicated as ˆp ∈ N (the couple
(p, ˆp) is unique).
A key concept to understand is that station association application ψ works both
with packets ψ : P → Ω and with packet IDs: ψ : N → Ω.
• If the algorithm is dynamic, then ψ is not a bijective application and it will not
always return the same station when invoked. In such a case, the association
(ˆp, si) must be saved somewhere. This condition requires the algorithm to em-
ploy a database functioning as a lookup table, which of course takes resources
and impacts memory.
• If the BA is static, then ψ is bijective and it will be always possible to retrieve
the station where a packet p was routed by simply calculating ψ(ˆp). It means
that it is not necessary to store the coordinates of a packet in order to retrieve it.
When a DU is sent for storage, its ID is used by the owner as a key to retrieve it in a
later time.
1.4 Well known balancing algorithms
The proposed algorithm competes with some other algorithms today available in the
market and commonly used in many different application domains. We are going to
describe some of these in order to compare, later, how the proposed approach ranks
among them.
1.4.1 Round Robin
This class of algorithms keeps a counter c = 1 . . . N which points to the destination
station sc where the current DU will be routed to. The counter is incremented at
every new incoming packet: ct+1 = (ct + 1)%N. These BAs guarantee a very precise
balancing as fairness in packet association is their strongest point.
Such algorithms are deterministic, stateless, static and require a packet lookup
table. They are typically centralized and the balancer keeps track of the counter.
However this condition is not a limitation as it is still possible to use Round Robin
in a distributed way, though such an approach is not common in the market today.
1.4.2 Weighted Round Robin
Like Round Robin, these algorithms guarantee fairness by looping through stations.
However the counter is used in a different way: every station si is not assigned with
1.4. Well known balancing algorithms 5
one number, but with a range of contiguous numbers: ci . . . ci , the counter will range
from 1 to C = maxi {ci, ci } and be incremented according to rule: ct+1 = (ct +1)%C.
The higher the interval ci −ci for one station si, the more packets that station will
receive. Although this seems like breaking the balancing principle, this approach
allows to keep into account stations not having the same storage capabilities. A pos-
sible application is assigning more DUs to stations having a larger storage capacity.
These algorithms are typically centralized, they are deterministic and still require
a packet lookup table. Given their nature, they can be either static/stateless or dy-
namic/stateful; the latter implementation is valid when characteristics of stations
can change in time, the state is typically related to storage capabilities.
1.4.3 Random
Random algorithms use a random number generator employing a discrete random
variable distributed in range 1 . . . N to choose the destination station si. Such algo-
rithms are non-deterministic, dynamic, stateless and always require a packet lookup
table. They can be either centralized or distributed, though the former are the most
common in the market.
One important aspect of these BAs is the requirement on the probability distri-
bution of the random variable employed by the number generator. The distribution
must be uniform in order to have proper balancing.
1.4.4 Source Address Hash
Commonly used in TCP/IP applications, these classes of centralized algorithms em-
ploy a balancing unit which assigns a hash range hi . . . hi to each station si. Hashes
are evaluated numerically so ranges are basically contiguous sequences of integer
numbers hi, hi ∈ N. When a packet arrives, the balancer will compute a hash h ∈ N
on the source address (the hashing function can either be one of the well known
cryptographic ones or some other ad-hoc implementations) and route the DU to the
station whose hash range includes the calculated hash.
Source Address Hash algorithms guarantee that packets coming from same sources
are routed to the same stations. These algorithms are deterministic, static, stateless
and require a lookup table. The balancing relies on the hash function, a crypto-
graphic hash is necessary to guarantee that stations whose address differ by a few
bits do not end up being stored in the same station. Given their implementation,
such algorithms can offer a fairly good balancing.
1.4.5 Least Load
These algorithms usually rely on a centralized balancer, however it is still possible
to perform the balancing in a distributed way (though not common in the market
for this implementation). When a DU arrives, it is routed to the station with the
lowest current load. This means that the balancer needs to know the load of each
station, which is the reason why these algorithms are stateful and typically introduce
a considerable overhead in the network. Least Load algorithms are deterministic,
dynamic and require a lookup table.
6 Chapter 1. Introduction
1.4.6 Graph based algorithms
There is a class of (typically) stateful, distributed and dynamic algorithms which
calculate the destination station for a DU basing on a graph search algorithm on the
network. These algorithms are usually highly scalable, also no assumption is made
on the topologies of the network (free topology).
Nearest Neighbour
A random node si is picked in the network in order to initiate the transmission of
a DU. The balancing is performed only in the context of the subnet represented by
the chosen node and its direct neighbours sj (having j = i). The algorithm picks the
station which minimizes a specific metric usually being the load state of each node.
RAND
This algorithm is non-deterministic and randomly selects a node (station) where to
route a packet p. A threshold L ∈ R is considered for a certain metric (usually the
load state), if the packet exceeds the threshold: |p| > L, then the DU is re-routed to
another randomly selected station; otherwise it is stored in the current one.
The most prominent characteristic of this algorithm is its statelessness. Since
RAND does not require any information from stations, the implementation is very
easy, minimum overhead is generated (due to threshold-exceeding packets which
need re-transmission) and the balancing is pretty good.
CYCLIC This algorithm is a variant of RAND in which a minimal state informa-
tion is kept: the last station a packet was re-transmitted to is always remembered by
the system, this guarantees that the same station is not picked up twice in a row in
case of consecutive threshold exceed occurrences.
Never Queue
There are algorithms which use state information exchanged among stations in order
to evaluate data not strictly relating to nodes. Typically the state is represented by
station-specific quantities, like the current load or the residual amount of storage
memory left. Never Queue employs a different state across the network, in order
to able to evaluate, the moment a DU arrives and needs to be stored, the station
to which transmitting the packet implies the least cost. Thus, the algorithm always
transmits the packet to the fastest available node.
This balancing strategy (which requires a lookup table) is poor as does not guar-
antee a good quality of the overall balancing across stations. What it guarantees
though, and this is the reason why we mention this approach, is good performance
in processing packets in terms of throughput and latency. However this has a cost
in overhead due to the amount of information exchanged by nodes to refresh the
network state.
THRESHOLD
This class of highly dynamic algorithms uses network state information, derived
from message exchange among stations, to decide where a packet will be stored. An
incoming packet p is initially routed to a random node (this makes the algorithm
non-deterministic), that station then compares the packet size with a load threshold
1.5. System overview 7
User Storage sys. Balancing sys. Srv pool
Store
Retrieve
Balance
FIGURE 1.1: Overall system architecture. The end user interacts only
with the storage system, while the balancing system is hidden to the
user and transparent to the storage system with regards to accessing
the server pool.
L ∈ R and decides what to do. If the threshold is not exceeded: |p| < L, then the
DU is stored in the current station, otherwise another station is picked via a polling
mechanism. A maximum number of attempts M ∈ N is considered after which a
packet will stay in the current station even though the threshold is exceeded.
Even though this approach looks a lot like RAND, it differs in the way a station
is selected. Only the initial node selection is random, in case of threshold exceed, the
next station is picked with a process based on analysis of the network state.
LEAST This algorithm works like THRESHOLD but limited to one single iter-
ation. When a packet arrives into a randomly selected node and the threshold is
exceeded, the algorithm will poll a certain pool of stations to pick the next one in its
first attempt to route the packet. The station is usually picked basing on its current
load (least loaded node). After that, no more attempt is performed. Think about
LEAST as THRESHOLD where M = 1.
1.5 System overview
Before detailing how the PA works, it is important to understand the architecture
of the storage system being designed as part of the research in this thesis. From a
point of view based on the API that the overall system exposes, we recognize 2 major
components:
• Storage system The component interacting with the end user and providing
the API for storing and retrieving data.
• Balancing system The component responsible for arranging the storage pool
and balancing data across its servers.
As pointed out by Figure 1.1, the architecture separates concerns by defining 2
sets of API: one exposed to the user, for submitting data and retrieving it, and an-
other one, hidden to the user, which is responsible for balancing DUs in the storage
pool.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
9
Chapter 2
The algorithm
In chapter 1, we have anticipated the most important properties that the PA shows.
What makes this algorithm innovative is the fact that it is at the same time dis-
tributed, static, stateless and does not require a lookup table. In this chapter we
are going to describe the mathematics behind the algorithm which makes all this
possible.
2.1 Network organization
The PA does not allow stations to be arranged in a free connection scheme. A strong
assumption is made on the topology of the network according to the DHT1 scheme.
It is important to make a few considerations about the protocol used by stations
to communicate to each other: in the specific case, no assumption is made of the
networking technology as every protocol can be employed in the network (more
protocols can actually be used as long as they are compatible with each other) except
for one:
Proposition 5 (Networking protocol). Stations are free to employ any arbitrary commu-
nication protocol, as long as this guarantees direct addressing and that a message can be
delivered from / to every pair of stations.
Real case scenario, the one considered here, is Internet and the TCP/IP protocol.
Given proposition 5, one station is actually capable to communicate with every other
one in the network; however this rarely can happen because, and this is the reason
for which direct addressing is essential, a node’s address must be known. Networks
can be mannered according to DHT specifications thanks to a limited knowledge of
other nodes’ addresses.
The direct consequence of proposition 5 is that a separation is made between sta-
tions’ physical and logical connection schemes. Physically, stations can be arranged
without any constraint, however the limited knowledge of other nodes in the graph
allows the system to generate an overlay network which is the one being considered
here.
A station is supposed to have a very limited knowledge regarding other sta-
tions, thus having in memory few of them (considered as neighbours). According to
DHT specifications, a ring topology is employed and it derives from nodes holding
a neighbour set of only 2 nodes: a predecessor and a successor as shown in figure 2.1.
1
Distributed Hash Tables are employed in distributed networks. This network protocol was
adopted by 4 major P2P systems: Chord, Pastry, CAN and Tapestry.
10 Chapter 2. The algorithm
Station 1
Station 2
Station 3
Station 4
Station 5
Station 6
Station 7
Station 8
FIGURE 2.1: A N = 8 network example showing the logical ring
topology. Each station is assigned with an ID (typically the IP address
hash) and packets are routed by content.
Station ordering
A natural order occurs among stations. When the ring is formed, every node si
computed an identifier Id(si) ∈ N represented by the hash of the station’s address.
This identifier is used to build the ring as every station needs to locate its predecessor
and successor in the network:
Lemma 1 (Ring construction). As long as every station has a minimal initial set of con-
nections which guarantees that all nodes form a connected graph, the ring can be built by
having every station reshape its neighbourhood with one successor and one predecessor.
After this initialization phase, the ring is on-line and ready to accept packets.
This aspect is very important as it allows us to define the set of stations Ω as ordered
and we can define:
Definition 5 (Station preceding operator). Given a couple of stations (si, sj) ∈ Ω2 (i = j
and i, j ≤ N), operator : Ω×Ω → {true, false} defines the preceding relation among them.
The operator works as follows:
si sj ⇐⇒ Id(si) < Id(sj) (2.1)
The first important result is the following:
Theorem 2 (Ring complete ordering). The set of stations Ω with precedence operator
: Ω × Ω → {true, false} is a complete ordered set.
Proof. Immediate by considering that operator on Ω, because of its definition, di-
rectly maps on operator < on N which is a fully ordered set.
Ring access
The ring is the place where data is stored and the purpose of the PA is to help the
storage system balance all DUs across stations. The first detail we focus on is how
the ring is accessed when the SS needs to send a packet to be stored or retrieved. The
ring has to be kept safe both from external nodes and from the same nodes that are
part of the network. The way to guarantee the latter is through the following:
2.1. Network organization 11
Station 1
Station 2
Station 3
Proxy
FIGURE 2.2: Access to the ring is guarded by proxies.
Proposition 6 (Limited knowledge principle). To guarantee safety and scalability, every
node in the system has a partial knowledge of the overall network.
In order to protect the ring from external activity, direct access to the network
must be forbidden:
Proposition 7 (Zero knowledge principle). To guarantee safety from external intrusions,
no node, except from those in the ring, knows the address of any station in the system.
This principle, though valid, cannot be adopted as-is because we would end up
with an isolated network otherwise. However, in order to fulfil the security features
promoted by proposition 7, it is possible to build a guarding system around the ring
which hides it from the external world. A collection of proxy stations is employed
for this purpose. As shown in figure 2.2, those stations will be exposed to the ex-
ternal world and they will act as intermediary to the ring, whose addresses are kept
private (in order to fulfil proposition 6, proxy stations will know the addresses of
only a few stations in the ring).
Routing
The SS we are designing is distributed. The basic idea, according to the DHT spec-
ifications, is that a packet will enter the ring from an arbitrary station, called entry
point, in the context of a transmission. From there, every station, which has the BA
deployed, knows whether that packet should be stored there or should otherwise be
routed to a different station.
Every station keeps a limited knowledge of the network. This knowledge is rep-
resented by the set of neighbour nodes one stations keeps. Given the topology, we
define a parameter called leaf radius: r ∈ N (r < N) which represents the number of
successor (or predecessor) nodes every station holds as its neighbourhood.
Definition 6 (Leaf Set). Every station si ∈ Ω keeps track of its neighbour nodes (plus
itself) in an ordered set called leaf set: Λ(si) ⊂ Ω. The leaf set’s cardinality is always
Λ(si) = 2r + 1 where r is the leaf radius of the ring. The following equation holds:
Λ(si) = {sj ∈ Ω : ai,j = 1} ∪ {si} (2.2)
Where A = [ai,j] ∈ NN×N is the adjacency matrix of the network.
12 Chapter 2. The algorithm
Note how one station’s leaf set contains the station itself. Also, the leaf set is
always an ordered set as it is, by definition, a subset of Ω which we proved being
ordered in theorem 2. Unless otherwise specified, we will always consider r = 1.
Lastly, it is sometimes convenient to picture one station’s leaf set extensively as
the ordered vector of neighbour stations (including si):
Λ(si) = si, . . . , si−1, si, si+1, . . . , si
The notation above helps us detecting nodes si and si as the extreme nodes in
every stations’s leaf set. Those nodes will play an important role when defining
routing function ψ later on.
Definition 7 (Upper Leaf Set). Let si ∈ Ω be a station and Λ(si) ⊂ Ω its leaf set. The
upper leaf set ΛU (si) ⊂ Λ(si) is defined as the set of all neighbours which the station
precedes:
ΛU (si) = {sj ∈ Λ(si) : si sj} (2.3)
This set has always cardinality ΛU (si) = r.
Definition 8 (Lower Leaf Set). Let si ∈ Ω be a station and Λ(si) ⊂ Ω its leaf set. The
lower leaf set ΛL(si) ⊂ Λ(si) is defined as the set of all neighbours which the station is
preceded by:
ΛL(si) = {sj ∈ Λ(si) : sj si} (2.4)
This set has always cardinality ΛL(si) = r.
When a station receives a packet, it performs certain operations in order to un-
derstand whether that packet is to be stored there or elsewhere. In the latter, the
station will pick one of the stations in its leaf set and route the packet there. The next
station will repeat the same sequence of operations until the packet is stored into a
node. This algorithm is represented by function ψ : P → Ω, to perform it, every
station in the ring uses the same hash function:
Definition 9 (Hash function). Let P be the set of packets and H ⊆ N , we define ξ : P →
H as a hash function used to calculate the station where to route a packet in the ring.
Not all hash functions can be used to route packets in the ring:
Proposition 8 (Cryptographic hash function). Hash function ξ : P → H is a crypto-
graphic hash function. By definition, ξ behaves in a way such that one bit change in the input
packet will cause the change of at least 50% of the output hash’s bit string.
It is possible to consider many cryptographic hash functions out of those cur-
rently employed in modern systems. Among the most common today we have the
following families: SHA2 and MD3.
As anticipated earlier, stations self-organize in a logical overlay ring by assigning
IDs. One station’s ID Id(si) is computed by using the same hash function ξ on the
station’s address. For formal consistency, we intend hash function ξ to also work
on stations: ξ : Ω → H which is perfectly valid as hash functions do not really care
about the type of data fed as input as long as it is a bitstream. The following expres-
sion: hi = Id(si) = ξ(si) is to be intended as hash function ξ calculated on station
2
Shamir Hash Function. SHA-1 (128 bits), SHA-256 (256 bits), SHA-512 (512 bits).
3
Message Digest Hash. MD2 and MD4 (128 bits) today considered unsafe. MD5 (128 bits) and MD6
(512 bits).
2.1. Network organization 13
si’s address.
As soon as the ring is initialized and ready to work, the topology will define the
ordering of stations: s1 s2 · · · si · · · sN−1 sN following the order of
IDs (hashes): h1 < h2 < · · · < hi < · · · < hN−1 < hN . From here the DHT assigns to
each station an hash segment:
Definition 10 (Hash Segment). Let si ∈ Ω be a station in the ring and hi = Id(si) = ξ(si)
its ID. Station si’s hash segment Ξ(si) is defined as the set of contiguous hashes ranging from
hi up to hi+1 (excluded):
Ξ(si) =
{h ∈ H : hi ≤ h < hi+1} if i = N
{h ∈ H : hi ≤ h ≤ hM} ∪ {h ∈ H : 0 ≤ h < h1} if i = N
(2.5)
Where hM ∈ H is the highest value that hash function ξ can produce: ξ(·) ∈ [0, hM].
Routing function ψ employs hash function ξ in order to compute the destination
station for a packet. The algorithm is deployed on every station and behaves always
the same:
Algorithm 1 Routing a packet in the ring
Require: Ring initialized
Require: Station si has ID hi = ξ(si)
Require: Station si has associated hash segment Ξ(si)
Require: Station si has associated leaf set Λ(si)
1: function ψ(p ∈ P)
2: hp ← ξ(p)
3: Λ ← ∅
4: if hp ≥ hi then
5: Λ ← ΛU (si)  {si } ∪ {si}
6: else
7: Λ ← ΛL(si)
8: end if
9: for s ← Λ do
10: if h ≤ hp ≤ h then Since Ξ(s) = [h , h ] ⊂ H
11: return s
12: end if
13: end for
14: if hp ≥ hi then Having that Λ(si) = (si, . . . , si, . . . , si )
15: return si The packet either belongs to si or further
16: else
17: return si The packet belongs to a station preceding si
18: end if
19: end function
We can now provide a better formal description of a ring by introducing its defi-
nition:
Definition 11 (Ring). Let Ω be the set of stations, r ∈ N be the leaf radius, ξ : · → H ⊆ N
the hash function used by each station and ψ : P → Ω the routing function, based on ξ,
used to assign packets to stations. Then we define R = (Ω, r, ξ, ψ) as a fully qualified ring
14 Chapter 2. The algorithm
overlay across N = Ω stations si ∈ Ω where packets p ∈ P are routed and delivered to
each station via routing function ψ employing hash function ξ.
Given algorithm 1, we have the following result:
Lemma 3. Let R = (Ω, r, ξ, ψ) be a ring, then the following holds:
ψ {p} = si ⇐⇒ ξ(p) ∈ Ξ(si), ∀p ∈ P
Proof. Immediate by considering the first exit point of algorithm 1.
Regarding ψ, we want to describe a few more important aspects:
Definition 12 (Routing). Function ψ will be repeatedly executed for a certain number of
iterations from the moment packet p enters the ring until it finds its destination station. The
routing is over when ψ returns the same station where it is evaluated.
It is now evident that one single application of function ψ does not effectively
routes the packet in the correct station. It is necessary to perform a certain number of
iterations and apply ψ to the same packet in different stations. This scheme generates
a recursive condition which we want to make more evident. Let us denote with
ψk ∈ Ω the station returned by the k-th application of ψ, the recursive definition is
completed by setting the initial condition:
ψk+1 = ψ (ψk)
ψ1 = si
(2.6)
As for every recursive function, we ask ourselves whether the recursive defini-
tion in equation 2.6 converges to a value. As per definition 12, we expect function
ψ to assume a value, at a certain iteration b ∈ N, and always keep it in every future
iteration b + k, k ∈ N. For this reason, it is imperative that the cyclic application of ψ
does not lead to an infinite sequence of iterations, which would make the recursive
definition generate an alternating sequence.
Theorem 4 (Routing is always successful). Let R = (Ω, r, ξ, ψ) be a ring and p ∈ P a
packet entering it from station si ∈ Ω. Let b ∈ N be the number of different applications
of routing function ψ, across the different stations of the ring, before p finally reaches its
destination. Then b is always limited: b ≤ B ∈ N
Proof. We take this by contradiction, thus assuming ∃p ∈ P : b → ∞.
By analyzing algorithm 1, in order to have an infinite number of iterations, we
need to make sure that function ψ, when evaluated on station si ∈ Ω, never returns
the current station: ψ(p) = si. For such a condition to hold, then the following must
occur:
∃p ∈ P : hp = ξ(p) /∈ Ξ(si), ∀si ∈ Ω
which translates into:
∃p ∈ P : hp = ξ(p) /∈ [hi, hM
i ], ∀si ∈ Ω
Since we do not know whether si is the last station in the ring, we use hM
i to indicate
the final hash in station si’s hash segment.
However, since Ξ(si) = [hi, hM
i ] in the equations above depends only on si, and
since those equations hold for all stations, we can consider the totality of the hash
segments:
s∈Ω
Ξ(s) =
N
k=1
hk, hM
k = [0, hM]
2.2. Unbalanced ring 15
So we can re-write the previous equations as:
∃p ∈ P : hp = ξ(p) /∈ [0, hM]
Which is contradictory as hash function ξ is, by definition, limited in range ξ(·) ∈
[0, hM], and since hp is calculated via hash function ξ, it must fall in that range.
Theorem 4 proves that the recursive definition introduced before converges:
Corollary 4.1 (Recursive application of ψ converges). Let R = (Ω, r, ξ, ψ) be a ring
and p ∈ P a packet entering it from station si ∈ Ω. Then the recursive term ψk during the
routing of the packet converges to station sj ∈ Ω after b ∈ N iterations:
lim
k→∞
ψk = ψb = sj
In order to avoid confusion between the final computed destination station and
the intermediate hopping stations calculated by the several iterations of the rout-
ing function; from now on, we will indicate with expression si = ψ(p) the station
where packet p is stored at the end of the routing process. That is, we consider ψ as
returning ψb = si (last iteration) unless otherwise specified.
At this point, we have completed describing and formalizing the storage system.
2.2 Unbalanced ring
We now move forward by analyzing what problems this structure presents in terms
of balancing. Even though only the storage system has been covered so far, it is im-
portant to point out that, as it is now, the architecture already enables a primitive
form of balancing. Packets are, in fact, distributed across different stations and mod-
ern P2P networks are entirely based on this scheme. What kind of balancing do we
end up with?
Since routing algorithm ψ employs hash function ξ, the balancing state of the
ring depends entirely on ξ only! Under static conditions (the ring does not change),
the routing of packet p is done the moment hp = ξ(p) is computed! Routing function
ψ is executed many times because the knowledge that each station has of the ring
is limited, but that number of iterations does not affect the destination station being
computed.
The question we want to answer to is: “Given hash h ∈ H, what is the probability
that, given a random input packet p ∈ P, hash function ξ computed on p returns h:
Pr {ξ(p) = h}?”. Since ξ is a cryptographic hash function, it has an interesting prop-
erty: the probability distribution of the hashes being generated is approximately
uniform, it means that:
∀h ∈ H, ∀p ∈ P, Pr {ξ(p) = h} =
1
hM + 1
(2.7)
Since ξ : · → [0, hM]. If we indicate with l ∈ N the number of bits of the hash
(hash length): l = log2 hM , then ξ : · → [0, 2l − 1] and can write:
∀h ∈ H, ∀p ∈ P, Pr {ξ(p) = h} = 2−l
(2.8)
What we want to focus on is knowing, in the long run, how many packets each
stations gets with this configuration. From this knowledge, we will then analytically
16 Chapter 2. The algorithm
calculate the information we need about the balancing.
The problem of Packets in Stations is very close (though not entirely equivalent)
to another well-known one which we will take into consideration: Balls in Bins 4. We
will see that the scenario of throwing a ball in an area full of bins and assessing in
which bin the balls falls, is equivalent to producing a random packet, calculating its
hash and checking into which station it is going to be routed.
Our analysis starts with identifying the probability for a packet to be routed into
one station of the ring:
Theorem 5 (Packet-in-station probability). Let R = (Ω, r, ξ, ψ) be a ring. Then the
probability that a packet p ∈ P is routed to station si ∈ Ω is:
π
(ξ)
i = Pr {ψξ(p) = si} =
Ξ(si)
1 + hM
Where Ξ(si) is station si’s hash segment’s length (hash coverage).
Proof. Since each station si owns a specific hash segment Ξ(si) = hk, hM
k , The proof
is immediate by considering that every hash h ∈ H ≡ [0, hM] has the same probabil-
ity to be selected, as per equation 2.8. So we just need to multiply that probability
by the length of the segment.
We use expressions π
(ξ)
i and ψξ(·) to indicate the probability for a packet to fall
into a certain station and the routing function ψ, both when hashing function ξ is
employed in the ring. The reason why we want to explicit the hash function is be-
cause, later, we are going to evaluate the same hash-based quantities with a different
hashing function and compare results.
Given one station si, the length of its hash segment Ξ(si) is an important quan-
tity. We can efficiently formalize its value by using expression: Ξ(si) = δi,i+1, and
by defining the following quantity:
Definition 13 (Segment length calculator). Let Ω be the set of stations and let every
station si ∈ Ω have assigned an hash segment hk, hM
k ⊂ N such that the hash partitioning
is circular, thus last station sN has hash segment [hN , hM] ∪ [0, h1 − 1]. We define the
segment length calculator function as the application returning the number of hashes in
the segment assigned to one station:
δi,j = 1i,j · 2l
− hi mod (N+1) − hj mod (N+1)
Having ∀i, j ∈ N ∧ i, j > 0 and:
1i,j =
1 i > j
0
Then we can express the packet-in-station probability in theorem 5 as follows:
π
(ξ)
i = Pr {ψξ(p) = si} = 2−l
· δi,i+1 (2.9)
Theorem 6 (Packets-in-station probability). Let R = (Ω, r, ξ, ψ) be a ring. Let µi =
|si| = 0 . . . m be a r.v. counting the number of packets is station si where m ∈ N represents
4
The problem describes a non-deterministic scenario where balls are thrown on an area full of bins
in a random direction as described in: (Kolchin, 1998).
2.3. Balancing the ring 17
the number of total packets sent so far to the ring. Then the probability that station si ∈ Ω
has k ∈ N packets is:
Pr µ
(ξ)
i = k =
m
k
· π
(ξ)
i
k
· 1 − π
(ξ)
i
m−k
Proof. Immediate. We consider m packets routed into the ring and we want to cal-
culate that the probability that k among them were routed in station si. This calls
for Bernoulli Trials. The probability for a packet to end up in one station is given by
theorem 5.
Thanks to theorem 6, we know the PDF of r.v. µ
(ξ)
i and we are able to calculate
how many packets in average one station gets:
η
(ξ)
i (m) = E µ
(ξ)
i =
m
k=0
k · Pr µ
(ξ)
i = k
=
m
k=0
k
m
k
· π
(ξ)
i
k
· 1 − π
(ξ)
i
m−k
(2.10)
As described in (Kolchin, 1998), in the Balls in Bins problem, the following holds:
m
k=0
k
m
k
· pk
(1 − p)m−k
= mp, ∀m ∈ N, m > 0, p ∈ [0, 1] ⊆ R
Which allows us to calculate the average load per station in a simpler form:
η
(ξ)
i = m · 2−l
· δi,i+1 (2.11)
Proposition 9 (Unbalanced ring). Let R = (Ω, r, ξ, ψ) be a ring. Given equation 2.11,
the network is not balanced. The wider is one station’s hash segment, the more packets that
station gets:
Ξ(si) > Ξ(sj) ⇐⇒ η
(ξ)
i > η
(ξ)
j , ∀si, sj ∈ Ω
Proposition 9 is very important as it states that the system architecture, so far,
is able to achieve a very high level of decentralization, however it fails in balancing
stations.
2.3 Balancing the ring
Equation 2.11 is our starting point to take the current architecture and try to modify
it in order to reach load balancing. Our ideal model is such for which each station
gets the same number of packets:
Definition 14 (Ideal station load). Given a network of N ∈ N different stations, we define
the following as the ideal load per station:
ηi =
m
N
, ∀si ∈ Ω
Where m ∈ N is the total number of packets sent to the network.
18 Chapter 2. The algorithm
Definition 14 points out that our final goal is having our architecture move to-
wards the Balls in Bins model. Our goal can also be expressed by considering the
probability that each single bin has to get a ball:
πi = N−1
, ∀si ∈ Ω (2.12)
If we can have equation 2.9 converge to equation 2.12, our goal is reached since
both definition 14 and equation 2.12 describe a uniformly distributed r.v.
2.3.1 Extending the ring
Moving forward to our goal, as per definition 14, we need to understand why the
ring is not balanced. We can identify 2 possible causes:
1. Hash function ξ does not keep into account the fact that stations have hash
segments of different lengths. It actually assumes that all stations have hash
segments of the same size.
2. Stations should have hash segments of the same size.
The 2 problems described above are actually 2 possible explanations of the same
issue: with regards to balancing, the network structure and the hash function are not
well coupled together. For our solution, we actually choose to accept the standpoint
offered by point 1 which blames the hash function rather than stations.
Our approach is replacing hash function ξ with another one:
Definition 15 (Hash function φ). Let φ : P → [0, φM] ⊂ R, be a hash function. By
employing function φ, a ring can achieve balancing and each station approximately receives
the same number of packets:
η
(φ)
i ≈ ηi =
m
N
Where m ∈ N is the total number of packets sent to the network. Also, we still define l as the
length (bits) of hashes generated by φ: l = log2 φM .
Definition 15 represents a goal for us. In the next section we are going to design
φ so that load balancing in the ring is achieved.
The first thing we notice about φ is that we have designed it to return real num-
bers, thus the hash space is no more discrete, but continuos. We will see that the
continuos characterization of φ will not be a problem when employed in the ring
and in routing algorithm ψ. Furthermore, in our model, we will consider function φ
to be used on packets only, we will not be using this new hash function to compute
the IDs of stations: for them, we will keep using hash function ξ.
Definition 16 (Extended ring). Let Ω be the set of stations and r ∈ N be the leaf radius.
Let ξ : · → H ⊆ N be the underlying hash function and φ : P → [0, φM] ⊂ R be the
balancing hash function used by each station and based on ξ. Let ψ : P → Ω be the routing
function, based on φ, used to assign packets to stations. Then we define R = (Ω, r, ξ, φ, ψ)
as the extended ring overlay where packets are balanced via hash function φ.
Given definition 16, we see that φ does not act as a replacement of ξ, so we will
actually consider the former as an extension of the latter.
2.3. Balancing the ring 19
Adapting concepts in extended ring
With hash function φ in place, routing algorithm ψ needs to be slightly changed.
Actually, since the whole ring structure is based on hashes, we need to adjust a few
definitions so that extended ring (Ω, r, ξ, φ, ψ) can be properly described.
The most important concept to introduce in the extended architecture, is that φ
acts transparently with regards to the hash space. Hash function ξ returns hashes in
the discrete space H ≡ [0, hM] ⊂ N, while φ generates hashes into continuos space
Φ ≡ [0, φM] ⊂ R. We design φ such that φM = hM; thanks to this, one space contains
the other but they are bound by the same extremes: H ⊂ Φ.
This also means that hash segments can be expressed both as enumerable sets
and real intervals. Whether the former or the latter will be specified via set identities
or inferred by context.
The next concept to adapt is stations and their IDs. Nothing changes with re-
gards to this matter: every station will keep using hash function ξ to calculate the
hash of its address hi ∈ H, however here the important point is understanding that
the hash identifying one station is also contained in the spaces of phi-hashes: hi ∈ Φ.
The last aspect to cover is routing function ψ. Given the assumptions above, algo-
rithm 1 remains 99% unchanged. What changes is line 2 where, instead of using hash
function ξ to calculate the packet, hash function φ is used instead: hp ← φ(p). All
other operations remain unchanged because Φ extends H. The direct consequence
of this last point is the following:
Lemma 7. Let R = (Ω, r, ξ, φ, ψ) be an extended ring, then the following holds:
ψ {p} = si ⇐⇒ φ(p) ∈ Ξ(si), ∀p ∈ P
Proof. Immediate by considering lemma 3 and the fact that function φ replaces ξ in
algorithm 1 at line 2.
Defining sizing equations
Equation 2.12 describes the PDF of r.v. s ∈ Ω which represents the bin (station
in an ideally balanced ring) where a ball (packet) falls into. As that equation pre-
scribes, if we want to have the ring balanced, we need to make sure that all stations
get the same probability to receive a packet, which is not the case for a normal ring
(Ω, r, ξ, ψ) as per equation 2.9.
So, by employing hash function φ, r.v. sφ ∈ Ω can be defined as the station where
a packet falls into by assuming the extended ring (Ω, r, ξ, φ, ψ) is in place.
R.v. sφ’s PDF is the start point from where we can commence our sizing effort.
Since sφ is continuos, and given lemma 7, the probability that a packet is routed to
station si in the extended ring is:
π
(φ)
i = Pr hi ≤ φ(p) ≤ hM
i =
Ξ(si)
fφ (r) dr (2.13)
Where fφ : [0, hM] ⊂ R → R is r.v. hφ’s PDF: it represents the generated φ-hashes;
and Ξ(si) ⊆ Φ is station si’s continuos hash segment. It is important to notice how
20 Chapter 2. The algorithm
this equation relates r.v. sφ ∈ Ω (since its PDF has expression: N
k=1 π
(φ)
k · δ(r − k)5)
together with r.v. hφ ∈ Φ.
Recalling equation 2.12, we basically want: π
(φ)
i = πi :
hM
i
hi
fφ (r) dr = N−1
, ∀si ∈ Ω (2.14)
We start from equation 2.14. Our purpose is designing hash function φ’s im-
plementation so that this equation holds. This approach will guarantee that r.v. sφ
behaves like continuously distributed r.v. s in the Balls in Bins scenario.
2.3.2 Designing hash function φ
Equation 2.14 represents a constraint on fφ. This expression points out an important
relationship:
Proposition 10 (Relationship between r.v. sφ and hφ). Given equation 2.14, the effort
of designing hash function φ is transferred on r.v. hφ, as its PDF fφ is the subject of such
design.
Designing r.v. sφ’s PDF
Of course, we cannot extract function fφ from the integral sign in equation 2.14, so
we need to make some assumptions on it.
Definition 17 (Formatting impulse). Let g : R → R be a continuos, domain and value
bounded function with the following constraints:
1. g(r) ≥ 0, ∀r ∈ R.
2. g is a compact-support6 function: ∃r1, r2 ∈ R, r1 < r2 : g(r) = 0, ∀r ∈ [r1, r2].
3. ∃A ∈ R, A > 0 : g(r) ≤ A, ∀r ∈ R.
4.
+∞
−∞ g(r)dr ≤ 1.
5. It is possible to calculate g’s antiderivative: ∃G(r) : G (r) = g(r).
We define g as fφ formatting impulse: a function used to shape r.v. hφ’s PDF and solve
equation 2.14. Because of its definition, we use expression gr1,r2,A(r) to refer to an impulse
with amplitude A and definition interval [r1, r2] ⊂ R.
We want to show right now an important result concerning definition 17 which
will be useful later on in this chapter:
Lemma 8 (Impulse antiderivative is invertible). Let g : R → R be a formatting impulse.
Then its antiderivative G : R → R is invertible: ∃G−1.
Proof. The inverse function theorem7 states that a continuously differentiable uni-
variate function with nonzero derivative in a certain interval is therein invertible. In
our case, G is the antiderivative of a continuous function, thus it is continuous itself
5
Function δ is intended to be a generalized function, or distribution: δ, ϕ = ϕ(0).
6
Compact-support functions used in distributional calculus.
7
As described in (Nijenhuis, 1974), the theorem provides a sufficient condition for a function to be
invertible.
2.3. Balancing the ring 21
s1 s2 sk sN s1
hM 0[h1, hM
1 ]
gh1,hM
1 ,A1
[h2, hM
2 ]
gh2,hM
2 ,A2
[hk, hM
k ]
ghk,hM
k ,Ak
[hN , hM]
ghN ,hM,AN
[0, h1]
g0,h1,AN
FIGURE 2.3: Hash-partitioning of a ring into different segments, one
per each station. For each segment, a different impulse is used, its
coverage matches the segment’s length.
and differentiable by definition of primitive function. Thus we meet the conditions
of invertibility.
The nonzero derivative condition is not met by g’s definition. However this does
not undermine its invertibility: rather, it does not guarantee that the inverse function
is also continuously differentiable.
Function fφ’s domain [0, hM] ⊂ R can be partitioned into N different segments:
one hash segment Ξ(si) ≡ [hi, hM
i ] per each station si.
The basic idea is having function fφ employ impulse g to cover the different seg-
ments in the whole hash space [0, hM] ⊂ R as shown in figure 2.3. So, for each hash
segment [hi, hM
i ] ⊂ R, impulse ghi,hM
i ,Ai
is considered and is employed to calculate
fφ’s values falling into that specific segment. For the last segment relative to sN ,
since it crosses the max hash value hM, we need to actually use 2 different impulses:
ghN ,hM,AN,1
and g0,h1,AN,2
. Amplitudes A1, A2, . . . , AN,1, AN,2 are sized quantities and
their values will be calculated later in this chapter.
Definition 18 (Function fφ’s structure). Let R = (Ω, r, ξ, φ, ψ) be an extended ring, let
hφ ∈ [0, hM] ⊂ R be the r.v. representing a φ-hash, then its PDF is formally defined as:
fφ(r) =
N−1
k=1
ghk,hM
k ,Ak
(r) + ghN ,hM,AN,1
(r) + g0,h1,AN,2
(r)
It is important to notice that fφ is a not a regular unconstrained function, it is
a PDF, thus it must meet certain requirements. Later on, we will verify that those
requirements are actually in place. We can now take equation 2.14 and replace fφ
with its definition:
hM
i
hi
fφ (r) dr =
hM
i
hi
ghi,hM
i ,Ai
(r)dr = N−1
, ∀i = 1 . . . N − 1 (2.15)
In case last station sN is considered, then 2.14 becomes:
hM
i
hi
fφ (r) dr =
Ξ(sN )
ghN ,hM,AN,1
(r) + g0,h1,AN,2
(r) dr
=
hM
hN
ghN ,hM,AN,1
(r)dr +
h1
0
g0,h1,AN,2
(r)dr = N−1
(2.16)
Since g has antiderivative as per definition 17, we can proceed further in both
equations:
Ghi,hM
i ,Ai
(r)
hM
i
hi
= N−1
, ∀i = 1 . . . N − 1 (2.17)
And:
22 Chapter 2. The algorithm
GhN ,hM,AN,1
(r)
hM
hN
+ G0,h1,AN,2
(r)
h1
0
= N−1
(2.18)
Equations 2.17 and 2.18 are the closed-form constraints we have calculated from
equation 2.14 right now.
Remark (Solutions of equations 2.16 and 2.18). Regarding last station’s hash segment,
we have to use 2 different impulses whose amplitudes AN,1 and AN,2 can be sized via equa-
tions 2.16 and 2.18. Those equations provide a possibly infinite set of solutions where both
amplitudes are interdependent. Among the possible ones, we will choose to have each impulse
cover half of the target value:
hM
hN
ghN ,hM,AN,1
(r)dr = GhN ,hM,AN,1
(r)
hM
hN
=
1
2N
(2.19)
And:
h1
0
g0,h1,AN,2
(r)dr = G0,h1,AN,2
(r)
h1
0
=
1
2N
(2.20)
Remark (Requirements on impulse). In definition 17, we have required formatting im-
pulse g to have antiderivative G and that its definition is known. This assumption is pretty
strong but not essential. As we could see from calculations so far, equations 2.15 and 2.16
can actually be used to size the value of the impulse’s amplitude by using an alternative
method to exact integration. Throughout the rest of our analysis, equations 2.17 and 2.18
will always be referred to as the preferred amplitude sizing method; however it will always
be implicitly intended that equations 2.15 and 2.16 can replace them.
The process of designing fφ is completed as the equations above can be used to
calculate all impulse amplitudes A1, A2, . . . , AN,1, AN,2:
Proposition 11 (Defining function fφ). In order to reach load balancing in the ring, the
following operations are considered:
1. Formatting impulse ghi,hM
i ,Ai
: [hi, hM
i ] ⊂ R → [0, Ai] ⊂ R is defined for each station
si ∈ Ω.
2. Function fφ is designed as per definition 18 by summing impulses all together.
3. Function fφ will be parametric on the set of impulse amplitudes, hence its input space
will be RN+2: fφ(A1, . . . , AN,1, AN,2, r) with r, Ak ∈ R, Ak > 0, ∀k = 1 . . . N.
4. For each impulse, the corresponding antiderivative:
Ghi,hM
i ,Ai
(r) =
r
0
ghi,hM
i ,Ai
(x)dx (2.21)
is calculated.
5. For each impulse, by means of equations 2.17 and 2.18, the corresponding amplitude
Ai is calculated.
Now that we know fφ’s formal definition, we need to verify that such expression
meets the constraints of a PDF:
Theorem 9 (Function fφ is a regular PDF). Let R = (Ω, r, ξ, φ, ψ) be an extended ring
with N = Ω stations and let ghi,hM
i ,Ai
: [hi, hM
i ] ⊂ R → [0, Ai] ⊂ R be the formatting
impulse for each station si ∈ Ω such that equations 2.17 and 2.18 hold. Then function fφ,
as per definition 18, is a regular PDF.
2.3. Balancing the ring 23
Proof. The 3 basic properties of PDF functions must be met:
1. fφ is positive and bounded given the definition of impulse g:
0 ≤ ghi,hM
i ,Ai
(r) ≤ Ai, ∀i = 1 . . . N, r ∈ R
2. Given its definition, fφ’s domain is the union of all non-overlapping domains
of the impulses:
N−1
k=1
[hi, hM
i ] ∪ [hN , hM] ∪ [0, h1] ≡ [0, hM] ⊂ R
Thus the function is 0 out of its definition range:
fφ(r) = 0, ∀r < 0 ∧ r > hM =⇒ lim
r→±∞
fφ(r) = 0
3. fφ’s area is unitary because of equations 2.17 and 2.18:
+∞
−∞
fφ(r)dr =
hM
0
fφ(r)dr = N · N−1
= 1
The following comes as a direct consequence of theorem 9:
Corollary 9.1 (Function Fφ is a regular CDF). Function Fφ : R → R has the following
form:
Fφ(r) =
N−1
k=1
Ghk,hM
k ,Ak
(r) + GhN ,hM,AN,1
(r) + G0,h1,AN,2
(r) (2.22)
And is a regular CDF and r.v. hφ’s CDF.
Proof. Immediate by considering r.v. hφ’s CDF’s definition:
Fφ =
r
−∞
fφ(x)dx =
r
−∞
N−1
k=1
ghk,hM
k ,Ak
(x) + ghN ,hM,AN,1
(x) + g0,h1,AN,2
(x) dx
=
r
−∞
N−1
k=1
ghk,hM
k ,Ak
(x)dx +
r
−∞
ghN ,hM,AN,1
(x)dx +
r
−∞
g0,h1,AN,2
(x)dx
=
N−1
k=1
r
−∞
ghk,hM
k ,Ak
(x)dx +
r
−∞
ghN ,hM,AN,1
(x)dx +
r
−∞
g0,h1,AN,2
(x)dx
=
N−1
k=1
Ghk,hM
k ,Ak
(x)
r
−∞
+ GhN ,hM,AN,1
(x)
r
−∞
+ G0,h1,AN,2
(x)
r
−∞
Where Ghk,hM
k ,Ak
is defined according to equation 2.21. Given impulse’s definition,
its antiderivative and binding to hash segments as per equation 18, we know that
the following holds:
lim
r→∞
gr1,r2,A(r) = 0 ∧ r1, r2 ≥ 0 =⇒ lim
r→−∞
Gr1,r2,A(r) = 0
24 Chapter 2. The algorithm
Hence, leading to the following result:
[Gr1,r2,A(x)]r
−∞ = Gr1,r2,A(r) − lim
r→−∞
Gr1,r2,A(r) = Gr1,r2,A(r), ∀r ∈ R
Which leads us to equation 2.22. Finally, theorem 9 has proved fφ(x) is a regular
PDF and this covers the proof of all aspects of the thesis.
Designing r.v. sφ
Although proposition 11 describes the procedure for calculating fφ, it does not pro-
vide a way to build r.v. hφ in a way such that it generates hashes according to that
function. This problem is known in literature as random variable generation8. By em-
ploying this technique, we are able to get the algorithm for calculating φ-hashes
which allow us to balance load in the ring. For the sake of completeness, the proof
of this process is described below:
Theorem 10 (R.v. generation via inverse transform). Let F : R → [0, 1] ⊂ R be a
continuos inversible function meeting the characteristics of a CDF:
1. Bounded in [0, 1]: 0 ≤ F(r) ≤ 1, ∀r ∈ R.
2. limr→−∞ F(r) = 0.
3. limr→+∞ F(r) = 1.
4. Monotone increasing: r1 < r2 =⇒ F(r1) ≤ F(r2), ∀r1, r2 ∈ R.
Let U ∈ R be a continuos uniformly distributed r.v. over [0, 1] ⊂ R. Define X ∈ R as a r.v.
such that the following transformation holds:
X = F−1
(U) (2.23)
Where F−1 : [0, 1] ⊂ R → R denotes F’s inverse function. Then X is distributed as F:
Pr {X ≤ x} = F(x), ∀x ∈ R (2.24)
Proof. We need to prove that equation 2.23 causes r.v. X to be distributed according
to CDF F. Starting from equation 2.24, we need to prove the following:
Pr F−1
(U) ≤ x = F(x), ∀x ∈ R
Since F is invertible, then the function is both injective and surjective. Also, F is, by
hypothesis, a continuos function. Thanks to those 2 conditions, we can apply F to
both ends of the inequality under the probability sign:
Pr F F−1
(U) ≤ F(x) = F(x) =⇒ Pr {U ≤ F(x)} = F(x), ∀x ∈ R
The sign of the inequality is left unchanged because F is monotone increasing. Let
now a = F(x), as x ranges in R, a will range in [0, 1] because F is a CDF and emits
values in that interval. So the previous equation becomes:
Pr {U ≤ a} = a, ∀a ∈ [0, 1] ⊂ R
8
As described in: (Haugh, 2004), r.v. transformation via inverse transform is a known technique
which makes it possible to define a random variable by using the inverse of its CDF.
2.3. Balancing the ring 25
But we have that Pr {U ≤ a} = FU . U is supposed to be a continuos uniformly
distributed r.v. over [0, 1]. The previous equation is actually r.v. U’s CDF’s formal
definition which proves the thesis.
Theorem 10 answers our question about the implementation of hash function
φ. When considering a ring (Ω, r, ξ, φ, ψ), we can use hash function ξ to build hash
function φ in order to reach balancing. Before detailing this process, we need to
make sure we meet all conditions defined by the theorem above:
Lemma 11 (Function Fφ is invertible). Let fφ be r.v. hφ’s PDF defined as per proposition
11. Let Fφ be its CDF. Then Fφ is invertible and we will indicate with F−1
φ its inverse.
Proof. The proof is almost immediate. We consider Fφ’s definition, as per corol-
lary 9.1, and note that it is basically built up by many different impulse antideriva-
tives Gi = Ghi,hM
i ,Ai
having i = 1 . . . N. It is possible to invert a function by parts;
given Fφ’s structure, we can prove its invertibility by proving that each impulse an-
tiderivative Gi is itself invertible. Thanks to lemma 8, every impulse antiderivative
is actually invertible, and this proves the thesis.
The process for achieving this result is as follows:
Proposition 12 (Hash function φ’s implementation). Let R = (Ω, r, ξ, φ, ψ) be an ex-
tended ring. In order to build hash function φ, the following operations must be performed
at ring initialization time:
1. For each station si ∈ Ω, compute its hash identifier: hi = ξ(si) and sort all ID hashes
by increasing value: hi, h2, . . . , hN .
2. Define a formatting impulse g(r, r1, r2, A) to use, parametric in (r1, r2, A) ∈ R3. It is
possible to use one impulse definition for all stations or use different impulse definitions
per each.
3. Bind each station’s associated impulse gr1,r2,A to the station’s hash segment Ξ(si) =
[hi, hM
i ], thus obtaining an impulse gi = ghi,hM
i ,Ai
parametric in amplitude Ai.
4. Use sizing equations 2.17 and 2.18 to compute, for each impulse gi, the value of am-
plitude Ai which allows to achieve balancing. Thus compute an array of fully qualified
impulses: g1, g2, . . . , gN−1, gN,1, gN,2 (no more parametric).
5. Build PDF function fφ as per equation 18.
6. Compute CDF function Fφ as per corollary 9.1.
7. As per theorem 10, compute φ as:
φ(·) = F−1
φ 2l
− 1
−1
· ξ(·) (2.25)
Equation 2.25 represents hash function φ formal definition.
We will refer to this equation as the: Balancing Equation.
26 Chapter 2. The algorithm
0 hM
0 1
h1 h2 hk hN−1 hN
1
2N
1
2N
+ 1
N . . .
1
2N
+ k
N
1
2N
+ N−1
N
h1 h2 − h1 . . . . . . hN − hN−1 hM − hN
1
2N
1
N
. . . . . . 1
N
1
2N
φ φ φ φ φ φ
sN s1 s2 sk sN−1 sN
sN s1 s2 sk sN−1 sN
FIGURE 2.4: Hash segments mapped onto φ segments illustrating
how hash function φ works. The top part of the diagram shows the φ
hash-space while the bottom part the ξ hash-space
2.3.3 Understanding how φ works
Equation 2.25 allows us to balance the ring. Before moving on, we would like to
point out a few important facts regarding the balancing equation to better under-
stand how it works.
Figure 2.4 clearly demonstrates the basic principle behind φ: the balancing appli-
cation basically maps an unevenly distributed space (regular hashes, that is ξ hashes)
onto an evenly distributed space (φ hashes). The φ space even partitioning is based
on the number of stations N.
We want to remark the fact that, given its definition (equation 2.25), hash function
φ has a domain which spans values ranging in interval [0, 1] ⊂ R. This interval is
subdivided into equal segments, each one assigned to a station. We will call them
φ-segments, and we will use term φ-coverage, later on, to indicate the same concept.
2.4 Ring balancing example
For the sake of completeness, we are going to provide an example regarding how to
build hash function φ for a simple small ring consisting of a few stations.
Since our purpose here is to provide a real case scenario, easy to understand, we
are going to consider the following conditions:
1. The ring will be made of N = 6 stations.
2. Leaf set radius is the minimum: r = 1.
3. We are going to consider extremely short hashes with l = 10 bits. This means
that hM = 210 − 1 = 1023.
Remark. Hash function ξ will be considered but not defined as we are not going to physically
use it in our calculations.
We will now follow proposition 12’s prescriptions.
2.4. Ring balancing example 27
Station Hash identifier HS (Ξ(si) ⊂ N) HS (Ξ(si) ⊂ R)
s1 = St. 1 h1 = 101 {101 . . . 209} [101, 210)
s2 = St. 2 h2 = 210 {210 . . . 339} [210, 340)
s3 = St. 3 h3 = 340 {340 . . . 552} [340, 553)
s4 = St. 4 h4 = 553 {553 . . . 700} [553, 701)
s5 = St. 5 h5 = 701 {701 . . . 997} [701, 998)
s6 = St. 6 h6 = 998 {998 . . . 1023} ∪ {0 . . . 100} [998, 1023] ∪ [0, 101)
TABLE 2.1: Showing, in the example, values of hash identifiers and
hash segments for each station.
2.4.1 Defining the ring
We must first define ring R = (Ω, r, ξ, φ, ψ). Remember that hash function φ is the
last quantity we will define.
Each station si ∈ Ω must first define an identifier and compute hash function ξ
on that in order to calculate its hash identifier hi.
As it is possible to see in table 2.1, once each station receives its hash identifier,
hash segments are defined so that routing is possible in the ring.
2.4.2 Defining the formatting impulse
According to the second point of proposition 12, we must move on to defining the
formatting impulse to use in order to achieve balancing. We can choose to either
define one impulse type for all stations or a different impulse type per each one of
them. We are going to choose the first option for 2 reasons:
1. Choosing one single impulse type is easier from a computation point of view
as it implies to formulate its parametric antiderivative only once.
2. Choosing more impulse types has not proved, so far, to be any more beneficial
than using a single one. The quality of the balancing is not impacted by this
choice9.
For the sake of simplicity, we are going to consider a very simple impulse type:
the rectangular impulse:
gr1,r2,A(r) = A · Π
r − r1
r2 − r1
(2.26)
Having:
Π (r) =
1 0 ≤ r ≤ 1
0
The impulse we have chosen in equation 2.26 is compliant to definition 17. Since
the proof is immediate, we will not cover it.
9
This is based on observations from simulations run so far. No proof supports this theory nor denies
it though.
28 Chapter 2. The algorithm
2.4.3 Binding impulses to stations
Now we need to bind every station’s HS to an impulse in order to get a collection of
impulses all parametric with respect to their amplitudes.
s1 =⇒ g1 = g101,210,A1
s2 =⇒ g2 = g210,340,A2
s3 =⇒ g3 = g340,553,A3
s4 =⇒ g4 = g553,701,A4
s5 =⇒ g5 = g701,998,A5
s6 =⇒ g6,1 = g998,1023,A6,1 , g6,2 = g0,101,A6,2
Remark (Impulse for last station). The last station in the ring has a special treatment
because it might span two hash intervals since it will include the highest hash hM and the
lowest one (the null hash). Thus 2 impulses are actually used.
2.4.4 Calculating amplitudes
At the moment, we have a collection of N + 1 = 7 impulses, all parametric with
respect to amplitudes. We need to find the value of those amplitudes in order to
have these impulses define fφ in a way that hash function φ can balance the load in
the ring.
Equations 2.17 and 2.18 will be used to size those impulses. Since the sizing
equations require the computation of impulse’s antiderivative, we need to calculate
it first. Given its extremely simple definition, the calculation is almost immediate:
Gr1,r2,A(r) =
r
0
gr1,r2,A(x)dx =
r
0
A · Π
x − r1
r2 − r1
dx
= A · Π
r − r1
r2 − r1
· (r − r1) + H (r − r2)
Where H (r) is Heaviside’s step function10:
H (r) =
1 r > 0
0
In order to apply equations 2.17 and 2.18, we need to calculate the following
quantity, we will do the binding to hash segments later:
[Gr1,r2,A]r2
r1
= A · Π
r − r1
r2 − r1
· (r − r1) + A · H (r − r2)
r2
r1
= A · Π
r − r1
r2 − r1
· (r − r1) + H (r − r2)
r2
r1
= A · (r2 − r1)
We can now apply equation 2.17 to g1 . . . g5:
10
Heaviside’s function exists in literature in different forms, here we consider the variation where
the function assumes only 2 values: 0 and 1.
2.4. Ring balancing example 29
Ghk,hM
k ,Ak
hM
k
hk
= N−1
=⇒ Ak = hM
k − hk
−1
· N−1
, ∀k = 1 . . . 5
The same goes for g6,1 and g6,2 where we apply equations 2.19 and 2.20:
GhN ,hM,AN,1
hM
hN
= 1
2N =⇒ AN,1 · (hM − hN ) = 1
2N =⇒ AN,1 = 1
2(hM−hN )·N
G0,h1,AN,2
h1
0
= 1
2N =⇒ AN,2 · h1 = 1
2N =⇒ AN,2 = 1
2h1·N
We now have the values of all impulses:
A1 = (h2 − h1)−1
· N−1
= (210 − 101)−1
· 6−1
= 654−1
A2 = (h3 − h2)−1
· N−1
= (340 − 210)−1
· 6−1
= 780−1
A3 = (h4 − h3)−1
· N−1
= (553 − 340)−1
· 6−1
= 1278−1
A4 = (h5 − h4)−1
· N−1
= (701 − 553)−1
· 6−1
= 888−1
A5 = (h6 − h5)−1
· N−1
= (998 − 701)−1
· 6−1
= 1782−1
A6,1 = (hM − h6)−1
· (2N)−1
= (1023 − 998)−1
· 12−1
= 300−1
A6,2 = h−1
1 · (2N)−1
= 101−1
· 12−1
= 1212−1
2.4.5 Computing functions
Now that all impulses have been properly sized and we have their values, function
fφ is fully defined. As a direct result, we also have that function Fφ is fully defined.
Given its simplicity, Fφ can be easily inverted piecewise.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
31
Chapter 3
Simulation results
In this chapter we are going to describe the simulations which were performed in
order to validate, on a practical standpoint, all the results analytically achieved in
chapter 5.
As a generic overview, two different simulation systems were designed and de-
veloped:
• Regular simulations A high-level engine developed in Matlab1 and Mathe-
matica2 targeting small-size simulations in order to produce real data to vali-
date the whole system.
• High performance simulations A low-level engine developed in C/C++ and
targeting large-size simulations in order to produce high fidelity data to vali-
date the system in real life conditions.
Both solutions were used to prove that all analytical results in chapter 5 do pro-
vide a valid description of the system’s behaviour.
3.1 Small-size simulations
This simulation engine was developed to generate results in the context of a con-
trolled environment where conditions are similar to those in real life. The main
features regarding this simulation set are:
• Functional definition of impulses and functions.
• Real hashes are calculated using standard Crypto3 library.
• All big integers are normalized into a smaller interval.
Of course, given its nature, the engine comes with some limitations and some
downsides too:
• Even though real hashes are used, their values are normalized to fit a smaller
interval. Thus, these values cannot be considered as high fidelity.
• Simulations are slow. Given the application of functional calculus, impulse
functions and their antiderivatives are defined in open form, thus requiring
numerical integration to be performed every time.
1
Mathworks Matlab https://www.mathworks.com/products/matlab.html.
2
Wolfram Mathematica https://www.wolfram.com/mathematica/.
3
OpenSSL library was used to compute regular hashes. More information available in appendix A.
32 Chapter 3. Simulation results
100
200
300
400
30
210
60
240
90
270
120
300
150
330
180 0
PHCP non−balanced case
50
100
150
30
210
60
240
90
270
120
300
150
330
180 0
PHCP in balanced case
FIGURE 3.1: Showing the Polar Hash Coverage Plot (PHCP) of a sim-
ulation on an N = 10 station ring after sending m = 103
packets.
Both plots show the configuration of the station hash segments to-
gether with the final load levels at the end of the simulation. The plot
on the left refers to a normal ring (hash function ξ applied), the one
on the right refers to an extended ring where hash function φ based
on same ξ is considered. The same packets were sent in both rings.
• Given the different subsystems being used, numerical accuracy is not guaran-
teed.
On one side, this set of simulations is characterized by a relatively easy imple-
mentation, thus they come with certain intrinsic limitations (mainly related to the
subsystems being used). The other set of simulations is meant to target those issues
and provide a better numerical fidelity.
3.1.1 Verifying load balance
One of the most basic simulations are used to verify that the algorithm effectively
helps the ring achieving load balance given different hash segment distributions
among stations in the hash range interval [0, hM]. These simulations perform the
following operations:
1. The hash space is divided into N random parts and each assigned to one sta-
tion.
2. A total number of m packets (random numeric vectors) are generated and fed
to the hash function which can be either ξ or φ (both are considered in order to
compare loads per station at the end of one simulation).
3. Packets are assigned to stations according to routing function ψ based on se-
lected hash function.
4. Final results are collected: the number of packets per each station is tracked.
3.1. Small-size simulations 33
0 1000 2000
0
100
200
St. 1
0 1000 2000
0
100
200
St. 2
0 1000 2000
0
100
200
St. 3
0 1000 2000
0
100
200
St. 4
0 1000 2000
0
100
200
St. 5
0 1000 2000
0
100
200
St. 6
0 1000 2000
0
100
200
St. 7
0 1000 2000
0
100
200
St. 8
0 1000 2000
0
100
200
St. 9
FIGURE 3.2: Showing load state (in blue) |sk| in each station sk as
time grows. In this simulation, hash function ξ is used (normal ring).
The green line shows the expected load state (uniform) for each point
in time.
Definition 19 (Polar Hash Coverage Plot (PHCP)). Let R = (Ω, r, ξ, φ, ψ) be a ring
with Ω = N stations. The Polar Hash Coverage Plot is a set of N vectors in the 2D
space:
E = Ak · ei·ωk
, k = 1 . . . N
Every vector has its amplitude indicate the packet load relative to the station it refers to, while
the phase indicates the station’s hash segment amplitude and position in the ring:
Ak = ηk
m
ωk = Ξ(sk)
hM
+ ωk,0
Where ωk,0 = k−1
j=0 ωj indicates the phase shift due to all stations preceding sk.
Figure 3.1 shows the PHCP of the same simulation in which the same m pack-
ets have been sent to the network, with and without load balancing hash function
φ in place. As it is possible to see, the vectors in the second plot (on the right) have
roughly the same amplitude in comparison with the first diagram (on the left), indi-
cating that hash function φ is effectively able to provide balancing on the same set of
packets across the stations in the ring.
3.1.2 Evaluating load levels per station
Another set of simulations are used to measure the difference between the final load
state in each station and the expected one (uniform) after the network has been fed
with a certain number of packets.
34 Chapter 3. Simulation results
0 1000 2000
0
100
200
St. 1
0 1000 2000
0
100
200
St. 2
0 1000 2000
0
100
200
St. 3
0 1000 2000
0
100
200
St. 4
0 1000 2000
0
100
200
St. 5
0 1000 2000
0
100
200
St. 6
0 1000 2000
0
100
200
St. 7
0 1000 2000
0
100
200
St. 8
0 1000 2000
0
100
200
St. 9
FIGURE 3.3: Showing load state (in blue) |sk| in each station sk as time
grows. In this simulations set (same as in figure 3.1), hash function φ
is used (extended ring). The green line shows the expected load state
(uniform) for each point in time.
These simulations also have the objective of showing how the network behaves
with and without balancing hash function φ in place. These normal vs. extended
ring scenarios are important as they allow us to visually assess the work done by φ
in reshaping the load distribution in the network. For this analysis to be effective,
it is crucial that both scenarios are evaluated on the exact same set of packets gen-
erated. To guarantee this condition, when randomly generating packets, the same
seed is used when evaluating the normal and the extended ring during one simula-
tion session.
Figure 3.2 and 3.3 show, respectively, the same simulation session first conducted
on the ring and then again on the same ring but extended (φ in place). As it is
possible to see, as time grows, each station reports its load level. In the normal ring,
station load levels do not all meet expected load level η = m
N . On the other hand,
when φ is in place (figure 3.3), all station loads tend to match the expected levels.
Remark (Discrete time). This set of simulations is very important as the load state in each
station is evaluated during all the time. In this context, time is considered discrete and time
instants are to be associated to events. The only event being considered here is the generation
of a random packet.
3.2 Large-size simulations
This simulation engine was developed for two reasons: getting high fidelity simu-
lation data, and providing an initial implementation of the algorithm. As a direct
3.2. Large-size simulations 35
result, we could deliver the first implementation of the algorithm described in the
previous chapter. The main features of this system are the following:
• Being developed in C/C++, the application is very fast computing regular
hashes and performing φ-hashes processing.
• Simulations can be run sequentially or in parallel (packet generation).
• Standard Crypto library is used, therefore all generated hashes are real hashes
and not simulated quantities.
• Big integers are employed, so no scaling is performed in order to adapt real
data to simulation artifacts, hence providing more fidelity to the real scenarios.
Simulation flow In the context of this simulation effort, several computation and
memory intensive runs have been scheduled on a dedicated pool of servers. A de-
tailed description of the infrastructure being used is available in appendix A; here
we provide a brief synopsis about how these simulations work:
1. When the engine starts an initialization phase ensures memory and other con-
ditions.
2. Random packets are generated. Packets are generated as random bitstreams
of specific size. Different sizes can be specified and during one simulation the
size can range in a certain interval.
3. Hashes (using ξ and φ) are computed.
4. Routing of packets is performed for each hash by using application ψ.
5. All results are persisted in memory. Data manipulation is then performed in
order to extract information of interest.
6. Simulation output files are generated.
7. Post-processing is performed by generating diagrams and aggregated quanti-
ties using output files.
3.2.1 Overview
Many simulations have been run, all targeting different network structures and con-
ditions. Before showing results, we need to provide a synopsis of which configu-
rations have been considered in order to understand what was actually simulated.
Every simulation run is characterized by the following properties:
• Number of stations in the ring: N. This parameter directly impacts the size of
the network.
• Number of generated packets: m.
• Leaf radius r. For all simulations, the radius is unitary: r = 1.
• Packet size S ∈ {100Kb, 1Mb, 3Mb, 10Mb}.
Since one same configuration can be run different times with different seed val-
ues, aggregate properties describing one simulation group/batch include:
36 Chapter 3. Simulation results
• Number of simulations in batch: C ∈ N.
• Overall simulation time of the batch: T ∈ R.
Grouping simulations Simulation conducted in the context of this research can
be classified using the parameters described above. The following batches were run:
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
100Kb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
ξ,φ
#
1Mb
13.1M
10
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30 PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30 PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30 PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30 PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30 PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
1Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30 PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30 PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
3Mb
89M
30
PA
R
ξ,φ
#
10Mb
89M
30
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
50 PA
R
ξ,φ
#
1Mb
10M
50
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
PA
R
ξ,φ
#
1Mb
10M
100
The diagram above illustrates the different configurations used to run simula-
tions. To read each single tile, just refer to the following legend:
ξ,φ
#
pkt. size S
m gen. pkt.
N
PA
R
ξ,φ
#
pkt. size S
m gen. pkt.
N
ξ,φ
#
pkt. size S
m gen. pkt.
N
PA
R
ξ,φ
#
pkt. size S
m gen. pkt.
N
regular parallel intensive parallel
intensive
For each simulation, a different seed was used (thus the #-symbol in the bottom-
left corner) and both regular and φ-hashes were computed (top-left corner).
3.2. Large-size simulations 37
0 0.2 0.4 0.6 0.8 1 1.2 1.4
·105
0
0.2
0.4
0.6
0.8
1
1.2
·105
σ of hξ
σ2
/ηofhξ
0 0.5 1 1.5
·105
0.5
1
1.5
2
2.5
·105
σ of hξ
σ2
/ηofhξ
1.5 2 2.5
·104
1.6
1.8
2
2.2
2.4
·104
σ of hξ
σ2
/ηofhξ
0 200 400 600 800
0
1,000
2,000
3,000
σ of hφ
σ2
/ηofhφ
0 0.5 1 1.5
·104
0
2
4
6
8
·104
σ of hφ
σ2
/ηofhφ
0 200 400 600 800 1,0001,2001,400
1,000
2,000
3,000
4,000
5,000
σ of hφ
σ2
/ηofhφ
FIGURE 3.4: Plotting standard deviation vs dispersion factor of gen-
erated ξ-hashes and φ-hashes during simulations batches (from left to
right): N = 10 (40 simulations), N = 30 (60 simulations) and N = 50
(10 simulations).
3.2.2 Evaluating the variance of hash segment amplitudes
Two information were of interest and, accordingly, two different types of data were
extracted from every simulation:
1. The statistical variation of regular hash values and φ’hash values in order to
see whether patterns exist.
2. The statistical relation between the distribution of hash segment amplitudes
and the distribution of φ-hash values. Since more φ hashes are routed into
a specific segment if that segment has a small amplitude, we want to assess
whether special patterns arise in case of high variance in segment amplitudes
when observing φ-hash values.
Figure 3.4 reports possible patterns between variations of regular and φ-hashes.
In general we can conclude that φ-hashes have a more localized behaviour as their
variations are more contained than regular hashes via hash function ξ.
This is expected: if we consider the whole hash space [0, hM] ⊂ R, we have that
hash function ξ has a uniform distribution over that range; on the other side, φ is
characterized by a distribution which allocates hashes with different probabilities in
different sub-intervals of the overall hash range. This last observation is the main
reason why we want to investigate the relation between the variance of segment
lengths and the variance of φ-hashes.
Hashes and segment amplitudes As anticipated, the following questions were
of interest with regards to the behaviour of φ-hashes and the distribution of hash
segment lengths Ξ(si) , ∀si ∈ Ω:
1. If all stations in the ring are arranged in a way such that the distribution of
hash segment lengths is approximately uniform, what behaviour should we
expect from φ-hashes?
38 Chapter 3. Simulation results
0 5 10 15 20 25 30 35 40
0
2
4
HS amplitudes (×1037
)
0
1,000
2,000
3,000
Standarddeviationσ
φ-hashes
0 5 10 15 20 25 30 35 40 45 50 55 60
0
5
10
15
20
HS amplitudes (×1036
)
0
2
4
6
8
·104
Standarddeviationσ
φ-hashes
0 1 2 3 4 5 6 7 8 9 10
0
2
4
6
8
HS amplitudes (×1036
)
0
2,000
4,000
6,000
Standarddeviationσ
φ-hashes
FIGURE 3.5: Plotting standard deviation of hash segment lengths and
standard deviation of φ-hashes during each simulations in batches
(from top to bottom): N = 10 (40 simulations), N = 30 (60 simula-
tions) and N = 50 (10 simulations).
2. If all stations in the ring define very different hash segments (some very wide
and some very short), what behaviour should we expect from φ-hashes?
The diagrams in figure 3.5 try to catch such behaviour and describe it from a sta-
tistical point of view. Both questions raised above can be mathematically mapped
one one statistical descriptor which, therefore, becomes of high interest in this con-
text: the standard deviation of hash segment amplitudes and φ-hashes.
By looking at those diagrams, we can assess a very weak trend for which the
variance of φ-hashes tends to be higher the higher is the variance of hash segment
lengths. As pointed out, this is classifiable as a pattern but in a very prudent way
as the trend is not immediately evident and there are some cases where such a trend
3.2. Large-size simulations 39
0
0.5
1
·105
Station sk
Simulations
Loadη
(ξ)
k
0
2
4
·104
Station sk
Simulations
Loadη
(φ)
k
FIGURE 3.6: Plotting station loads η
(ξ)
k (no balancing) and η
(φ)
k (bal-
anced ring) at the end of four N30 simulations with different seeds.
does not show up. Our conclusion is that a correlation between segment lengths
Ξ(si) and hashes hφ is probably present, however more variables are involved and
more investigation on this regard is necessary.
3.2.3 Evaluating load levels per station
This set of high performance simulations have been used, of course, to verify the
quality of the balancing performed by hash function φ.
Figure 3.6 shows station loads in the context of four different simulations with
30 stations. As it is possible to see, packets are balanced across stations and the
balancing is evident when comparing loads to simulations where no balancing is
performed.
Migrations flows
A concept extremely important in the context of these simulations and, more gener-
ally, in the context of this research effort, is the following:
Definition 20 (Migration flow ξ-φ). Let R = (Ω, r, ξ, φ, ψ) be a ring and p ∈ P a packet.
Let si = ψ(ξ)(p) be the station where the packet is routed to by using hash function ξ, and
40 Chapter 3. Simulation results
S0
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S1
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S2
0%10%20%30%40%50%60%70%80%90%100%
S3
0%10%20%30%40%50%60%70%80%90%100%
S4
0%10%20%30%40%
50%
60%
70%
80%
90%
100%
S5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S6
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S7
0%10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S9
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S10
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S11
0%10%20%30%40%50%60%70%80%90%100%
S12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S13
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S14
0%10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S15
0%
10%
20%
30%
40%
50%60%
70%
80%
90%
100%
S16
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S17
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S18
0%
10%
20%
30%
40%
50%
60%70%80%90%100%
S19
0%10%20%30%40%50%60%70%80%90%100%
S20
0%10%20%30%40%
50%
60%
70%
80%
90%
100%
S21
0%10%20%30%40%
50%
60%
70%
80%
90%100%
S22
0%10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S23
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S24
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S25
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S26 0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
S27
0%
10%20%30%
40%
50%
60%
70%
80%
90%
100%
S28
0%10%20%30%40%50%60%70%80%90%
100%
S29
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FIGURE 3.7: Migrations flows in a N30 ring.
let sj = ψ(φ)(p) be the station where the packet is routed to by using hash function φ. The
virtual transition that packet p experiences from si to sj is called transition flow.
Definition 20 is the foundation of a differential analysis conducted during all
simulations. By collecting all hashes and mapping them to stations, it is possible, at
the end of the simulation, to extract all migration flows. At the end, it is possible
to identify as many flows as generated packets, the final process is aggregating this
information and counting duplicates.
Visualizing migration flows is difficult using tables, thus circo-diagrams4 are em-
ployed instead. Figure 3.7 shows migration flows for a simulation with 30 stations
and 1M packets generated. The diagram clearly provides a good description about
how packets virtually move from one station to another when hash function φ is
used for routing. Thanks to these diagrams it is possible to state the following:
Proposition 13 (Packet migrations). Let R = (Ω, r, ξ, φ, ψ) be a ring, then stations be-
have in two different ways:
• Wide-coverage stations are more likely to donate packets to other stations.
• Narrow-coverage stations are more likely to accept packets from other stations.
4
Circo-diagrams have been generated by using software Circos: http://circos.ca/. To read
these diagrams, read the on-line documentation.
3.2. Large-size simulations 41
It is also worth noticing that all migration flows are localized in adjacent stations
when considering one node in the ring. This pattern is interesting because, differ-
ently from expectations, the re-arrangement performed by φ does not move packets
so far from their ξ-selected station.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
43
Chapter 4
System API
In this chapter we want to provide a description of the different interactions the
system exposes to the end user for storing and retrieving data, and what protocols
are used in the ring to ensure those services.
In chapter 1, we have covered the system architecture. As we recall, the end user
is able to interact with the storage system in order to take advantage of its services;
what happens on the other side of it, in the ring, is not known to him. The questions
we want to give an answer to are: "What happens when the user sends data to be
stored?", "How can the user retrieve data he previously stored?".
The overall system exposes a minimal set of API consisting of 4 primitives:
1. Store By means of this functionality, the user can transmit a DU and have it
persisted in the system. Typically, upon invoking this API, the user receives
some data in return, a token, which will be used later for retrieving that same
data.
2. Retrieve By invoking this API, the user can retrieve data previously stored in
the system. If the operation is successful, the user receives his data in return.
3. Remove The user can actually decide to remove data he previously stored by
invoking this primitive. No data is returned after invoking the API except
a status code indicating whether the removal was successful plus additional
information (optional, like the amount of total data that was deleted).
4. Update This functionality is used by the user to update existing data. It even-
tually results in the sequential application of a delete and store invocation.
We are going to examine in detail the first 2 primitives: store and retrieve as the
others are extensively based on the former pair.
4.1 Storing data
The API for transmitting a DU and have it persisted in the system requires the user
to provide, as input, the byte stream. A identification token is returned if the call is
successful:
t_token store ( stream_t& input )
The moment the user invokes store on DU p ∈ P, 2 things happen in sequence:
1. The token is computed by calculating the hash of the input packet h = ξ(p).
44 Chapter 4. System API
2. DU p’s size is considered together with fragmentation threshold c ∈ N. If p ex-
ceeds the threshold: |p| > c, then the packet is fragmented into smaller units.
The token is returned to the user in case the storage process is successful. Given
the DHT and content addressing, it is possible to retrieve the DU later by using that
specific hash.
4.1.1 Packet fragmentation
The fragmentation process is necessary for some important reasons:
• The ring has a high level of control traffic. Given the DHT and routing al-
gorithm ψ, many transmissions occur between contiguous stations in the net-
work. In order to reduce the latency of communications, the network tends to
favour quantity over size, thus allowing many packets to be exchanged as long
as their size is small enough.
• When a packet reaches a certain station which is not the final destination, an-
other routing iteration is necessary. This means that another communication
must be performed with one of the contiguous stations in the ring. However
if that link is in use, as another packet is being transmitted, then the incoming
one must be queued. In order to reduce station traversing times (while hop-
ping because of routing) for packets, packets’ size is set to a reasonably low
level.
• By dealing with small DUs, it is possible to ensure better balancing over time.
If data were stored without breaking them down into smaller pieces, we would
not ensure units of the same size to be stored across stations, this goes against
one of the assumptions for our balancing algorithm: all data units have the
same size.
As a packet is submitted for storage, the fragmentation process breaks it down
into n smaller units:
n =
|p|
c
In case the original DU is fragmented into smaller units, the final returned token
is still the hash of the original packet. Later on we will see that, by using the same
token, all fragments can be retrieved back. For this to happen, it is necessary to
create fragments in a specific way:
Definition 21 (Packet fragmentation). Given packet p ∈ P and fragmentation thresh-
old c ∈ N (number of bytes), then application ζ : P → 2P returns the set of fragments
{p1, p2, . . . , pn} given input p. Every returned fragment has the following format:
1. The hash of the original packet.
2. The sequence number of the fragment (needed when re-constructing the packet).
3. The hash of this fragment.
4. The data stream (up to c bytes).
As shown in figure 4.1.
4.1. Storing data 45
ξ(p) Seq. k ξ(pk) . . . Data
Ctrl. info
Max c bytes
FIGURE 4.1: Data unit format.
Remark. The frame format for non fragmented packets, called whole packets, is the same,
however the first field (parent hash) is null (all zeros) and the sequence number is −1, which
is the value which can be looked at for distinguishing fragments from whole packets.
4.1.2 Routing
After the fragmentation phase, which can end up with no fragment to be generated
if the original packet’s size does not exceed threshold c, every fragment pk ∈ P is
sent to the ring to be routed according to routing function ψ based on balancing hash
function φ.
As anticipated in chapter 1, the ring is never directly accessed by users. Proxies
are employed instead. A proxy station serves as an intermediary entity to balance
the access to the ring and to hide all ring’s stations from the outside world. When
the store primitive is invoked, the system, from the user’s computer, sends a store
request SReq to one of the known available proxies. When receiving a request, the
proxy will decide which station of the ring to pick for letting SReq enter the network.
The decision is based on a balancing algorithm which decides basing on each of the
known station’s link usage: the proxy’s goal is to avoid overloading one station with
incoming traffic.
Once the request reaches one of the stations, algorithm 1 will do the job and
guarantee that a station is found for the packet. The system client ensures that all
packets undergo the same process. If every fragment is successfully stored, the orig-
inal packet’s hash is returned to the user as a token for retrieving all fragments.
Asynchronous communications For performance reasons, the best approach is
having every communication asynchronous. It means that when a node (one of the
stations, a proxy or the user client node) sends a SReq, it does not keep the con-
nection open until the packet is successfully routed waiting for the final response to
close that connection. It is much better to send a request as a datagram transmission.
That station will receive a store response SRes when its request has been processed.
Every intermediate node that passes the request forward, will wait for its response
and, after receiving it, will construct its own response for the node who sent the re-
quest to it in the first place. This increases the bandwidth as links will not be owned
for long times.
Employing asynchronous transmissions complicates the communication proto-
col but allows better performance. One of the complications is represented by timers
which every station has to implement in order to raise an error when the response
does not get delivered within a reasonable time (request transmission failure). In a
synchronous scheme, timers are handled by the transmission protocol (e.g. TCP/IP)
in a transparent way to the caller, however in asynchronous scenarios, the station
has to implement timers on its own for each sent request. Figure 4.2 shows both
46 Chapter 4. System API
SReq
success
SReq
success
store(stream)
token
SReq
success
SReq
success
SRes
success
SRes
success
store(stream)
token
User: Proxy: Entry station: Dst station:
Synchronous communications
User: Proxy: Entry station: Dst station:
Asynchronous communications
FIGURE 4.2: Synchronous vs. asynchronous communication model
when storing a single packet.
communication schemes.
As part of the effort in writing tests and simulations of the algorithm, an actual
implementation of the ring has been developed in Microsoft .NET using communi-
cation library WCF1. Today, it is possible to implement asynchronous transmissions
in a very easy way as the IT industry has moved forward to that direction providing
developers with the set of API required to implement such protocols.
4.2 Retrieving data
The other side of the story, a little more complicated, is about getting data back. We
are going to cover this topic by considering the 2 possible scenarios here:
1. Retrieving a whole packet.
2. Retrieving a fragmented packet.
In both cases, the process always starts with the same set of operations: the user
has a token he received when storing data in the past and utilizes it to retrieve that
stream back as per retrieve primitive:
1
Microsoft’s Windows Communication Foundation: a library consisting of a collection of network
protocols highly customizable and flexible.
4.2. Retrieving data 47
ξ(p) Total n ξ (p1) . . . ξ (pn)
Fragments
FIGURE 4.3: Packet info format.
stream_t& r e t r i e v e ( t_token t )
Retrieving a whole packet As soon as the user invokes the retrieve primitive,
through the proxy, a retrieve request RReq message is built and routed in the ring.
The token is the hash of the original DU, so, by following the DHT retrieval, the
request is routed to the destination station. Once in there, the station will search the
database to find the stored stream.
In order to have great performance in the packet search process, a dictionary can
be used inside every station. Since DUs are saved according to the format shown in
figure 4.1, the hash of the stream is always available and can be used for looking up
that specific packet when a RReq is routed to a station.
As soon as the stream is retrieved, it can be sent back to the request originator:
the end user, who will receive the DU in return from its retrieve call.
Retrieving a fragmented packet When the packet was fragment the time it was
stored in the ring, a problem occurs. In fact, when the request is sent and reaches
the destination station basing on the token (the original packet’s hash), nothing is
found. The original packet has been fragmented and each fragment has a different
hash completely unrelated to the token (since we use cryptographic hashing, there
is no way to get the original stream from the hash).
In order to solve this issue we can actually store all packet’s fragments’ hashes
into the token, which would become an array of hashes and grow in size. Although
this solution might work, we don’t really like it. The user should still be able to
locate all fragments just by having the original packet’s hash. In order to do so, we
need to modify the store protocol.
After a packet p ∈ P has been fragmented into n several units pk ∈ P, before
transmitting them, a packet info unit is constructed:
Definition 22 (Packet info DU). Given packet p ∈ P such that its size exceeds the frag-
menting threshold: |p| > c, a special data unit is built to track information about it and all
its fragments. The stream contains the following fields:
1. The original packet’s hash h = ξ(p).
2. The number of fragments n.
3. The hash of each single fragment pk (k = 1 . . . n) in order (from first to last).
As shown in figure 4.3.
In the revised store protocol, before sending each single fragment to be stored,
thus before calling store on each single fragment, the same primitive is called on
the packet info DU which has been built right after computing all hashes (original
packet and its fragments). This initial call will route the packet info into a station by
using the original packet’s hash.
Thanks to this approach, when retrieving a DU, the first RReq will reach the
station where the system will find the packet info. Using that, the system will then
48 Chapter 4. System API
RReq
success
RReq
success
lookup
pkt-info
RRes
success
RRes
success
retrieve(token)
pkt-info
RReq
success
. . .
RRes
success
retrieve(token pk)
fragment pk
aggregate(pk)
packet p
User: Proxy: Entry station: Dst station:
Station:
loop
[∀pk]
FIGURE 4.4: Sequence diagram showing the retrieval protocol in case
of a fragmented packet.
issue n retrieve calls in order to fetch each single fragment. Later, after getting all
streams, the original DU can be built, the order into which combining each fragment
is given by the sequence number in each retrieved fragment packet.
As shown in figure 4.4, the process to retrieve and build a stored packet might
take some time, not only the ring size influences this latency, but the number of
fragments too play a significant role in the process. It goes without saying that a
larger packet requires more time to be fully retrieved.
49
Chapter 5
Dynamic conditions
In the previous chapters we have described and analyzed the behaviour of the ring
under static conditions.
Definition 23 (Dynamic conditions). Let R = (Ω, r, ξ, φ, ψ) be a ring. We say the net-
work is under dynamic conditions when any of its characterizing elements changes:
1. Stations si ∈ Ω. Stations might disconnect or new stations might extend the ring.
This possibility also covers the event of stations faulting and becoming off-line.
2. Leaf set radius r changes.
3. Any of the connections in the overlay ring changes.
4. Hash function ξ or φ changes.
5. Routing strategy ψ changes.
Static conditions are the opposite of dynamic: the ring does not change and re-
mains the same. So, why do we need to talk about dynamic conditions? Why should
the ring change?
Ideally, if well designed, the system can be configured with a certain number
of stations, a certain radius and work optimally under static conditions. However,
today every system is exposed to dynamic conditions as many different planned or
unplanned events may occur:
1. One station enters a faulty state. It can happen for any reason like an hardware
issue (e.g. hard disk failure, data corruption, etc.) or a software problem (e.g.
system failure, emergency system reboot, etc.).
2. Stations can experience network issues. This can cause both a permanent of-
fline state or a temporary one if machines have a way to automatically recover
from these types of failures.
3. More stations are required because the system needs to serve an higher volume
of data (planned scale-up).
4. One or more stations need to undergo planned or unplanned maintenance.
5. Security related issues force some stations to be pulled away from the ring.
Those enumerated above are only a few possibilities. The point here is that a
storage system must keep into account such circumstances which are part of the real
world of connected systems.
50 Chapter 5. Dynamic conditions
When dynamic conditions are in place, the ring structure and the balancing al-
gorithm described so far need to be revised and modified in order to avoid perfor-
mance degradation and, in some other more critical cases, service outage. We are
going to examine the following dynamic cases:
• Scalability The ability of the ring to grow or shrink in a flexible way causing
the least possible performance degradation.
– Station join A station joins the ring causing it to expand.
– Station removal A station is pulled off the ring, causing it to shrink.
• Fault conditions One station experiences internal problems which cause it to
be unresponsive.
5.1 Scalability
What happens when a station joins the ring? When such an event occurs, there are a
few operations that need to be considered to re-initialize the ring:
1. The new station needs to build its leaf-set in order to identify its successors
and predecessors.
2. All nodes in the neighbourhood of the new station must re-arrange their leaf-
sets in order to update their successors or predecessors depending on the leaf-
set radius r.
3. Balancing hash function φ must be re-designed as now the ring has changed.
Since we have more stations, we have different hash segments and this impacts
function φ’s implementation.
The first 2 operations are infrastructural and can be addressed through well
known protocols currently employed in DHT-mannered networks; since the prob-
lem is nothing new, we are not going to spend more time talking about it. The 3rd
point though is a different story as it poses a new situation inside our network ar-
chitecture: stations must be synchronized to use a new balancing hash function φ.
Lemma 12 (Balancing hash function φ’s outdatedness upon ring scaling). Let R =
(Ω, r, ξ, φ, ψ) be a ring with N = Ω stations. Let, at any point in time, consider one
station joining R or being pulled out of it, causing the number of stations to become N =
N ± 1. Then hash function φ is no more suited for balancing the ring.
Proof. Immediate by considering proposition 12. According to that, hash function φ
depends on the number of stations in the ring, if that changes, the hash segments
affecting Fφ’s codomain change too; hence causing original hash function φ not to
reflect the new state of the network anymore.
The main problem we want to face here is the process of synchronising stations
in the ring and it consists of:
1. Computing new balancing hash function φ .
2. Updating all stations to use new hash function φ .
3. Rearranging packets across stations to ensure the balancing state of the ring.
5.1. Scalability 51
The last point is actually crucial. We are going to assume that a station joining the
ring comes with no packets stored in it. That is because any other scenario does not
make any sense. When the new station s∗ is on-line in the ring, the load distribution
changes from:
Σ = (|s1|, |s2|, . . . , |sN |) , |si| ≈ m · n−1
, ∀i = 1 . . . N
to this form:
Σ = (|s1|, |s2|, . . . , |s∗
| = 0, . . . , |sN |) , |si| ≈ m · n−1
, ∀si ∈ Ω  {s∗
}
Which implies that the ring is not balanced anymore, hence the last point men-
tioned in the synchronization process introduced earlier, which looks more and more
expensive as we investigate the challenges introduced by the dynamic conditions
just taken into consideration.
If the operations required to synchronize the ring get too expensive (time-wise),
then the proposed algorithm has a serious issue in terms of scalability as it makes
the network adapt pretty badly under the hypothesis of dynamic conditions. Our
purpose is, therefore, trying to understand how actually expensive it is to scale the
ring.
Given our analysis so far, we have been able to break down the scalability issue
down to 2 sub-problems:
1. Updating hash function φ and aligning all stations in the ring to use it.
2. Re-arranging existing stored packets across stations in order to bring the load
distribution in the ring back to its balanced state.
We are going to look at these two problems separately and evaluate the final
performance impact later.
Conjecture 1 (Scaling overall impact). Let R = (Ω, r, ξ, φ, ψ) be a ring experiencing a
scaling process due to one station joining or leaving the network.
• Let τφ ∈ R measure the performance impact (latency) of the process of updating hash
function φ to φ on all stations in the ring.
• Let τψ ∈ R measure the performance impact of the process of rearranging packets
among stations in order to take the ring back to its balanced condition.
• Let τS ∈ R measure the overall latency experienced by the system while carrying out
the two operations above in order to scale the ring.
We expect the following equation to hold:
τS ≤ τφ + τψ (5.1)
Conjencture 1 expresses our feeling that the overall performance impact caused
by the two scaling operations cannot be computed as the sum of the latencies in-
troduced by each one of them, as the two operations can be carried out in parallel,
rather than sequentially. We try to prove this throughout the rest of this chapter.
52 Chapter 5. Dynamic conditions
MT
hsrc ξ(Data) . . . Data
Header
Max c bytes
FIGURE 5.1: Message format.
5.1.1 Updating φ
Hash function φ is a global contract in the network.
Definition 24 (Global contract). A variable, or, more generally, a piece of information
shared by all stations in the ring. The main assumption is about all station keeping an exact
copy of the same value.
The protocol we need to design for updating hash function φ to φ on all stations
is, more generically speaking, a protocol to update a global contract in the network.
Since a DHT is designed for distributed scenarios, every condition implying a certain
level of centrality causes the system to behave with lower performance, and this is
the case here. The PA is based on a distributed approach, however the balancing
process is carried on through a global contract which is hash function φ; this explains
why we should expect this process to be relatively expensive.
Broadcasting in DHT
In order to have a global contract updated, we basically need to transmit a message,
containing the updated contract, in broadcast on the network because we need to
reach every single node. The message that needs to be sent, in terms of the API of
the balancing system, is PUM (φ Update Message). The cost of updating φ is equal to
the cost of sending a message in broadcast in the ring.
Since the broadcast occurs in the context of a network overlay, we need to create
a protocol specific for message broadcasting. Generally speaking, we can create a
message format which all transmissions between stations in the ring must comply
to. The message must contain, at least, the following information:
1. Message Type An enumeration indicating the type of communication (e.g.
RReq, RRes, etc.)
2. Source hash The hash of the source station (not strictly needed but nice to have
for performance reasons, as a station receiving a message knows its neighbours
and it is able to generate the hash of their IP addresses).
3. Body hash The hash of field Body. This is used for routing the message (des-
tination hash).
4. Body The content to transmit.
A possible implementation for the broadcasting protocol has to occur at station
level. Since the architecture is distributed, we cannot employ any centralized entity.
5.1. Scalability 53
Algorithm 2 Message broadcasting in the ring
Require: Ring initialized
Require: Station si has ID hi = ξ(si)
Require: Station si has an associated leaf set Λ(si)
Require: Station si receives packet p ∈ P from station ssrc ∈ Λ(si)
Require: Global variable hp ∈ N is available
Require: Global variable d ∈ {−1, 0, 1} ⊂ N is available and initially set to 0
1: function BROADCAST(p ∈ P)
2: hsrc ← ξ (ssrc) Actually computed or taken from message
3: if hsrc < hi then
4: d∗ ← −1 Message from LLS
5: else
6: d∗ ← 1 Message from ULS
7: end if
8: if ξ(p) = hp ∧ d + d = 0 then Same message from opposite side of ring
9: return Abort. End condition reached
10: else if ξ(p) = hp then Duplicate message from same side of ring
11: return Don’t send again
12: end if
13: hp ← ξ(p)
14: Λ ← ∅
15: if hsrc < hi then
16: Λ ← ΛU (si) Message from LLS =⇒ Send to ULS
17: else
18: Λ ← ΛL(si) Message from ULS =⇒ Send to LLS
19: end if
20: for s ← Λ do
21: Send p to s
22: end for
23: end function
Stations can recognize such a type of communication by inspecting the content of
the message, a possible solution is using a flag field in the message, or, better, using
a special value in the destination address field1.
Lemma 13 (Message broadcasting complexity). Let R = (Ω, r, ξ, φ, ψ) be a ring with
N = Ω stations. The cost of transmitting a message in broadcast, in best case scenario, is:
Θ∗
B =
N
2r
Where ΘB is expressed in number of message hops2.
Proof. Without loss of generality, we indicate with sA ∈ Ω the station initiating the
broadcast transmission in the ring. As soon as a station receives a broadcast message,
it consumes the content and then forwards it to the opposite side of its own leaf-set
in relation to which node it received the message from, as per algorithm 2. If sA
starts the protocol by sending the message to only one side of its own leaf-set, then
1
Usually protocols use the all-1 string to indicate a broadcasting address
2
A hop, in the scenario of message routing, is a single direct transmission from one node to another.
54 Chapter 5. Dynamic conditions
the maximum number of hops required to cover all the ring is:
ΘB =
N
r
Because one station forwards the message in one go to r neighbours. However,
initiator sA can be smarter and send the message to all nodes in its own leaf-set (both
sides). This would trigger a symmetric chain to both sides of the ring, thus leading
to the thesis as the best case scenario is when all messages travel at the same speed,
and the last transmission occurs at the very opposite side of the ring (the hypothesis
is that no delayed transmission occurs).
Finger tables A well-known routing enhancing techniques, often used in DHTs,
is the employment of finger tables. Briefly, it consists in arranging leaf-sets in the
ring in a way such the LLF is empty and the ULF contains all successors in the
ring according to relative position sequence: 20, 21, 22 until reaching 2l. This way
of linking stations implies an higher cost from a control point of view because it
takes more time to re-arrange those links at initialization time and when dynamic
conditions are in place (e.g. one station joining or pulling off the ring).
That being said, on the other hand, this pattern actually also ensures better per-
formance from routing standpoint, hence guaranteeing an even better complexity
than the one considered in lemma 13 in message-broadcasting scenarios.
So, as we can see, the cost of updating global contract φ is acceptable and it is
possible to consider many well-known approaches in literature. Therefore, we have
no interest in detailing this issue any further.
5.1.2 Load re-arrangement
The part of the scaling cost we are most worried about is actually the re-arrangement
of packets. This operation is not required just from a balancing point of view, there
is a more critical aspect which needs to be addressed as soon as one station joins the
ring: packet retrieval.
Let us consider a scenario where station s∗ has joined the ring and hash function
φ has been updated. We consider that s∗’s predecessor is now station si. If no load
re-arrangement is performed, then RReq messages targeting a packet p whose hash
hp = ξ(p) is now covered by s∗: hp ∈ Ξ(s∗), will not be found as they are actually
still stored in si, since that station was covering hp before the ring scaped-up.
The question we want to answer is: "How is the retrieve primitive badly im-
pacted by the ring scaling up?". The example we just considered suggests that only
a portion of the ring is impacted by the scaling up, so the packet re-arrangement
should only occur between 2 stations, however this is something that needs to be
proved.
Theorem 14 (Load re-arrangement upon scale-up by 1 station). Let R = (Ω, r, ξ, φ, ψ)
be a ring. Let s∗ ∈ Ω be a station joining the ring causing hash function φ to be updated on all
stations to φ . Let us also consider station si now becoming s∗’s predecessor, so that its hash
segment is: Ξ(s∗) = [h∗, hM
i ], assuming that h∗ = ξ(s∗). Then the packet re-arrangement
effort required to make all packets in the new network retrievable and to re-balance the ring
impacts all stations in the network.
5.1. Scalability 55
Proof. Recalling how hash function φ works as we described in section 2.3.3, we
need to understand whether the joining of a station causes φ segments in its do-
main to change boundaries (see figure 2.4). We can try to visualize the impact on the
domain by considering the domain mapping diagram while keeping into considera-
tion dynamic conditions. We consider, for simplicity and without loss of generality,
si = s1:
0 hM
0 hM
0 1
0 1
h1 h2 hk hN−1 hN
h1 h∗
h2 hk hN−1 hN
s∗
1
2N
1
2N
+ 1
N . . .
1
2N
+ k
N
1
2N
+ N−1
N
1
2(N+1)
1
2(N+1)
+ 1
N+1
1
2(N+1)
+ 2
N+1 . . .
1
2(N+1)
+ k
N+1
1
2(N+1)
+ N
N+1
h1 h2 − h1 . . . . . . hN − hN−1 hM − hN
1
2N
1
N
. . . . . . 1
N
1
2N
h∗
− h1 h2 − h∗
1
2N
1
N+1
new φ segment: 1
N+1
1
N+1
1
N+1
1
N+1
1
2N
As it is possible to see, additional station s causes only one change in the ξ space
(φ’s codomain), but it causes all φ segments to resize in order to make room for an
additional interval of amplitude 1
N+1.
As we can see our initial assumption was not quite right unfortunately. The
whole domain of hash function φ is impacted and, possibly, all packets need to be
re-routed according to new hash function φ . Nonetheless, we still don’t know basic
information which can really tell us how bad the re-arrangement effort is, like:
1. Are packets re-routed to new stations completely unrelated to the original one?
Or there is a pattern?
2. Do all packets require re-routing? Is there a percentage of them that remains
in their current station when hash function φ makes transition to φ ?
These two question are crucial to evaluate the cost of the re-arrangement effort.
So we need more investigation.
Lemma 15 (Station transition direction upon packets re-arrangement). Under the
same hypothesis and conditions of theorem 14, any packet p ∈ P stored in any station
sj ∈ Ω of the ring, if moved because of the re-arrangement, it is moved either:
• To any of sj’s successors if sj s∗.
• To any of sj’s predecessors if s∗ sj.
Proof. To show this, we consider the ring in the 2 different configurations (before
the scale-up and after). The diagram below shows hash function φ’s domain in both
conditions (N stations at the bottom and N +1 at the top), and also plots the location
of 2 hashes therein.
56 Chapter 5. Dynamic conditions
h h
Ω = N
Ω = N + 1
1
2N
+ k−1
N
1
2N
+ k
N
1
2N
+ k+1
N
1
2(N+1)
+ k−1
N+1
1
2(N+1)
+ k
N+1
1
2(N+1)
+ k+1
N+1
1
2(N+1)
+ k+2
N+1
1
N
1
N
1
N+1
1
N+1
1
N+1
sk−2 . . . sk−1 sk sk+1 . . .
sk−2 . . . sk−1 s∗ sk
sk+1 . . .
As we can see, hash h falls initially in station sk−1, but after the transition it ends up
falling in station s∗’s coverage. In the same way, hash h falls initially in station sk+1,
but after the transition it ends up falling in station sk’s coverage. The formulation
of the lemma implies that packets can also remain in the same station. It is actually
possible as the diagram shows regions on the top and bottom hash spaces which
have values in common.
We now know that a minimal pattern is present while re-routing packets. How-
ever the information provided by lemma 15 is not much. A more interesting result
can be considered, but, before that, we need a quantity to be introduced:
Definition 25 (Packet’s station transition delta). Under the hypothesis of dynamic con-
ditions originating from the ring scaling up by one station, let si and sj be the original
station and the new station (after re-arrangement) for any packet p; then quantity ∆(p) ∈ N
represents the number of stations packet p had to be moved across:
∆(p) =
i − j if |i − j| ≤ N
2
sign(i − j) · N − i + j
∆(p) provides information about whether a packet was moved or not from its
original station (∆(p) = 0), and also about the direction of the move (∆(p) < 0 or
∆(p) > 0). The value of ∆(p) for each packet is the main subject of the next important
result:
Theorem 16 (Packets station transition delta upon scale-up by 1 station). Under the
hypothesis and conditions of theorem 14, the transition delta ∆(p) of any packet p ∈ P stored
in any station si ∈ Ω of the ring is, at most, unitary in absolute value: |∆(p)| ≤ 1.
Proof. The theorem basically states that if a packet is moved, that is moved to one
of the 2 direct contiguous stations. To prove this statement, we want to re-formulate
the thesis by using an equivalent definition. In conjunction with lemma 15, we need
to prove that:
1. A packet hosted in a station preceding s∗ is re-routed, at most, to its immediate
successor.
2. A packet hosted in a station preceded by s∗ is re-routed, at most, to its imme-
diate predecessor.
We will initially prove the first point, later on the second by considering it as a mir-
rored condition of the former.
Let us consider a linear bounded real space divided into N ∈ N even parts. Every
part is marked with an identifying number k = 1 . . . N. Then, we consider N to raise
to N + 1, we consider that every existing segment shrinks down in order to make
5.1. Scalability 57
space for segment N + 1 which is, therefore, supposed to be added as the last one.
This scenario abstracts the condition where stations’ φ coverages per each station
are shrunk to lower φ values due to s∗ joining the ring, in the specific case where all
stations being considered are predecessors (down to s1) of s∗.
a
[N]
[N + 1]
k
N
k+1
N
k+2
N
k
N+1
k+1
N+1
k+2
N+1
k+3
N+1
1
N
1
N
1
N+1
1
N+1
1
N+1
Without loss of generality, we consider point a ∈ [0, 1] ⊂ R and re-express the thesis
as follows: "Is it possible to find any combination of a, k and N such that, after the
shift from N to N + 1, a falls into a segment further than its original’s successor?".
Formally, this question is stated as follows:
∃a ∈ [0, 1] ⊂ R, N ∈ N, N > 0, k ∈ N, k = 1 . . . N :
a < k+1
N
a ≥ k+2
N+1
If that system of inequalities has no solution, then the thesis is confirmed. Be devel-
oping both inequalities we get the following:
a − k+1
N < 0
a − k+2
N+1 ≥ 0
=⇒
aN − k − 1 < 0
a(N + 1) − k − 2 ≥ 0
=⇒
aN − k − 1 < 0
aN + a − k − 2 ≥ 0
By isolating N, we get:
aN < k + 1
aN ≥ k + 2 − a
=⇒
N < k+1
a
N ≥ k+2−a
a
∨
−k − 1 < 0
−k − 2 ≥ 0
The second system arises from dividing both members, in both inequalities, by a.
We need to consider what solutions the system might present in case a = 0. This last
system is easily proved to be impossible:
k + 1 > 0
k + 2 ≤ 0
=⇒
k > −1
k ≤ −2
Resuming on the former system and considering from now on a ∈ (0, 1] ⊂ R, we can
develop more and get:
k + 2 − a
a
≤ N <
k + 1
a
=⇒
k + 2 − a
a
<
k + 1
a
=⇒ k + 2 − a < k + 1 =⇒ a > 1
The system has solutions for a > 1, however this is in contrast with our hypothesis
for which a ∈ (0, 1] ⊂ R, thus the system does not have solutions in the definition
boundaries of a, N and k!
We still need to prove the symmetric case of stations that are successors of s∗.
However it is possible to skip this by considering that such a scenario is the mirror
of the one just proved.
As a direct result, we have the following:
58 Chapter 5. Dynamic conditions
Corollary 16.1 (Packet lookup failure at re-arrangement time). Under the hypothesis
and conditions of theorem 14, if packet p ∈ P is not found in station si ∈ Ω while the system
is in the process or re-arranging packets, then it will be found in the previous or next node
depending on whether s∗ si or si s∗!
Lemma 15, theorem 16 and corollary 16.1 provide the answers to our initial ques-
tions. To draw our conclusions: the ring is not perfectly scalable as all stations need
to rearrange their packets under dynamic conditions; however the effort is extremely
localized in the context of each station.
5.1.3 Scaling overall impact
We have now more information in order to evaluate conjecture 1. Considering the
characteristics of the operations of updating hash function φ across stations and re-
distributing packets, we now understand that they can be executed in parallel. As
soon as the joining station computes φ , it commences the protocol for broadcast-
ing this knowledge in the ring. At the same time, the same station can start going
through all its packets and evaluating the new hash function on those in order to re-
route its DUs. This process can be started in every station the moment φ is available
and it is traversing the ring.
That being said, the packet moving operations are more expensive than the op-
eration of computing the new hash function or receiving it from other stations, thus
the time needed for re-routing DU loads in the network is far higher: τψ τφ, so the
overall scaling time is basically defined by τψ.
5.1.4 Ring scale-down
All the considerations made so far regarding the ring scaling up can be transferred
to the opposite case where a station leaves the network. A few considerations must
be made though in relation to this dynamic condition:
• When a station leaves the network, the physical detachment to the other nodes
is not performed until all packets are re-routed. This is crucial and different
in comparison to the scenario of a station joining the ring; in fact we cannot
afford here to lose a whole bucket of packets.
• A station leaving the network is not the same scenario of a station abandoning
the ring. The former is a controlled process happening through a specific pro-
tocol and requires time; the latter is a sudden event and cannot be controlled,
its nature is described later in this chapter.
5.2 Fault conditions
As anything can happen, stations in the ring might enter weird states. The reasons
for such a scenario to occur can be many: hardware or software related and adequate
countermeasures can be considered. Nonetheless, when it comes to disaster recovery,
it is not much about all possible cases we know, but rather more about everything
we don’t know. So, we will now consider the possibility of a station becoming un-
available and we are not going to ask ourselves why! What we ask instead is: "How
do we guarantee data retrieval services and the balancing in such conditions?".
5.2. Fault conditions 59
S
p
hS p1
h
(1)
S p2
h
(2)
S p3
. . .
FIGURE 5.2: Multiple hashing mechanism for achieving safe redun-
dancy. Hashes are computed and then concatenated to the data
stream, hence generating packets ready to be sent.
When a station goes down, the first issue is infrastructural. If the ring is set to
have leaf-set radius r = 1, then we have a problem as the ring basically breaks apart
and messages cannot be routed across stations. Of course, if the radius is higher:
r > 1, then no immediate consequences are experienced in terms of message rout-
ing. In both cases, DHT networks have existing protocols in literature to fix dangling
links and isolate the unavailable station; the only difference is that a unitary radius
ring will experience some downtime until links are fixed, this is one of the reasons
for which non-unitary radius rings are more robust to disasters.
The second issue to solve is from data retrieval perspective. A station went down
unexpectedly, thus there was no time to apply any scale-down protocol (in fact the
scenario here is not a station leaving the ring, but a station disappearing from it).
The direct consequence is virtual data loss: all packets stored in that station are now
unavailable and when any RReq is sent to the ring targeting one of those DUs, the
destination station will not find the packet hash in its database.
It is clear that, to solve this issue, something has to be done before the station goes
down. However we cannot make any assumption on this condition and its timing.
So we need to change the data storage protocol to target situations where emergency
packet retrieval is needed as we cannot afford, for any reason, the possibility of data
becoming unavailable to users.
In chapter 4 we have described the API for storing a packet in the ring. Our in-
tention is to modify the storage protocol (primitive store) in order to save one packet
in multiple locations in the ring without losing balancing. The procedure applies to
either packets or fragments, in general, we consider a certain stream of data to be
sent for storage:
1. The data stream S to send is processed and its hash computed: hS = φ(S).
2. Another hash is computed, by using as input previously computed hash hS:
h
(1)
S = φ (hS).
3. The same recursive operation is repeated for ∈ N times and several hashes
are computed in chain: h
(k)
S = φ h
(k−1)
S .
4. different packets are generated by constructing a frame with the same body
(the data stream) but different associated hash as per figure 4.1 and then sent
to the ring.
60 Chapter 5. Dynamic conditions
RReq
success
RReq
success
lookup
null
RRes (error)
success
RRes (error)
success
retrieve(φ(p))
null
RReq
success
RReq
success
lookup
pkt p
RRes
success
RRes
success
retrieve(φ (φ(p)))
pkt p
User: Proxy: Entry station: Dst station 1:
Dst station 2:
FIGURE 5.3: Packet retrieval session under the hypothesis of one sta-
tion down. The diagram illustrates how a failed RReq triggers the
emergency retrieval process.
The procedure just described will generate different copies of the same DU
and they will all be sent to different locations in the ring. Thanks to the Lamport
scheme3, we can compute more hashes of the same initial stream and use them as
storage keys.
Remark. Generating the first hash hS is potentially expensive because the input stream can
be long (however bound to a certain level considering fragmentation threshold c). The same
cannot be said for the other hashes h
(k)
S because they are computed on another hash (very
short string). So the process of computing the redundant hashes is very cheap.
3
The process of generating the hash of an hash is used today in security-related scenarios in order to
generate ephemeral keys. The scheme has been proved to be safe and, by using a secure cryptographic
hash function, irreversible.
5.2. Fault conditions 61
How can this procedure help us when attempting to retrieve a DU stored into an
unavailable station? We consider again the broken scenario of before where station
si suddenly become unavailable:
1. The system tries to retrieve packet p via its hash h = φ(p).
2. The RReq message will reach station si−1 as it now covers the hash segment of
si when it was on-line. However station si−1 cannot find hash h in its database,
thus returns an error in the RRes.
3. The system acknowledges the first RReq is not successful, so it tries again to
retrieve the packet by computing φ(h).
4. The second RReq now reaches another station sj where the packet is found
and returned.
Figure 5.3 illustrates the protocol just described.
5.2.1 Collisions threshold
As we promote the idea of introducing redundancy of packets in the network as a
mean to achieve good levels of disaster recovery, we should try to be careful to make
this effort the most efficient possible, therefore avoiding unnecessary cost. Since we
are routing the same packet in the ring but with different hashes, we want to make
sure all the copies do not end up being routed into the same station. If we generated
one copy a packet and they both were routed to the same station, our effort would
be pointless: the moment that station goes down, our emergency retrieval procedure
would fail. On the other end we don’t want to generate too many copies of the same
packet as we would waste precious memory in our stations. How to find a good
balance? Let’s start by considering collisions in the ring:
Lemma 17 (Packets collision probability). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω =
N stations, and let p1 ∈ P and p2 ∈ P be two packets. Then the probability that they collide
(routed to) onto the same station is:
γ =
1
N
(5.2)
Proof. A collision occurs when p1 and p2 are routed to the same station si ∈ Ω:
ψ (p1) = ψ (p2) = si. The probability of this event can be defined as follows:
γ = Pr {ψ (p1) = ψ (p2) = si} , ∀p1, p2 ∈ P, ∀si ∈ Ω
We first consider packet p1 routed into the ring into station sk ∈ Ω and then consider
packet p2 being processed: the probability to have a collision with p1 is the proba-
bility of being routed into sk considering that sk can be any station of the ring, this
calls for the Law of Total Probability:
γ =
N
k=1
Pr {ψ (p2) = sk|ψ (p1) = sk} · Pr {ψ (p1) = sk} (5.3)
Since ψ is based on hash function φ which is based on ξ which is a cryptographic
hash function, we have that consecutive applications of ψ are not interdependent, it
mens that:
Pr {ψ (p2) = sk|ψ (p1) = sk} = Pr {ψ (p2) = sk} , ∀p1, p2 ∈ P, ∀sk ∈ Ω
62 Chapter 5. Dynamic conditions
We can rewrite equation 5.3 as follows:
γ =
N
k=1
Pr {ψ (p2) = sk} · Pr {ψ (p1) = sk}
Recalling theorem 5 and the definition of πk ∈ [0, 1] ⊂ R as the packet in station k
probability, we can write our equation as follows:
γ =
N
k=1
πk · πk =
N
k=1
π2
k
We are under the hypothesis of a balanced ring, since hash function φ is applied: so,
according to equation 2.12, we have:
γ =
N
k=1
1
N2
=
1
N2
·
N
k=1
1 =
1
N2
· N =
1
N
Which proves the thesis.
The follow up to lemma 17 is calculating the average number of collisions that
are experienced in the ring when sending packets. Remember that the fact of send-
ing copies p1 . . . p of packet p does not create a correlation between the different
instances being sent. This is due to the fact that we are sending different hashes
hS, h
(1)
S . . . h
( )
S related to each other by the Lamport chain which, actually, guaran-
tees that all the hashes are not (stochastically) interdependent.
Lemma 18 (Average number of collisions). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω =
N stations. Then, when generating m ∈ N packets, the average number of collisions experi-
enced between different couples of units is:
ηγ =
m
2
1
N
(5.4)
Proof. We introduce r.v. y ∈ N counting the number of collisions between couples of
m packets. This variable can range from 0 up to all the possible combinations of two
different packets: |C(m, 2)| = m
2 . We also introduce r.v. χ = 0, 1 ⊂ N defined as
follows:
χ(p1, p2) =
1 if a collision occurs between packets
0 no collision
Remembering that C(m, 2) enumerates all possible combinations of packets (order
does not matter), we can define y as follows:
y =
(p1,p2)∈C(m,2)
χ(p1, p2)
R.v. y’s mean value can then be calculated as:
ηγ = E [y] = E


(p1,p2)∈C(m,2)
χ(p1, p2)


5.2. Fault conditions 63
Since operator E [·] is linear, we have that:
E


(p1,p2)∈C(m,2)
χ(p1, p2)

 =
(p1,p2)∈C(m,2)
E [χ(p1, p2)]
R.v. χ is discrete and distributed on two values only, so its mean can be easily calcu-
lated:
E [χ(p1, p2)] = 1 · Pr {χ = 1} + 0 · Pr {χ = 0} = Pr {χ = 1} = γ, ∀p1, p2 ∈ Ω
So, back to r.v. y’s mean value:
ηγ =
(p1,p2)∈C(m,2)
E [χ(p1, p2)] =
|C(m,2)|
k=0
γ =
|C(m,2)|
k=0
1
N
=
1
N
·
|C(m,2)|
k=0
=
1
N
m
2
Proving the thesis.
Thanks to lemma 18, we can now try to calculate a reasonable value for and
decide how many clones of a packet we should send in the network to ensure an
effective level of redundancy.
Theorem 19 (Optimal ). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω = N stations and let
p ∈ P be a packet sent with redundancy factor ∈ N. Then, in order to guarantee that at
least 50 % of sent packets do not collide, the optimal redundancy factor is:
< opt = N
Proof. Let β ∈ [0, 1] ⊂ R be the percentage of collisions that we allow on the number
of total packets + 1 (the original packet and its clones) sent to the network. So, the
following holds:
+ 1
2
1
N
< β( + 1) =⇒
+ 1
2
< βN( + 1)
=⇒
+ 1
2
1
+ 1
< βN
=⇒
( + 1)!
2!( − 1)!
·
1
+ 1
< βN
=⇒
( + 1) ( − 1)!
2( − 1)!
·
1
+ 1
< βN
=⇒
2
< βN =⇒ < 2βN
Which proves the thesis by considering: β = 1
2.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
65
Chapter 6
Conclusions and final notes
Simulations have shown the effectiveness of the balancing performed by the algo-
rithm, together with the use of known distributed architectures (DHT networks),
the proposed balancing approach is feasible and potentially employable in real case
scenarios.
6.1 Open issues
The algorithm currently presents some challenges which must be addressed in order
to make the architecture more flexible and less costly from a network performance
standpoint (traffic and control overhead)
Scalability is the first priority. The analysis performed so far has provided good
upper bindings to the cost of scaling up the ring by one station; however more is
to be investigated. More simulations should be run on scaling rings and a differ-
ential analysis must be carried on to identify possible patterns which can be made
advantage of.
6.2 What’s next
As a continuation of the effort described in this document, the next action items to
focus on are:
1. Improving C/C++ simulations to target more advanced scenarios.
2. Performing more simulations on very large networks (up to 1000 stations and
more) and higher traffic volumes.
3. Developing simulations targeting traffic handling in the ring, in order to get
more information about the impact on network performance introduced by
the PA.
4. Enriching simulations with more features addressing differential analysis on
scaling rings.
The next iteration should focus on collecting more information regarding the
performance of the algorithm with special focus on high variance conditions in the
amplitudes on hash segments. Furthermore, it can be beneficial to evaluate migra-
tions flows on scaling scenarios.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
67
Appendix A
C/C++ simulation engine’s
architecture
The C/C++ simulation engine has been developed with the following technologies:
• Intel’s Threading Building Blocks (TBB1) for parallel packet generation and
hash computation.
• GNU C/C++ compiler.
• Boost2 C++ libraries for big integers and other utilities.
• Tina’s Random Number Generator (TRNG3) library for randomizers.
• OpenSSL4 cryptographic library for hash computation.
• Circos5 library for circo-diagrams generation (migration flows).
Simulation steps Simulations can run sequentially or in parallel. When running
in parallel, a Monte Carlo approach is used so that packet generation and hash com-
putation can be performed much faster. When running a simulation, the following
steps are performed:
1. Pre-compilation configuration Compilation variables are assigned. The en-
gine is based on STL6 and parameters such as the number of stations N, the
number of generated packets m are all defined as compile-time constants; thus
they need to be set.
2. Compilation The simulation engine undergoes compilation in order to pro-
duce simulation executables.
3. Post-compilation configuration Simulation input files are prepared in order to
specify hash segments and other network descriptive variables.
4. Execution Simulations run.
5. Data extraction Output data is generated in order to get aggregated informa-
tion and markup files to be used for generating circo-diagrams.
1
Intel’s library for multi-threaded processing. https://www.threadingbuildingblocks.
org/.
2
Boost libraries. http://www.boost.org/
3
Random number generator library. https://www.numbercrunch.de/trng/.
4
Standard SSL implementation. https://www.openssl.org/.
5
Circos. http://circos.ca/.
6
C++ Standard Template Library allows the use of generic types and compile-time constants.
68 Appendix A. C/C++ simulation engine’s architecture
Every simulation generates 3 files:
• A data file tracking hash segments per each station and all generated packets,
hashes and φ-hashes.
• A table file containing a matrix used by Circos to generate migration flows.
• A Karyotype file used by Circos to generate other diagrams (for the future).
Infrastructure All simulations mentioned in this document have been run against
a pool of Intel 4-core machines: HP ProLiant DL180 G6 (64 bit) on CentOS 6 (RHEL).
69
List of Figures
1.1 Overall system architecture. The end user interacts only with the stor-
age system, while the balancing system is hidden to the user and
transparent to the storage system with regards to accessing the server
pool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 A N = 8 network example showing the logical ring topology. Each
station is assigned with an ID (typically the IP address hash) and
packets are routed by content. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Access to the ring is guarded by proxies. . . . . . . . . . . . . . . . . . . 11
2.3 Hash-partitioning of a ring into different segments, one per each sta-
tion. For each segment, a different impulse is used, its coverage matches
the segment’s length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Hash segments mapped onto φ segments illustrating how hash func-
tion φ works. The top part of the diagram shows the φ hash-space
while the bottom part the ξ hash-space . . . . . . . . . . . . . . . . . . . 26
3.1 Showing the Polar Hash Coverage Plot (PHCP) of a simulation on an
N = 10 station ring after sending m = 103 packets. Both plots show
the configuration of the station hash segments together with the final
load levels at the end of the simulation. The plot on the left refers to a
normal ring (hash function ξ applied), the one on the right refers to an
extended ring where hash function φ based on same ξ is considered.
The same packets were sent in both rings. . . . . . . . . . . . . . . . . . 32
3.2 Showing load state (in blue) |sk| in each station sk as time grows. In
this simulation, hash function ξ is used (normal ring). The green line
shows the expected load state (uniform) for each point in time. . . . . . 33
3.3 Showing load state (in blue) |sk| in each station sk as time grows. In
this simulations set (same as in figure 3.1), hash function φ is used (ex-
tended ring). The green line shows the expected load state (uniform)
for each point in time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Plotting standard deviation vs dispersion factor of generated ξ-hashes
and φ-hashes during simulations batches (from left to right): N = 10
(40 simulations), N = 30 (60 simulations) and N = 50 (10 simulations). 37
3.5 Plotting standard deviation of hash segment lengths and standard
deviation of φ-hashes during each simulations in batches (from top
to bottom): N = 10 (40 simulations), N = 30 (60 simulations) and
N = 50 (10 simulations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Plotting station loads η
(ξ)
k (no balancing) and η
(φ)
k (balanced ring) at
the end of four N30 simulations with different seeds. . . . . . . . . . . 39
3.7 Migrations flows in a N30 ring. . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Data unit format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
70 List of Figures
4.2 Synchronous vs. asynchronous communication model when storing
a single packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Packet info format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Sequence diagram showing the retrieval protocol in case of a frag-
mented packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Message format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Multiple hashing mechanism for achieving safe redundancy. Hashes
are computed and then concatenated to the data stream, hence gener-
ating packets ready to be sent. . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Packet retrieval session under the hypothesis of one station down.
The diagram illustrates how a failed RReq triggers the emergency re-
trieval process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
71
List of Tables
2.1 Showing, in the example, values of hash identifiers and hash seg-
ments for each station. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
73
Acknowledgements
Thanks to my supervisor: Prof. Eng. O. Tomarchio, for having enough patience and
waiting a few more years for me to finish this research while working in Denmark.
Thanks to Medilink srl: my host company for my master traineeship during
which this research effort was started and completed in its first step. They provided
everything I needed (resources, infrastructure) to complete my work.
Thanks to my Team Lead at Microsoft: Horina, for her flexibility and availability,
allowing me to submit this work in time.
Graphics and artwork Icons and graphics in figures created by Katemangostar -
Freepik.com.
Last but not least, thanks to all the amazing public libraries in Copenhagen which
have hosted me and my work during many weekends spent on this thesis.
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
75
Bibliography
Haugh, Martin (2004). “Generating Random Variables and Stochastic Processes”. In:
Monte Carlo Simulation: IEOR E4703 1.1, pp. 6–10. URL: http://www.columbia.
edu/~mh2078/MCS04/MCS_generate_rv.pdf.
Kolchin, Valentin F. (1998). “Random Allocations”. In: Washington: Winston 69.3, pp. 1236–
1239. URL: http://link.aip.org/link/?RSI/69/1236/1.
Nijenhuis, Albert (1974). “Strong derivatives and inverse mappings”. In: The Amer-
ican Mathematical Monthly: DOI: 10.2307/2319298 81.1, pp. 1–12. URL: http://
www.jstor.org/stable/2319298.

More Related Content

PDF
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
PDF
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
PDF
Neural networks and deep learning
PDF
Optimal control systems
PDF
Morton john canty image analysis and pattern recognition for remote sensing...
PDF
final (1)
PDF
A Matlab Implementation Of Nn
PDF
Neural Network Toolbox MATLAB
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Neural networks and deep learning
Optimal control systems
Morton john canty image analysis and pattern recognition for remote sensing...
final (1)
A Matlab Implementation Of Nn
Neural Network Toolbox MATLAB

What's hot (20)

PDF
Lecture notes on planetary sciences and orbit determination
PDF
Fuzzy and Neural Approaches in Engineering MATLAB
PDF
Matconvnet manual
PDF
Au anthea-ws-201011-ma sc-thesis
PDF
Machine learning-cheat-sheet
PDF
feilner0201
PDF
Tutorial
PDF
Di11 1
PDF
Hoifodt
PDF
phd_unimi_R08725
PDF
Emotions prediction for augmented EEG signals using VAE and Convolutional Neu...
PDF
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
PDF
Donhauser - 2012 - Jump Variation From High-Frequency Asset Returns
PDF
mythesis
PDF
Thesis van Heesch
PDF
Thesis Fabian Brull
PDF
Lower Bound methods for the Shakedown problem of WC-Co composites
PDF
Free high-school-science-texts-physics
Lecture notes on planetary sciences and orbit determination
Fuzzy and Neural Approaches in Engineering MATLAB
Matconvnet manual
Au anthea-ws-201011-ma sc-thesis
Machine learning-cheat-sheet
feilner0201
Tutorial
Di11 1
Hoifodt
phd_unimi_R08725
Emotions prediction for augmented EEG signals using VAE and Convolutional Neu...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Donhauser - 2012 - Jump Variation From High-Frequency Asset Returns
mythesis
Thesis van Heesch
Thesis Fabian Brull
Lower Bound methods for the Shakedown problem of WC-Co composites
Free high-school-science-texts-physics
Ad

Similar to Master Thesis - A Distributed Algorithm for Stateless Load Balancing (20)

PDF
Pratical mpi programming
PDF
Efficient algorithms for sorting and synchronization
PDF
Efficient algorithms for sorting and synchronization
PDF
pyspark.pdf
PDF
MSC-2013-12
PDF
book.pdf
PDF
matconvnet-manual.pdf
PDF
Ns doc
PDF
PDF
Thats How We C
PDF
Stochastic Processes and Simulations – A Machine Learning Perspective
PDF
Maxime Javaux - Automated spike analysis
PDF
Big data-and-the-web
PDF
Classification System for Impedance Spectra
PDF
MS_Thesis
PDF
An Introduction to MATLAB for Geoscientists.pdf
PDF
Implementation of a Localization System for Sensor Networks-berkley
PDF
usersguide.pdf
PDF
The maxima book
PDF
10.1.1.652.4894
Pratical mpi programming
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronization
pyspark.pdf
MSC-2013-12
book.pdf
matconvnet-manual.pdf
Ns doc
Thats How We C
Stochastic Processes and Simulations – A Machine Learning Perspective
Maxime Javaux - Automated spike analysis
Big data-and-the-web
Classification System for Impedance Spectra
MS_Thesis
An Introduction to MATLAB for Geoscientists.pdf
Implementation of a Localization System for Sensor Networks-berkley
usersguide.pdf
The maxima book
10.1.1.652.4894
Ad

More from Andrea Tino (20)

PDF
Our Journey: from Waterfall to Agile to DevOps
PDF
Development & GDPR (v2)
PDF
Development & GDPR
PDF
Cutting Edge on Development Methodologies in IT
PDF
An introduction to DevOps
PDF
Continuous Everything
PDF
Modern Trends in UI Decoupling Design
PDF
Javascript cheatsheet
PDF
Workshop on Cryptography - Frequency Analysis (basic)
PDF
Modern web applications
PDF
Visual Studio Tools for Cordova
PDF
Microsoft + Agile (light)
PDF
Microsoft + Agile
PDF
The World of Stylesheet Languages
PDF
How to Develop Cross-Platform Apps
PDF
The Asynchronous Pattern (for beginners)
PDF
Designing an effective hybrid apps automation framework
PDF
7 tips for more effective morning SCRUM
PDF
Powerful tools for building web solutions
PDF
Working with Agile technologies and SCRUM
Our Journey: from Waterfall to Agile to DevOps
Development & GDPR (v2)
Development & GDPR
Cutting Edge on Development Methodologies in IT
An introduction to DevOps
Continuous Everything
Modern Trends in UI Decoupling Design
Javascript cheatsheet
Workshop on Cryptography - Frequency Analysis (basic)
Modern web applications
Visual Studio Tools for Cordova
Microsoft + Agile (light)
Microsoft + Agile
The World of Stylesheet Languages
How to Develop Cross-Platform Apps
The Asynchronous Pattern (for beginners)
Designing an effective hybrid apps automation framework
7 tips for more effective morning SCRUM
Powerful tools for building web solutions
Working with Agile technologies and SCRUM

Recently uploaded (20)

PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
ai tools demonstartion for schools and inter college
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Nekopoi APK 2025 free lastest update
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
top salesforce developer skills in 2025.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms II-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
ai tools demonstartion for schools and inter college
Softaken Excel to vCard Converter Software.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
2025 Textile ERP Trends: SAP, Odoo & Oracle
L1 - Introduction to python Backend.pptx
ISO 45001 Occupational Health and Safety Management System
Odoo POS Development Services by CandidRoot Solutions
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Nekopoi APK 2025 free lastest update
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
How to Migrate SBCGlobal Email to Yahoo Easily
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
top salesforce developer skills in 2025.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
How Creative Agencies Leverage Project Management Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Master Thesis - A Distributed Algorithm for Stateless Load Balancing

  • 1. UNIVERSITY OF CATANIA MASTER’S THESIS A Distributed Algorithm for Stateless Load Balancing Author: Andrea TINO Supervisor: Prof. Eng. Orazio TOMARCHIO Assistant Supervisor: Eng. Antonino BLANCATO A thesis submitted in fulfillment of the requirements for the degree of Master of Engineering in the Faculty of Computer Science Engineering Department of Electrical, Electronic and Computer Science Engineering July 21, 2017
  • 3. iii Declaration of Authorship I, Andrea TINO, declare that this thesis titled, “A Distributed Algorithm for Stateless Load Balancing” and the work presented in it are my own. I confirm that: • This work was done wholly or mainly while in candidature for a research de- gree at this University. • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly attributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed my- self. Signed: Date:
  • 5. v “I like thinking that this work of mine kind of reflects my international personality. The idea of this algorithm has crossed my mind while I was working in Japan (summer 2012) on non-deterministic mathematical models to describe fast similarity search algorithms. It’s incredible how some ideas come to life so spontaneously! Then, I started working and devel- oping the foundations of the algorithm in Italy. After I started as an employee in Microsoft, I kept on working on this project in Denmark. Even on vacation, I found time to work on this thesis while roaming in several areas of South Korea. It is also worth mentioning that I worked on some chapters while I was in The Netherlands. When I think about this, I feel happy! ” Andrea Tino
  • 7. vii University of Catania Abstract Faculty of Computer Science Engineering Department of Electrical, Electronic and Computer Science Engineering Master of Engineering A Distributed Algorithm for Stateless Load Balancing by Andrea TINO The algorithm object of this thesis deals with the problem of balancing data units across different stations in the context of storing large amounts of information in data stores or data centres. The approaches being used today are mainly based on employing a central balancing node which often requires information from the dif- ferent stations about their load state. The algorithm being proposed here follows the opposite strategy for which data is balanced without the use of any centralized balancing unit, thus fulfilling the dis- tributed property, and without gathering any information from stations about their current load state, thus the stateless property. This document will go through the details of the algorithm by describing the idea and the mathematical principles behind it. By means of an analytical proof, the equa- tion of balancing will be devised and introduced. Later on, tests and simulations, carried on by means of different environments and technologies, will illustrate the effectiveness of the approach. Results will be introduced and discussed in the second part of this document together with final notes about current state of art, challenges and deployment considerations in real scenarios. (IT) L’algoritmo oggetto della tesi tratta il problema del bilanciamento di unitá dati all’interno di un pool di diverse stazioni, contestualmente alla necessitá di man- tenere in persistenza grandi quantitá di informazione all’interno di server-farm o data-centre. Le strategie tuttora in utilizzo sono principalmente basate sull’impiego di un componente centrale per il bilanciamento il quale, spesso, necessita di alcune informazioni da parte dei nodi della rete circa il loro stato attuale di carico. L’algoritmo proposto in questa sede procede verso un approccio diametralmente opposto per cui il bilanciamento dati viene effettuato senza l’utilizzo di alcun com- ponente centralizzato, da cui la proprietá distributed, e senza la necessitá di ottenere alcun dato dalle stazioni relativamente al loro stato di carico, da cui la proprietá stateless. In questo documento, procederemo nell’esaminare i dettagli dell’algoritmo tramite una descrizione dell’idea di fondo e dei principi matematici alla sua base. Attraverso l’impiego di una dimostrazione analitica, verrá dedotta e analizzata l’equazione di bi- lanciamento. Successivamente, procederemo ad esaminare i test e le simulazioni, en- trambi condotti tramite diverse tecnologie, a supporto dell’efficacia dell’algoritmo. I risultati verranno esaminati e discussi nella seconda parte di questo documento, assieme alle note finali riguardo lo stato corrente della tecnologia nel campo del bilanciamento dati. Verranno esaminate, inoltre, le problematiche e gli scenari di utilizzo dell’algoritmo.
  • 9. ix Contents Declaration of Authorship iii Abstract vii 1 Introduction 1 1.1 About Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Describing the scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Characterization of balancing algorithms . . . . . . . . . . . . . . . . . 2 1.3.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3.2 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.3 Static vs. dynamic . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.4 Centralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.5 DU retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Well known balancing algorithms . . . . . . . . . . . . . . . . . . . . . . 4 1.4.1 Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.2 Weighted Round Robin . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.3 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.4 Source Address Hash . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.5 Least Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.6 Graph based algorithms . . . . . . . . . . . . . . . . . . . . . . . 6 Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . . . 6 RAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Never Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 THRESHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 The algorithm 9 2.1 Network organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Station ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Ring access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Unbalanced ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Balancing the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Extending the ring . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Adapting concepts in extended ring . . . . . . . . . . . . . . . . 19 Defining sizing equations . . . . . . . . . . . . . . . . . . . . . . 19 2.3.2 Designing hash function φ . . . . . . . . . . . . . . . . . . . . . . 20 Designing r.v. sφ’s PDF . . . . . . . . . . . . . . . . . . . . . . . . 20 Designing r.v. sφ . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 Understanding how φ works . . . . . . . . . . . . . . . . . . . . 26 2.4 Ring balancing example . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Defining the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
  • 10. x 2.4.2 Defining the formatting impulse . . . . . . . . . . . . . . . . . . 27 2.4.3 Binding impulses to stations . . . . . . . . . . . . . . . . . . . . 28 2.4.4 Calculating amplitudes . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.5 Computing functions . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Simulation results 31 3.1 Small-size simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Verifying load balance . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Evaluating load levels per station . . . . . . . . . . . . . . . . . 33 3.2 Large-size simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Evaluating the variance of hash segment amplitudes . . . . . . 37 3.2.3 Evaluating load levels per station . . . . . . . . . . . . . . . . . 39 Migrations flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 System API 43 4.1 Storing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1 Packet fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Dynamic conditions 49 5.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1.1 Updating φ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Broadcasting in DHT . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1.2 Load re-arrangement . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.3 Scaling overall impact . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1.4 Ring scale-down . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Fault conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Collisions threshold . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Conclusions and final notes 65 6.1 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 What’s next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A C/C++ simulation engine’s architecture 67 Acknowledgements 73 Bibliography 75
  • 11. xi List of Abbreviations BS Balancing System DSLB Distributed Stateless Load Balancing PA Proposed Algorithm SS Storage System BA Balancing Algorithm BP Balancing Pool DLB Data Load Balancing DLBA Data Load Balancing Algorithm DU Data Unit LD Load Distribution SL Station Load CDF Cumulative Distribution Function PDF Probability Density Function DHT Distributed Hash Table HS Hash Segment ID IDentifier LS Leaf Set LLS Lower Leaf Set P2P Peer To Peer ULS Upper Leaf Set API Application Program Intrerface r.v. random variable
  • 13. xiii List of Symbols N Number of stations Ω Balancing pool (set) P Data Units (packets) (set) si Station p Data Unit (packet) ψ Packet/station assignment application Σ Load distribution l Hash length (number of bits) ξ Hash function h Hash string (number) hξ Regular hash string (number) hφ φ-hash string (number) η Station packet load (number) η(ξ) Station packet load (via ξ) (number) η(φ) Station packet load (via φ) (number) πi Packet in station probability (probability) π (ξ) i Packet in station probability (via ξ) (probability) π (φ) i Packet in station probability (via φ) (probability) f PDF (function) F CDF (function) F−1 CDF Inverse (function) g Formatting impulse (function) G Formatting impulse antiderivative (function) Λ Leaf set ΛU Upper leaf set ΛL Lower leaf set
  • 15. xv dedicated to my Mother and my Father
  • 17. 1 Chapter 1 Introduction 1.1 About Balancing Under the (umbrella) term balancing, it is possible to refer to different problems and solutions: balancing of connections, of workloads, of tasks or data. What tells each single type of balancing apart from the other is actually what is being balanced. Definition 1 (Balancing). In Computing and Computer Science, it indicates the problem of distributing an indefinitely high number of entities across multiple subjects (stations). The selection is performed in order to guarantee that, at any given time, all stations have (roughly) the same amount of entities. From which follows: Definition 2 (Balancing Algorithm). An algorithm, or a system based on a certain algo- rithm, designed to solve the balancing problem. In the context of this research work, we are going to focus on a specific type of balancing: Definition 3 (Data Load Balancing). A type of balancing focusing on units of data often referred to as packets or, simply, data units. The solutions to the latter are referred to as: Definition 4 (Data Load Balancing Algorithm). A BA targeting DLB in order to solve it by minimizing a certain objective function. This research effort focuses on DLB and DLBAs in order to introduce a new al- gorithm targeting multiple performance metrics. Objective The objective of this thesis is actually to employ this algorithm in a cloud storage system designed to serve different applications. The system must be capable of: 1. Accepting data as input which will be stored in a pool of servers (stations). 2. Retrieving stored data on demand. 3. Removing stored data on demand. These essential functionalities must be enabled by the BA and its design.
  • 18. 2 Chapter 1. Introduction 1.2 Describing the scenario Let us describe DLB and its most important aspects in formal terms. The network A certain number of stations N is always considered and together they form a balancing pool: a set which we will indicate as Ω. Each station si ∈ Ω (having i = 1 . . . N) is connected to the others by a generic protocol, we do not con- sider any specific communication technology, the only required assumption is that the protocol employs direct addressing of each station (one unique address per sta- tion). Station assignment At a any given time, a DU (or packet) p ∈ P (being P the set of all DUs) must be stored in the BP. The system/algorithm responsible for carrying out this activity is expressed by application ψ : P → Ω, which basically assigns a DU to a station. The way ψ works is essentially the core of the balancing system. The application will choose a station with the objective of guaranteeing that, at any given time, all stations roughly have the same amount of packets stored in them. Remark. Application ψ typically receives a DU as input: ψ(p), however it actually accepts more arguments: ψ(p, ·) depending on the strategy it uses to perform the balancing. Loads At any given time, each station si in the BP will have a certain number of DUs assigned, we indicate this quantity as station load with operator: |si|, thus: |si| = {p ∈ P : ψ(p) = si} ⊆ N (1.1) This quantity can sometimes be expressed as the number of bytes (or any of its multiples) of the total packets stored in a station: |si| = ∀p∈P:ψ(p)=si |p| ⊆ R (1.2) Where |p| indicates the length of a packet (typically in bytes). Otherwise speci- fied, we will refer to the former definition. To have an overview of the balancing state, another quantity is introduced: load distribution indicated with symbol Σ = (|s1|, |s2|, . . . , |sN |), representing the ordered vector of station loads at any given time. Time The algorithm runtime does not require a continuos description. Therefore time will be considered discrete: t ∈ N and characterized by events, an event being, for instance, the arrival of a new DU to route in the network. 1.3 Characterization of balancing algorithms DLBAs can be differentiated basing on several points of view. Considering this clas- sification is important in order to locate the proposed algorithm inside the taxonomy of today’s most used systems. 1.3.1 Randomness Algorithms can employ non-deterministic components like pseudo-random number generators in order to pick up the station to associate to a DU. This approach is
  • 19. 1.3. Characterization of balancing algorithms 3 not that bad because, provided the generator is characterized by an uniform PDF, it guarantees a fairly good level of balancing at low cost. Proposition 1 (Determinism). The algorithm being proposed is fully deterministic. 1.3.2 State In order to perform balancing, algorithms may require stations to keep information regarding their load state (e.g. disk usage or residual available space). Statefulness implies that stations communicate the state information to other stations or special nodes in the network; by employing this knowledge, the BA can perform a more precise job. The downside is mostly related to communication overhead as state information must be regularly exchagnged. Proposition 2 (Statefulness). The algorithm being proposed is stateless, it does not require any information from stations to be sent in order to perform balancing. 1.3.3 Static vs. dynamic The balancing can happen at two possible points in time: • At runtime The association to a station is performed while the DU is being transmitted to stations. In these conditions, the same DU might be routed to different stations depending on the contingent situation. It is also possible to have DUs being re-routed. This behaviour makes the algorithm dynamic. As a rule of thumb, a BA is dynamic when it is not possible to always know in advance where a DU will be routed until the algorithm is actually run. • Before runtime The destination station is known before the algorithm is run. This makes the algorithm static. As a rule of thumb, a BA is static when the same DU is always routed to the same station. Proposition 3 (Staticity). The algorithm being proposed is static. 1.3.4 Centralization This property determines whether the algorithm requires a central node in the net- work used to perform the balancing. A centralized BA requires a unit, called bal- ancer, which takes care of routing the DU to its destination station. Conversely, a distributed BA will not need this extra component. Centralized BAs are easier to implement, but they have 2 major downsides: • All traffic must pass through the balancer which acts like a hub node. • The balancer represents a single fault unit in the network. If it goes down, the whole network is compromised. Safety mechanisms can be employed in order to avoid network downtime by limiting the outage to the balancing fea- ture only: if the balancer fails, the traffic will still be routed to stations but no balancing will occur. This property also impacts the topology of the network. Typically, centralized algo- rithms employ a star topology where the balancer is the central node. Proposition 4 (Non-centralization). The algorithm being proposed is distributed.
  • 20. 4 Chapter 1. Introduction 1.3.5 DU retrieval A very important characteristic of BAs is the way it is possible to retrieve a DU once it has been stored in a station. This does not really relate to the BA itself as DU re- trieval is more an aspect concerning the storage algorithm (which employs the BA). However the 2 systems are connected together and will be treated as one. An essential part of the data retrieval story is DU and station identification. Since a DU is assigned to a station, the association performed by si = ψ(p) must be iden- tifiable. We have already introduced station identifiers, so we need to do the same with DUs and introduce, for a packet p, its identifier indicated as ˆp ∈ N (the couple (p, ˆp) is unique). A key concept to understand is that station association application ψ works both with packets ψ : P → Ω and with packet IDs: ψ : N → Ω. • If the algorithm is dynamic, then ψ is not a bijective application and it will not always return the same station when invoked. In such a case, the association (ˆp, si) must be saved somewhere. This condition requires the algorithm to em- ploy a database functioning as a lookup table, which of course takes resources and impacts memory. • If the BA is static, then ψ is bijective and it will be always possible to retrieve the station where a packet p was routed by simply calculating ψ(ˆp). It means that it is not necessary to store the coordinates of a packet in order to retrieve it. When a DU is sent for storage, its ID is used by the owner as a key to retrieve it in a later time. 1.4 Well known balancing algorithms The proposed algorithm competes with some other algorithms today available in the market and commonly used in many different application domains. We are going to describe some of these in order to compare, later, how the proposed approach ranks among them. 1.4.1 Round Robin This class of algorithms keeps a counter c = 1 . . . N which points to the destination station sc where the current DU will be routed to. The counter is incremented at every new incoming packet: ct+1 = (ct + 1)%N. These BAs guarantee a very precise balancing as fairness in packet association is their strongest point. Such algorithms are deterministic, stateless, static and require a packet lookup table. They are typically centralized and the balancer keeps track of the counter. However this condition is not a limitation as it is still possible to use Round Robin in a distributed way, though such an approach is not common in the market today. 1.4.2 Weighted Round Robin Like Round Robin, these algorithms guarantee fairness by looping through stations. However the counter is used in a different way: every station si is not assigned with
  • 21. 1.4. Well known balancing algorithms 5 one number, but with a range of contiguous numbers: ci . . . ci , the counter will range from 1 to C = maxi {ci, ci } and be incremented according to rule: ct+1 = (ct +1)%C. The higher the interval ci −ci for one station si, the more packets that station will receive. Although this seems like breaking the balancing principle, this approach allows to keep into account stations not having the same storage capabilities. A pos- sible application is assigning more DUs to stations having a larger storage capacity. These algorithms are typically centralized, they are deterministic and still require a packet lookup table. Given their nature, they can be either static/stateless or dy- namic/stateful; the latter implementation is valid when characteristics of stations can change in time, the state is typically related to storage capabilities. 1.4.3 Random Random algorithms use a random number generator employing a discrete random variable distributed in range 1 . . . N to choose the destination station si. Such algo- rithms are non-deterministic, dynamic, stateless and always require a packet lookup table. They can be either centralized or distributed, though the former are the most common in the market. One important aspect of these BAs is the requirement on the probability distri- bution of the random variable employed by the number generator. The distribution must be uniform in order to have proper balancing. 1.4.4 Source Address Hash Commonly used in TCP/IP applications, these classes of centralized algorithms em- ploy a balancing unit which assigns a hash range hi . . . hi to each station si. Hashes are evaluated numerically so ranges are basically contiguous sequences of integer numbers hi, hi ∈ N. When a packet arrives, the balancer will compute a hash h ∈ N on the source address (the hashing function can either be one of the well known cryptographic ones or some other ad-hoc implementations) and route the DU to the station whose hash range includes the calculated hash. Source Address Hash algorithms guarantee that packets coming from same sources are routed to the same stations. These algorithms are deterministic, static, stateless and require a lookup table. The balancing relies on the hash function, a crypto- graphic hash is necessary to guarantee that stations whose address differ by a few bits do not end up being stored in the same station. Given their implementation, such algorithms can offer a fairly good balancing. 1.4.5 Least Load These algorithms usually rely on a centralized balancer, however it is still possible to perform the balancing in a distributed way (though not common in the market for this implementation). When a DU arrives, it is routed to the station with the lowest current load. This means that the balancer needs to know the load of each station, which is the reason why these algorithms are stateful and typically introduce a considerable overhead in the network. Least Load algorithms are deterministic, dynamic and require a lookup table.
  • 22. 6 Chapter 1. Introduction 1.4.6 Graph based algorithms There is a class of (typically) stateful, distributed and dynamic algorithms which calculate the destination station for a DU basing on a graph search algorithm on the network. These algorithms are usually highly scalable, also no assumption is made on the topologies of the network (free topology). Nearest Neighbour A random node si is picked in the network in order to initiate the transmission of a DU. The balancing is performed only in the context of the subnet represented by the chosen node and its direct neighbours sj (having j = i). The algorithm picks the station which minimizes a specific metric usually being the load state of each node. RAND This algorithm is non-deterministic and randomly selects a node (station) where to route a packet p. A threshold L ∈ R is considered for a certain metric (usually the load state), if the packet exceeds the threshold: |p| > L, then the DU is re-routed to another randomly selected station; otherwise it is stored in the current one. The most prominent characteristic of this algorithm is its statelessness. Since RAND does not require any information from stations, the implementation is very easy, minimum overhead is generated (due to threshold-exceeding packets which need re-transmission) and the balancing is pretty good. CYCLIC This algorithm is a variant of RAND in which a minimal state informa- tion is kept: the last station a packet was re-transmitted to is always remembered by the system, this guarantees that the same station is not picked up twice in a row in case of consecutive threshold exceed occurrences. Never Queue There are algorithms which use state information exchanged among stations in order to evaluate data not strictly relating to nodes. Typically the state is represented by station-specific quantities, like the current load or the residual amount of storage memory left. Never Queue employs a different state across the network, in order to able to evaluate, the moment a DU arrives and needs to be stored, the station to which transmitting the packet implies the least cost. Thus, the algorithm always transmits the packet to the fastest available node. This balancing strategy (which requires a lookup table) is poor as does not guar- antee a good quality of the overall balancing across stations. What it guarantees though, and this is the reason why we mention this approach, is good performance in processing packets in terms of throughput and latency. However this has a cost in overhead due to the amount of information exchanged by nodes to refresh the network state. THRESHOLD This class of highly dynamic algorithms uses network state information, derived from message exchange among stations, to decide where a packet will be stored. An incoming packet p is initially routed to a random node (this makes the algorithm non-deterministic), that station then compares the packet size with a load threshold
  • 23. 1.5. System overview 7 User Storage sys. Balancing sys. Srv pool Store Retrieve Balance FIGURE 1.1: Overall system architecture. The end user interacts only with the storage system, while the balancing system is hidden to the user and transparent to the storage system with regards to accessing the server pool. L ∈ R and decides what to do. If the threshold is not exceeded: |p| < L, then the DU is stored in the current station, otherwise another station is picked via a polling mechanism. A maximum number of attempts M ∈ N is considered after which a packet will stay in the current station even though the threshold is exceeded. Even though this approach looks a lot like RAND, it differs in the way a station is selected. Only the initial node selection is random, in case of threshold exceed, the next station is picked with a process based on analysis of the network state. LEAST This algorithm works like THRESHOLD but limited to one single iter- ation. When a packet arrives into a randomly selected node and the threshold is exceeded, the algorithm will poll a certain pool of stations to pick the next one in its first attempt to route the packet. The station is usually picked basing on its current load (least loaded node). After that, no more attempt is performed. Think about LEAST as THRESHOLD where M = 1. 1.5 System overview Before detailing how the PA works, it is important to understand the architecture of the storage system being designed as part of the research in this thesis. From a point of view based on the API that the overall system exposes, we recognize 2 major components: • Storage system The component interacting with the end user and providing the API for storing and retrieving data. • Balancing system The component responsible for arranging the storage pool and balancing data across its servers. As pointed out by Figure 1.1, the architecture separates concerns by defining 2 sets of API: one exposed to the user, for submitting data and retrieving it, and an- other one, hidden to the user, which is responsible for balancing DUs in the storage pool.
  • 25. 9 Chapter 2 The algorithm In chapter 1, we have anticipated the most important properties that the PA shows. What makes this algorithm innovative is the fact that it is at the same time dis- tributed, static, stateless and does not require a lookup table. In this chapter we are going to describe the mathematics behind the algorithm which makes all this possible. 2.1 Network organization The PA does not allow stations to be arranged in a free connection scheme. A strong assumption is made on the topology of the network according to the DHT1 scheme. It is important to make a few considerations about the protocol used by stations to communicate to each other: in the specific case, no assumption is made of the networking technology as every protocol can be employed in the network (more protocols can actually be used as long as they are compatible with each other) except for one: Proposition 5 (Networking protocol). Stations are free to employ any arbitrary commu- nication protocol, as long as this guarantees direct addressing and that a message can be delivered from / to every pair of stations. Real case scenario, the one considered here, is Internet and the TCP/IP protocol. Given proposition 5, one station is actually capable to communicate with every other one in the network; however this rarely can happen because, and this is the reason for which direct addressing is essential, a node’s address must be known. Networks can be mannered according to DHT specifications thanks to a limited knowledge of other nodes’ addresses. The direct consequence of proposition 5 is that a separation is made between sta- tions’ physical and logical connection schemes. Physically, stations can be arranged without any constraint, however the limited knowledge of other nodes in the graph allows the system to generate an overlay network which is the one being considered here. A station is supposed to have a very limited knowledge regarding other sta- tions, thus having in memory few of them (considered as neighbours). According to DHT specifications, a ring topology is employed and it derives from nodes holding a neighbour set of only 2 nodes: a predecessor and a successor as shown in figure 2.1. 1 Distributed Hash Tables are employed in distributed networks. This network protocol was adopted by 4 major P2P systems: Chord, Pastry, CAN and Tapestry.
  • 26. 10 Chapter 2. The algorithm Station 1 Station 2 Station 3 Station 4 Station 5 Station 6 Station 7 Station 8 FIGURE 2.1: A N = 8 network example showing the logical ring topology. Each station is assigned with an ID (typically the IP address hash) and packets are routed by content. Station ordering A natural order occurs among stations. When the ring is formed, every node si computed an identifier Id(si) ∈ N represented by the hash of the station’s address. This identifier is used to build the ring as every station needs to locate its predecessor and successor in the network: Lemma 1 (Ring construction). As long as every station has a minimal initial set of con- nections which guarantees that all nodes form a connected graph, the ring can be built by having every station reshape its neighbourhood with one successor and one predecessor. After this initialization phase, the ring is on-line and ready to accept packets. This aspect is very important as it allows us to define the set of stations Ω as ordered and we can define: Definition 5 (Station preceding operator). Given a couple of stations (si, sj) ∈ Ω2 (i = j and i, j ≤ N), operator : Ω×Ω → {true, false} defines the preceding relation among them. The operator works as follows: si sj ⇐⇒ Id(si) < Id(sj) (2.1) The first important result is the following: Theorem 2 (Ring complete ordering). The set of stations Ω with precedence operator : Ω × Ω → {true, false} is a complete ordered set. Proof. Immediate by considering that operator on Ω, because of its definition, di- rectly maps on operator < on N which is a fully ordered set. Ring access The ring is the place where data is stored and the purpose of the PA is to help the storage system balance all DUs across stations. The first detail we focus on is how the ring is accessed when the SS needs to send a packet to be stored or retrieved. The ring has to be kept safe both from external nodes and from the same nodes that are part of the network. The way to guarantee the latter is through the following:
  • 27. 2.1. Network organization 11 Station 1 Station 2 Station 3 Proxy FIGURE 2.2: Access to the ring is guarded by proxies. Proposition 6 (Limited knowledge principle). To guarantee safety and scalability, every node in the system has a partial knowledge of the overall network. In order to protect the ring from external activity, direct access to the network must be forbidden: Proposition 7 (Zero knowledge principle). To guarantee safety from external intrusions, no node, except from those in the ring, knows the address of any station in the system. This principle, though valid, cannot be adopted as-is because we would end up with an isolated network otherwise. However, in order to fulfil the security features promoted by proposition 7, it is possible to build a guarding system around the ring which hides it from the external world. A collection of proxy stations is employed for this purpose. As shown in figure 2.2, those stations will be exposed to the ex- ternal world and they will act as intermediary to the ring, whose addresses are kept private (in order to fulfil proposition 6, proxy stations will know the addresses of only a few stations in the ring). Routing The SS we are designing is distributed. The basic idea, according to the DHT spec- ifications, is that a packet will enter the ring from an arbitrary station, called entry point, in the context of a transmission. From there, every station, which has the BA deployed, knows whether that packet should be stored there or should otherwise be routed to a different station. Every station keeps a limited knowledge of the network. This knowledge is rep- resented by the set of neighbour nodes one stations keeps. Given the topology, we define a parameter called leaf radius: r ∈ N (r < N) which represents the number of successor (or predecessor) nodes every station holds as its neighbourhood. Definition 6 (Leaf Set). Every station si ∈ Ω keeps track of its neighbour nodes (plus itself) in an ordered set called leaf set: Λ(si) ⊂ Ω. The leaf set’s cardinality is always Λ(si) = 2r + 1 where r is the leaf radius of the ring. The following equation holds: Λ(si) = {sj ∈ Ω : ai,j = 1} ∪ {si} (2.2) Where A = [ai,j] ∈ NN×N is the adjacency matrix of the network.
  • 28. 12 Chapter 2. The algorithm Note how one station’s leaf set contains the station itself. Also, the leaf set is always an ordered set as it is, by definition, a subset of Ω which we proved being ordered in theorem 2. Unless otherwise specified, we will always consider r = 1. Lastly, it is sometimes convenient to picture one station’s leaf set extensively as the ordered vector of neighbour stations (including si): Λ(si) = si, . . . , si−1, si, si+1, . . . , si The notation above helps us detecting nodes si and si as the extreme nodes in every stations’s leaf set. Those nodes will play an important role when defining routing function ψ later on. Definition 7 (Upper Leaf Set). Let si ∈ Ω be a station and Λ(si) ⊂ Ω its leaf set. The upper leaf set ΛU (si) ⊂ Λ(si) is defined as the set of all neighbours which the station precedes: ΛU (si) = {sj ∈ Λ(si) : si sj} (2.3) This set has always cardinality ΛU (si) = r. Definition 8 (Lower Leaf Set). Let si ∈ Ω be a station and Λ(si) ⊂ Ω its leaf set. The lower leaf set ΛL(si) ⊂ Λ(si) is defined as the set of all neighbours which the station is preceded by: ΛL(si) = {sj ∈ Λ(si) : sj si} (2.4) This set has always cardinality ΛL(si) = r. When a station receives a packet, it performs certain operations in order to un- derstand whether that packet is to be stored there or elsewhere. In the latter, the station will pick one of the stations in its leaf set and route the packet there. The next station will repeat the same sequence of operations until the packet is stored into a node. This algorithm is represented by function ψ : P → Ω, to perform it, every station in the ring uses the same hash function: Definition 9 (Hash function). Let P be the set of packets and H ⊆ N , we define ξ : P → H as a hash function used to calculate the station where to route a packet in the ring. Not all hash functions can be used to route packets in the ring: Proposition 8 (Cryptographic hash function). Hash function ξ : P → H is a crypto- graphic hash function. By definition, ξ behaves in a way such that one bit change in the input packet will cause the change of at least 50% of the output hash’s bit string. It is possible to consider many cryptographic hash functions out of those cur- rently employed in modern systems. Among the most common today we have the following families: SHA2 and MD3. As anticipated earlier, stations self-organize in a logical overlay ring by assigning IDs. One station’s ID Id(si) is computed by using the same hash function ξ on the station’s address. For formal consistency, we intend hash function ξ to also work on stations: ξ : Ω → H which is perfectly valid as hash functions do not really care about the type of data fed as input as long as it is a bitstream. The following expres- sion: hi = Id(si) = ξ(si) is to be intended as hash function ξ calculated on station 2 Shamir Hash Function. SHA-1 (128 bits), SHA-256 (256 bits), SHA-512 (512 bits). 3 Message Digest Hash. MD2 and MD4 (128 bits) today considered unsafe. MD5 (128 bits) and MD6 (512 bits).
  • 29. 2.1. Network organization 13 si’s address. As soon as the ring is initialized and ready to work, the topology will define the ordering of stations: s1 s2 · · · si · · · sN−1 sN following the order of IDs (hashes): h1 < h2 < · · · < hi < · · · < hN−1 < hN . From here the DHT assigns to each station an hash segment: Definition 10 (Hash Segment). Let si ∈ Ω be a station in the ring and hi = Id(si) = ξ(si) its ID. Station si’s hash segment Ξ(si) is defined as the set of contiguous hashes ranging from hi up to hi+1 (excluded): Ξ(si) = {h ∈ H : hi ≤ h < hi+1} if i = N {h ∈ H : hi ≤ h ≤ hM} ∪ {h ∈ H : 0 ≤ h < h1} if i = N (2.5) Where hM ∈ H is the highest value that hash function ξ can produce: ξ(·) ∈ [0, hM]. Routing function ψ employs hash function ξ in order to compute the destination station for a packet. The algorithm is deployed on every station and behaves always the same: Algorithm 1 Routing a packet in the ring Require: Ring initialized Require: Station si has ID hi = ξ(si) Require: Station si has associated hash segment Ξ(si) Require: Station si has associated leaf set Λ(si) 1: function ψ(p ∈ P) 2: hp ← ξ(p) 3: Λ ← ∅ 4: if hp ≥ hi then 5: Λ ← ΛU (si) {si } ∪ {si} 6: else 7: Λ ← ΛL(si) 8: end if 9: for s ← Λ do 10: if h ≤ hp ≤ h then Since Ξ(s) = [h , h ] ⊂ H 11: return s 12: end if 13: end for 14: if hp ≥ hi then Having that Λ(si) = (si, . . . , si, . . . , si ) 15: return si The packet either belongs to si or further 16: else 17: return si The packet belongs to a station preceding si 18: end if 19: end function We can now provide a better formal description of a ring by introducing its defi- nition: Definition 11 (Ring). Let Ω be the set of stations, r ∈ N be the leaf radius, ξ : · → H ⊆ N the hash function used by each station and ψ : P → Ω the routing function, based on ξ, used to assign packets to stations. Then we define R = (Ω, r, ξ, ψ) as a fully qualified ring
  • 30. 14 Chapter 2. The algorithm overlay across N = Ω stations si ∈ Ω where packets p ∈ P are routed and delivered to each station via routing function ψ employing hash function ξ. Given algorithm 1, we have the following result: Lemma 3. Let R = (Ω, r, ξ, ψ) be a ring, then the following holds: ψ {p} = si ⇐⇒ ξ(p) ∈ Ξ(si), ∀p ∈ P Proof. Immediate by considering the first exit point of algorithm 1. Regarding ψ, we want to describe a few more important aspects: Definition 12 (Routing). Function ψ will be repeatedly executed for a certain number of iterations from the moment packet p enters the ring until it finds its destination station. The routing is over when ψ returns the same station where it is evaluated. It is now evident that one single application of function ψ does not effectively routes the packet in the correct station. It is necessary to perform a certain number of iterations and apply ψ to the same packet in different stations. This scheme generates a recursive condition which we want to make more evident. Let us denote with ψk ∈ Ω the station returned by the k-th application of ψ, the recursive definition is completed by setting the initial condition: ψk+1 = ψ (ψk) ψ1 = si (2.6) As for every recursive function, we ask ourselves whether the recursive defini- tion in equation 2.6 converges to a value. As per definition 12, we expect function ψ to assume a value, at a certain iteration b ∈ N, and always keep it in every future iteration b + k, k ∈ N. For this reason, it is imperative that the cyclic application of ψ does not lead to an infinite sequence of iterations, which would make the recursive definition generate an alternating sequence. Theorem 4 (Routing is always successful). Let R = (Ω, r, ξ, ψ) be a ring and p ∈ P a packet entering it from station si ∈ Ω. Let b ∈ N be the number of different applications of routing function ψ, across the different stations of the ring, before p finally reaches its destination. Then b is always limited: b ≤ B ∈ N Proof. We take this by contradiction, thus assuming ∃p ∈ P : b → ∞. By analyzing algorithm 1, in order to have an infinite number of iterations, we need to make sure that function ψ, when evaluated on station si ∈ Ω, never returns the current station: ψ(p) = si. For such a condition to hold, then the following must occur: ∃p ∈ P : hp = ξ(p) /∈ Ξ(si), ∀si ∈ Ω which translates into: ∃p ∈ P : hp = ξ(p) /∈ [hi, hM i ], ∀si ∈ Ω Since we do not know whether si is the last station in the ring, we use hM i to indicate the final hash in station si’s hash segment. However, since Ξ(si) = [hi, hM i ] in the equations above depends only on si, and since those equations hold for all stations, we can consider the totality of the hash segments: s∈Ω Ξ(s) = N k=1 hk, hM k = [0, hM]
  • 31. 2.2. Unbalanced ring 15 So we can re-write the previous equations as: ∃p ∈ P : hp = ξ(p) /∈ [0, hM] Which is contradictory as hash function ξ is, by definition, limited in range ξ(·) ∈ [0, hM], and since hp is calculated via hash function ξ, it must fall in that range. Theorem 4 proves that the recursive definition introduced before converges: Corollary 4.1 (Recursive application of ψ converges). Let R = (Ω, r, ξ, ψ) be a ring and p ∈ P a packet entering it from station si ∈ Ω. Then the recursive term ψk during the routing of the packet converges to station sj ∈ Ω after b ∈ N iterations: lim k→∞ ψk = ψb = sj In order to avoid confusion between the final computed destination station and the intermediate hopping stations calculated by the several iterations of the rout- ing function; from now on, we will indicate with expression si = ψ(p) the station where packet p is stored at the end of the routing process. That is, we consider ψ as returning ψb = si (last iteration) unless otherwise specified. At this point, we have completed describing and formalizing the storage system. 2.2 Unbalanced ring We now move forward by analyzing what problems this structure presents in terms of balancing. Even though only the storage system has been covered so far, it is im- portant to point out that, as it is now, the architecture already enables a primitive form of balancing. Packets are, in fact, distributed across different stations and mod- ern P2P networks are entirely based on this scheme. What kind of balancing do we end up with? Since routing algorithm ψ employs hash function ξ, the balancing state of the ring depends entirely on ξ only! Under static conditions (the ring does not change), the routing of packet p is done the moment hp = ξ(p) is computed! Routing function ψ is executed many times because the knowledge that each station has of the ring is limited, but that number of iterations does not affect the destination station being computed. The question we want to answer to is: “Given hash h ∈ H, what is the probability that, given a random input packet p ∈ P, hash function ξ computed on p returns h: Pr {ξ(p) = h}?”. Since ξ is a cryptographic hash function, it has an interesting prop- erty: the probability distribution of the hashes being generated is approximately uniform, it means that: ∀h ∈ H, ∀p ∈ P, Pr {ξ(p) = h} = 1 hM + 1 (2.7) Since ξ : · → [0, hM]. If we indicate with l ∈ N the number of bits of the hash (hash length): l = log2 hM , then ξ : · → [0, 2l − 1] and can write: ∀h ∈ H, ∀p ∈ P, Pr {ξ(p) = h} = 2−l (2.8) What we want to focus on is knowing, in the long run, how many packets each stations gets with this configuration. From this knowledge, we will then analytically
  • 32. 16 Chapter 2. The algorithm calculate the information we need about the balancing. The problem of Packets in Stations is very close (though not entirely equivalent) to another well-known one which we will take into consideration: Balls in Bins 4. We will see that the scenario of throwing a ball in an area full of bins and assessing in which bin the balls falls, is equivalent to producing a random packet, calculating its hash and checking into which station it is going to be routed. Our analysis starts with identifying the probability for a packet to be routed into one station of the ring: Theorem 5 (Packet-in-station probability). Let R = (Ω, r, ξ, ψ) be a ring. Then the probability that a packet p ∈ P is routed to station si ∈ Ω is: π (ξ) i = Pr {ψξ(p) = si} = Ξ(si) 1 + hM Where Ξ(si) is station si’s hash segment’s length (hash coverage). Proof. Since each station si owns a specific hash segment Ξ(si) = hk, hM k , The proof is immediate by considering that every hash h ∈ H ≡ [0, hM] has the same probabil- ity to be selected, as per equation 2.8. So we just need to multiply that probability by the length of the segment. We use expressions π (ξ) i and ψξ(·) to indicate the probability for a packet to fall into a certain station and the routing function ψ, both when hashing function ξ is employed in the ring. The reason why we want to explicit the hash function is be- cause, later, we are going to evaluate the same hash-based quantities with a different hashing function and compare results. Given one station si, the length of its hash segment Ξ(si) is an important quan- tity. We can efficiently formalize its value by using expression: Ξ(si) = δi,i+1, and by defining the following quantity: Definition 13 (Segment length calculator). Let Ω be the set of stations and let every station si ∈ Ω have assigned an hash segment hk, hM k ⊂ N such that the hash partitioning is circular, thus last station sN has hash segment [hN , hM] ∪ [0, h1 − 1]. We define the segment length calculator function as the application returning the number of hashes in the segment assigned to one station: δi,j = 1i,j · 2l − hi mod (N+1) − hj mod (N+1) Having ∀i, j ∈ N ∧ i, j > 0 and: 1i,j = 1 i > j 0 Then we can express the packet-in-station probability in theorem 5 as follows: π (ξ) i = Pr {ψξ(p) = si} = 2−l · δi,i+1 (2.9) Theorem 6 (Packets-in-station probability). Let R = (Ω, r, ξ, ψ) be a ring. Let µi = |si| = 0 . . . m be a r.v. counting the number of packets is station si where m ∈ N represents 4 The problem describes a non-deterministic scenario where balls are thrown on an area full of bins in a random direction as described in: (Kolchin, 1998).
  • 33. 2.3. Balancing the ring 17 the number of total packets sent so far to the ring. Then the probability that station si ∈ Ω has k ∈ N packets is: Pr µ (ξ) i = k = m k · π (ξ) i k · 1 − π (ξ) i m−k Proof. Immediate. We consider m packets routed into the ring and we want to cal- culate that the probability that k among them were routed in station si. This calls for Bernoulli Trials. The probability for a packet to end up in one station is given by theorem 5. Thanks to theorem 6, we know the PDF of r.v. µ (ξ) i and we are able to calculate how many packets in average one station gets: η (ξ) i (m) = E µ (ξ) i = m k=0 k · Pr µ (ξ) i = k = m k=0 k m k · π (ξ) i k · 1 − π (ξ) i m−k (2.10) As described in (Kolchin, 1998), in the Balls in Bins problem, the following holds: m k=0 k m k · pk (1 − p)m−k = mp, ∀m ∈ N, m > 0, p ∈ [0, 1] ⊆ R Which allows us to calculate the average load per station in a simpler form: η (ξ) i = m · 2−l · δi,i+1 (2.11) Proposition 9 (Unbalanced ring). Let R = (Ω, r, ξ, ψ) be a ring. Given equation 2.11, the network is not balanced. The wider is one station’s hash segment, the more packets that station gets: Ξ(si) > Ξ(sj) ⇐⇒ η (ξ) i > η (ξ) j , ∀si, sj ∈ Ω Proposition 9 is very important as it states that the system architecture, so far, is able to achieve a very high level of decentralization, however it fails in balancing stations. 2.3 Balancing the ring Equation 2.11 is our starting point to take the current architecture and try to modify it in order to reach load balancing. Our ideal model is such for which each station gets the same number of packets: Definition 14 (Ideal station load). Given a network of N ∈ N different stations, we define the following as the ideal load per station: ηi = m N , ∀si ∈ Ω Where m ∈ N is the total number of packets sent to the network.
  • 34. 18 Chapter 2. The algorithm Definition 14 points out that our final goal is having our architecture move to- wards the Balls in Bins model. Our goal can also be expressed by considering the probability that each single bin has to get a ball: πi = N−1 , ∀si ∈ Ω (2.12) If we can have equation 2.9 converge to equation 2.12, our goal is reached since both definition 14 and equation 2.12 describe a uniformly distributed r.v. 2.3.1 Extending the ring Moving forward to our goal, as per definition 14, we need to understand why the ring is not balanced. We can identify 2 possible causes: 1. Hash function ξ does not keep into account the fact that stations have hash segments of different lengths. It actually assumes that all stations have hash segments of the same size. 2. Stations should have hash segments of the same size. The 2 problems described above are actually 2 possible explanations of the same issue: with regards to balancing, the network structure and the hash function are not well coupled together. For our solution, we actually choose to accept the standpoint offered by point 1 which blames the hash function rather than stations. Our approach is replacing hash function ξ with another one: Definition 15 (Hash function φ). Let φ : P → [0, φM] ⊂ R, be a hash function. By employing function φ, a ring can achieve balancing and each station approximately receives the same number of packets: η (φ) i ≈ ηi = m N Where m ∈ N is the total number of packets sent to the network. Also, we still define l as the length (bits) of hashes generated by φ: l = log2 φM . Definition 15 represents a goal for us. In the next section we are going to design φ so that load balancing in the ring is achieved. The first thing we notice about φ is that we have designed it to return real num- bers, thus the hash space is no more discrete, but continuos. We will see that the continuos characterization of φ will not be a problem when employed in the ring and in routing algorithm ψ. Furthermore, in our model, we will consider function φ to be used on packets only, we will not be using this new hash function to compute the IDs of stations: for them, we will keep using hash function ξ. Definition 16 (Extended ring). Let Ω be the set of stations and r ∈ N be the leaf radius. Let ξ : · → H ⊆ N be the underlying hash function and φ : P → [0, φM] ⊂ R be the balancing hash function used by each station and based on ξ. Let ψ : P → Ω be the routing function, based on φ, used to assign packets to stations. Then we define R = (Ω, r, ξ, φ, ψ) as the extended ring overlay where packets are balanced via hash function φ. Given definition 16, we see that φ does not act as a replacement of ξ, so we will actually consider the former as an extension of the latter.
  • 35. 2.3. Balancing the ring 19 Adapting concepts in extended ring With hash function φ in place, routing algorithm ψ needs to be slightly changed. Actually, since the whole ring structure is based on hashes, we need to adjust a few definitions so that extended ring (Ω, r, ξ, φ, ψ) can be properly described. The most important concept to introduce in the extended architecture, is that φ acts transparently with regards to the hash space. Hash function ξ returns hashes in the discrete space H ≡ [0, hM] ⊂ N, while φ generates hashes into continuos space Φ ≡ [0, φM] ⊂ R. We design φ such that φM = hM; thanks to this, one space contains the other but they are bound by the same extremes: H ⊂ Φ. This also means that hash segments can be expressed both as enumerable sets and real intervals. Whether the former or the latter will be specified via set identities or inferred by context. The next concept to adapt is stations and their IDs. Nothing changes with re- gards to this matter: every station will keep using hash function ξ to calculate the hash of its address hi ∈ H, however here the important point is understanding that the hash identifying one station is also contained in the spaces of phi-hashes: hi ∈ Φ. The last aspect to cover is routing function ψ. Given the assumptions above, algo- rithm 1 remains 99% unchanged. What changes is line 2 where, instead of using hash function ξ to calculate the packet, hash function φ is used instead: hp ← φ(p). All other operations remain unchanged because Φ extends H. The direct consequence of this last point is the following: Lemma 7. Let R = (Ω, r, ξ, φ, ψ) be an extended ring, then the following holds: ψ {p} = si ⇐⇒ φ(p) ∈ Ξ(si), ∀p ∈ P Proof. Immediate by considering lemma 3 and the fact that function φ replaces ξ in algorithm 1 at line 2. Defining sizing equations Equation 2.12 describes the PDF of r.v. s ∈ Ω which represents the bin (station in an ideally balanced ring) where a ball (packet) falls into. As that equation pre- scribes, if we want to have the ring balanced, we need to make sure that all stations get the same probability to receive a packet, which is not the case for a normal ring (Ω, r, ξ, ψ) as per equation 2.9. So, by employing hash function φ, r.v. sφ ∈ Ω can be defined as the station where a packet falls into by assuming the extended ring (Ω, r, ξ, φ, ψ) is in place. R.v. sφ’s PDF is the start point from where we can commence our sizing effort. Since sφ is continuos, and given lemma 7, the probability that a packet is routed to station si in the extended ring is: π (φ) i = Pr hi ≤ φ(p) ≤ hM i = Ξ(si) fφ (r) dr (2.13) Where fφ : [0, hM] ⊂ R → R is r.v. hφ’s PDF: it represents the generated φ-hashes; and Ξ(si) ⊆ Φ is station si’s continuos hash segment. It is important to notice how
  • 36. 20 Chapter 2. The algorithm this equation relates r.v. sφ ∈ Ω (since its PDF has expression: N k=1 π (φ) k · δ(r − k)5) together with r.v. hφ ∈ Φ. Recalling equation 2.12, we basically want: π (φ) i = πi : hM i hi fφ (r) dr = N−1 , ∀si ∈ Ω (2.14) We start from equation 2.14. Our purpose is designing hash function φ’s im- plementation so that this equation holds. This approach will guarantee that r.v. sφ behaves like continuously distributed r.v. s in the Balls in Bins scenario. 2.3.2 Designing hash function φ Equation 2.14 represents a constraint on fφ. This expression points out an important relationship: Proposition 10 (Relationship between r.v. sφ and hφ). Given equation 2.14, the effort of designing hash function φ is transferred on r.v. hφ, as its PDF fφ is the subject of such design. Designing r.v. sφ’s PDF Of course, we cannot extract function fφ from the integral sign in equation 2.14, so we need to make some assumptions on it. Definition 17 (Formatting impulse). Let g : R → R be a continuos, domain and value bounded function with the following constraints: 1. g(r) ≥ 0, ∀r ∈ R. 2. g is a compact-support6 function: ∃r1, r2 ∈ R, r1 < r2 : g(r) = 0, ∀r ∈ [r1, r2]. 3. ∃A ∈ R, A > 0 : g(r) ≤ A, ∀r ∈ R. 4. +∞ −∞ g(r)dr ≤ 1. 5. It is possible to calculate g’s antiderivative: ∃G(r) : G (r) = g(r). We define g as fφ formatting impulse: a function used to shape r.v. hφ’s PDF and solve equation 2.14. Because of its definition, we use expression gr1,r2,A(r) to refer to an impulse with amplitude A and definition interval [r1, r2] ⊂ R. We want to show right now an important result concerning definition 17 which will be useful later on in this chapter: Lemma 8 (Impulse antiderivative is invertible). Let g : R → R be a formatting impulse. Then its antiderivative G : R → R is invertible: ∃G−1. Proof. The inverse function theorem7 states that a continuously differentiable uni- variate function with nonzero derivative in a certain interval is therein invertible. In our case, G is the antiderivative of a continuous function, thus it is continuous itself 5 Function δ is intended to be a generalized function, or distribution: δ, ϕ = ϕ(0). 6 Compact-support functions used in distributional calculus. 7 As described in (Nijenhuis, 1974), the theorem provides a sufficient condition for a function to be invertible.
  • 37. 2.3. Balancing the ring 21 s1 s2 sk sN s1 hM 0[h1, hM 1 ] gh1,hM 1 ,A1 [h2, hM 2 ] gh2,hM 2 ,A2 [hk, hM k ] ghk,hM k ,Ak [hN , hM] ghN ,hM,AN [0, h1] g0,h1,AN FIGURE 2.3: Hash-partitioning of a ring into different segments, one per each station. For each segment, a different impulse is used, its coverage matches the segment’s length. and differentiable by definition of primitive function. Thus we meet the conditions of invertibility. The nonzero derivative condition is not met by g’s definition. However this does not undermine its invertibility: rather, it does not guarantee that the inverse function is also continuously differentiable. Function fφ’s domain [0, hM] ⊂ R can be partitioned into N different segments: one hash segment Ξ(si) ≡ [hi, hM i ] per each station si. The basic idea is having function fφ employ impulse g to cover the different seg- ments in the whole hash space [0, hM] ⊂ R as shown in figure 2.3. So, for each hash segment [hi, hM i ] ⊂ R, impulse ghi,hM i ,Ai is considered and is employed to calculate fφ’s values falling into that specific segment. For the last segment relative to sN , since it crosses the max hash value hM, we need to actually use 2 different impulses: ghN ,hM,AN,1 and g0,h1,AN,2 . Amplitudes A1, A2, . . . , AN,1, AN,2 are sized quantities and their values will be calculated later in this chapter. Definition 18 (Function fφ’s structure). Let R = (Ω, r, ξ, φ, ψ) be an extended ring, let hφ ∈ [0, hM] ⊂ R be the r.v. representing a φ-hash, then its PDF is formally defined as: fφ(r) = N−1 k=1 ghk,hM k ,Ak (r) + ghN ,hM,AN,1 (r) + g0,h1,AN,2 (r) It is important to notice that fφ is a not a regular unconstrained function, it is a PDF, thus it must meet certain requirements. Later on, we will verify that those requirements are actually in place. We can now take equation 2.14 and replace fφ with its definition: hM i hi fφ (r) dr = hM i hi ghi,hM i ,Ai (r)dr = N−1 , ∀i = 1 . . . N − 1 (2.15) In case last station sN is considered, then 2.14 becomes: hM i hi fφ (r) dr = Ξ(sN ) ghN ,hM,AN,1 (r) + g0,h1,AN,2 (r) dr = hM hN ghN ,hM,AN,1 (r)dr + h1 0 g0,h1,AN,2 (r)dr = N−1 (2.16) Since g has antiderivative as per definition 17, we can proceed further in both equations: Ghi,hM i ,Ai (r) hM i hi = N−1 , ∀i = 1 . . . N − 1 (2.17) And:
  • 38. 22 Chapter 2. The algorithm GhN ,hM,AN,1 (r) hM hN + G0,h1,AN,2 (r) h1 0 = N−1 (2.18) Equations 2.17 and 2.18 are the closed-form constraints we have calculated from equation 2.14 right now. Remark (Solutions of equations 2.16 and 2.18). Regarding last station’s hash segment, we have to use 2 different impulses whose amplitudes AN,1 and AN,2 can be sized via equa- tions 2.16 and 2.18. Those equations provide a possibly infinite set of solutions where both amplitudes are interdependent. Among the possible ones, we will choose to have each impulse cover half of the target value: hM hN ghN ,hM,AN,1 (r)dr = GhN ,hM,AN,1 (r) hM hN = 1 2N (2.19) And: h1 0 g0,h1,AN,2 (r)dr = G0,h1,AN,2 (r) h1 0 = 1 2N (2.20) Remark (Requirements on impulse). In definition 17, we have required formatting im- pulse g to have antiderivative G and that its definition is known. This assumption is pretty strong but not essential. As we could see from calculations so far, equations 2.15 and 2.16 can actually be used to size the value of the impulse’s amplitude by using an alternative method to exact integration. Throughout the rest of our analysis, equations 2.17 and 2.18 will always be referred to as the preferred amplitude sizing method; however it will always be implicitly intended that equations 2.15 and 2.16 can replace them. The process of designing fφ is completed as the equations above can be used to calculate all impulse amplitudes A1, A2, . . . , AN,1, AN,2: Proposition 11 (Defining function fφ). In order to reach load balancing in the ring, the following operations are considered: 1. Formatting impulse ghi,hM i ,Ai : [hi, hM i ] ⊂ R → [0, Ai] ⊂ R is defined for each station si ∈ Ω. 2. Function fφ is designed as per definition 18 by summing impulses all together. 3. Function fφ will be parametric on the set of impulse amplitudes, hence its input space will be RN+2: fφ(A1, . . . , AN,1, AN,2, r) with r, Ak ∈ R, Ak > 0, ∀k = 1 . . . N. 4. For each impulse, the corresponding antiderivative: Ghi,hM i ,Ai (r) = r 0 ghi,hM i ,Ai (x)dx (2.21) is calculated. 5. For each impulse, by means of equations 2.17 and 2.18, the corresponding amplitude Ai is calculated. Now that we know fφ’s formal definition, we need to verify that such expression meets the constraints of a PDF: Theorem 9 (Function fφ is a regular PDF). Let R = (Ω, r, ξ, φ, ψ) be an extended ring with N = Ω stations and let ghi,hM i ,Ai : [hi, hM i ] ⊂ R → [0, Ai] ⊂ R be the formatting impulse for each station si ∈ Ω such that equations 2.17 and 2.18 hold. Then function fφ, as per definition 18, is a regular PDF.
  • 39. 2.3. Balancing the ring 23 Proof. The 3 basic properties of PDF functions must be met: 1. fφ is positive and bounded given the definition of impulse g: 0 ≤ ghi,hM i ,Ai (r) ≤ Ai, ∀i = 1 . . . N, r ∈ R 2. Given its definition, fφ’s domain is the union of all non-overlapping domains of the impulses: N−1 k=1 [hi, hM i ] ∪ [hN , hM] ∪ [0, h1] ≡ [0, hM] ⊂ R Thus the function is 0 out of its definition range: fφ(r) = 0, ∀r < 0 ∧ r > hM =⇒ lim r→±∞ fφ(r) = 0 3. fφ’s area is unitary because of equations 2.17 and 2.18: +∞ −∞ fφ(r)dr = hM 0 fφ(r)dr = N · N−1 = 1 The following comes as a direct consequence of theorem 9: Corollary 9.1 (Function Fφ is a regular CDF). Function Fφ : R → R has the following form: Fφ(r) = N−1 k=1 Ghk,hM k ,Ak (r) + GhN ,hM,AN,1 (r) + G0,h1,AN,2 (r) (2.22) And is a regular CDF and r.v. hφ’s CDF. Proof. Immediate by considering r.v. hφ’s CDF’s definition: Fφ = r −∞ fφ(x)dx = r −∞ N−1 k=1 ghk,hM k ,Ak (x) + ghN ,hM,AN,1 (x) + g0,h1,AN,2 (x) dx = r −∞ N−1 k=1 ghk,hM k ,Ak (x)dx + r −∞ ghN ,hM,AN,1 (x)dx + r −∞ g0,h1,AN,2 (x)dx = N−1 k=1 r −∞ ghk,hM k ,Ak (x)dx + r −∞ ghN ,hM,AN,1 (x)dx + r −∞ g0,h1,AN,2 (x)dx = N−1 k=1 Ghk,hM k ,Ak (x) r −∞ + GhN ,hM,AN,1 (x) r −∞ + G0,h1,AN,2 (x) r −∞ Where Ghk,hM k ,Ak is defined according to equation 2.21. Given impulse’s definition, its antiderivative and binding to hash segments as per equation 18, we know that the following holds: lim r→∞ gr1,r2,A(r) = 0 ∧ r1, r2 ≥ 0 =⇒ lim r→−∞ Gr1,r2,A(r) = 0
  • 40. 24 Chapter 2. The algorithm Hence, leading to the following result: [Gr1,r2,A(x)]r −∞ = Gr1,r2,A(r) − lim r→−∞ Gr1,r2,A(r) = Gr1,r2,A(r), ∀r ∈ R Which leads us to equation 2.22. Finally, theorem 9 has proved fφ(x) is a regular PDF and this covers the proof of all aspects of the thesis. Designing r.v. sφ Although proposition 11 describes the procedure for calculating fφ, it does not pro- vide a way to build r.v. hφ in a way such that it generates hashes according to that function. This problem is known in literature as random variable generation8. By em- ploying this technique, we are able to get the algorithm for calculating φ-hashes which allow us to balance load in the ring. For the sake of completeness, the proof of this process is described below: Theorem 10 (R.v. generation via inverse transform). Let F : R → [0, 1] ⊂ R be a continuos inversible function meeting the characteristics of a CDF: 1. Bounded in [0, 1]: 0 ≤ F(r) ≤ 1, ∀r ∈ R. 2. limr→−∞ F(r) = 0. 3. limr→+∞ F(r) = 1. 4. Monotone increasing: r1 < r2 =⇒ F(r1) ≤ F(r2), ∀r1, r2 ∈ R. Let U ∈ R be a continuos uniformly distributed r.v. over [0, 1] ⊂ R. Define X ∈ R as a r.v. such that the following transformation holds: X = F−1 (U) (2.23) Where F−1 : [0, 1] ⊂ R → R denotes F’s inverse function. Then X is distributed as F: Pr {X ≤ x} = F(x), ∀x ∈ R (2.24) Proof. We need to prove that equation 2.23 causes r.v. X to be distributed according to CDF F. Starting from equation 2.24, we need to prove the following: Pr F−1 (U) ≤ x = F(x), ∀x ∈ R Since F is invertible, then the function is both injective and surjective. Also, F is, by hypothesis, a continuos function. Thanks to those 2 conditions, we can apply F to both ends of the inequality under the probability sign: Pr F F−1 (U) ≤ F(x) = F(x) =⇒ Pr {U ≤ F(x)} = F(x), ∀x ∈ R The sign of the inequality is left unchanged because F is monotone increasing. Let now a = F(x), as x ranges in R, a will range in [0, 1] because F is a CDF and emits values in that interval. So the previous equation becomes: Pr {U ≤ a} = a, ∀a ∈ [0, 1] ⊂ R 8 As described in: (Haugh, 2004), r.v. transformation via inverse transform is a known technique which makes it possible to define a random variable by using the inverse of its CDF.
  • 41. 2.3. Balancing the ring 25 But we have that Pr {U ≤ a} = FU . U is supposed to be a continuos uniformly distributed r.v. over [0, 1]. The previous equation is actually r.v. U’s CDF’s formal definition which proves the thesis. Theorem 10 answers our question about the implementation of hash function φ. When considering a ring (Ω, r, ξ, φ, ψ), we can use hash function ξ to build hash function φ in order to reach balancing. Before detailing this process, we need to make sure we meet all conditions defined by the theorem above: Lemma 11 (Function Fφ is invertible). Let fφ be r.v. hφ’s PDF defined as per proposition 11. Let Fφ be its CDF. Then Fφ is invertible and we will indicate with F−1 φ its inverse. Proof. The proof is almost immediate. We consider Fφ’s definition, as per corol- lary 9.1, and note that it is basically built up by many different impulse antideriva- tives Gi = Ghi,hM i ,Ai having i = 1 . . . N. It is possible to invert a function by parts; given Fφ’s structure, we can prove its invertibility by proving that each impulse an- tiderivative Gi is itself invertible. Thanks to lemma 8, every impulse antiderivative is actually invertible, and this proves the thesis. The process for achieving this result is as follows: Proposition 12 (Hash function φ’s implementation). Let R = (Ω, r, ξ, φ, ψ) be an ex- tended ring. In order to build hash function φ, the following operations must be performed at ring initialization time: 1. For each station si ∈ Ω, compute its hash identifier: hi = ξ(si) and sort all ID hashes by increasing value: hi, h2, . . . , hN . 2. Define a formatting impulse g(r, r1, r2, A) to use, parametric in (r1, r2, A) ∈ R3. It is possible to use one impulse definition for all stations or use different impulse definitions per each. 3. Bind each station’s associated impulse gr1,r2,A to the station’s hash segment Ξ(si) = [hi, hM i ], thus obtaining an impulse gi = ghi,hM i ,Ai parametric in amplitude Ai. 4. Use sizing equations 2.17 and 2.18 to compute, for each impulse gi, the value of am- plitude Ai which allows to achieve balancing. Thus compute an array of fully qualified impulses: g1, g2, . . . , gN−1, gN,1, gN,2 (no more parametric). 5. Build PDF function fφ as per equation 18. 6. Compute CDF function Fφ as per corollary 9.1. 7. As per theorem 10, compute φ as: φ(·) = F−1 φ 2l − 1 −1 · ξ(·) (2.25) Equation 2.25 represents hash function φ formal definition. We will refer to this equation as the: Balancing Equation.
  • 42. 26 Chapter 2. The algorithm 0 hM 0 1 h1 h2 hk hN−1 hN 1 2N 1 2N + 1 N . . . 1 2N + k N 1 2N + N−1 N h1 h2 − h1 . . . . . . hN − hN−1 hM − hN 1 2N 1 N . . . . . . 1 N 1 2N φ φ φ φ φ φ sN s1 s2 sk sN−1 sN sN s1 s2 sk sN−1 sN FIGURE 2.4: Hash segments mapped onto φ segments illustrating how hash function φ works. The top part of the diagram shows the φ hash-space while the bottom part the ξ hash-space 2.3.3 Understanding how φ works Equation 2.25 allows us to balance the ring. Before moving on, we would like to point out a few important facts regarding the balancing equation to better under- stand how it works. Figure 2.4 clearly demonstrates the basic principle behind φ: the balancing appli- cation basically maps an unevenly distributed space (regular hashes, that is ξ hashes) onto an evenly distributed space (φ hashes). The φ space even partitioning is based on the number of stations N. We want to remark the fact that, given its definition (equation 2.25), hash function φ has a domain which spans values ranging in interval [0, 1] ⊂ R. This interval is subdivided into equal segments, each one assigned to a station. We will call them φ-segments, and we will use term φ-coverage, later on, to indicate the same concept. 2.4 Ring balancing example For the sake of completeness, we are going to provide an example regarding how to build hash function φ for a simple small ring consisting of a few stations. Since our purpose here is to provide a real case scenario, easy to understand, we are going to consider the following conditions: 1. The ring will be made of N = 6 stations. 2. Leaf set radius is the minimum: r = 1. 3. We are going to consider extremely short hashes with l = 10 bits. This means that hM = 210 − 1 = 1023. Remark. Hash function ξ will be considered but not defined as we are not going to physically use it in our calculations. We will now follow proposition 12’s prescriptions.
  • 43. 2.4. Ring balancing example 27 Station Hash identifier HS (Ξ(si) ⊂ N) HS (Ξ(si) ⊂ R) s1 = St. 1 h1 = 101 {101 . . . 209} [101, 210) s2 = St. 2 h2 = 210 {210 . . . 339} [210, 340) s3 = St. 3 h3 = 340 {340 . . . 552} [340, 553) s4 = St. 4 h4 = 553 {553 . . . 700} [553, 701) s5 = St. 5 h5 = 701 {701 . . . 997} [701, 998) s6 = St. 6 h6 = 998 {998 . . . 1023} ∪ {0 . . . 100} [998, 1023] ∪ [0, 101) TABLE 2.1: Showing, in the example, values of hash identifiers and hash segments for each station. 2.4.1 Defining the ring We must first define ring R = (Ω, r, ξ, φ, ψ). Remember that hash function φ is the last quantity we will define. Each station si ∈ Ω must first define an identifier and compute hash function ξ on that in order to calculate its hash identifier hi. As it is possible to see in table 2.1, once each station receives its hash identifier, hash segments are defined so that routing is possible in the ring. 2.4.2 Defining the formatting impulse According to the second point of proposition 12, we must move on to defining the formatting impulse to use in order to achieve balancing. We can choose to either define one impulse type for all stations or a different impulse type per each one of them. We are going to choose the first option for 2 reasons: 1. Choosing one single impulse type is easier from a computation point of view as it implies to formulate its parametric antiderivative only once. 2. Choosing more impulse types has not proved, so far, to be any more beneficial than using a single one. The quality of the balancing is not impacted by this choice9. For the sake of simplicity, we are going to consider a very simple impulse type: the rectangular impulse: gr1,r2,A(r) = A · Π r − r1 r2 − r1 (2.26) Having: Π (r) = 1 0 ≤ r ≤ 1 0 The impulse we have chosen in equation 2.26 is compliant to definition 17. Since the proof is immediate, we will not cover it. 9 This is based on observations from simulations run so far. No proof supports this theory nor denies it though.
  • 44. 28 Chapter 2. The algorithm 2.4.3 Binding impulses to stations Now we need to bind every station’s HS to an impulse in order to get a collection of impulses all parametric with respect to their amplitudes. s1 =⇒ g1 = g101,210,A1 s2 =⇒ g2 = g210,340,A2 s3 =⇒ g3 = g340,553,A3 s4 =⇒ g4 = g553,701,A4 s5 =⇒ g5 = g701,998,A5 s6 =⇒ g6,1 = g998,1023,A6,1 , g6,2 = g0,101,A6,2 Remark (Impulse for last station). The last station in the ring has a special treatment because it might span two hash intervals since it will include the highest hash hM and the lowest one (the null hash). Thus 2 impulses are actually used. 2.4.4 Calculating amplitudes At the moment, we have a collection of N + 1 = 7 impulses, all parametric with respect to amplitudes. We need to find the value of those amplitudes in order to have these impulses define fφ in a way that hash function φ can balance the load in the ring. Equations 2.17 and 2.18 will be used to size those impulses. Since the sizing equations require the computation of impulse’s antiderivative, we need to calculate it first. Given its extremely simple definition, the calculation is almost immediate: Gr1,r2,A(r) = r 0 gr1,r2,A(x)dx = r 0 A · Π x − r1 r2 − r1 dx = A · Π r − r1 r2 − r1 · (r − r1) + H (r − r2) Where H (r) is Heaviside’s step function10: H (r) = 1 r > 0 0 In order to apply equations 2.17 and 2.18, we need to calculate the following quantity, we will do the binding to hash segments later: [Gr1,r2,A]r2 r1 = A · Π r − r1 r2 − r1 · (r − r1) + A · H (r − r2) r2 r1 = A · Π r − r1 r2 − r1 · (r − r1) + H (r − r2) r2 r1 = A · (r2 − r1) We can now apply equation 2.17 to g1 . . . g5: 10 Heaviside’s function exists in literature in different forms, here we consider the variation where the function assumes only 2 values: 0 and 1.
  • 45. 2.4. Ring balancing example 29 Ghk,hM k ,Ak hM k hk = N−1 =⇒ Ak = hM k − hk −1 · N−1 , ∀k = 1 . . . 5 The same goes for g6,1 and g6,2 where we apply equations 2.19 and 2.20: GhN ,hM,AN,1 hM hN = 1 2N =⇒ AN,1 · (hM − hN ) = 1 2N =⇒ AN,1 = 1 2(hM−hN )·N G0,h1,AN,2 h1 0 = 1 2N =⇒ AN,2 · h1 = 1 2N =⇒ AN,2 = 1 2h1·N We now have the values of all impulses: A1 = (h2 − h1)−1 · N−1 = (210 − 101)−1 · 6−1 = 654−1 A2 = (h3 − h2)−1 · N−1 = (340 − 210)−1 · 6−1 = 780−1 A3 = (h4 − h3)−1 · N−1 = (553 − 340)−1 · 6−1 = 1278−1 A4 = (h5 − h4)−1 · N−1 = (701 − 553)−1 · 6−1 = 888−1 A5 = (h6 − h5)−1 · N−1 = (998 − 701)−1 · 6−1 = 1782−1 A6,1 = (hM − h6)−1 · (2N)−1 = (1023 − 998)−1 · 12−1 = 300−1 A6,2 = h−1 1 · (2N)−1 = 101−1 · 12−1 = 1212−1 2.4.5 Computing functions Now that all impulses have been properly sized and we have their values, function fφ is fully defined. As a direct result, we also have that function Fφ is fully defined. Given its simplicity, Fφ can be easily inverted piecewise.
  • 47. 31 Chapter 3 Simulation results In this chapter we are going to describe the simulations which were performed in order to validate, on a practical standpoint, all the results analytically achieved in chapter 5. As a generic overview, two different simulation systems were designed and de- veloped: • Regular simulations A high-level engine developed in Matlab1 and Mathe- matica2 targeting small-size simulations in order to produce real data to vali- date the whole system. • High performance simulations A low-level engine developed in C/C++ and targeting large-size simulations in order to produce high fidelity data to vali- date the system in real life conditions. Both solutions were used to prove that all analytical results in chapter 5 do pro- vide a valid description of the system’s behaviour. 3.1 Small-size simulations This simulation engine was developed to generate results in the context of a con- trolled environment where conditions are similar to those in real life. The main features regarding this simulation set are: • Functional definition of impulses and functions. • Real hashes are calculated using standard Crypto3 library. • All big integers are normalized into a smaller interval. Of course, given its nature, the engine comes with some limitations and some downsides too: • Even though real hashes are used, their values are normalized to fit a smaller interval. Thus, these values cannot be considered as high fidelity. • Simulations are slow. Given the application of functional calculus, impulse functions and their antiderivatives are defined in open form, thus requiring numerical integration to be performed every time. 1 Mathworks Matlab https://www.mathworks.com/products/matlab.html. 2 Wolfram Mathematica https://www.wolfram.com/mathematica/. 3 OpenSSL library was used to compute regular hashes. More information available in appendix A.
  • 48. 32 Chapter 3. Simulation results 100 200 300 400 30 210 60 240 90 270 120 300 150 330 180 0 PHCP non−balanced case 50 100 150 30 210 60 240 90 270 120 300 150 330 180 0 PHCP in balanced case FIGURE 3.1: Showing the Polar Hash Coverage Plot (PHCP) of a sim- ulation on an N = 10 station ring after sending m = 103 packets. Both plots show the configuration of the station hash segments to- gether with the final load levels at the end of the simulation. The plot on the left refers to a normal ring (hash function ξ applied), the one on the right refers to an extended ring where hash function φ based on same ξ is considered. The same packets were sent in both rings. • Given the different subsystems being used, numerical accuracy is not guaran- teed. On one side, this set of simulations is characterized by a relatively easy imple- mentation, thus they come with certain intrinsic limitations (mainly related to the subsystems being used). The other set of simulations is meant to target those issues and provide a better numerical fidelity. 3.1.1 Verifying load balance One of the most basic simulations are used to verify that the algorithm effectively helps the ring achieving load balance given different hash segment distributions among stations in the hash range interval [0, hM]. These simulations perform the following operations: 1. The hash space is divided into N random parts and each assigned to one sta- tion. 2. A total number of m packets (random numeric vectors) are generated and fed to the hash function which can be either ξ or φ (both are considered in order to compare loads per station at the end of one simulation). 3. Packets are assigned to stations according to routing function ψ based on se- lected hash function. 4. Final results are collected: the number of packets per each station is tracked.
  • 49. 3.1. Small-size simulations 33 0 1000 2000 0 100 200 St. 1 0 1000 2000 0 100 200 St. 2 0 1000 2000 0 100 200 St. 3 0 1000 2000 0 100 200 St. 4 0 1000 2000 0 100 200 St. 5 0 1000 2000 0 100 200 St. 6 0 1000 2000 0 100 200 St. 7 0 1000 2000 0 100 200 St. 8 0 1000 2000 0 100 200 St. 9 FIGURE 3.2: Showing load state (in blue) |sk| in each station sk as time grows. In this simulation, hash function ξ is used (normal ring). The green line shows the expected load state (uniform) for each point in time. Definition 19 (Polar Hash Coverage Plot (PHCP)). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω = N stations. The Polar Hash Coverage Plot is a set of N vectors in the 2D space: E = Ak · ei·ωk , k = 1 . . . N Every vector has its amplitude indicate the packet load relative to the station it refers to, while the phase indicates the station’s hash segment amplitude and position in the ring: Ak = ηk m ωk = Ξ(sk) hM + ωk,0 Where ωk,0 = k−1 j=0 ωj indicates the phase shift due to all stations preceding sk. Figure 3.1 shows the PHCP of the same simulation in which the same m pack- ets have been sent to the network, with and without load balancing hash function φ in place. As it is possible to see, the vectors in the second plot (on the right) have roughly the same amplitude in comparison with the first diagram (on the left), indi- cating that hash function φ is effectively able to provide balancing on the same set of packets across the stations in the ring. 3.1.2 Evaluating load levels per station Another set of simulations are used to measure the difference between the final load state in each station and the expected one (uniform) after the network has been fed with a certain number of packets.
  • 50. 34 Chapter 3. Simulation results 0 1000 2000 0 100 200 St. 1 0 1000 2000 0 100 200 St. 2 0 1000 2000 0 100 200 St. 3 0 1000 2000 0 100 200 St. 4 0 1000 2000 0 100 200 St. 5 0 1000 2000 0 100 200 St. 6 0 1000 2000 0 100 200 St. 7 0 1000 2000 0 100 200 St. 8 0 1000 2000 0 100 200 St. 9 FIGURE 3.3: Showing load state (in blue) |sk| in each station sk as time grows. In this simulations set (same as in figure 3.1), hash function φ is used (extended ring). The green line shows the expected load state (uniform) for each point in time. These simulations also have the objective of showing how the network behaves with and without balancing hash function φ in place. These normal vs. extended ring scenarios are important as they allow us to visually assess the work done by φ in reshaping the load distribution in the network. For this analysis to be effective, it is crucial that both scenarios are evaluated on the exact same set of packets gen- erated. To guarantee this condition, when randomly generating packets, the same seed is used when evaluating the normal and the extended ring during one simula- tion session. Figure 3.2 and 3.3 show, respectively, the same simulation session first conducted on the ring and then again on the same ring but extended (φ in place). As it is possible to see, as time grows, each station reports its load level. In the normal ring, station load levels do not all meet expected load level η = m N . On the other hand, when φ is in place (figure 3.3), all station loads tend to match the expected levels. Remark (Discrete time). This set of simulations is very important as the load state in each station is evaluated during all the time. In this context, time is considered discrete and time instants are to be associated to events. The only event being considered here is the generation of a random packet. 3.2 Large-size simulations This simulation engine was developed for two reasons: getting high fidelity simu- lation data, and providing an initial implementation of the algorithm. As a direct
  • 51. 3.2. Large-size simulations 35 result, we could deliver the first implementation of the algorithm described in the previous chapter. The main features of this system are the following: • Being developed in C/C++, the application is very fast computing regular hashes and performing φ-hashes processing. • Simulations can be run sequentially or in parallel (packet generation). • Standard Crypto library is used, therefore all generated hashes are real hashes and not simulated quantities. • Big integers are employed, so no scaling is performed in order to adapt real data to simulation artifacts, hence providing more fidelity to the real scenarios. Simulation flow In the context of this simulation effort, several computation and memory intensive runs have been scheduled on a dedicated pool of servers. A de- tailed description of the infrastructure being used is available in appendix A; here we provide a brief synopsis about how these simulations work: 1. When the engine starts an initialization phase ensures memory and other con- ditions. 2. Random packets are generated. Packets are generated as random bitstreams of specific size. Different sizes can be specified and during one simulation the size can range in a certain interval. 3. Hashes (using ξ and φ) are computed. 4. Routing of packets is performed for each hash by using application ψ. 5. All results are persisted in memory. Data manipulation is then performed in order to extract information of interest. 6. Simulation output files are generated. 7. Post-processing is performed by generating diagrams and aggregated quanti- ties using output files. 3.2.1 Overview Many simulations have been run, all targeting different network structures and con- ditions. Before showing results, we need to provide a synopsis of which configu- rations have been considered in order to understand what was actually simulated. Every simulation run is characterized by the following properties: • Number of stations in the ring: N. This parameter directly impacts the size of the network. • Number of generated packets: m. • Leaf radius r. For all simulations, the radius is unitary: r = 1. • Packet size S ∈ {100Kb, 1Mb, 3Mb, 10Mb}. Since one same configuration can be run different times with different seed val- ues, aggregate properties describing one simulation group/batch include:
  • 52. 36 Chapter 3. Simulation results • Number of simulations in batch: C ∈ N. • Overall simulation time of the batch: T ∈ R. Grouping simulations Simulation conducted in the context of this research can be classified using the parameters described above. The following batches were run: ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 100Kb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 ξ,φ # 1Mb 13.1M 10 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 1Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 3Mb 89M 30 PA R ξ,φ # 10Mb 89M 30 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 50 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 PA R ξ,φ # 1Mb 10M 100 The diagram above illustrates the different configurations used to run simula- tions. To read each single tile, just refer to the following legend: ξ,φ # pkt. size S m gen. pkt. N PA R ξ,φ # pkt. size S m gen. pkt. N ξ,φ # pkt. size S m gen. pkt. N PA R ξ,φ # pkt. size S m gen. pkt. N regular parallel intensive parallel intensive For each simulation, a different seed was used (thus the #-symbol in the bottom- left corner) and both regular and φ-hashes were computed (top-left corner).
  • 53. 3.2. Large-size simulations 37 0 0.2 0.4 0.6 0.8 1 1.2 1.4 ·105 0 0.2 0.4 0.6 0.8 1 1.2 ·105 σ of hξ σ2 /ηofhξ 0 0.5 1 1.5 ·105 0.5 1 1.5 2 2.5 ·105 σ of hξ σ2 /ηofhξ 1.5 2 2.5 ·104 1.6 1.8 2 2.2 2.4 ·104 σ of hξ σ2 /ηofhξ 0 200 400 600 800 0 1,000 2,000 3,000 σ of hφ σ2 /ηofhφ 0 0.5 1 1.5 ·104 0 2 4 6 8 ·104 σ of hφ σ2 /ηofhφ 0 200 400 600 800 1,0001,2001,400 1,000 2,000 3,000 4,000 5,000 σ of hφ σ2 /ηofhφ FIGURE 3.4: Plotting standard deviation vs dispersion factor of gen- erated ξ-hashes and φ-hashes during simulations batches (from left to right): N = 10 (40 simulations), N = 30 (60 simulations) and N = 50 (10 simulations). 3.2.2 Evaluating the variance of hash segment amplitudes Two information were of interest and, accordingly, two different types of data were extracted from every simulation: 1. The statistical variation of regular hash values and φ’hash values in order to see whether patterns exist. 2. The statistical relation between the distribution of hash segment amplitudes and the distribution of φ-hash values. Since more φ hashes are routed into a specific segment if that segment has a small amplitude, we want to assess whether special patterns arise in case of high variance in segment amplitudes when observing φ-hash values. Figure 3.4 reports possible patterns between variations of regular and φ-hashes. In general we can conclude that φ-hashes have a more localized behaviour as their variations are more contained than regular hashes via hash function ξ. This is expected: if we consider the whole hash space [0, hM] ⊂ R, we have that hash function ξ has a uniform distribution over that range; on the other side, φ is characterized by a distribution which allocates hashes with different probabilities in different sub-intervals of the overall hash range. This last observation is the main reason why we want to investigate the relation between the variance of segment lengths and the variance of φ-hashes. Hashes and segment amplitudes As anticipated, the following questions were of interest with regards to the behaviour of φ-hashes and the distribution of hash segment lengths Ξ(si) , ∀si ∈ Ω: 1. If all stations in the ring are arranged in a way such that the distribution of hash segment lengths is approximately uniform, what behaviour should we expect from φ-hashes?
  • 54. 38 Chapter 3. Simulation results 0 5 10 15 20 25 30 35 40 0 2 4 HS amplitudes (×1037 ) 0 1,000 2,000 3,000 Standarddeviationσ φ-hashes 0 5 10 15 20 25 30 35 40 45 50 55 60 0 5 10 15 20 HS amplitudes (×1036 ) 0 2 4 6 8 ·104 Standarddeviationσ φ-hashes 0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 HS amplitudes (×1036 ) 0 2,000 4,000 6,000 Standarddeviationσ φ-hashes FIGURE 3.5: Plotting standard deviation of hash segment lengths and standard deviation of φ-hashes during each simulations in batches (from top to bottom): N = 10 (40 simulations), N = 30 (60 simula- tions) and N = 50 (10 simulations). 2. If all stations in the ring define very different hash segments (some very wide and some very short), what behaviour should we expect from φ-hashes? The diagrams in figure 3.5 try to catch such behaviour and describe it from a sta- tistical point of view. Both questions raised above can be mathematically mapped one one statistical descriptor which, therefore, becomes of high interest in this con- text: the standard deviation of hash segment amplitudes and φ-hashes. By looking at those diagrams, we can assess a very weak trend for which the variance of φ-hashes tends to be higher the higher is the variance of hash segment lengths. As pointed out, this is classifiable as a pattern but in a very prudent way as the trend is not immediately evident and there are some cases where such a trend
  • 55. 3.2. Large-size simulations 39 0 0.5 1 ·105 Station sk Simulations Loadη (ξ) k 0 2 4 ·104 Station sk Simulations Loadη (φ) k FIGURE 3.6: Plotting station loads η (ξ) k (no balancing) and η (φ) k (bal- anced ring) at the end of four N30 simulations with different seeds. does not show up. Our conclusion is that a correlation between segment lengths Ξ(si) and hashes hφ is probably present, however more variables are involved and more investigation on this regard is necessary. 3.2.3 Evaluating load levels per station This set of high performance simulations have been used, of course, to verify the quality of the balancing performed by hash function φ. Figure 3.6 shows station loads in the context of four different simulations with 30 stations. As it is possible to see, packets are balanced across stations and the balancing is evident when comparing loads to simulations where no balancing is performed. Migrations flows A concept extremely important in the context of these simulations and, more gener- ally, in the context of this research effort, is the following: Definition 20 (Migration flow ξ-φ). Let R = (Ω, r, ξ, φ, ψ) be a ring and p ∈ P a packet. Let si = ψ(ξ)(p) be the station where the packet is routed to by using hash function ξ, and
  • 56. 40 Chapter 3. Simulation results S0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S2 0%10%20%30%40%50%60%70%80%90%100% S3 0%10%20%30%40%50%60%70%80%90%100% S4 0%10%20%30%40% 50% 60% 70% 80% 90% 100% S5 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S6 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S7 0%10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S8 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S9 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S10 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S11 0%10%20%30%40%50%60%70%80%90%100% S12 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S13 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S14 0%10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S15 0% 10% 20% 30% 40% 50%60% 70% 80% 90% 100% S16 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S17 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S18 0% 10% 20% 30% 40% 50% 60%70%80%90%100% S19 0%10%20%30%40%50%60%70%80%90%100% S20 0%10%20%30%40% 50% 60% 70% 80% 90% 100% S21 0%10%20%30%40% 50% 60% 70% 80% 90%100% S22 0%10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S23 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S24 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S25 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S26 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% S27 0% 10%20%30% 40% 50% 60% 70% 80% 90% 100% S28 0%10%20%30%40%50%60%70%80%90% 100% S29 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% FIGURE 3.7: Migrations flows in a N30 ring. let sj = ψ(φ)(p) be the station where the packet is routed to by using hash function φ. The virtual transition that packet p experiences from si to sj is called transition flow. Definition 20 is the foundation of a differential analysis conducted during all simulations. By collecting all hashes and mapping them to stations, it is possible, at the end of the simulation, to extract all migration flows. At the end, it is possible to identify as many flows as generated packets, the final process is aggregating this information and counting duplicates. Visualizing migration flows is difficult using tables, thus circo-diagrams4 are em- ployed instead. Figure 3.7 shows migration flows for a simulation with 30 stations and 1M packets generated. The diagram clearly provides a good description about how packets virtually move from one station to another when hash function φ is used for routing. Thanks to these diagrams it is possible to state the following: Proposition 13 (Packet migrations). Let R = (Ω, r, ξ, φ, ψ) be a ring, then stations be- have in two different ways: • Wide-coverage stations are more likely to donate packets to other stations. • Narrow-coverage stations are more likely to accept packets from other stations. 4 Circo-diagrams have been generated by using software Circos: http://circos.ca/. To read these diagrams, read the on-line documentation.
  • 57. 3.2. Large-size simulations 41 It is also worth noticing that all migration flows are localized in adjacent stations when considering one node in the ring. This pattern is interesting because, differ- ently from expectations, the re-arrangement performed by φ does not move packets so far from their ξ-selected station.
  • 59. 43 Chapter 4 System API In this chapter we want to provide a description of the different interactions the system exposes to the end user for storing and retrieving data, and what protocols are used in the ring to ensure those services. In chapter 1, we have covered the system architecture. As we recall, the end user is able to interact with the storage system in order to take advantage of its services; what happens on the other side of it, in the ring, is not known to him. The questions we want to give an answer to are: "What happens when the user sends data to be stored?", "How can the user retrieve data he previously stored?". The overall system exposes a minimal set of API consisting of 4 primitives: 1. Store By means of this functionality, the user can transmit a DU and have it persisted in the system. Typically, upon invoking this API, the user receives some data in return, a token, which will be used later for retrieving that same data. 2. Retrieve By invoking this API, the user can retrieve data previously stored in the system. If the operation is successful, the user receives his data in return. 3. Remove The user can actually decide to remove data he previously stored by invoking this primitive. No data is returned after invoking the API except a status code indicating whether the removal was successful plus additional information (optional, like the amount of total data that was deleted). 4. Update This functionality is used by the user to update existing data. It even- tually results in the sequential application of a delete and store invocation. We are going to examine in detail the first 2 primitives: store and retrieve as the others are extensively based on the former pair. 4.1 Storing data The API for transmitting a DU and have it persisted in the system requires the user to provide, as input, the byte stream. A identification token is returned if the call is successful: t_token store ( stream_t& input ) The moment the user invokes store on DU p ∈ P, 2 things happen in sequence: 1. The token is computed by calculating the hash of the input packet h = ξ(p).
  • 60. 44 Chapter 4. System API 2. DU p’s size is considered together with fragmentation threshold c ∈ N. If p ex- ceeds the threshold: |p| > c, then the packet is fragmented into smaller units. The token is returned to the user in case the storage process is successful. Given the DHT and content addressing, it is possible to retrieve the DU later by using that specific hash. 4.1.1 Packet fragmentation The fragmentation process is necessary for some important reasons: • The ring has a high level of control traffic. Given the DHT and routing al- gorithm ψ, many transmissions occur between contiguous stations in the net- work. In order to reduce the latency of communications, the network tends to favour quantity over size, thus allowing many packets to be exchanged as long as their size is small enough. • When a packet reaches a certain station which is not the final destination, an- other routing iteration is necessary. This means that another communication must be performed with one of the contiguous stations in the ring. However if that link is in use, as another packet is being transmitted, then the incoming one must be queued. In order to reduce station traversing times (while hop- ping because of routing) for packets, packets’ size is set to a reasonably low level. • By dealing with small DUs, it is possible to ensure better balancing over time. If data were stored without breaking them down into smaller pieces, we would not ensure units of the same size to be stored across stations, this goes against one of the assumptions for our balancing algorithm: all data units have the same size. As a packet is submitted for storage, the fragmentation process breaks it down into n smaller units: n = |p| c In case the original DU is fragmented into smaller units, the final returned token is still the hash of the original packet. Later on we will see that, by using the same token, all fragments can be retrieved back. For this to happen, it is necessary to create fragments in a specific way: Definition 21 (Packet fragmentation). Given packet p ∈ P and fragmentation thresh- old c ∈ N (number of bytes), then application ζ : P → 2P returns the set of fragments {p1, p2, . . . , pn} given input p. Every returned fragment has the following format: 1. The hash of the original packet. 2. The sequence number of the fragment (needed when re-constructing the packet). 3. The hash of this fragment. 4. The data stream (up to c bytes). As shown in figure 4.1.
  • 61. 4.1. Storing data 45 ξ(p) Seq. k ξ(pk) . . . Data Ctrl. info Max c bytes FIGURE 4.1: Data unit format. Remark. The frame format for non fragmented packets, called whole packets, is the same, however the first field (parent hash) is null (all zeros) and the sequence number is −1, which is the value which can be looked at for distinguishing fragments from whole packets. 4.1.2 Routing After the fragmentation phase, which can end up with no fragment to be generated if the original packet’s size does not exceed threshold c, every fragment pk ∈ P is sent to the ring to be routed according to routing function ψ based on balancing hash function φ. As anticipated in chapter 1, the ring is never directly accessed by users. Proxies are employed instead. A proxy station serves as an intermediary entity to balance the access to the ring and to hide all ring’s stations from the outside world. When the store primitive is invoked, the system, from the user’s computer, sends a store request SReq to one of the known available proxies. When receiving a request, the proxy will decide which station of the ring to pick for letting SReq enter the network. The decision is based on a balancing algorithm which decides basing on each of the known station’s link usage: the proxy’s goal is to avoid overloading one station with incoming traffic. Once the request reaches one of the stations, algorithm 1 will do the job and guarantee that a station is found for the packet. The system client ensures that all packets undergo the same process. If every fragment is successfully stored, the orig- inal packet’s hash is returned to the user as a token for retrieving all fragments. Asynchronous communications For performance reasons, the best approach is having every communication asynchronous. It means that when a node (one of the stations, a proxy or the user client node) sends a SReq, it does not keep the con- nection open until the packet is successfully routed waiting for the final response to close that connection. It is much better to send a request as a datagram transmission. That station will receive a store response SRes when its request has been processed. Every intermediate node that passes the request forward, will wait for its response and, after receiving it, will construct its own response for the node who sent the re- quest to it in the first place. This increases the bandwidth as links will not be owned for long times. Employing asynchronous transmissions complicates the communication proto- col but allows better performance. One of the complications is represented by timers which every station has to implement in order to raise an error when the response does not get delivered within a reasonable time (request transmission failure). In a synchronous scheme, timers are handled by the transmission protocol (e.g. TCP/IP) in a transparent way to the caller, however in asynchronous scenarios, the station has to implement timers on its own for each sent request. Figure 4.2 shows both
  • 62. 46 Chapter 4. System API SReq success SReq success store(stream) token SReq success SReq success SRes success SRes success store(stream) token User: Proxy: Entry station: Dst station: Synchronous communications User: Proxy: Entry station: Dst station: Asynchronous communications FIGURE 4.2: Synchronous vs. asynchronous communication model when storing a single packet. communication schemes. As part of the effort in writing tests and simulations of the algorithm, an actual implementation of the ring has been developed in Microsoft .NET using communi- cation library WCF1. Today, it is possible to implement asynchronous transmissions in a very easy way as the IT industry has moved forward to that direction providing developers with the set of API required to implement such protocols. 4.2 Retrieving data The other side of the story, a little more complicated, is about getting data back. We are going to cover this topic by considering the 2 possible scenarios here: 1. Retrieving a whole packet. 2. Retrieving a fragmented packet. In both cases, the process always starts with the same set of operations: the user has a token he received when storing data in the past and utilizes it to retrieve that stream back as per retrieve primitive: 1 Microsoft’s Windows Communication Foundation: a library consisting of a collection of network protocols highly customizable and flexible.
  • 63. 4.2. Retrieving data 47 ξ(p) Total n ξ (p1) . . . ξ (pn) Fragments FIGURE 4.3: Packet info format. stream_t& r e t r i e v e ( t_token t ) Retrieving a whole packet As soon as the user invokes the retrieve primitive, through the proxy, a retrieve request RReq message is built and routed in the ring. The token is the hash of the original DU, so, by following the DHT retrieval, the request is routed to the destination station. Once in there, the station will search the database to find the stored stream. In order to have great performance in the packet search process, a dictionary can be used inside every station. Since DUs are saved according to the format shown in figure 4.1, the hash of the stream is always available and can be used for looking up that specific packet when a RReq is routed to a station. As soon as the stream is retrieved, it can be sent back to the request originator: the end user, who will receive the DU in return from its retrieve call. Retrieving a fragmented packet When the packet was fragment the time it was stored in the ring, a problem occurs. In fact, when the request is sent and reaches the destination station basing on the token (the original packet’s hash), nothing is found. The original packet has been fragmented and each fragment has a different hash completely unrelated to the token (since we use cryptographic hashing, there is no way to get the original stream from the hash). In order to solve this issue we can actually store all packet’s fragments’ hashes into the token, which would become an array of hashes and grow in size. Although this solution might work, we don’t really like it. The user should still be able to locate all fragments just by having the original packet’s hash. In order to do so, we need to modify the store protocol. After a packet p ∈ P has been fragmented into n several units pk ∈ P, before transmitting them, a packet info unit is constructed: Definition 22 (Packet info DU). Given packet p ∈ P such that its size exceeds the frag- menting threshold: |p| > c, a special data unit is built to track information about it and all its fragments. The stream contains the following fields: 1. The original packet’s hash h = ξ(p). 2. The number of fragments n. 3. The hash of each single fragment pk (k = 1 . . . n) in order (from first to last). As shown in figure 4.3. In the revised store protocol, before sending each single fragment to be stored, thus before calling store on each single fragment, the same primitive is called on the packet info DU which has been built right after computing all hashes (original packet and its fragments). This initial call will route the packet info into a station by using the original packet’s hash. Thanks to this approach, when retrieving a DU, the first RReq will reach the station where the system will find the packet info. Using that, the system will then
  • 64. 48 Chapter 4. System API RReq success RReq success lookup pkt-info RRes success RRes success retrieve(token) pkt-info RReq success . . . RRes success retrieve(token pk) fragment pk aggregate(pk) packet p User: Proxy: Entry station: Dst station: Station: loop [∀pk] FIGURE 4.4: Sequence diagram showing the retrieval protocol in case of a fragmented packet. issue n retrieve calls in order to fetch each single fragment. Later, after getting all streams, the original DU can be built, the order into which combining each fragment is given by the sequence number in each retrieved fragment packet. As shown in figure 4.4, the process to retrieve and build a stored packet might take some time, not only the ring size influences this latency, but the number of fragments too play a significant role in the process. It goes without saying that a larger packet requires more time to be fully retrieved.
  • 65. 49 Chapter 5 Dynamic conditions In the previous chapters we have described and analyzed the behaviour of the ring under static conditions. Definition 23 (Dynamic conditions). Let R = (Ω, r, ξ, φ, ψ) be a ring. We say the net- work is under dynamic conditions when any of its characterizing elements changes: 1. Stations si ∈ Ω. Stations might disconnect or new stations might extend the ring. This possibility also covers the event of stations faulting and becoming off-line. 2. Leaf set radius r changes. 3. Any of the connections in the overlay ring changes. 4. Hash function ξ or φ changes. 5. Routing strategy ψ changes. Static conditions are the opposite of dynamic: the ring does not change and re- mains the same. So, why do we need to talk about dynamic conditions? Why should the ring change? Ideally, if well designed, the system can be configured with a certain number of stations, a certain radius and work optimally under static conditions. However, today every system is exposed to dynamic conditions as many different planned or unplanned events may occur: 1. One station enters a faulty state. It can happen for any reason like an hardware issue (e.g. hard disk failure, data corruption, etc.) or a software problem (e.g. system failure, emergency system reboot, etc.). 2. Stations can experience network issues. This can cause both a permanent of- fline state or a temporary one if machines have a way to automatically recover from these types of failures. 3. More stations are required because the system needs to serve an higher volume of data (planned scale-up). 4. One or more stations need to undergo planned or unplanned maintenance. 5. Security related issues force some stations to be pulled away from the ring. Those enumerated above are only a few possibilities. The point here is that a storage system must keep into account such circumstances which are part of the real world of connected systems.
  • 66. 50 Chapter 5. Dynamic conditions When dynamic conditions are in place, the ring structure and the balancing al- gorithm described so far need to be revised and modified in order to avoid perfor- mance degradation and, in some other more critical cases, service outage. We are going to examine the following dynamic cases: • Scalability The ability of the ring to grow or shrink in a flexible way causing the least possible performance degradation. – Station join A station joins the ring causing it to expand. – Station removal A station is pulled off the ring, causing it to shrink. • Fault conditions One station experiences internal problems which cause it to be unresponsive. 5.1 Scalability What happens when a station joins the ring? When such an event occurs, there are a few operations that need to be considered to re-initialize the ring: 1. The new station needs to build its leaf-set in order to identify its successors and predecessors. 2. All nodes in the neighbourhood of the new station must re-arrange their leaf- sets in order to update their successors or predecessors depending on the leaf- set radius r. 3. Balancing hash function φ must be re-designed as now the ring has changed. Since we have more stations, we have different hash segments and this impacts function φ’s implementation. The first 2 operations are infrastructural and can be addressed through well known protocols currently employed in DHT-mannered networks; since the prob- lem is nothing new, we are not going to spend more time talking about it. The 3rd point though is a different story as it poses a new situation inside our network ar- chitecture: stations must be synchronized to use a new balancing hash function φ. Lemma 12 (Balancing hash function φ’s outdatedness upon ring scaling). Let R = (Ω, r, ξ, φ, ψ) be a ring with N = Ω stations. Let, at any point in time, consider one station joining R or being pulled out of it, causing the number of stations to become N = N ± 1. Then hash function φ is no more suited for balancing the ring. Proof. Immediate by considering proposition 12. According to that, hash function φ depends on the number of stations in the ring, if that changes, the hash segments affecting Fφ’s codomain change too; hence causing original hash function φ not to reflect the new state of the network anymore. The main problem we want to face here is the process of synchronising stations in the ring and it consists of: 1. Computing new balancing hash function φ . 2. Updating all stations to use new hash function φ . 3. Rearranging packets across stations to ensure the balancing state of the ring.
  • 67. 5.1. Scalability 51 The last point is actually crucial. We are going to assume that a station joining the ring comes with no packets stored in it. That is because any other scenario does not make any sense. When the new station s∗ is on-line in the ring, the load distribution changes from: Σ = (|s1|, |s2|, . . . , |sN |) , |si| ≈ m · n−1 , ∀i = 1 . . . N to this form: Σ = (|s1|, |s2|, . . . , |s∗ | = 0, . . . , |sN |) , |si| ≈ m · n−1 , ∀si ∈ Ω {s∗ } Which implies that the ring is not balanced anymore, hence the last point men- tioned in the synchronization process introduced earlier, which looks more and more expensive as we investigate the challenges introduced by the dynamic conditions just taken into consideration. If the operations required to synchronize the ring get too expensive (time-wise), then the proposed algorithm has a serious issue in terms of scalability as it makes the network adapt pretty badly under the hypothesis of dynamic conditions. Our purpose is, therefore, trying to understand how actually expensive it is to scale the ring. Given our analysis so far, we have been able to break down the scalability issue down to 2 sub-problems: 1. Updating hash function φ and aligning all stations in the ring to use it. 2. Re-arranging existing stored packets across stations in order to bring the load distribution in the ring back to its balanced state. We are going to look at these two problems separately and evaluate the final performance impact later. Conjecture 1 (Scaling overall impact). Let R = (Ω, r, ξ, φ, ψ) be a ring experiencing a scaling process due to one station joining or leaving the network. • Let τφ ∈ R measure the performance impact (latency) of the process of updating hash function φ to φ on all stations in the ring. • Let τψ ∈ R measure the performance impact of the process of rearranging packets among stations in order to take the ring back to its balanced condition. • Let τS ∈ R measure the overall latency experienced by the system while carrying out the two operations above in order to scale the ring. We expect the following equation to hold: τS ≤ τφ + τψ (5.1) Conjencture 1 expresses our feeling that the overall performance impact caused by the two scaling operations cannot be computed as the sum of the latencies in- troduced by each one of them, as the two operations can be carried out in parallel, rather than sequentially. We try to prove this throughout the rest of this chapter.
  • 68. 52 Chapter 5. Dynamic conditions MT hsrc ξ(Data) . . . Data Header Max c bytes FIGURE 5.1: Message format. 5.1.1 Updating φ Hash function φ is a global contract in the network. Definition 24 (Global contract). A variable, or, more generally, a piece of information shared by all stations in the ring. The main assumption is about all station keeping an exact copy of the same value. The protocol we need to design for updating hash function φ to φ on all stations is, more generically speaking, a protocol to update a global contract in the network. Since a DHT is designed for distributed scenarios, every condition implying a certain level of centrality causes the system to behave with lower performance, and this is the case here. The PA is based on a distributed approach, however the balancing process is carried on through a global contract which is hash function φ; this explains why we should expect this process to be relatively expensive. Broadcasting in DHT In order to have a global contract updated, we basically need to transmit a message, containing the updated contract, in broadcast on the network because we need to reach every single node. The message that needs to be sent, in terms of the API of the balancing system, is PUM (φ Update Message). The cost of updating φ is equal to the cost of sending a message in broadcast in the ring. Since the broadcast occurs in the context of a network overlay, we need to create a protocol specific for message broadcasting. Generally speaking, we can create a message format which all transmissions between stations in the ring must comply to. The message must contain, at least, the following information: 1. Message Type An enumeration indicating the type of communication (e.g. RReq, RRes, etc.) 2. Source hash The hash of the source station (not strictly needed but nice to have for performance reasons, as a station receiving a message knows its neighbours and it is able to generate the hash of their IP addresses). 3. Body hash The hash of field Body. This is used for routing the message (des- tination hash). 4. Body The content to transmit. A possible implementation for the broadcasting protocol has to occur at station level. Since the architecture is distributed, we cannot employ any centralized entity.
  • 69. 5.1. Scalability 53 Algorithm 2 Message broadcasting in the ring Require: Ring initialized Require: Station si has ID hi = ξ(si) Require: Station si has an associated leaf set Λ(si) Require: Station si receives packet p ∈ P from station ssrc ∈ Λ(si) Require: Global variable hp ∈ N is available Require: Global variable d ∈ {−1, 0, 1} ⊂ N is available and initially set to 0 1: function BROADCAST(p ∈ P) 2: hsrc ← ξ (ssrc) Actually computed or taken from message 3: if hsrc < hi then 4: d∗ ← −1 Message from LLS 5: else 6: d∗ ← 1 Message from ULS 7: end if 8: if ξ(p) = hp ∧ d + d = 0 then Same message from opposite side of ring 9: return Abort. End condition reached 10: else if ξ(p) = hp then Duplicate message from same side of ring 11: return Don’t send again 12: end if 13: hp ← ξ(p) 14: Λ ← ∅ 15: if hsrc < hi then 16: Λ ← ΛU (si) Message from LLS =⇒ Send to ULS 17: else 18: Λ ← ΛL(si) Message from ULS =⇒ Send to LLS 19: end if 20: for s ← Λ do 21: Send p to s 22: end for 23: end function Stations can recognize such a type of communication by inspecting the content of the message, a possible solution is using a flag field in the message, or, better, using a special value in the destination address field1. Lemma 13 (Message broadcasting complexity). Let R = (Ω, r, ξ, φ, ψ) be a ring with N = Ω stations. The cost of transmitting a message in broadcast, in best case scenario, is: Θ∗ B = N 2r Where ΘB is expressed in number of message hops2. Proof. Without loss of generality, we indicate with sA ∈ Ω the station initiating the broadcast transmission in the ring. As soon as a station receives a broadcast message, it consumes the content and then forwards it to the opposite side of its own leaf-set in relation to which node it received the message from, as per algorithm 2. If sA starts the protocol by sending the message to only one side of its own leaf-set, then 1 Usually protocols use the all-1 string to indicate a broadcasting address 2 A hop, in the scenario of message routing, is a single direct transmission from one node to another.
  • 70. 54 Chapter 5. Dynamic conditions the maximum number of hops required to cover all the ring is: ΘB = N r Because one station forwards the message in one go to r neighbours. However, initiator sA can be smarter and send the message to all nodes in its own leaf-set (both sides). This would trigger a symmetric chain to both sides of the ring, thus leading to the thesis as the best case scenario is when all messages travel at the same speed, and the last transmission occurs at the very opposite side of the ring (the hypothesis is that no delayed transmission occurs). Finger tables A well-known routing enhancing techniques, often used in DHTs, is the employment of finger tables. Briefly, it consists in arranging leaf-sets in the ring in a way such the LLF is empty and the ULF contains all successors in the ring according to relative position sequence: 20, 21, 22 until reaching 2l. This way of linking stations implies an higher cost from a control point of view because it takes more time to re-arrange those links at initialization time and when dynamic conditions are in place (e.g. one station joining or pulling off the ring). That being said, on the other hand, this pattern actually also ensures better per- formance from routing standpoint, hence guaranteeing an even better complexity than the one considered in lemma 13 in message-broadcasting scenarios. So, as we can see, the cost of updating global contract φ is acceptable and it is possible to consider many well-known approaches in literature. Therefore, we have no interest in detailing this issue any further. 5.1.2 Load re-arrangement The part of the scaling cost we are most worried about is actually the re-arrangement of packets. This operation is not required just from a balancing point of view, there is a more critical aspect which needs to be addressed as soon as one station joins the ring: packet retrieval. Let us consider a scenario where station s∗ has joined the ring and hash function φ has been updated. We consider that s∗’s predecessor is now station si. If no load re-arrangement is performed, then RReq messages targeting a packet p whose hash hp = ξ(p) is now covered by s∗: hp ∈ Ξ(s∗), will not be found as they are actually still stored in si, since that station was covering hp before the ring scaped-up. The question we want to answer is: "How is the retrieve primitive badly im- pacted by the ring scaling up?". The example we just considered suggests that only a portion of the ring is impacted by the scaling up, so the packet re-arrangement should only occur between 2 stations, however this is something that needs to be proved. Theorem 14 (Load re-arrangement upon scale-up by 1 station). Let R = (Ω, r, ξ, φ, ψ) be a ring. Let s∗ ∈ Ω be a station joining the ring causing hash function φ to be updated on all stations to φ . Let us also consider station si now becoming s∗’s predecessor, so that its hash segment is: Ξ(s∗) = [h∗, hM i ], assuming that h∗ = ξ(s∗). Then the packet re-arrangement effort required to make all packets in the new network retrievable and to re-balance the ring impacts all stations in the network.
  • 71. 5.1. Scalability 55 Proof. Recalling how hash function φ works as we described in section 2.3.3, we need to understand whether the joining of a station causes φ segments in its do- main to change boundaries (see figure 2.4). We can try to visualize the impact on the domain by considering the domain mapping diagram while keeping into considera- tion dynamic conditions. We consider, for simplicity and without loss of generality, si = s1: 0 hM 0 hM 0 1 0 1 h1 h2 hk hN−1 hN h1 h∗ h2 hk hN−1 hN s∗ 1 2N 1 2N + 1 N . . . 1 2N + k N 1 2N + N−1 N 1 2(N+1) 1 2(N+1) + 1 N+1 1 2(N+1) + 2 N+1 . . . 1 2(N+1) + k N+1 1 2(N+1) + N N+1 h1 h2 − h1 . . . . . . hN − hN−1 hM − hN 1 2N 1 N . . . . . . 1 N 1 2N h∗ − h1 h2 − h∗ 1 2N 1 N+1 new φ segment: 1 N+1 1 N+1 1 N+1 1 N+1 1 2N As it is possible to see, additional station s causes only one change in the ξ space (φ’s codomain), but it causes all φ segments to resize in order to make room for an additional interval of amplitude 1 N+1. As we can see our initial assumption was not quite right unfortunately. The whole domain of hash function φ is impacted and, possibly, all packets need to be re-routed according to new hash function φ . Nonetheless, we still don’t know basic information which can really tell us how bad the re-arrangement effort is, like: 1. Are packets re-routed to new stations completely unrelated to the original one? Or there is a pattern? 2. Do all packets require re-routing? Is there a percentage of them that remains in their current station when hash function φ makes transition to φ ? These two question are crucial to evaluate the cost of the re-arrangement effort. So we need more investigation. Lemma 15 (Station transition direction upon packets re-arrangement). Under the same hypothesis and conditions of theorem 14, any packet p ∈ P stored in any station sj ∈ Ω of the ring, if moved because of the re-arrangement, it is moved either: • To any of sj’s successors if sj s∗. • To any of sj’s predecessors if s∗ sj. Proof. To show this, we consider the ring in the 2 different configurations (before the scale-up and after). The diagram below shows hash function φ’s domain in both conditions (N stations at the bottom and N +1 at the top), and also plots the location of 2 hashes therein.
  • 72. 56 Chapter 5. Dynamic conditions h h Ω = N Ω = N + 1 1 2N + k−1 N 1 2N + k N 1 2N + k+1 N 1 2(N+1) + k−1 N+1 1 2(N+1) + k N+1 1 2(N+1) + k+1 N+1 1 2(N+1) + k+2 N+1 1 N 1 N 1 N+1 1 N+1 1 N+1 sk−2 . . . sk−1 sk sk+1 . . . sk−2 . . . sk−1 s∗ sk sk+1 . . . As we can see, hash h falls initially in station sk−1, but after the transition it ends up falling in station s∗’s coverage. In the same way, hash h falls initially in station sk+1, but after the transition it ends up falling in station sk’s coverage. The formulation of the lemma implies that packets can also remain in the same station. It is actually possible as the diagram shows regions on the top and bottom hash spaces which have values in common. We now know that a minimal pattern is present while re-routing packets. How- ever the information provided by lemma 15 is not much. A more interesting result can be considered, but, before that, we need a quantity to be introduced: Definition 25 (Packet’s station transition delta). Under the hypothesis of dynamic con- ditions originating from the ring scaling up by one station, let si and sj be the original station and the new station (after re-arrangement) for any packet p; then quantity ∆(p) ∈ N represents the number of stations packet p had to be moved across: ∆(p) = i − j if |i − j| ≤ N 2 sign(i − j) · N − i + j ∆(p) provides information about whether a packet was moved or not from its original station (∆(p) = 0), and also about the direction of the move (∆(p) < 0 or ∆(p) > 0). The value of ∆(p) for each packet is the main subject of the next important result: Theorem 16 (Packets station transition delta upon scale-up by 1 station). Under the hypothesis and conditions of theorem 14, the transition delta ∆(p) of any packet p ∈ P stored in any station si ∈ Ω of the ring is, at most, unitary in absolute value: |∆(p)| ≤ 1. Proof. The theorem basically states that if a packet is moved, that is moved to one of the 2 direct contiguous stations. To prove this statement, we want to re-formulate the thesis by using an equivalent definition. In conjunction with lemma 15, we need to prove that: 1. A packet hosted in a station preceding s∗ is re-routed, at most, to its immediate successor. 2. A packet hosted in a station preceded by s∗ is re-routed, at most, to its imme- diate predecessor. We will initially prove the first point, later on the second by considering it as a mir- rored condition of the former. Let us consider a linear bounded real space divided into N ∈ N even parts. Every part is marked with an identifying number k = 1 . . . N. Then, we consider N to raise to N + 1, we consider that every existing segment shrinks down in order to make
  • 73. 5.1. Scalability 57 space for segment N + 1 which is, therefore, supposed to be added as the last one. This scenario abstracts the condition where stations’ φ coverages per each station are shrunk to lower φ values due to s∗ joining the ring, in the specific case where all stations being considered are predecessors (down to s1) of s∗. a [N] [N + 1] k N k+1 N k+2 N k N+1 k+1 N+1 k+2 N+1 k+3 N+1 1 N 1 N 1 N+1 1 N+1 1 N+1 Without loss of generality, we consider point a ∈ [0, 1] ⊂ R and re-express the thesis as follows: "Is it possible to find any combination of a, k and N such that, after the shift from N to N + 1, a falls into a segment further than its original’s successor?". Formally, this question is stated as follows: ∃a ∈ [0, 1] ⊂ R, N ∈ N, N > 0, k ∈ N, k = 1 . . . N : a < k+1 N a ≥ k+2 N+1 If that system of inequalities has no solution, then the thesis is confirmed. Be devel- oping both inequalities we get the following: a − k+1 N < 0 a − k+2 N+1 ≥ 0 =⇒ aN − k − 1 < 0 a(N + 1) − k − 2 ≥ 0 =⇒ aN − k − 1 < 0 aN + a − k − 2 ≥ 0 By isolating N, we get: aN < k + 1 aN ≥ k + 2 − a =⇒ N < k+1 a N ≥ k+2−a a ∨ −k − 1 < 0 −k − 2 ≥ 0 The second system arises from dividing both members, in both inequalities, by a. We need to consider what solutions the system might present in case a = 0. This last system is easily proved to be impossible: k + 1 > 0 k + 2 ≤ 0 =⇒ k > −1 k ≤ −2 Resuming on the former system and considering from now on a ∈ (0, 1] ⊂ R, we can develop more and get: k + 2 − a a ≤ N < k + 1 a =⇒ k + 2 − a a < k + 1 a =⇒ k + 2 − a < k + 1 =⇒ a > 1 The system has solutions for a > 1, however this is in contrast with our hypothesis for which a ∈ (0, 1] ⊂ R, thus the system does not have solutions in the definition boundaries of a, N and k! We still need to prove the symmetric case of stations that are successors of s∗. However it is possible to skip this by considering that such a scenario is the mirror of the one just proved. As a direct result, we have the following:
  • 74. 58 Chapter 5. Dynamic conditions Corollary 16.1 (Packet lookup failure at re-arrangement time). Under the hypothesis and conditions of theorem 14, if packet p ∈ P is not found in station si ∈ Ω while the system is in the process or re-arranging packets, then it will be found in the previous or next node depending on whether s∗ si or si s∗! Lemma 15, theorem 16 and corollary 16.1 provide the answers to our initial ques- tions. To draw our conclusions: the ring is not perfectly scalable as all stations need to rearrange their packets under dynamic conditions; however the effort is extremely localized in the context of each station. 5.1.3 Scaling overall impact We have now more information in order to evaluate conjecture 1. Considering the characteristics of the operations of updating hash function φ across stations and re- distributing packets, we now understand that they can be executed in parallel. As soon as the joining station computes φ , it commences the protocol for broadcast- ing this knowledge in the ring. At the same time, the same station can start going through all its packets and evaluating the new hash function on those in order to re- route its DUs. This process can be started in every station the moment φ is available and it is traversing the ring. That being said, the packet moving operations are more expensive than the op- eration of computing the new hash function or receiving it from other stations, thus the time needed for re-routing DU loads in the network is far higher: τψ τφ, so the overall scaling time is basically defined by τψ. 5.1.4 Ring scale-down All the considerations made so far regarding the ring scaling up can be transferred to the opposite case where a station leaves the network. A few considerations must be made though in relation to this dynamic condition: • When a station leaves the network, the physical detachment to the other nodes is not performed until all packets are re-routed. This is crucial and different in comparison to the scenario of a station joining the ring; in fact we cannot afford here to lose a whole bucket of packets. • A station leaving the network is not the same scenario of a station abandoning the ring. The former is a controlled process happening through a specific pro- tocol and requires time; the latter is a sudden event and cannot be controlled, its nature is described later in this chapter. 5.2 Fault conditions As anything can happen, stations in the ring might enter weird states. The reasons for such a scenario to occur can be many: hardware or software related and adequate countermeasures can be considered. Nonetheless, when it comes to disaster recovery, it is not much about all possible cases we know, but rather more about everything we don’t know. So, we will now consider the possibility of a station becoming un- available and we are not going to ask ourselves why! What we ask instead is: "How do we guarantee data retrieval services and the balancing in such conditions?".
  • 75. 5.2. Fault conditions 59 S p hS p1 h (1) S p2 h (2) S p3 . . . FIGURE 5.2: Multiple hashing mechanism for achieving safe redun- dancy. Hashes are computed and then concatenated to the data stream, hence generating packets ready to be sent. When a station goes down, the first issue is infrastructural. If the ring is set to have leaf-set radius r = 1, then we have a problem as the ring basically breaks apart and messages cannot be routed across stations. Of course, if the radius is higher: r > 1, then no immediate consequences are experienced in terms of message rout- ing. In both cases, DHT networks have existing protocols in literature to fix dangling links and isolate the unavailable station; the only difference is that a unitary radius ring will experience some downtime until links are fixed, this is one of the reasons for which non-unitary radius rings are more robust to disasters. The second issue to solve is from data retrieval perspective. A station went down unexpectedly, thus there was no time to apply any scale-down protocol (in fact the scenario here is not a station leaving the ring, but a station disappearing from it). The direct consequence is virtual data loss: all packets stored in that station are now unavailable and when any RReq is sent to the ring targeting one of those DUs, the destination station will not find the packet hash in its database. It is clear that, to solve this issue, something has to be done before the station goes down. However we cannot make any assumption on this condition and its timing. So we need to change the data storage protocol to target situations where emergency packet retrieval is needed as we cannot afford, for any reason, the possibility of data becoming unavailable to users. In chapter 4 we have described the API for storing a packet in the ring. Our in- tention is to modify the storage protocol (primitive store) in order to save one packet in multiple locations in the ring without losing balancing. The procedure applies to either packets or fragments, in general, we consider a certain stream of data to be sent for storage: 1. The data stream S to send is processed and its hash computed: hS = φ(S). 2. Another hash is computed, by using as input previously computed hash hS: h (1) S = φ (hS). 3. The same recursive operation is repeated for ∈ N times and several hashes are computed in chain: h (k) S = φ h (k−1) S . 4. different packets are generated by constructing a frame with the same body (the data stream) but different associated hash as per figure 4.1 and then sent to the ring.
  • 76. 60 Chapter 5. Dynamic conditions RReq success RReq success lookup null RRes (error) success RRes (error) success retrieve(φ(p)) null RReq success RReq success lookup pkt p RRes success RRes success retrieve(φ (φ(p))) pkt p User: Proxy: Entry station: Dst station 1: Dst station 2: FIGURE 5.3: Packet retrieval session under the hypothesis of one sta- tion down. The diagram illustrates how a failed RReq triggers the emergency retrieval process. The procedure just described will generate different copies of the same DU and they will all be sent to different locations in the ring. Thanks to the Lamport scheme3, we can compute more hashes of the same initial stream and use them as storage keys. Remark. Generating the first hash hS is potentially expensive because the input stream can be long (however bound to a certain level considering fragmentation threshold c). The same cannot be said for the other hashes h (k) S because they are computed on another hash (very short string). So the process of computing the redundant hashes is very cheap. 3 The process of generating the hash of an hash is used today in security-related scenarios in order to generate ephemeral keys. The scheme has been proved to be safe and, by using a secure cryptographic hash function, irreversible.
  • 77. 5.2. Fault conditions 61 How can this procedure help us when attempting to retrieve a DU stored into an unavailable station? We consider again the broken scenario of before where station si suddenly become unavailable: 1. The system tries to retrieve packet p via its hash h = φ(p). 2. The RReq message will reach station si−1 as it now covers the hash segment of si when it was on-line. However station si−1 cannot find hash h in its database, thus returns an error in the RRes. 3. The system acknowledges the first RReq is not successful, so it tries again to retrieve the packet by computing φ(h). 4. The second RReq now reaches another station sj where the packet is found and returned. Figure 5.3 illustrates the protocol just described. 5.2.1 Collisions threshold As we promote the idea of introducing redundancy of packets in the network as a mean to achieve good levels of disaster recovery, we should try to be careful to make this effort the most efficient possible, therefore avoiding unnecessary cost. Since we are routing the same packet in the ring but with different hashes, we want to make sure all the copies do not end up being routed into the same station. If we generated one copy a packet and they both were routed to the same station, our effort would be pointless: the moment that station goes down, our emergency retrieval procedure would fail. On the other end we don’t want to generate too many copies of the same packet as we would waste precious memory in our stations. How to find a good balance? Let’s start by considering collisions in the ring: Lemma 17 (Packets collision probability). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω = N stations, and let p1 ∈ P and p2 ∈ P be two packets. Then the probability that they collide (routed to) onto the same station is: γ = 1 N (5.2) Proof. A collision occurs when p1 and p2 are routed to the same station si ∈ Ω: ψ (p1) = ψ (p2) = si. The probability of this event can be defined as follows: γ = Pr {ψ (p1) = ψ (p2) = si} , ∀p1, p2 ∈ P, ∀si ∈ Ω We first consider packet p1 routed into the ring into station sk ∈ Ω and then consider packet p2 being processed: the probability to have a collision with p1 is the proba- bility of being routed into sk considering that sk can be any station of the ring, this calls for the Law of Total Probability: γ = N k=1 Pr {ψ (p2) = sk|ψ (p1) = sk} · Pr {ψ (p1) = sk} (5.3) Since ψ is based on hash function φ which is based on ξ which is a cryptographic hash function, we have that consecutive applications of ψ are not interdependent, it mens that: Pr {ψ (p2) = sk|ψ (p1) = sk} = Pr {ψ (p2) = sk} , ∀p1, p2 ∈ P, ∀sk ∈ Ω
  • 78. 62 Chapter 5. Dynamic conditions We can rewrite equation 5.3 as follows: γ = N k=1 Pr {ψ (p2) = sk} · Pr {ψ (p1) = sk} Recalling theorem 5 and the definition of πk ∈ [0, 1] ⊂ R as the packet in station k probability, we can write our equation as follows: γ = N k=1 πk · πk = N k=1 π2 k We are under the hypothesis of a balanced ring, since hash function φ is applied: so, according to equation 2.12, we have: γ = N k=1 1 N2 = 1 N2 · N k=1 1 = 1 N2 · N = 1 N Which proves the thesis. The follow up to lemma 17 is calculating the average number of collisions that are experienced in the ring when sending packets. Remember that the fact of send- ing copies p1 . . . p of packet p does not create a correlation between the different instances being sent. This is due to the fact that we are sending different hashes hS, h (1) S . . . h ( ) S related to each other by the Lamport chain which, actually, guaran- tees that all the hashes are not (stochastically) interdependent. Lemma 18 (Average number of collisions). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω = N stations. Then, when generating m ∈ N packets, the average number of collisions experi- enced between different couples of units is: ηγ = m 2 1 N (5.4) Proof. We introduce r.v. y ∈ N counting the number of collisions between couples of m packets. This variable can range from 0 up to all the possible combinations of two different packets: |C(m, 2)| = m 2 . We also introduce r.v. χ = 0, 1 ⊂ N defined as follows: χ(p1, p2) = 1 if a collision occurs between packets 0 no collision Remembering that C(m, 2) enumerates all possible combinations of packets (order does not matter), we can define y as follows: y = (p1,p2)∈C(m,2) χ(p1, p2) R.v. y’s mean value can then be calculated as: ηγ = E [y] = E   (p1,p2)∈C(m,2) χ(p1, p2)  
  • 79. 5.2. Fault conditions 63 Since operator E [·] is linear, we have that: E   (p1,p2)∈C(m,2) χ(p1, p2)   = (p1,p2)∈C(m,2) E [χ(p1, p2)] R.v. χ is discrete and distributed on two values only, so its mean can be easily calcu- lated: E [χ(p1, p2)] = 1 · Pr {χ = 1} + 0 · Pr {χ = 0} = Pr {χ = 1} = γ, ∀p1, p2 ∈ Ω So, back to r.v. y’s mean value: ηγ = (p1,p2)∈C(m,2) E [χ(p1, p2)] = |C(m,2)| k=0 γ = |C(m,2)| k=0 1 N = 1 N · |C(m,2)| k=0 = 1 N m 2 Proving the thesis. Thanks to lemma 18, we can now try to calculate a reasonable value for and decide how many clones of a packet we should send in the network to ensure an effective level of redundancy. Theorem 19 (Optimal ). Let R = (Ω, r, ξ, φ, ψ) be a ring with Ω = N stations and let p ∈ P be a packet sent with redundancy factor ∈ N. Then, in order to guarantee that at least 50 % of sent packets do not collide, the optimal redundancy factor is: < opt = N Proof. Let β ∈ [0, 1] ⊂ R be the percentage of collisions that we allow on the number of total packets + 1 (the original packet and its clones) sent to the network. So, the following holds: + 1 2 1 N < β( + 1) =⇒ + 1 2 < βN( + 1) =⇒ + 1 2 1 + 1 < βN =⇒ ( + 1)! 2!( − 1)! · 1 + 1 < βN =⇒ ( + 1) ( − 1)! 2( − 1)! · 1 + 1 < βN =⇒ 2 < βN =⇒ < 2βN Which proves the thesis by considering: β = 1 2.
  • 81. 65 Chapter 6 Conclusions and final notes Simulations have shown the effectiveness of the balancing performed by the algo- rithm, together with the use of known distributed architectures (DHT networks), the proposed balancing approach is feasible and potentially employable in real case scenarios. 6.1 Open issues The algorithm currently presents some challenges which must be addressed in order to make the architecture more flexible and less costly from a network performance standpoint (traffic and control overhead) Scalability is the first priority. The analysis performed so far has provided good upper bindings to the cost of scaling up the ring by one station; however more is to be investigated. More simulations should be run on scaling rings and a differ- ential analysis must be carried on to identify possible patterns which can be made advantage of. 6.2 What’s next As a continuation of the effort described in this document, the next action items to focus on are: 1. Improving C/C++ simulations to target more advanced scenarios. 2. Performing more simulations on very large networks (up to 1000 stations and more) and higher traffic volumes. 3. Developing simulations targeting traffic handling in the ring, in order to get more information about the impact on network performance introduced by the PA. 4. Enriching simulations with more features addressing differential analysis on scaling rings. The next iteration should focus on collecting more information regarding the performance of the algorithm with special focus on high variance conditions in the amplitudes on hash segments. Furthermore, it can be beneficial to evaluate migra- tions flows on scaling scenarios.
  • 83. 67 Appendix A C/C++ simulation engine’s architecture The C/C++ simulation engine has been developed with the following technologies: • Intel’s Threading Building Blocks (TBB1) for parallel packet generation and hash computation. • GNU C/C++ compiler. • Boost2 C++ libraries for big integers and other utilities. • Tina’s Random Number Generator (TRNG3) library for randomizers. • OpenSSL4 cryptographic library for hash computation. • Circos5 library for circo-diagrams generation (migration flows). Simulation steps Simulations can run sequentially or in parallel. When running in parallel, a Monte Carlo approach is used so that packet generation and hash com- putation can be performed much faster. When running a simulation, the following steps are performed: 1. Pre-compilation configuration Compilation variables are assigned. The en- gine is based on STL6 and parameters such as the number of stations N, the number of generated packets m are all defined as compile-time constants; thus they need to be set. 2. Compilation The simulation engine undergoes compilation in order to pro- duce simulation executables. 3. Post-compilation configuration Simulation input files are prepared in order to specify hash segments and other network descriptive variables. 4. Execution Simulations run. 5. Data extraction Output data is generated in order to get aggregated informa- tion and markup files to be used for generating circo-diagrams. 1 Intel’s library for multi-threaded processing. https://www.threadingbuildingblocks. org/. 2 Boost libraries. http://www.boost.org/ 3 Random number generator library. https://www.numbercrunch.de/trng/. 4 Standard SSL implementation. https://www.openssl.org/. 5 Circos. http://circos.ca/. 6 C++ Standard Template Library allows the use of generic types and compile-time constants.
  • 84. 68 Appendix A. C/C++ simulation engine’s architecture Every simulation generates 3 files: • A data file tracking hash segments per each station and all generated packets, hashes and φ-hashes. • A table file containing a matrix used by Circos to generate migration flows. • A Karyotype file used by Circos to generate other diagrams (for the future). Infrastructure All simulations mentioned in this document have been run against a pool of Intel 4-core machines: HP ProLiant DL180 G6 (64 bit) on CentOS 6 (RHEL).
  • 85. 69 List of Figures 1.1 Overall system architecture. The end user interacts only with the stor- age system, while the balancing system is hidden to the user and transparent to the storage system with regards to accessing the server pool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 A N = 8 network example showing the logical ring topology. Each station is assigned with an ID (typically the IP address hash) and packets are routed by content. . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Access to the ring is guarded by proxies. . . . . . . . . . . . . . . . . . . 11 2.3 Hash-partitioning of a ring into different segments, one per each sta- tion. For each segment, a different impulse is used, its coverage matches the segment’s length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Hash segments mapped onto φ segments illustrating how hash func- tion φ works. The top part of the diagram shows the φ hash-space while the bottom part the ξ hash-space . . . . . . . . . . . . . . . . . . . 26 3.1 Showing the Polar Hash Coverage Plot (PHCP) of a simulation on an N = 10 station ring after sending m = 103 packets. Both plots show the configuration of the station hash segments together with the final load levels at the end of the simulation. The plot on the left refers to a normal ring (hash function ξ applied), the one on the right refers to an extended ring where hash function φ based on same ξ is considered. The same packets were sent in both rings. . . . . . . . . . . . . . . . . . 32 3.2 Showing load state (in blue) |sk| in each station sk as time grows. In this simulation, hash function ξ is used (normal ring). The green line shows the expected load state (uniform) for each point in time. . . . . . 33 3.3 Showing load state (in blue) |sk| in each station sk as time grows. In this simulations set (same as in figure 3.1), hash function φ is used (ex- tended ring). The green line shows the expected load state (uniform) for each point in time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Plotting standard deviation vs dispersion factor of generated ξ-hashes and φ-hashes during simulations batches (from left to right): N = 10 (40 simulations), N = 30 (60 simulations) and N = 50 (10 simulations). 37 3.5 Plotting standard deviation of hash segment lengths and standard deviation of φ-hashes during each simulations in batches (from top to bottom): N = 10 (40 simulations), N = 30 (60 simulations) and N = 50 (10 simulations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Plotting station loads η (ξ) k (no balancing) and η (φ) k (balanced ring) at the end of four N30 simulations with different seeds. . . . . . . . . . . 39 3.7 Migrations flows in a N30 ring. . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Data unit format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
  • 86. 70 List of Figures 4.2 Synchronous vs. asynchronous communication model when storing a single packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Packet info format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Sequence diagram showing the retrieval protocol in case of a frag- mented packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Message format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Multiple hashing mechanism for achieving safe redundancy. Hashes are computed and then concatenated to the data stream, hence gener- ating packets ready to be sent. . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Packet retrieval session under the hypothesis of one station down. The diagram illustrates how a failed RReq triggers the emergency re- trieval process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
  • 87. 71 List of Tables 2.1 Showing, in the example, values of hash identifiers and hash seg- ments for each station. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
  • 89. 73 Acknowledgements Thanks to my supervisor: Prof. Eng. O. Tomarchio, for having enough patience and waiting a few more years for me to finish this research while working in Denmark. Thanks to Medilink srl: my host company for my master traineeship during which this research effort was started and completed in its first step. They provided everything I needed (resources, infrastructure) to complete my work. Thanks to my Team Lead at Microsoft: Horina, for her flexibility and availability, allowing me to submit this work in time. Graphics and artwork Icons and graphics in figures created by Katemangostar - Freepik.com. Last but not least, thanks to all the amazing public libraries in Copenhagen which have hosted me and my work during many weekends spent on this thesis.
  • 91. 75 Bibliography Haugh, Martin (2004). “Generating Random Variables and Stochastic Processes”. In: Monte Carlo Simulation: IEOR E4703 1.1, pp. 6–10. URL: http://www.columbia. edu/~mh2078/MCS04/MCS_generate_rv.pdf. Kolchin, Valentin F. (1998). “Random Allocations”. In: Washington: Winston 69.3, pp. 1236– 1239. URL: http://link.aip.org/link/?RSI/69/1236/1. Nijenhuis, Albert (1974). “Strong derivatives and inverse mappings”. In: The Amer- ican Mathematical Monthly: DOI: 10.2307/2319298 81.1, pp. 1–12. URL: http:// www.jstor.org/stable/2319298.