Utilisation of l-Diversity and Differential Privacy in the Anonymisation of Network Traces

Utilisation of -Diversity and Differential Privacy in the
Anonymisation of Network Traces
Shankar Lal
Aalto University, Finland
shankar.lal@aalto.fi
Ian Oliver, Yoan Miche
Security Research
Nokia Neworks, Finland
first.last@nokia.com
Abstract: Noise addition for anonymisation is a known technique for increasing the
privacy of a data sets. However this technique is often presented as individual and
independent, or, just stated as techniques to be applied. This increases the danger of
misapplication of these techniques and a resulting anonymised data set that is open to
relatively easy re-identification or reconstruction. To better understand the application
of these techniques we demonstrate their application to a specific domain - that of
network trace anonymisation.
1 Introduction
Privacy and especially anonymisation of data sets is a hot topic. It, therefore, comes as
no surprise that data sets such as network traces which contain large amounts of sensitive
information about the behavior of users on a network require such treatment.
Techniques such as suppression, hashing and encryption of fields suffice to a point. In
the case of suppression information is lost, while in hashing or encryption of data, the
information content is transformed, from say an IP address which identifies a particular
machine, into just some kind of generic identifier. In many cases a pattern of behavior
is still recognizable, for example, hashing source and target IP addresses still reveals a
unique pattern of communication if not the precise identities [1] [2].
More advanced techniques such as κ-anonymity [3], -diversity [4] and differential privacy
[5] (amongst others) have been developed; κ-anonymity in particular has been successfully
used with medical data. These techniques are now being recommended, if not mandated,
to be used in the process of anonymisation.
In this paper, we present techniques for anonymising network traces that preserve some de-
gree of statistical properties such that some meaningful analysis can still be made. Working
in this specific domain means that we can carefully tailor techniques such as differential
privacy such that a reasonable degree of privacy is assured.

2 Network trace files
A network trace file contains sensitive fields such as source and destination IP addresses,
protocol type, packet lengths etc. Some of these can further act as quasi identifiers whose
combination can lead to the identification of an individual despite seemingly identifying
fields being removed or anonymised [6]. The source/destination IP address and time-stamp
field can disclose who is talking to whom and also provide proof that communication ex-
isted between the parties in certain period of time. Protocol field is crucial in the sense that
certain protocols can identify the nature of traffic.
Packet length field refers to the total length of an IP packet which includes payload and
header information. This field is also very important from the security point of view,
since certain security incidents have fixed packet length for example some network worms
i.e. Slammer worm and Nachi worm have fixed packet length of 404 bytes and 92 bytes
respectively [16] [10]. The packet length field is also vital in the sense that major transport
protocols like TCP and UDP, mostly have packets of larger length e.g. around 1500 bytes.
The other management protocols for example ICMP, DNS etc. have packet lengths mostly
fewer than 200 bytes. Therefore due to this structure of the fields, it, sometime, can be
easy to guess the protocol type by checking its packet length.
3 Overview of anonymisation techniques
Data anonymity can not be ensured by employing any single anonymisation technique
as each technique has its advantages and disadvantages. All data sets are required to
be processed through any combination of the techniques presented here and the many
variations thereof [7].
3.1 Differential Privacy
The notion of adding noise and randomized values to the data in a controlled manner is
known as differential privacy and provides a technique suited to continuous data sets such
as location data, or in our case, data such as packet length and time-stamp data.
Consider two neighboring data sets D1, D2. The neighboring data sets are the data sets
which differ only from one entry, one row or one record. They produce output S when
mechanism K which satisfies -differential privacy, is applied. The mechanism K can be
a randomized function to add jitter to some data field, fulfills the condition about the in-
formation disclosure related to any individual. Differential privacy algorithm states that
probability of data set D1 producing output S is nearly the same as the probability of data
set D2 producing same output.

Dwork’s [5] definition of differential privacy is following:
A mechanism K satisfies -differential privacy if for all pairs of adjacent databases D and
D , and all S ⊆ Range(K),
Pr[K(D ) ∈ S] ≤ e × Pr[K(D ) ∈ S] (1)
Here, is known as privacy parameter. The -value corresponds to the strength of the
privacy, smaller value of usually returns better privacy. Differential privacy uses Laplace
noise which is calculated by following formula ∆f / .
Where ∆f is the sensitivity of the function and defined as the maximum amount the out-
puts that can be perturbed by adding or removing some records from the data sets. This
measures how much output can be altered if a response is either included or excluded from
the result. To get an idea about the value of sensitivity, if a data set is queered that ’how
many rows have certain property, will yield sensitivity value of 1’. In our network trace,
we first generate query that ”what is the value of packet length/time-stamp field” which
gives us sensitivity value as 1 and then we add Laplace noise to the queried field using
suitable -value.
3.2 -diversity
The technique of -diversity involves partitioning fields within the data set such that within
each group there exists a balanced representation [8]. This addresses a number of weak-
nesses in the κ-anonymity techniques such as homogeneity attacks [9].
The discrete field in network trace such as protocol is also sensitive and need to be anonymised.
For example, some specific protocols like BitTorrent used for file sharing or UDP mostly
used for video streaming, can possibly identify the nature of traffic. Machanavajjhala et
al. [4] define -diversity principle as:
”A q-block is -diverse if contains at least ”well-represented” values for the sensitive
attribute S. A table is -diverse if every q-block is -diverse.”
4 Implementation of anonymisation techniques
In this section we apply the Differential Privacy and -diversity techniques to a network
trace along with guidelines on how the techniques are to be applied in this domain.

4.1 Differential Privacy over Continuous Fields
Differential privacy is suitable for continuous numerical fields so we add random Laplace
noise to time-stamp and packet length field in our network trace. We have tried range of
different values of to obfuscate data fields but for the sake of simplicity, we present the
result based on two values in this paper. We first select value as 0.01 and plot histogram
of both original and obfuscated data of time-stamp and packets length field and then we
again plot the same histogram with value as 0.1 and compare both distributions. In figure
1 and 2, blue bars represents the original distribution of the data and green bars represent
distribution of obfuscated data.
(a) Length Distribution with = 0.1 (b) Length Distribution with = 0.01
Figure 1: Packet Length Distributions
From the figure 1(b) and 2(b), we can infer that noise magnitude with = 0.01 has heavily
modified the data and the obfuscated data does not follow the distribution of the original
data any more. This type of noise destroys the useful features of the data and makes
statistical analysis useless. While, as seen in figure 1(a) and 2(a), noise addition with =
0.1 produce the identical distribution as the original one and implies that even if individual
records of the data are perturbed but overall distribution of the data is preserved.
(a) Packet Frequency with Noise, = 0.1 (b) Packet Frequency with Noise, = 0.01
Figure 2: Time-stamp Distribution

The box plot in figure 3 compares the statistical parameters of original data and set of
obfuscated data with different -value. It can be seen in figure 3 that obfuscated packet
length field with -values 0.1 maintains almost similar features of the box plot such as
Minimum, Maximum, Median, Q1 and Q3 values. In our experiment, it turned out that
value of 0.1 is the most suitable for our network trace.
Figure 3: Distribution Spread of Packet Length
4.2 -Diversity over Discrete Fields
Protocol field can be grouped in a family equivalence class, where family name represents
the characteristics of its members. To create the protocol equivalence classes, we first
examine the type of protocols present in the network trace and then we group protocols
of similar functionalities and put them in their respective equivalence class. In order to
obfuscate the protocol field, we can replace it with its equivalence class name. The benefit
of doing so is to avoid any inference attack which might occur if original protocol field is
exposed.

Our anonymised network trace consist of 5 equivalence classes namely Transport pro-
tocols, Management protocols, Security protocols, Mobile network protocols and other
protocols. Each equivalence class contains protocols with similar functionalities for ex-
ample major transport protocols such as TCP and UDP are placed in Transport protocol
equivalence class. Although, replacing the protocol ﬁeld with its equivalence class ruins
some amount of data but still provides enough information about the types of protocol,
being anonymised, this is actually the trade-off between privacy and data utility.
After replacing the original protocol ﬁeld with its respective equivalence class, the per-
centage of each equivalence class present in anonymised trace can easily be calculated and
plotted using pie chart as shown in Figure 4. Figure 5 presents one sample of 5-diverse
network trace with each block of data containing 5 diverse values of equivalence class.
Figure 4: Protocol Equivalence Class Distribution
4.3 IP address anonymisation
There exist a number of ways to anonymise IP addresses. Traditionally, IP address are
anonymised by using hashing methods or converting them to some real number. The
problem with these methods is that they do not provide any means to carry out statistical
analysis over anonymised IP addresses.
We tried two different methods to anonymise IP addresses. First method suppresses the last
octet of the IP address while keeping other octets as intact e.g. 192.168.1.∗. This technique

Figure 5: A sample of 5-diverse anonymised network trace
ensures that user who generated the data packet cannot be traced back while on the other
hand provides information about the network topology which might be useful in certain
analysis. In the second method, we replaced IP addresses with their corresponding class
type for example IP address 192.168.1.10 is replaced by Class C and so on. Although, this
technique ruins the information about the network ID and subnet mask but still provides
some knowledge about IP address class and range of the addresses.
5 Clustering analysis of obfuscated data
In this section, we experiment with the obfuscated data obtained after applying above
anonymisation techniques to observe its usefulness. This experiment uses packet ﬂow
records calculated from obfuscated data which become available for clustering.

Flow statistics [11], [12] are a set of measurements of the traffic data. Analyzing the traffic
at the flow level (and not at the packet level) allows for several types of analysis, typically
usage-based charging, network anomaly detection, identification of the heavy users, etc.
In practice, and for on-line systems and analysis, there is usually a mandatory sampling of
the packets, as direct measurement at the flow level on a high-speed link is not possible in
terms of CPU requirements, as well as memory and storage matters. Ideally, one would
want to use all packet data to compute the flows for higher precision of the calculated
statistics [13], [14]. In the case of this paper, this is actually possible, as we only consider
off-line network traces for the experiment presented. We have used a NetMate-flowcalc
based approach, the software is named flowtbag, which is specifically designed for off-line
flow statistics extraction.
5.1 Practicalities about data processing
The overall methodology for this clustering-based analysis is described on Figure 6, and
in more details in the following.
Figure 6: Overall block diagram of the data processing flow used.
Given a certain network traffic interface G, the traffic passing through this interface over
a certain period of time ∆t is sent to a computer running WireShark (latest development
version) and dumping the traffic G(∆t) to a PCAP file F.
This PCAP data file F is then sent directly to flow statistics extraction, which will compute
features that are directly usable for clustering. The resulting clustering is denoted Clus(F).
The very same file F is also run through a set of functions, each of which is parametrized by
θ, and will obfuscate some of the data, leading to the file Fθ. The θ parameter controls the
amount of obfuscation applied to the traffic data. The Fθ file then needs to be re-ordered,
if some noise has been applied to the time field of the PCAP records, as this will have put
the packets out of order, and therefore rendered this data unusable for computing the flow
statistics. The obfuscated data Fθ is re-ordered using the development version (1.99) of
WireShark 2, which allows for doing this directly on the PCAP file (using the associated
reorderpcap tool).
The re-ordered obfuscated PCAP file is then sent through the very same clustering as the
original file (non-obfuscated). This results in a certain clustering of the data which we will
denote Clus(Fθ).

5.2 Remarks
Using flowtbag, on a Linux based computer with 6GB RAM, and i5-4300U@1.9GHz
CPU, we obtained the following processing speeds for a PCAP network trace:
• 14:7 secs to extract about 106124 packets for flow statistics;
• About 8200 packets/sec;
• About 0:00012 sec/packet.
Figure 7 (a) and (b) shows the effect of noise and -value on the flow records available for
clustering. It can be noted, from Figures 7 (a) that the number of available flow statistics
records for analysis depends heavily on the noise value used. Indeed, if the introduced
noise in the time-stamp of the packets is too large, the flow statistics, which are directly
based on this, become impossible to compute for a large amount of packets.
As expected, from Figure 7 (b), the dependency of the number of available flow records to
the value is almost non-existent, compared to that of the noise.
In order to have meaningful results for the clustering part, it is thus necessary to have
reasonable values for both the noise amount and . In the following experiment, is varied
to observe its influence on the clustering performed.

6 Discussion
In this paper, we have presented anonymisation techniques that can be applied specifically
in the network trace domain for the practice of anonymisation. Specifically we have em-
phasized on differential privacy and -diversity as these are lightly used for any anonymi-
sation (with respect to privacy) and also being techniques that are being promoted by the
privacy community.
One of the dangers of any type of anonymisation techniques is that techniques are ei-
ther applied to single fields, ignoring the presence of functional dependencies and quasi-
identifiers, or are applied without context to the semantics domain of the data. In this paper
we have shown the application of differential privacy and -diversity to obfuscate data in
the network trace domain.
As plain statistical analysis is just one mechanism for understanding the underlying data,
so machine learning provides a more sophisticated manner in which re-identification might
be made. While this work is still at a relatively early stage and understanding of how
differential privacy and other forms of noise addition techniques effect some analysis such
as clustering, will become critical to preserve privacy.
Acknowledgements
This paper was partially funded by the TEKES CyberTrust Programme.

References
[1] C. C. Aggarwal and P. S. Yu, A general survey of privacy-preserving data mining models
and algorithms, in Privacy-Preserving Data Mining, ser. The Kluwer International Series on
Advances in Database Systems, C. C. Aggarwal, P. S. Yu, and A. K. Elmagarmid, Eds. Springer
US, 2008, vol. 34, pp. 1152.
[2] P. Langendorfer, M. Maaser, K. Piotrowski, and S. Peter, ”Privacy Enhancing Techniques: A
Survey and Classification.” Information Science Reference, 2008.
[3] L. Sweeney, ”κ-anonymity: A model for protecting privacy” Int. J. Uncertain. Fuzziness
Knowl.-Based Syst., vol. 10, no. 5, pp. 557570, Oct. 2002.
[4] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, ” -diversity: Privacy
beyond κ-anonymity, 2013 IEEE 29th International Conference on Data Engineering (ICDE),
vol. 0, p. 24, 2006.
[5] C. Dwork, ”Differential privacy: A survey of results, in Theory and Applications of Models of
Computation,” ser. Lecture Notes in Computer Science, vol. 4978. Springer Verlag, April 2008,
pp. 119.
[6] R. Motwani and Y. Xu, Efficient algorithms for masking and finding quasi-identifiers, in Pro-
ceedings of the Conference on Very Large Data Bases (VLDB), 2007.
[7] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, Unique in the Crowd: The
privacy bounds of human mobility, Scientific Reports, vol. 3, March 2013.
[8] N. Li and T. Li, t-closeness: Privacy beyond κ-anonymity and -diversity, in In Proc. of IEEE
23rd International Conference on Data Engineering (ICDE07
[9] A. Friedman, R. Wolff, and A. Schuster, Providing κ-anonymity in data mining, The VLDB
Journal, vol. 17, no. 4, pp. 789804, Jul. 2008.
[10] W. Yurcik, C. Woolam, G. Hellings, L. Khan, and B. M. Thuraisingham, Toward trusted
sharing of network packet traces using anonymization: Single-field privacy/analysis tradeoffs,
CoRR, vol. abs/0710.3979, 2007.
[11] S. Handelman, S. Stibler, N. Brownlee, and G. Ruth, RFC 2724: RTFM: New attributes for
traffic flow measurement, 1999. [Online]. Available: http://tools.ietf.org/html/rfc2724
[12] N. Brownlee, Network management and realtime traffic flow measurement,Journal of Network
and Systems Management, vol. 6, no. 2, pp. 223228, 1998.
[13] C. Estan and G. Varghese, New directions in traffic measurement and accounting, SIG-
COMM Comput. Commun. Rev., vol. 32, no. 4, pp. 323336, Aug. 2002. [Online]. Available:
http://doi.acm.org/10.1145/964725.633056
[14] N. Duffield, C. Lund, and M. Thorup, Properties and prediction of flowstatistics from sam-
pled packet streams, in In Proc. ACM SIGCOMM Internet Measurement Workshop, 2002, pp.
159171.
[15] S. Lloyd, Least squares quantization in pcm, Information Theory, IEEE Transactions on, vol.
28, no. 2, pp. 129137, Mar 1982
[16] Chen, Thomas M. ”Intrusion detection for viruses and worms.” IEC Annual Review of Com-
munications 57 (2004).

Utilisation of l-Diversity and Differential Privacy in the Anonymisation of Network Traces

More Related Content

What's hot (19)

Viewers also liked (14)

Similar to Utilisation of l-Diversity and Differential Privacy in the Anonymisation of Network Traces (20)

Recently uploaded (20)

Utilisation of l-Diversity and Differential Privacy in the Anonymisation of Network Traces