SlideShare a Scribd company logo
Utilisation of -Diversity and Differential Privacy in the
Anonymisation of Network Traces
Shankar Lal
Aalto University, Finland
shankar.lal@aalto.fi
Ian Oliver, Yoan Miche
Security Research
Nokia Neworks, Finland
first.last@nokia.com
Abstract: Noise addition for anonymisation is a known technique for increasing the
privacy of a data sets. However this technique is often presented as individual and
independent, or, just stated as techniques to be applied. This increases the danger of
misapplication of these techniques and a resulting anonymised data set that is open to
relatively easy re-identification or reconstruction. To better understand the application
of these techniques we demonstrate their application to a specific domain - that of
network trace anonymisation.
1 Introduction
Privacy and especially anonymisation of data sets is a hot topic. It, therefore, comes as
no surprise that data sets such as network traces which contain large amounts of sensitive
information about the behavior of users on a network require such treatment.
Techniques such as suppression, hashing and encryption of fields suffice to a point. In
the case of suppression information is lost, while in hashing or encryption of data, the
information content is transformed, from say an IP address which identifies a particular
machine, into just some kind of generic identifier. In many cases a pattern of behavior
is still recognizable, for example, hashing source and target IP addresses still reveals a
unique pattern of communication if not the precise identities [1] [2].
More advanced techniques such as κ-anonymity [3], -diversity [4] and differential privacy
[5] (amongst others) have been developed; κ-anonymity in particular has been successfully
used with medical data. These techniques are now being recommended, if not mandated,
to be used in the process of anonymisation.
In this paper, we present techniques for anonymising network traces that preserve some de-
gree of statistical properties such that some meaningful analysis can still be made. Working
in this specific domain means that we can carefully tailor techniques such as differential
privacy such that a reasonable degree of privacy is assured.
2 Network trace files
A network trace file contains sensitive fields such as source and destination IP addresses,
protocol type, packet lengths etc. Some of these can further act as quasi identifiers whose
combination can lead to the identification of an individual despite seemingly identifying
fields being removed or anonymised [6]. The source/destination IP address and time-stamp
field can disclose who is talking to whom and also provide proof that communication ex-
isted between the parties in certain period of time. Protocol field is crucial in the sense that
certain protocols can identify the nature of traffic.
Packet length field refers to the total length of an IP packet which includes payload and
header information. This field is also very important from the security point of view,
since certain security incidents have fixed packet length for example some network worms
i.e. Slammer worm and Nachi worm have fixed packet length of 404 bytes and 92 bytes
respectively [16] [10]. The packet length field is also vital in the sense that major transport
protocols like TCP and UDP, mostly have packets of larger length e.g. around 1500 bytes.
The other management protocols for example ICMP, DNS etc. have packet lengths mostly
fewer than 200 bytes. Therefore due to this structure of the fields, it, sometime, can be
easy to guess the protocol type by checking its packet length.
3 Overview of anonymisation techniques
Data anonymity can not be ensured by employing any single anonymisation technique
as each technique has its advantages and disadvantages. All data sets are required to
be processed through any combination of the techniques presented here and the many
variations thereof [7].
3.1 Differential Privacy
The notion of adding noise and randomized values to the data in a controlled manner is
known as differential privacy and provides a technique suited to continuous data sets such
as location data, or in our case, data such as packet length and time-stamp data.
Consider two neighboring data sets D1, D2. The neighboring data sets are the data sets
which differ only from one entry, one row or one record. They produce output S when
mechanism K which satisfies -differential privacy, is applied. The mechanism K can be
a randomized function to add jitter to some data field, fulfills the condition about the in-
formation disclosure related to any individual. Differential privacy algorithm states that
probability of data set D1 producing output S is nearly the same as the probability of data
set D2 producing same output.
Dwork’s [5] definition of differential privacy is following:
A mechanism K satisfies -differential privacy if for all pairs of adjacent databases D and
D , and all S ⊆ Range(K),
Pr[K(D ) ∈ S] ≤ e × Pr[K(D ) ∈ S] (1)
Here, is known as privacy parameter. The -value corresponds to the strength of the
privacy, smaller value of usually returns better privacy. Differential privacy uses Laplace
noise which is calculated by following formula ∆f / .
Where ∆f is the sensitivity of the function and defined as the maximum amount the out-
puts that can be perturbed by adding or removing some records from the data sets. This
measures how much output can be altered if a response is either included or excluded from
the result. To get an idea about the value of sensitivity, if a data set is queered that ’how
many rows have certain property, will yield sensitivity value of 1’. In our network trace,
we first generate query that ”what is the value of packet length/time-stamp field” which
gives us sensitivity value as 1 and then we add Laplace noise to the queried field using
suitable -value.
3.2 -diversity
The technique of -diversity involves partitioning fields within the data set such that within
each group there exists a balanced representation [8]. This addresses a number of weak-
nesses in the κ-anonymity techniques such as homogeneity attacks [9].
The discrete field in network trace such as protocol is also sensitive and need to be anonymised.
For example, some specific protocols like BitTorrent used for file sharing or UDP mostly
used for video streaming, can possibly identify the nature of traffic. Machanavajjhala et
al. [4] define -diversity principle as:
”A q-block is -diverse if contains at least ”well-represented” values for the sensitive
attribute S. A table is -diverse if every q-block is -diverse.”
4 Implementation of anonymisation techniques
In this section we apply the Differential Privacy and -diversity techniques to a network
trace along with guidelines on how the techniques are to be applied in this domain.
4.1 Differential Privacy over Continuous Fields
Differential privacy is suitable for continuous numerical fields so we add random Laplace
noise to time-stamp and packet length field in our network trace. We have tried range of
different values of to obfuscate data fields but for the sake of simplicity, we present the
result based on two values in this paper. We first select value as 0.01 and plot histogram
of both original and obfuscated data of time-stamp and packets length field and then we
again plot the same histogram with value as 0.1 and compare both distributions. In figure
1 and 2, blue bars represents the original distribution of the data and green bars represent
distribution of obfuscated data.
(a) Length Distribution with = 0.1 (b) Length Distribution with = 0.01
Figure 1: Packet Length Distributions
From the figure 1(b) and 2(b), we can infer that noise magnitude with = 0.01 has heavily
modified the data and the obfuscated data does not follow the distribution of the original
data any more. This type of noise destroys the useful features of the data and makes
statistical analysis useless. While, as seen in figure 1(a) and 2(a), noise addition with =
0.1 produce the identical distribution as the original one and implies that even if individual
records of the data are perturbed but overall distribution of the data is preserved.
(a) Packet Frequency with Noise, = 0.1 (b) Packet Frequency with Noise, = 0.01
Figure 2: Time-stamp Distribution
The box plot in figure 3 compares the statistical parameters of original data and set of
obfuscated data with different -value. It can be seen in figure 3 that obfuscated packet
length field with -values 0.1 maintains almost similar features of the box plot such as
Minimum, Maximum, Median, Q1 and Q3 values. In our experiment, it turned out that
value of 0.1 is the most suitable for our network trace.
Figure 3: Distribution Spread of Packet Length
4.2 -Diversity over Discrete Fields
Protocol field can be grouped in a family equivalence class, where family name represents
the characteristics of its members. To create the protocol equivalence classes, we first
examine the type of protocols present in the network trace and then we group protocols
of similar functionalities and put them in their respective equivalence class. In order to
obfuscate the protocol field, we can replace it with its equivalence class name. The benefit
of doing so is to avoid any inference attack which might occur if original protocol field is
exposed.
Our anonymised network trace consist of 5 equivalence classes namely Transport pro-
tocols, Management protocols, Security protocols, Mobile network protocols and other
protocols. Each equivalence class contains protocols with similar functionalities for ex-
ample major transport protocols such as TCP and UDP are placed in Transport protocol
equivalence class. Although, replacing the protocol field with its equivalence class ruins
some amount of data but still provides enough information about the types of protocol,
being anonymised, this is actually the trade-off between privacy and data utility.
After replacing the original protocol field with its respective equivalence class, the per-
centage of each equivalence class present in anonymised trace can easily be calculated and
plotted using pie chart as shown in Figure 4. Figure 5 presents one sample of 5-diverse
network trace with each block of data containing 5 diverse values of equivalence class.
Figure 4: Protocol Equivalence Class Distribution
4.3 IP address anonymisation
There exist a number of ways to anonymise IP addresses. Traditionally, IP address are
anonymised by using hashing methods or converting them to some real number. The
problem with these methods is that they do not provide any means to carry out statistical
analysis over anonymised IP addresses.
We tried two different methods to anonymise IP addresses. First method suppresses the last
octet of the IP address while keeping other octets as intact e.g. 192.168.1.∗. This technique
Figure 5: A sample of 5-diverse anonymised network trace
ensures that user who generated the data packet cannot be traced back while on the other
hand provides information about the network topology which might be useful in certain
analysis. In the second method, we replaced IP addresses with their corresponding class
type for example IP address 192.168.1.10 is replaced by Class C and so on. Although, this
technique ruins the information about the network ID and subnet mask but still provides
some knowledge about IP address class and range of the addresses.
5 Clustering analysis of obfuscated data
In this section, we experiment with the obfuscated data obtained after applying above
anonymisation techniques to observe its usefulness. This experiment uses packet flow
records calculated from obfuscated data which become available for clustering.
Flow statistics [11], [12] are a set of measurements of the traffic data. Analyzing the traffic
at the flow level (and not at the packet level) allows for several types of analysis, typically
usage-based charging, network anomaly detection, identification of the heavy users, etc.
In practice, and for on-line systems and analysis, there is usually a mandatory sampling of
the packets, as direct measurement at the flow level on a high-speed link is not possible in
terms of CPU requirements, as well as memory and storage matters. Ideally, one would
want to use all packet data to compute the flows for higher precision of the calculated
statistics [13], [14]. In the case of this paper, this is actually possible, as we only consider
off-line network traces for the experiment presented. We have used a NetMate-flowcalc
based approach, the software is named flowtbag, which is specifically designed for off-line
flow statistics extraction.
5.1 Practicalities about data processing
The overall methodology for this clustering-based analysis is described on Figure 6, and
in more details in the following.
Figure 6: Overall block diagram of the data processing flow used.
Given a certain network traffic interface G, the traffic passing through this interface over
a certain period of time ∆t is sent to a computer running WireShark (latest development
version) and dumping the traffic G(∆t) to a PCAP file F.
This PCAP data file F is then sent directly to flow statistics extraction, which will compute
features that are directly usable for clustering. The resulting clustering is denoted Clus(F).
The very same file F is also run through a set of functions, each of which is parametrized by
θ, and will obfuscate some of the data, leading to the file Fθ. The θ parameter controls the
amount of obfuscation applied to the traffic data. The Fθ file then needs to be re-ordered,
if some noise has been applied to the time field of the PCAP records, as this will have put
the packets out of order, and therefore rendered this data unusable for computing the flow
statistics. The obfuscated data Fθ is re-ordered using the development version (1.99) of
WireShark 2, which allows for doing this directly on the PCAP file (using the associated
reorderpcap tool).
The re-ordered obfuscated PCAP file is then sent through the very same clustering as the
original file (non-obfuscated). This results in a certain clustering of the data which we will
denote Clus(Fθ).
5.2 Remarks
Using flowtbag, on a Linux based computer with 6GB RAM, and i5-4300U@1.9GHz
CPU, we obtained the following processing speeds for a PCAP network trace:
• 14:7 secs to extract about 106124 packets for flow statistics;
• About 8200 packets/sec;
• About 0:00012 sec/packet.
Figure 7 (a) and (b) shows the effect of noise and -value on the flow records available for
clustering. It can be noted, from Figures 7 (a) that the number of available flow statistics
records for analysis depends heavily on the noise value used. Indeed, if the introduced
noise in the time-stamp of the packets is too large, the flow statistics, which are directly
based on this, become impossible to compute for a large amount of packets.
As expected, from Figure 7 (b), the dependency of the number of available flow records to
the value is almost non-existent, compared to that of the noise.
In order to have meaningful results for the clustering part, it is thus necessary to have
reasonable values for both the noise amount and . In the following experiment, is varied
to observe its influence on the clustering performed.
Figure 7: Overall block diagram of the data processing flow used.
Figure 8: Overall block diagram of the data processing flow used.
6 Discussion
In this paper, we have presented anonymisation techniques that can be applied specifically
in the network trace domain for the practice of anonymisation. Specifically we have em-
phasized on differential privacy and -diversity as these are lightly used for any anonymi-
sation (with respect to privacy) and also being techniques that are being promoted by the
privacy community.
One of the dangers of any type of anonymisation techniques is that techniques are ei-
ther applied to single fields, ignoring the presence of functional dependencies and quasi-
identifiers, or are applied without context to the semantics domain of the data. In this paper
we have shown the application of differential privacy and -diversity to obfuscate data in
the network trace domain.
As plain statistical analysis is just one mechanism for understanding the underlying data,
so machine learning provides a more sophisticated manner in which re-identification might
be made. While this work is still at a relatively early stage and understanding of how
differential privacy and other forms of noise addition techniques effect some analysis such
as clustering, will become critical to preserve privacy.
Acknowledgements
This paper was partially funded by the TEKES CyberTrust Programme.
References
[1] C. C. Aggarwal and P. S. Yu, A general survey of privacy-preserving data mining models
and algorithms, in Privacy-Preserving Data Mining, ser. The Kluwer International Series on
Advances in Database Systems, C. C. Aggarwal, P. S. Yu, and A. K. Elmagarmid, Eds. Springer
US, 2008, vol. 34, pp. 1152.
[2] P. Langendorfer, M. Maaser, K. Piotrowski, and S. Peter, ”Privacy Enhancing Techniques: A
Survey and Classification.” Information Science Reference, 2008.
[3] L. Sweeney, ”κ-anonymity: A model for protecting privacy” Int. J. Uncertain. Fuzziness
Knowl.-Based Syst., vol. 10, no. 5, pp. 557570, Oct. 2002.
[4] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, ” -diversity: Privacy
beyond κ-anonymity, 2013 IEEE 29th International Conference on Data Engineering (ICDE),
vol. 0, p. 24, 2006.
[5] C. Dwork, ”Differential privacy: A survey of results, in Theory and Applications of Models of
Computation,” ser. Lecture Notes in Computer Science, vol. 4978. Springer Verlag, April 2008,
pp. 119.
[6] R. Motwani and Y. Xu, Efficient algorithms for masking and finding quasi-identifiers, in Pro-
ceedings of the Conference on Very Large Data Bases (VLDB), 2007.
[7] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, Unique in the Crowd: The
privacy bounds of human mobility, Scientific Reports, vol. 3, March 2013.
[8] N. Li and T. Li, t-closeness: Privacy beyond κ-anonymity and -diversity, in In Proc. of IEEE
23rd International Conference on Data Engineering (ICDE07
[9] A. Friedman, R. Wolff, and A. Schuster, Providing κ-anonymity in data mining, The VLDB
Journal, vol. 17, no. 4, pp. 789804, Jul. 2008.
[10] W. Yurcik, C. Woolam, G. Hellings, L. Khan, and B. M. Thuraisingham, Toward trusted
sharing of network packet traces using anonymization: Single-field privacy/analysis tradeoffs,
CoRR, vol. abs/0710.3979, 2007.
[11] S. Handelman, S. Stibler, N. Brownlee, and G. Ruth, RFC 2724: RTFM: New attributes for
traffic flow measurement, 1999. [Online]. Available: http://tools.ietf.org/html/rfc2724
[12] N. Brownlee, Network management and realtime traffic flow measurement,Journal of Network
and Systems Management, vol. 6, no. 2, pp. 223228, 1998.
[13] C. Estan and G. Varghese, New directions in traffic measurement and accounting, SIG-
COMM Comput. Commun. Rev., vol. 32, no. 4, pp. 323336, Aug. 2002. [Online]. Available:
http://doi.acm.org/10.1145/964725.633056
[14] N. Duffield, C. Lund, and M. Thorup, Properties and prediction of flowstatistics from sam-
pled packet streams, in In Proc. ACM SIGCOMM Internet Measurement Workshop, 2002, pp.
159171.
[15] S. Lloyd, Least squares quantization in pcm, Information Theory, IEEE Transactions on, vol.
28, no. 2, pp. 129137, Mar 1982
[16] Chen, Thomas M. ”Intrusion detection for viruses and worms.” IEC Annual Review of Com-
munications 57 (2004).

More Related Content

PDF
Prevention of Packet Hiding Methods In Selective Jamming Attack
PDF
SOM-PAD: Novel Data Security Algorithm on Self Organizing Map
PDF
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
PDF
Cv4201644655
PDF
Review on Encrypted Image with Hidden Data Using AES Algorithm
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
W4301117121
PDF
Stegonoraphy
Prevention of Packet Hiding Methods In Selective Jamming Attack
SOM-PAD: Novel Data Security Algorithm on Self Organizing Map
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
Cv4201644655
Review on Encrypted Image with Hidden Data Using AES Algorithm
International Journal of Computational Engineering Research(IJCER)
W4301117121
Stegonoraphy

What's hot (19)

PDF
Hiding text in audio using lsb based steganography
PDF
Audio steganography using r prime rsa and ga based lsb algorithm to enhance s...
PDF
Secure multipath routing scheme using key
PDF
Name a naming mechanism for delay disruption tolerant network
PDF
Paper id 212014145
DOCX
Audio Steganography synopsis
PDF
A NOVEL APPROACHES TOWARDS STEGANOGRAPHY
PDF
A Novel IP Traceback Scheme for Spoofing Attack
PDF
A0 ad276c eacf-6f38-e32efa1adf1e36cc
PDF
RSA Based Secured Image Steganography Using DWT Approach
PDF
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
PDF
File Encryption and Hiding Application Based on AES and Append Insertion Steg...
PDF
A New Design of Algorithm for Enhancing Security in Bluetooth Communication w...
PDF
A NOVEL APPROACH FOR CONCEALED DATA SHARING AND DATA EMBEDDING FOR SECURED CO...
PDF
Hiding Image within Video Clip
PDF
Hiding Text within Image Using LSB Replacement
PDF
A Novel Steganography Technique that Embeds Security along with Compression
PDF
A RSA- DWT Based Visual Cryptographic Steganogrphy Technique by Mohit Goel
Hiding text in audio using lsb based steganography
Audio steganography using r prime rsa and ga based lsb algorithm to enhance s...
Secure multipath routing scheme using key
Name a naming mechanism for delay disruption tolerant network
Paper id 212014145
Audio Steganography synopsis
A NOVEL APPROACHES TOWARDS STEGANOGRAPHY
A Novel IP Traceback Scheme for Spoofing Attack
A0 ad276c eacf-6f38-e32efa1adf1e36cc
RSA Based Secured Image Steganography Using DWT Approach
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
File Encryption and Hiding Application Based on AES and Append Insertion Steg...
A New Design of Algorithm for Enhancing Security in Bluetooth Communication w...
A NOVEL APPROACH FOR CONCEALED DATA SHARING AND DATA EMBEDDING FOR SECURED CO...
Hiding Image within Video Clip
Hiding Text within Image Using LSB Replacement
A Novel Steganography Technique that Embeds Security along with Compression
A RSA- DWT Based Visual Cryptographic Steganogrphy Technique by Mohit Goel
Ad

Viewers also liked (14)

PPT
Maintenance-Free Exteriors - Simonton Windows
PPTX
Privacy Preserving Log File Processing in Mobile Network Environment
PPTX
Must Know Things About Extended Tax Deadline Oct 15
PPTX
Luis alberto garcia de la cruz
PPTX
"Святая Екатерина" в Святом Власе (Болгария)
DOCX
MUSICA
DOCX
PPTX
Social entrepreneurs
ODP
Cours Histoire et communication Mélanie Roche : Gutenberg et l'imprimerie
PDF
Chamiluda guatemala-mayo2013
PPT
CESG - ACCOR - César Gonçalves
PPT
Gestão de Armazéns
PDF
Bolero Crowdfunding Inspiratiesessie Gent - 20 april 2016
PDF
Differential privacy and applications to location privacy
Maintenance-Free Exteriors - Simonton Windows
Privacy Preserving Log File Processing in Mobile Network Environment
Must Know Things About Extended Tax Deadline Oct 15
Luis alberto garcia de la cruz
"Святая Екатерина" в Святом Власе (Болгария)
MUSICA
Social entrepreneurs
Cours Histoire et communication Mélanie Roche : Gutenberg et l'imprimerie
Chamiluda guatemala-mayo2013
CESG - ACCOR - César Gonçalves
Gestão de Armazéns
Bolero Crowdfunding Inspiratiesessie Gent - 20 april 2016
Differential privacy and applications to location privacy
Ad

Similar to Utilisation of l-Diversity and Differential Privacy in the Anonymisation of Network Traces (20)

PDF
Hiding message from hacker using novel network techniques
DOC
Unit2[1]
DOC
Unit2[1]
PDF
Secure data transmission by using steganography
PDF
11.secure data transmission by using steganography
PDF
Implementation of De-Duplication Algorithm
PDF
Efficient security approaches in mobile ad hoc networks a survey
PDF
A novel cloud storage system with support of sensitive data application
PPT
Steganography
PDF
Wireless Network Security Architecture with Blowfish Encryption Model
PDF
DATA HIDING IN AUDIO SIGNALS USING WAVELET TRANSFORM WITH ENHANCED SECURITY
PDF
A Review Paper On Steganography Techniques
PDF
An Effective Semantic Encrypted Relational Data Using K-Nn Model
PDF
An Effective Semantic Encrypted Relational Data Using K-Nn Model
PPT
File transfer with multiple security mechanism
DOCX
Packet switching
PDF
Implementation on Data Security Approach in Dynamic Multi Hop Communication
PDF
Ijcnc050208
PDF
Elgamal signature for content distribution with network coding
PDF
data communication and network models two marks with answers
Hiding message from hacker using novel network techniques
Unit2[1]
Unit2[1]
Secure data transmission by using steganography
11.secure data transmission by using steganography
Implementation of De-Duplication Algorithm
Efficient security approaches in mobile ad hoc networks a survey
A novel cloud storage system with support of sensitive data application
Steganography
Wireless Network Security Architecture with Blowfish Encryption Model
DATA HIDING IN AUDIO SIGNALS USING WAVELET TRANSFORM WITH ENHANCED SECURITY
A Review Paper On Steganography Techniques
An Effective Semantic Encrypted Relational Data Using K-Nn Model
An Effective Semantic Encrypted Relational Data Using K-Nn Model
File transfer with multiple security mechanism
Packet switching
Implementation on Data Security Approach in Dynamic Multi Hop Communication
Ijcnc050208
Elgamal signature for content distribution with network coding
data communication and network models two marks with answers

Recently uploaded (20)

PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
additive manufacturing of ss316l using mig welding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
573137875-Attendance-Management-System-original
PPTX
Welding lecture in detail for understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
web development for engineering and engineering
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Geodesy 1.pptx...............................................
PPT
Project quality management in manufacturing
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Sustainable Sites - Green Building Construction
PPTX
OOP with Java - Java Introduction (Basics)
PDF
PPT on Performance Review to get promotions
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Foundation to blockchain - A guide to Blockchain Tech
additive manufacturing of ss316l using mig welding
bas. eng. economics group 4 presentation 1.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
573137875-Attendance-Management-System-original
Welding lecture in detail for understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
web development for engineering and engineering
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Geodesy 1.pptx...............................................
Project quality management in manufacturing
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Sustainable Sites - Green Building Construction
OOP with Java - Java Introduction (Basics)
PPT on Performance Review to get promotions

Utilisation of l-Diversity and Differential Privacy in the Anonymisation of Network Traces

  • 1. Utilisation of -Diversity and Differential Privacy in the Anonymisation of Network Traces Shankar Lal Aalto University, Finland shankar.lal@aalto.fi Ian Oliver, Yoan Miche Security Research Nokia Neworks, Finland first.last@nokia.com Abstract: Noise addition for anonymisation is a known technique for increasing the privacy of a data sets. However this technique is often presented as individual and independent, or, just stated as techniques to be applied. This increases the danger of misapplication of these techniques and a resulting anonymised data set that is open to relatively easy re-identification or reconstruction. To better understand the application of these techniques we demonstrate their application to a specific domain - that of network trace anonymisation. 1 Introduction Privacy and especially anonymisation of data sets is a hot topic. It, therefore, comes as no surprise that data sets such as network traces which contain large amounts of sensitive information about the behavior of users on a network require such treatment. Techniques such as suppression, hashing and encryption of fields suffice to a point. In the case of suppression information is lost, while in hashing or encryption of data, the information content is transformed, from say an IP address which identifies a particular machine, into just some kind of generic identifier. In many cases a pattern of behavior is still recognizable, for example, hashing source and target IP addresses still reveals a unique pattern of communication if not the precise identities [1] [2]. More advanced techniques such as κ-anonymity [3], -diversity [4] and differential privacy [5] (amongst others) have been developed; κ-anonymity in particular has been successfully used with medical data. These techniques are now being recommended, if not mandated, to be used in the process of anonymisation. In this paper, we present techniques for anonymising network traces that preserve some de- gree of statistical properties such that some meaningful analysis can still be made. Working in this specific domain means that we can carefully tailor techniques such as differential privacy such that a reasonable degree of privacy is assured.
  • 2. 2 Network trace files A network trace file contains sensitive fields such as source and destination IP addresses, protocol type, packet lengths etc. Some of these can further act as quasi identifiers whose combination can lead to the identification of an individual despite seemingly identifying fields being removed or anonymised [6]. The source/destination IP address and time-stamp field can disclose who is talking to whom and also provide proof that communication ex- isted between the parties in certain period of time. Protocol field is crucial in the sense that certain protocols can identify the nature of traffic. Packet length field refers to the total length of an IP packet which includes payload and header information. This field is also very important from the security point of view, since certain security incidents have fixed packet length for example some network worms i.e. Slammer worm and Nachi worm have fixed packet length of 404 bytes and 92 bytes respectively [16] [10]. The packet length field is also vital in the sense that major transport protocols like TCP and UDP, mostly have packets of larger length e.g. around 1500 bytes. The other management protocols for example ICMP, DNS etc. have packet lengths mostly fewer than 200 bytes. Therefore due to this structure of the fields, it, sometime, can be easy to guess the protocol type by checking its packet length. 3 Overview of anonymisation techniques Data anonymity can not be ensured by employing any single anonymisation technique as each technique has its advantages and disadvantages. All data sets are required to be processed through any combination of the techniques presented here and the many variations thereof [7]. 3.1 Differential Privacy The notion of adding noise and randomized values to the data in a controlled manner is known as differential privacy and provides a technique suited to continuous data sets such as location data, or in our case, data such as packet length and time-stamp data. Consider two neighboring data sets D1, D2. The neighboring data sets are the data sets which differ only from one entry, one row or one record. They produce output S when mechanism K which satisfies -differential privacy, is applied. The mechanism K can be a randomized function to add jitter to some data field, fulfills the condition about the in- formation disclosure related to any individual. Differential privacy algorithm states that probability of data set D1 producing output S is nearly the same as the probability of data set D2 producing same output.
  • 3. Dwork’s [5] definition of differential privacy is following: A mechanism K satisfies -differential privacy if for all pairs of adjacent databases D and D , and all S ⊆ Range(K), Pr[K(D ) ∈ S] ≤ e × Pr[K(D ) ∈ S] (1) Here, is known as privacy parameter. The -value corresponds to the strength of the privacy, smaller value of usually returns better privacy. Differential privacy uses Laplace noise which is calculated by following formula ∆f / . Where ∆f is the sensitivity of the function and defined as the maximum amount the out- puts that can be perturbed by adding or removing some records from the data sets. This measures how much output can be altered if a response is either included or excluded from the result. To get an idea about the value of sensitivity, if a data set is queered that ’how many rows have certain property, will yield sensitivity value of 1’. In our network trace, we first generate query that ”what is the value of packet length/time-stamp field” which gives us sensitivity value as 1 and then we add Laplace noise to the queried field using suitable -value. 3.2 -diversity The technique of -diversity involves partitioning fields within the data set such that within each group there exists a balanced representation [8]. This addresses a number of weak- nesses in the κ-anonymity techniques such as homogeneity attacks [9]. The discrete field in network trace such as protocol is also sensitive and need to be anonymised. For example, some specific protocols like BitTorrent used for file sharing or UDP mostly used for video streaming, can possibly identify the nature of traffic. Machanavajjhala et al. [4] define -diversity principle as: ”A q-block is -diverse if contains at least ”well-represented” values for the sensitive attribute S. A table is -diverse if every q-block is -diverse.” 4 Implementation of anonymisation techniques In this section we apply the Differential Privacy and -diversity techniques to a network trace along with guidelines on how the techniques are to be applied in this domain.
  • 4. 4.1 Differential Privacy over Continuous Fields Differential privacy is suitable for continuous numerical fields so we add random Laplace noise to time-stamp and packet length field in our network trace. We have tried range of different values of to obfuscate data fields but for the sake of simplicity, we present the result based on two values in this paper. We first select value as 0.01 and plot histogram of both original and obfuscated data of time-stamp and packets length field and then we again plot the same histogram with value as 0.1 and compare both distributions. In figure 1 and 2, blue bars represents the original distribution of the data and green bars represent distribution of obfuscated data. (a) Length Distribution with = 0.1 (b) Length Distribution with = 0.01 Figure 1: Packet Length Distributions From the figure 1(b) and 2(b), we can infer that noise magnitude with = 0.01 has heavily modified the data and the obfuscated data does not follow the distribution of the original data any more. This type of noise destroys the useful features of the data and makes statistical analysis useless. While, as seen in figure 1(a) and 2(a), noise addition with = 0.1 produce the identical distribution as the original one and implies that even if individual records of the data are perturbed but overall distribution of the data is preserved. (a) Packet Frequency with Noise, = 0.1 (b) Packet Frequency with Noise, = 0.01 Figure 2: Time-stamp Distribution
  • 5. The box plot in figure 3 compares the statistical parameters of original data and set of obfuscated data with different -value. It can be seen in figure 3 that obfuscated packet length field with -values 0.1 maintains almost similar features of the box plot such as Minimum, Maximum, Median, Q1 and Q3 values. In our experiment, it turned out that value of 0.1 is the most suitable for our network trace. Figure 3: Distribution Spread of Packet Length 4.2 -Diversity over Discrete Fields Protocol field can be grouped in a family equivalence class, where family name represents the characteristics of its members. To create the protocol equivalence classes, we first examine the type of protocols present in the network trace and then we group protocols of similar functionalities and put them in their respective equivalence class. In order to obfuscate the protocol field, we can replace it with its equivalence class name. The benefit of doing so is to avoid any inference attack which might occur if original protocol field is exposed.
  • 6. Our anonymised network trace consist of 5 equivalence classes namely Transport pro- tocols, Management protocols, Security protocols, Mobile network protocols and other protocols. Each equivalence class contains protocols with similar functionalities for ex- ample major transport protocols such as TCP and UDP are placed in Transport protocol equivalence class. Although, replacing the protocol field with its equivalence class ruins some amount of data but still provides enough information about the types of protocol, being anonymised, this is actually the trade-off between privacy and data utility. After replacing the original protocol field with its respective equivalence class, the per- centage of each equivalence class present in anonymised trace can easily be calculated and plotted using pie chart as shown in Figure 4. Figure 5 presents one sample of 5-diverse network trace with each block of data containing 5 diverse values of equivalence class. Figure 4: Protocol Equivalence Class Distribution 4.3 IP address anonymisation There exist a number of ways to anonymise IP addresses. Traditionally, IP address are anonymised by using hashing methods or converting them to some real number. The problem with these methods is that they do not provide any means to carry out statistical analysis over anonymised IP addresses. We tried two different methods to anonymise IP addresses. First method suppresses the last octet of the IP address while keeping other octets as intact e.g. 192.168.1.∗. This technique
  • 7. Figure 5: A sample of 5-diverse anonymised network trace ensures that user who generated the data packet cannot be traced back while on the other hand provides information about the network topology which might be useful in certain analysis. In the second method, we replaced IP addresses with their corresponding class type for example IP address 192.168.1.10 is replaced by Class C and so on. Although, this technique ruins the information about the network ID and subnet mask but still provides some knowledge about IP address class and range of the addresses. 5 Clustering analysis of obfuscated data In this section, we experiment with the obfuscated data obtained after applying above anonymisation techniques to observe its usefulness. This experiment uses packet flow records calculated from obfuscated data which become available for clustering.
  • 8. Flow statistics [11], [12] are a set of measurements of the traffic data. Analyzing the traffic at the flow level (and not at the packet level) allows for several types of analysis, typically usage-based charging, network anomaly detection, identification of the heavy users, etc. In practice, and for on-line systems and analysis, there is usually a mandatory sampling of the packets, as direct measurement at the flow level on a high-speed link is not possible in terms of CPU requirements, as well as memory and storage matters. Ideally, one would want to use all packet data to compute the flows for higher precision of the calculated statistics [13], [14]. In the case of this paper, this is actually possible, as we only consider off-line network traces for the experiment presented. We have used a NetMate-flowcalc based approach, the software is named flowtbag, which is specifically designed for off-line flow statistics extraction. 5.1 Practicalities about data processing The overall methodology for this clustering-based analysis is described on Figure 6, and in more details in the following. Figure 6: Overall block diagram of the data processing flow used. Given a certain network traffic interface G, the traffic passing through this interface over a certain period of time ∆t is sent to a computer running WireShark (latest development version) and dumping the traffic G(∆t) to a PCAP file F. This PCAP data file F is then sent directly to flow statistics extraction, which will compute features that are directly usable for clustering. The resulting clustering is denoted Clus(F). The very same file F is also run through a set of functions, each of which is parametrized by θ, and will obfuscate some of the data, leading to the file Fθ. The θ parameter controls the amount of obfuscation applied to the traffic data. The Fθ file then needs to be re-ordered, if some noise has been applied to the time field of the PCAP records, as this will have put the packets out of order, and therefore rendered this data unusable for computing the flow statistics. The obfuscated data Fθ is re-ordered using the development version (1.99) of WireShark 2, which allows for doing this directly on the PCAP file (using the associated reorderpcap tool). The re-ordered obfuscated PCAP file is then sent through the very same clustering as the original file (non-obfuscated). This results in a certain clustering of the data which we will denote Clus(Fθ).
  • 9. 5.2 Remarks Using flowtbag, on a Linux based computer with 6GB RAM, and i5-4300U@1.9GHz CPU, we obtained the following processing speeds for a PCAP network trace: • 14:7 secs to extract about 106124 packets for flow statistics; • About 8200 packets/sec; • About 0:00012 sec/packet. Figure 7 (a) and (b) shows the effect of noise and -value on the flow records available for clustering. It can be noted, from Figures 7 (a) that the number of available flow statistics records for analysis depends heavily on the noise value used. Indeed, if the introduced noise in the time-stamp of the packets is too large, the flow statistics, which are directly based on this, become impossible to compute for a large amount of packets. As expected, from Figure 7 (b), the dependency of the number of available flow records to the value is almost non-existent, compared to that of the noise. In order to have meaningful results for the clustering part, it is thus necessary to have reasonable values for both the noise amount and . In the following experiment, is varied to observe its influence on the clustering performed. Figure 7: Overall block diagram of the data processing flow used.
  • 10. Figure 8: Overall block diagram of the data processing flow used. 6 Discussion In this paper, we have presented anonymisation techniques that can be applied specifically in the network trace domain for the practice of anonymisation. Specifically we have em- phasized on differential privacy and -diversity as these are lightly used for any anonymi- sation (with respect to privacy) and also being techniques that are being promoted by the privacy community. One of the dangers of any type of anonymisation techniques is that techniques are ei- ther applied to single fields, ignoring the presence of functional dependencies and quasi- identifiers, or are applied without context to the semantics domain of the data. In this paper we have shown the application of differential privacy and -diversity to obfuscate data in the network trace domain. As plain statistical analysis is just one mechanism for understanding the underlying data, so machine learning provides a more sophisticated manner in which re-identification might be made. While this work is still at a relatively early stage and understanding of how differential privacy and other forms of noise addition techniques effect some analysis such as clustering, will become critical to preserve privacy. Acknowledgements This paper was partially funded by the TEKES CyberTrust Programme.
  • 11. References [1] C. C. Aggarwal and P. S. Yu, A general survey of privacy-preserving data mining models and algorithms, in Privacy-Preserving Data Mining, ser. The Kluwer International Series on Advances in Database Systems, C. C. Aggarwal, P. S. Yu, and A. K. Elmagarmid, Eds. Springer US, 2008, vol. 34, pp. 1152. [2] P. Langendorfer, M. Maaser, K. Piotrowski, and S. Peter, ”Privacy Enhancing Techniques: A Survey and Classification.” Information Science Reference, 2008. [3] L. Sweeney, ”κ-anonymity: A model for protecting privacy” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 557570, Oct. 2002. [4] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, ” -diversity: Privacy beyond κ-anonymity, 2013 IEEE 29th International Conference on Data Engineering (ICDE), vol. 0, p. 24, 2006. [5] C. Dwork, ”Differential privacy: A survey of results, in Theory and Applications of Models of Computation,” ser. Lecture Notes in Computer Science, vol. 4978. Springer Verlag, April 2008, pp. 119. [6] R. Motwani and Y. Xu, Efficient algorithms for masking and finding quasi-identifiers, in Pro- ceedings of the Conference on Very Large Data Bases (VLDB), 2007. [7] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, Unique in the Crowd: The privacy bounds of human mobility, Scientific Reports, vol. 3, March 2013. [8] N. Li and T. Li, t-closeness: Privacy beyond κ-anonymity and -diversity, in In Proc. of IEEE 23rd International Conference on Data Engineering (ICDE07 [9] A. Friedman, R. Wolff, and A. Schuster, Providing κ-anonymity in data mining, The VLDB Journal, vol. 17, no. 4, pp. 789804, Jul. 2008. [10] W. Yurcik, C. Woolam, G. Hellings, L. Khan, and B. M. Thuraisingham, Toward trusted sharing of network packet traces using anonymization: Single-field privacy/analysis tradeoffs, CoRR, vol. abs/0710.3979, 2007. [11] S. Handelman, S. Stibler, N. Brownlee, and G. Ruth, RFC 2724: RTFM: New attributes for traffic flow measurement, 1999. [Online]. Available: http://tools.ietf.org/html/rfc2724 [12] N. Brownlee, Network management and realtime traffic flow measurement,Journal of Network and Systems Management, vol. 6, no. 2, pp. 223228, 1998. [13] C. Estan and G. Varghese, New directions in traffic measurement and accounting, SIG- COMM Comput. Commun. Rev., vol. 32, no. 4, pp. 323336, Aug. 2002. [Online]. Available: http://doi.acm.org/10.1145/964725.633056 [14] N. Duffield, C. Lund, and M. Thorup, Properties and prediction of flowstatistics from sam- pled packet streams, in In Proc. ACM SIGCOMM Internet Measurement Workshop, 2002, pp. 159171. [15] S. Lloyd, Least squares quantization in pcm, Information Theory, IEEE Transactions on, vol. 28, no. 2, pp. 129137, Mar 1982 [16] Chen, Thomas M. ”Intrusion detection for viruses and worms.” IEC Annual Review of Com- munications 57 (2004).