SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2243
RANDOM DATA PERTURBATION TECHNIQUES IN PRIVACY PRESERVING DATA MINING
Sangavi N1, Jeevitha R2, Kathirvel P3, Dr. Premalatha K4
1,2,3PG Scholars, Bannari Amman Institute of Technology, Sathyamangalam
4Professor, Bannari Amman Institute of Technology, Sathyamangalam
----------------------------------------------------------------------***---------------------------------------------------------------------
ABSTRACT-Data mining strategies have been facing a
serious challenge in recent years due to heightened privacy
concerns and concerns, i.e. protecting the privacy of
important and sensitive data. Data perturbation is a
common Data Mining privacy technique. Dataperturbation's
biggest challenge is to balance privacy protection and data
quality, which is normally considered to be a pair of
contradictory factors. Geometric perturbation techniquefor
data is a combination of perturbationtechniqueforrotation,
translation, and noise addition. Publishing data while
protecting privacy–sensitive details–isparticularlyusefulfor
data owners. Typical examplesincludepublishingmicrodata
for research purposes or contractingthe datatothirdparties
providing services for data mining. In this paper we are
trying to explore the latest trends in the technique of
perturbation of geometric results.
Keywords: Data mining, Privacy preserving, data
perturbation, randomization, cryptography, Geometric
Data Perturbation.
INTRODUCTION
Enormous volumes of extensivepersonal data areroutinely
collected and analyzed using data mining tools. These data
include, among others, shopping habits, criminal records,
medical history, and credit records. Such data, on the one
hand, is an important asset for business organizations and
governments, both in decision-making processes and in
providing social benefits such as medical research, crime
reduction, national security, etc. Data mining techniques
are capable of deriving highly sensitive information from
unclassified data which is not even exposed to database
holders. Worse is the privacy invasion triggered by
secondary data use when people are unaware of usingdata
mining techniques "behind the scenes"[3].
The daunting problem: how can we defend against the
misuse of information that has been uncovered from
secondary data use and meet the needs of organizations
and governments to facilitate decision-making or even
promote social benefits? They claim that a solution to such
a problem involves two essential techniques: anonymityin
the first step of privacy protection to delete identifiers (e.g.
names, social insurance numbers, addresses, etc.) anddata
transformation to preserve those sensitive attributes (e.g.
income, age, etc.) since the release of data, after removal of
data. identifiers, may contain other information thatcanbe
linked with other datasets to re-identify individuals or
entities [4].
We cannot effectively safeguard dataprivacyagainstnaive
estimation. Rotation perturbation and random projection
perturbation are all threatened by prior knowledge
allowed Independent Component Analysis
Multidimensional-anonymization is only intended for
general-purpose utility preservationandmayresultinlow-
quality data mining models.Inthispaperweproposea new
multidimensional data perturbation technique: geometric
data perturbation that can be appliedforseveral categories
of popular data mining models with better utility
preservation and privacy preservation[5].
Need for Privacy in Data Mining
Presumably information is the most important and
demanded resource today. We live in an online societythat
relies on dissemination andinformationsharinginboththe
private as well as the public and government sectors.State,
federal, and private entitiesareincreasinglybeing required
to make their data electronicallyavailable[5][6].Protecting
respondent privacy (individuals, groups, associations,
businesses, etc.). Though technically anonymous, de-
identified data may include other data, such as ethnicity,
birth date, gender and ZIP code, which may be unique or
nearly unique. Identifying the characteristics of publicly
available databases associating these characteristics with
the identity of the respondent, data recipients may decide
to which respondent each pieceof releaseddata belongs, or
limit their confusion to a specific sub-set of persons.
DATA PERTURBATION
Data-perturbation-based approaches fall into two main
categories which we call the probability distribution
category and the fixed data perturbation category[8]. The
probability distribution group considerstheaggregationas
a sample from a given population with a given probability
distribution. In this case, the security check method
substitutes for the original data With another sample or by
the allotment itself from the same distribution. In the
context of fixed data perturbation the values of the
attributes in the database to be used to calculate statistics
are once and for all disrupted. The perturbationmethodsof
fixed data were developed solely for numerical or
categorical data[9].
Within the category of probability distribution two
methods can be defined. The firstiscalled"data swap-ping"
or "multidimensional transformation" This approach
replaces the original database with a randomly generated
database of exactly the same probability distributionasthe
original database[10]. When calculating a new
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2244
perturbation, consideration must be given to the
relationship between this entity and the rest of the
database, as long as a new entity is added or a current
entity is eliminated. A one-to - one mapping between the
original database and the disrupted database is needed.
The Precision resulting from this method may be
considered unacceptable, since in some cases the method
may have an error of up to 50 per cent. The lattermethodis
called probability distribution method. The method
consists of three steps: (1) Identify the underlying density
function of the attribute values, and estimate the
parameters of that function. (2) Generate a confidential
attribute data sample sequence from the approximate
density function. The most recent sample would need tobe
the same size as the database. (3) delete these genera In
other words, the lower value of thenewsamplewill replace
the lower value in the original data, and so on.
Data perturbation is a popular Data Mining privacy
technique. Data perturbation's biggest challenge is to
balance privacy protection and data quality, which are
normally considered as a pair of contradictory factors[11 ].
The distribution of in this method Reconstructed
independently of every data dimension. This means that
any data mining algorithm based on the distributionworks
under an implicit assumption that each dimension is
treated independently.
Approach to data perturbation is divided into two: the
approach to probability distribution and the approach to
value distortion. The approach to probability distribution
replaces the data with another sample from the same
distribution or the distribution itself, and the approach to
value distortion Disrupts data elements or attributes
directly by either additive noise, multiplicative noise, or
other procedures of randomization. There are three types
of approaches to data perturbation: Rotation perturbation,
Projection perturbation and perturbation of geometric
data.
DIFFERENT METHODS OF DATA
PERTURBATION
3.1 Noise Additive Perturbation
The standard technique of additive perturbation[13 ] is
column-based randomisation of additives. This type of
techniques is based on the facts that 1) data owners may
not want to protect all values in a record equally, so a
distortion of the column-based value can be applied to
disturb some sensitive columns. 2) The data classification
models to be used do not necessarily requiretheindividual
records, but only the distribution of the column value
assuming separate columns. The basic method is to
disguise the original values by injecting some amount of
random additive noise, while the specific information,such
as the column distribution, can still be effectively
reconstructed from the perturbed data.
We treat the original values (x1,x2,...,xn)froma columnto be
randomly drawn from a random variable X, which has
some kind of distribution. By adding random noises R to
the original data values, the randomization process
changes the original data and generates a disturbed data
column Y, Y= X+ R. It publishes the resulting record(x1+r1,
x2+r2,...,xn+rn) and the R distribution. The trick to
introducing random noise is the algorithm of distribution
reconstruction, which restores X's column distribution
based on perturbed data and R's distribution.
3.2 Condensation-based Perturbation:
The approach to condensation is a standard
multidimensional perturbation technique, aimed at
maintaining the matrix of covariance for multiplecolumns.
So some geometric properties like decision boundaryform
are well maintained. Unlike the randomization approach,
multiple columns as a whole are disturbed in order to
generate the whole "perturbed data set." As for the The
perturbed dataset preserves the covariance matrix, and
many existing data mining algorithms can be applied
directly to the perturbed dataset without requiring
algorithm modifications or new development.
It begins by partitioning the original data into groups k-
record. The groupconsistsof twosteps–randomlychoosing
a record as the center of the group from the current
records, and then identifying the (k − 1) nearest neighbors
of the center as the other (k − 1) members. Before the next
community is created the selected k records are extracted
from the original dataset. Since each group has a limited
locality, a set of k records may be regenerated to maintain
the distribution and covariance roughly. The record
replication algorithm aims to maintain each group's
ownvectors and values, as shown in the Figure 1.
Fig. 1 Eigen values of each group
3.3 Random Projection Perturbation:
Random projection perturbation (Liu, Kargupta and Ryan,
2006) refers to the technique of projecting a set of data
points to another randomly selected space from the
original multidimensional space.LetPk average bea matrix
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2245
of random projection, where the rows of P are
orthonormal[14 ].
G(X) = is applied to perturb the dataset X.
3.4 Geometric data perturbation:
Def: Geometric data perturbation consists of a sequence of
random geometric transformations, including
multiplicative transformation (R), translation
transformation (Ψ), and distance perturbation ∆.
G(X) = RX + Ψ + ∆ [15]
The data is assumed to be an Apxq matrix where each of
the p rows is an observation, Oi, and for each of the q
attributes, Ai, each observation containsvalues.Thematrix
may include both numerical and categorical attributes.
However, our methods of geometric data transformation
rely on numerical attributes d, such that d <=q.Thus,inthe
Euclidean space, the matrix px d, which is subject to
transformation, can be regarded as a vector subspace V, so
that each vectorvi€ V is the form vi= (a1;::; ad),1 <= i<=d,
where ai is one instance of Ai, ai€R, and R is the set of real
numbers. Before releasing the data for clustering analysis,
the vector subspace V must be transformed to preservethe
privacy of the individual data records. We need to add or
even multiply a constant noise term e to each element vi of
V in order to transform V into a distorted vector subspace
V.'
Translation Transformation: A constant is added for all
attribute values. The constant may be a negative or a
positive number. Although its degree of privacy protection
is 0 according to the formula for calculating the degree of
privacy protection, it makes us unable to see the raw data
directly from transformed data, so translation transform
can also play the role of privacy protection as well.
Translation is the task of moving a point with coordinates
(X;Y) through displacements(X0;Y0) to a new location.
Using a matrix representation v'=Tv, where T is a 2x 3
transformation matrix depicted in Figure 1(a), v is the
vector column containing theoriginal co-ordinates,andv' is
a column vector whose co-ordinates are thetransformed
co-ordinates, is easily achieved. This form of matrix is also
extended to Scaling and Rotation.
Rotation Transformation: Consider them as two-
dimensional space points for a pair of arbitrarily selected
attributes and rotate them according to a given angle with
the origin as the middle. If it is positive, we must rotateitin
anti-clockwise direction. Otherwise we'll rotatetheminthe
clockwise direction.
A more challenging transformation is rotation. This
transformation, in its simplest form, is for the rotation of a
point around the coordinate axes. Rotation of a point by
angle in a discrete 2D space is achieved using the
transformation matrix shown in Figure 1(b). The rotation
angle is measured clockwise and this transformation ects
the values of X and Y coordinates.
Fig. 2 (a) Translation Matrix (b) Rotation Matrix
The two elements above, translation and rotation maintain
the relationship between the distances. A bunch of
essential classification models will be "perturbation-
invariant" by retaining distances, which is the center of
geometric perturbation. In some situations, distance
conserving perturbation may be subject to distance-
inference attacks. The objective of distance disturbance is
to preserve the Distances are approximate, whileresilience
to distance-inference attacks is effectively increased. We
define the third component as a random matrix, where
each entry is a separate sample with zero mean and small
variance from the same distribution. By adding this
component it slightly disturbs the distance between a pair
of points.
CONCLUSIONS
It focuses mainly on random geometric perturbation
approach to privacy preservingdata classification.Random
geometric perturbation, G(X) = RX + Ψ + ∆, includes the
linear combination of the three components: rotation
perturbation, translation perturbation, and distance
perturbation. Geometric perturbation can preserve the
important geometric properties, thus most data mining
models that search for geometric class boundariesare well
preserved with the perturbed data.
Geometric perturbation perturbs multiple columns in one
transformation, which introduces new challenges in
evaluating the privacy guarantee for multi-dimensional
perturbation.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2246
REFERENCES
1. Chhinkaniwala H. and Garg S., “Privacy Preserving
Data Mining Techniques: Challenges and Issues”,
CSIT, 2011.
2. L.Golab and M.T.Ozsu, Data Stream Management
issues-”A Survey Technical Report”, 2003.
3. Majid, M.Asger, Rashid Ali, “PrivacypreservingData
Mining Techniques:Current Scenario and Future
Prospects”, IEEE 2012.
4. Aggrawal, C.C, and Yu.PS. ,” A condensation
approach to privacy preserving data mining”. Proc.
Of Int.conf. on extending Database
Technology(EDBT)(2004).
5. Chen K, and Liu, “Privacy Preserving Data
Classification with Rotation Perturbation”,
proc.ICDM, 2005, pp.589-592.
6. K.Liu, H Kargupta, and J.Ryan,”Randomprojection–
based multiplicative data perturbation for privacy
preserving distributed data mining.” IEEE
Transaction on knowledge and Data Engg,Jan
2006,pp 92-106.
7. Keke Chen, Gordon Sun, and Ling Liu. Towards
attack-resilient geometric data perturbation.” In
proceedings of the 2007 SIAM international
conference on Data mining, April 2007.
8. M. Reza,Somayyeh Seifi,” Classification and
Evaluation the PPDM Techniues by using a data
Modification -based framework”, IJCSE,Vol3.No2
Feb 2011.
9. Vassilios S.Verylios,E.Bertino,Igor N,”State –of-the
art in Privacy preserving Data Mining”,published in
SIGMOD 2004 pp.121-154.
10. Ching-Ming, Po-Zung & Chu-Hao,” Privacy
Preserving Clustering of Data streams”, Tamkang
Journal of Sc. & Engg, Vol.13 no. 3 pp.349-358
11. Jie Liu, Yifeng XU, “Privacy Preserving Clusteringby
Random Response Method of
Geometric Transformation”, IEEE 2010
12. Keke Chen, Ling lui, Privacy Preserving Multiparty
Collabrative Mining with Geometric Data
Perturbation , IEEE, January 2009

More Related Content

PDF
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
PDF
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
PDF
Cancer data partitioning with data structure and difficulty independent clust...
PDF
IRJET - Survey on Clustering based Categorical Data Protection
PDF
winbis1005
PDF
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
PDF
Paper id 212014109
PDF
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
Cancer data partitioning with data structure and difficulty independent clust...
IRJET - Survey on Clustering based Categorical Data Protection
winbis1005
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Paper id 212014109
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...

What's hot (19)

PDF
Saif_CCECE2007_full_paper_submitted
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
G44093135
PDF
Efficient classification of big data using vfdt (very fast decision tree)
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
PDF
An Analysis of Outlier Detection through clustering method
PDF
Privacy Preserving Clustering on Distorted data
PPTX
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
PPTX
Cluster analysis
PDF
Volume 2-issue-6-2143-2147
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
Digital image hiding algorithm for secret communication
PPTX
Protection models
PDF
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PDF
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
PDF
Combined mining approach to generate patterns for complex data
PDF
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
Saif_CCECE2007_full_paper_submitted
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
G44093135
Efficient classification of big data using vfdt (very fast decision tree)
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
An Analysis of Outlier Detection through clustering method
Privacy Preserving Clustering on Distorted data
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
Cluster analysis
Volume 2-issue-6-2143-2147
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Digital image hiding algorithm for secret communication
Protection models
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
IRJET- A Detailed Study on Classification Techniques for Data Mining
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
Combined mining approach to generate patterns for complex data
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
Ad

Similar to IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mining (20)

PDF
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
PDF
Survey paper on Big Data Imputation and Privacy Algorithms
PDF
Distance based transformation for privacy preserving data mining using hybrid...
PDF
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
PDF
Additive gaussian noise based data perturbation in multi level trust privacy ...
PDF
Using Randomized Response Techniques for Privacy-Preserving Data Mining
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PDF
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
PDF
journal for research
PDF
Framework to Avoid Similarity Attack in Big Streaming Data
PDF
Cluster Based Access Privilege Management Scheme for Databases
PDF
An efficient algorithm for privacy
PDF
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PDF
Different Classification Technique for Data mining in Insurance Industry usin...
PDF
Privacy preserving clustering on centralized data through scaling transf
PDF
Data mining techniques
PDF
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
PDF
A Survey on Privacy-Preserving Data Aggregation Without Secure Channel
PDF
MDAV2K: A VARIABLE-SIZE MICROAGGREGATION TECHNIQUE FOR PRIVACY PRESERVATION
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
Survey paper on Big Data Imputation and Privacy Algorithms
Distance based transformation for privacy preserving data mining using hybrid...
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
Additive gaussian noise based data perturbation in multi level trust privacy ...
Using Randomized Response Techniques for Privacy-Preserving Data Mining
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM
journal for research
Framework to Avoid Similarity Attack in Big Streaming Data
Cluster Based Access Privilege Management Scheme for Databases
An efficient algorithm for privacy
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
Different Classification Technique for Data mining in Insurance Industry usin...
Privacy preserving clustering on centralized data through scaling transf
Data mining techniques
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
A Survey on Privacy-Preserving Data Aggregation Without Secure Channel
MDAV2K: A VARIABLE-SIZE MICROAGGREGATION TECHNIQUE FOR PRIVACY PRESERVATION
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Project quality management in manufacturing
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Mechanical Engineering MATERIALS Selection
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
573137875-Attendance-Management-System-original
PDF
Well-logging-methods_new................
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Artificial Intelligence
PDF
composite construction of structures.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Project quality management in manufacturing
OOP with Java - Java Introduction (Basics)
Embodied AI: Ushering in the Next Era of Intelligent Systems
additive manufacturing of ss316l using mig welding
Foundation to blockchain - A guide to Blockchain Tech
Mechanical Engineering MATERIALS Selection
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
573137875-Attendance-Management-System-original
Well-logging-methods_new................
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Artificial Intelligence
composite construction of structures.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks

IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mining

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2243 RANDOM DATA PERTURBATION TECHNIQUES IN PRIVACY PRESERVING DATA MINING Sangavi N1, Jeevitha R2, Kathirvel P3, Dr. Premalatha K4 1,2,3PG Scholars, Bannari Amman Institute of Technology, Sathyamangalam 4Professor, Bannari Amman Institute of Technology, Sathyamangalam ----------------------------------------------------------------------***--------------------------------------------------------------------- ABSTRACT-Data mining strategies have been facing a serious challenge in recent years due to heightened privacy concerns and concerns, i.e. protecting the privacy of important and sensitive data. Data perturbation is a common Data Mining privacy technique. Dataperturbation's biggest challenge is to balance privacy protection and data quality, which is normally considered to be a pair of contradictory factors. Geometric perturbation techniquefor data is a combination of perturbationtechniqueforrotation, translation, and noise addition. Publishing data while protecting privacy–sensitive details–isparticularlyusefulfor data owners. Typical examplesincludepublishingmicrodata for research purposes or contractingthe datatothirdparties providing services for data mining. In this paper we are trying to explore the latest trends in the technique of perturbation of geometric results. Keywords: Data mining, Privacy preserving, data perturbation, randomization, cryptography, Geometric Data Perturbation. INTRODUCTION Enormous volumes of extensivepersonal data areroutinely collected and analyzed using data mining tools. These data include, among others, shopping habits, criminal records, medical history, and credit records. Such data, on the one hand, is an important asset for business organizations and governments, both in decision-making processes and in providing social benefits such as medical research, crime reduction, national security, etc. Data mining techniques are capable of deriving highly sensitive information from unclassified data which is not even exposed to database holders. Worse is the privacy invasion triggered by secondary data use when people are unaware of usingdata mining techniques "behind the scenes"[3]. The daunting problem: how can we defend against the misuse of information that has been uncovered from secondary data use and meet the needs of organizations and governments to facilitate decision-making or even promote social benefits? They claim that a solution to such a problem involves two essential techniques: anonymityin the first step of privacy protection to delete identifiers (e.g. names, social insurance numbers, addresses, etc.) anddata transformation to preserve those sensitive attributes (e.g. income, age, etc.) since the release of data, after removal of data. identifiers, may contain other information thatcanbe linked with other datasets to re-identify individuals or entities [4]. We cannot effectively safeguard dataprivacyagainstnaive estimation. Rotation perturbation and random projection perturbation are all threatened by prior knowledge allowed Independent Component Analysis Multidimensional-anonymization is only intended for general-purpose utility preservationandmayresultinlow- quality data mining models.Inthispaperweproposea new multidimensional data perturbation technique: geometric data perturbation that can be appliedforseveral categories of popular data mining models with better utility preservation and privacy preservation[5]. Need for Privacy in Data Mining Presumably information is the most important and demanded resource today. We live in an online societythat relies on dissemination andinformationsharinginboththe private as well as the public and government sectors.State, federal, and private entitiesareincreasinglybeing required to make their data electronicallyavailable[5][6].Protecting respondent privacy (individuals, groups, associations, businesses, etc.). Though technically anonymous, de- identified data may include other data, such as ethnicity, birth date, gender and ZIP code, which may be unique or nearly unique. Identifying the characteristics of publicly available databases associating these characteristics with the identity of the respondent, data recipients may decide to which respondent each pieceof releaseddata belongs, or limit their confusion to a specific sub-set of persons. DATA PERTURBATION Data-perturbation-based approaches fall into two main categories which we call the probability distribution category and the fixed data perturbation category[8]. The probability distribution group considerstheaggregationas a sample from a given population with a given probability distribution. In this case, the security check method substitutes for the original data With another sample or by the allotment itself from the same distribution. In the context of fixed data perturbation the values of the attributes in the database to be used to calculate statistics are once and for all disrupted. The perturbationmethodsof fixed data were developed solely for numerical or categorical data[9]. Within the category of probability distribution two methods can be defined. The firstiscalled"data swap-ping" or "multidimensional transformation" This approach replaces the original database with a randomly generated database of exactly the same probability distributionasthe original database[10]. When calculating a new
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2244 perturbation, consideration must be given to the relationship between this entity and the rest of the database, as long as a new entity is added or a current entity is eliminated. A one-to - one mapping between the original database and the disrupted database is needed. The Precision resulting from this method may be considered unacceptable, since in some cases the method may have an error of up to 50 per cent. The lattermethodis called probability distribution method. The method consists of three steps: (1) Identify the underlying density function of the attribute values, and estimate the parameters of that function. (2) Generate a confidential attribute data sample sequence from the approximate density function. The most recent sample would need tobe the same size as the database. (3) delete these genera In other words, the lower value of thenewsamplewill replace the lower value in the original data, and so on. Data perturbation is a popular Data Mining privacy technique. Data perturbation's biggest challenge is to balance privacy protection and data quality, which are normally considered as a pair of contradictory factors[11 ]. The distribution of in this method Reconstructed independently of every data dimension. This means that any data mining algorithm based on the distributionworks under an implicit assumption that each dimension is treated independently. Approach to data perturbation is divided into two: the approach to probability distribution and the approach to value distortion. The approach to probability distribution replaces the data with another sample from the same distribution or the distribution itself, and the approach to value distortion Disrupts data elements or attributes directly by either additive noise, multiplicative noise, or other procedures of randomization. There are three types of approaches to data perturbation: Rotation perturbation, Projection perturbation and perturbation of geometric data. DIFFERENT METHODS OF DATA PERTURBATION 3.1 Noise Additive Perturbation The standard technique of additive perturbation[13 ] is column-based randomisation of additives. This type of techniques is based on the facts that 1) data owners may not want to protect all values in a record equally, so a distortion of the column-based value can be applied to disturb some sensitive columns. 2) The data classification models to be used do not necessarily requiretheindividual records, but only the distribution of the column value assuming separate columns. The basic method is to disguise the original values by injecting some amount of random additive noise, while the specific information,such as the column distribution, can still be effectively reconstructed from the perturbed data. We treat the original values (x1,x2,...,xn)froma columnto be randomly drawn from a random variable X, which has some kind of distribution. By adding random noises R to the original data values, the randomization process changes the original data and generates a disturbed data column Y, Y= X+ R. It publishes the resulting record(x1+r1, x2+r2,...,xn+rn) and the R distribution. The trick to introducing random noise is the algorithm of distribution reconstruction, which restores X's column distribution based on perturbed data and R's distribution. 3.2 Condensation-based Perturbation: The approach to condensation is a standard multidimensional perturbation technique, aimed at maintaining the matrix of covariance for multiplecolumns. So some geometric properties like decision boundaryform are well maintained. Unlike the randomization approach, multiple columns as a whole are disturbed in order to generate the whole "perturbed data set." As for the The perturbed dataset preserves the covariance matrix, and many existing data mining algorithms can be applied directly to the perturbed dataset without requiring algorithm modifications or new development. It begins by partitioning the original data into groups k- record. The groupconsistsof twosteps–randomlychoosing a record as the center of the group from the current records, and then identifying the (k − 1) nearest neighbors of the center as the other (k − 1) members. Before the next community is created the selected k records are extracted from the original dataset. Since each group has a limited locality, a set of k records may be regenerated to maintain the distribution and covariance roughly. The record replication algorithm aims to maintain each group's ownvectors and values, as shown in the Figure 1. Fig. 1 Eigen values of each group 3.3 Random Projection Perturbation: Random projection perturbation (Liu, Kargupta and Ryan, 2006) refers to the technique of projecting a set of data points to another randomly selected space from the original multidimensional space.LetPk average bea matrix
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2245 of random projection, where the rows of P are orthonormal[14 ]. G(X) = is applied to perturb the dataset X. 3.4 Geometric data perturbation: Def: Geometric data perturbation consists of a sequence of random geometric transformations, including multiplicative transformation (R), translation transformation (Ψ), and distance perturbation ∆. G(X) = RX + Ψ + ∆ [15] The data is assumed to be an Apxq matrix where each of the p rows is an observation, Oi, and for each of the q attributes, Ai, each observation containsvalues.Thematrix may include both numerical and categorical attributes. However, our methods of geometric data transformation rely on numerical attributes d, such that d <=q.Thus,inthe Euclidean space, the matrix px d, which is subject to transformation, can be regarded as a vector subspace V, so that each vectorvi€ V is the form vi= (a1;::; ad),1 <= i<=d, where ai is one instance of Ai, ai€R, and R is the set of real numbers. Before releasing the data for clustering analysis, the vector subspace V must be transformed to preservethe privacy of the individual data records. We need to add or even multiply a constant noise term e to each element vi of V in order to transform V into a distorted vector subspace V.' Translation Transformation: A constant is added for all attribute values. The constant may be a negative or a positive number. Although its degree of privacy protection is 0 according to the formula for calculating the degree of privacy protection, it makes us unable to see the raw data directly from transformed data, so translation transform can also play the role of privacy protection as well. Translation is the task of moving a point with coordinates (X;Y) through displacements(X0;Y0) to a new location. Using a matrix representation v'=Tv, where T is a 2x 3 transformation matrix depicted in Figure 1(a), v is the vector column containing theoriginal co-ordinates,andv' is a column vector whose co-ordinates are thetransformed co-ordinates, is easily achieved. This form of matrix is also extended to Scaling and Rotation. Rotation Transformation: Consider them as two- dimensional space points for a pair of arbitrarily selected attributes and rotate them according to a given angle with the origin as the middle. If it is positive, we must rotateitin anti-clockwise direction. Otherwise we'll rotatetheminthe clockwise direction. A more challenging transformation is rotation. This transformation, in its simplest form, is for the rotation of a point around the coordinate axes. Rotation of a point by angle in a discrete 2D space is achieved using the transformation matrix shown in Figure 1(b). The rotation angle is measured clockwise and this transformation ects the values of X and Y coordinates. Fig. 2 (a) Translation Matrix (b) Rotation Matrix The two elements above, translation and rotation maintain the relationship between the distances. A bunch of essential classification models will be "perturbation- invariant" by retaining distances, which is the center of geometric perturbation. In some situations, distance conserving perturbation may be subject to distance- inference attacks. The objective of distance disturbance is to preserve the Distances are approximate, whileresilience to distance-inference attacks is effectively increased. We define the third component as a random matrix, where each entry is a separate sample with zero mean and small variance from the same distribution. By adding this component it slightly disturbs the distance between a pair of points. CONCLUSIONS It focuses mainly on random geometric perturbation approach to privacy preservingdata classification.Random geometric perturbation, G(X) = RX + Ψ + ∆, includes the linear combination of the three components: rotation perturbation, translation perturbation, and distance perturbation. Geometric perturbation can preserve the important geometric properties, thus most data mining models that search for geometric class boundariesare well preserved with the perturbed data. Geometric perturbation perturbs multiple columns in one transformation, which introduces new challenges in evaluating the privacy guarantee for multi-dimensional perturbation.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 02 | Feb 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2246 REFERENCES 1. Chhinkaniwala H. and Garg S., “Privacy Preserving Data Mining Techniques: Challenges and Issues”, CSIT, 2011. 2. L.Golab and M.T.Ozsu, Data Stream Management issues-”A Survey Technical Report”, 2003. 3. Majid, M.Asger, Rashid Ali, “PrivacypreservingData Mining Techniques:Current Scenario and Future Prospects”, IEEE 2012. 4. Aggrawal, C.C, and Yu.PS. ,” A condensation approach to privacy preserving data mining”. Proc. Of Int.conf. on extending Database Technology(EDBT)(2004). 5. Chen K, and Liu, “Privacy Preserving Data Classification with Rotation Perturbation”, proc.ICDM, 2005, pp.589-592. 6. K.Liu, H Kargupta, and J.Ryan,”Randomprojection– based multiplicative data perturbation for privacy preserving distributed data mining.” IEEE Transaction on knowledge and Data Engg,Jan 2006,pp 92-106. 7. Keke Chen, Gordon Sun, and Ling Liu. Towards attack-resilient geometric data perturbation.” In proceedings of the 2007 SIAM international conference on Data mining, April 2007. 8. M. Reza,Somayyeh Seifi,” Classification and Evaluation the PPDM Techniues by using a data Modification -based framework”, IJCSE,Vol3.No2 Feb 2011. 9. Vassilios S.Verylios,E.Bertino,Igor N,”State –of-the art in Privacy preserving Data Mining”,published in SIGMOD 2004 pp.121-154. 10. Ching-Ming, Po-Zung & Chu-Hao,” Privacy Preserving Clustering of Data streams”, Tamkang Journal of Sc. & Engg, Vol.13 no. 3 pp.349-358 11. Jie Liu, Yifeng XU, “Privacy Preserving Clusteringby Random Response Method of Geometric Transformation”, IEEE 2010 12. Keke Chen, Ling lui, Privacy Preserving Multiparty Collabrative Mining with Geometric Data Perturbation , IEEE, January 2009