SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2835
Amrutha HJ1, Anu A Kittur2, Chaitra MS3, Gowri M4, Sowmya SR5
1,2,3,4BE Student, Department of Information Science and Engineering
5Professor, Dept. of ISE, Dayananda Sagar Academy of Technology & Management, Karnataka, India
----------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The amount of publicly accessible datasets isrising
every day in the present age. Improving data privacy erefore
becomes mandatory. This has become a major reason why
prolonged research has been undertaken to deliver effective
fortification techniques that obstruct the revelationofentities
in the datasets by conserving the data utility. Acomprehensive
attachement for categorical data protection is carried out by
applying clusters to the dataset and then safeguarding every
data segment.
Key Words: Categorical Data, Clustering, Data mining,
Data privacy
1. INTRODUCTION
Providing the requisite privacy is the mainagenda toprotect
the data or information. All the clients who entered the data
would expect their data to be protected. Data mining is a
method in which it transforms the base data tofinisheddata.
It is approach that calls for and examines thevastquantityof
dat collected to obtain trends. Categorical data can also be
known as statistical data consisting of categorical values.
There are three major attributes to reflect when
consideringa dataset, namely confidential, identifiers and
Quasiidentifiers. Quasiidentifiers are pieces of information
with some degree of uncertainty that are not by themselves
distinct identifiers.
In the case of confidential attributes, it includes information
of employment, health issues or religion. Clustering can be
defined as the process in which the abstract objects become
an interconnected class of objects withintheset.Thestudy of
clustering takes into account in applications such as market
survey, data-analysis, pattern recognition and image
processing.
Protection approaches are tested on the basis of two
important measures they threaten the loss and disclosure of
information.
The information loss is calculated by comparing the
statistical parameter between the anonymousone and the
original data table. Security approachescanbeclassifiedinto
two general categories: disruptive and non-perturbatory.
Perturbative is a technique for changing the attribute’s
sensitive value via a new value.
NonPerturbative technique does not change the attribute's
sensitive value, rather it attribute’s sensitive value, rather it
suppresses or deletes certain datasets.
2. METHODOLOGY
2.1 Subtractive Clustering:
The currently in effect subtractive clustering approach can
be used only for numerical data that cannot be used for data
with categorical values. Many cluster grids have a maximum
value in the conventional mountain-clustering process. But
this mountain clusteringapproachcansometimestrigger the
computation's increasing complexity, so one subtractive
method to clustering has been proposed. This approach can
be used only in numerical data since there is no natural
ordering of the categorical data. Though clustering using
kmeans gives better efficiency, subtractive clustering is
powerful.
2.2 Robust Hierarchical Clustering (RHC):
Hierarchical clustering is the popular unsupervised
technique used for the Metabolomics data. In the case of
conventional hierarchical clustering system, it is highly
reactive to outliers and if there is the existenceofmisleading
clustering tests, those outliers exist. Two Stage Generalized
S-estimator (TSGS) is used to robustify hierarchical
clustering which allows use of the covariance matrix.
There are 3 major steps in robust hierarchical data
segmentation methodology.
1. Estimation of Robust covariance matrix:
The biggest hurdle here is to estimate an appropriatematrix
of correlation or dispersion at a time in the presence of cell-
wise anomalies or outliers in case-wise and cell-wise.
2. Robust evaluation of correlation matrix based on
dissimilarity using the TSGS covariance matrix.
3. Estimate of RHC proposed with TSGS dispersion matrix.
2.3 Decision Tree Categorical Value Clustering
Data breakdown methods add noise to the data to avoid
correct confidential values beingrevealed.Categorical values
Survey on Clustering based Categorical Data Protection
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2836
of attributes are clustered in the beginning, and these
clusters are then used in the later stages to create noise.
Categorical value clustering and disruption technique of the
decision-tree disturbs a non-class categorical feature of a
dataset. Therefore, we apply it once for each non-class
attribute specified on the original dataset to agitate all non-
class categorical attributes.Everytimea datasetisgenerated
with one disturbed attribute within it. Lastly, we constructa
dataset (combining all disturbed data sets) where each non-
class categorical attribute is disturbed and all other
attributes s are not disturbed.
2.4 Outlier Diagnosis:
Outlier is one that does not adhere to the pattern in the
dataset or any other feature expected. This may be
diagonalised using anomaly detection methods. These
phenomena can also be called outliers, novelties, noise, or
variations.
They come in three different types:
1. Supervised anomaly detection
2. Semi supervised anomaly detection
3. Unsupervised anomaly detection
Unmonitored detections of anomalies identify anomalies in
an unlabeled test data data set under which the data
collection standard of events is considered normal by
searching for instances that appear to conform to the rest of
the data set atleast .
2.4.1 Outlier Detection Techniques:
A. Statistical outlier detection:
It calculates the arguments in the case of statistical
distribution by imagining all the data points produced by
statistical dispersion
B. Depth based outlier detection:
Depth based search originality at data space cap for outlier
detection. They're autonomous regarding statistical data
distribution.
C. Distance based outlier detection:
This judges a point based on separation of neighborhoods.
D. Density based outlier detection:
It practices the distribution of data element density into the
set of data.
E. Deviation based outlier detection:
The data components are scattered as a sparse matrix in the
data set which creates confusion over the analysis ofresults.
When departing from standard points some points are
considered anomalies.
Table 1: Comparison table for outlier algorithms
2.5 Evolutionary Optimization Approach
A progressive accession to protection of data is based on an
evolutionary algorithm, driven by the amalgamation of loss
in information and threat disclosure procedures. This
algorithm is dedicated to discover precise or approximate
results to simplify or explore problems. The algorithm uses
two simple genetic operators: mutation and crossover. It
uses state-of-the-art techniques for categorical stability.
Mutation: The pieces are randomly arrangedtoobtaina new
offspring in case of mutation.
Crossover: Consists of 2 chromosomal recombined values
which also produce two new off springs.
2.6 L-Diversity
The anonymity models through generalizationcanshieldthe
confidentiality of individuals but often lead to information
loss. (K, l, al)-variety diminishes knowledgelossandensures
data quality. This method ensures data privacyevenwithout
the knowledge of the opponent’s background to avoid
disclosure of attributes. In this case sensitive attributes are
well represented. That technique is a k-anonymity
modification. A definition from a set of n records (k, l, range)
diversity is used in such a way that the data segment cluster
includes at least k (k = n) data elements as well as at least 1
dissimilar sensitive characteristics and the sum of all intra
cluster distance is reduced.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2837
3. RESULT COMPARISION
3.1 Clustering Algorithms
Table 2: Comparison table for clustering algorithms
Algorithm Benefit Drawback
Subtractive
clustering
There is an
efficient method
in this case using.
On numerous UCI
datasets, a few
investigations are
carried out, and
some
experimental
results describe
that the approach
given can attain
better clustering
precision when
compared to k-
modes algorithm.
Unsupervised
clustering is not
clear.
Robust
hierarchical
clustering
Simulation
training clearly
shows that the
anticipated
approach
improves
performance
considerably over
conventional
hierarchical
clustering
1. The preceding
step cannot be
undoed.
2. Complexity of
time: Not
suitable for
large datasets.
3.2 Outlier Algorithms
Table 3: Comparison table for outlier algorithms
3.3 Protection Algorithms
Table 4: Comparison table for outlier algorithms
Algorithm Advantage Disadvantage
L-Diversity 1. Makes
distribution
more robust
1. This can be
redundant and
laborious to
within the
category of
critical
attributes,
thereby
increasing data
protection 2.
Protects from
disclosing
attribute.
achieve.
2. Prone to
attacks such as
skewness
attack.
Evolutionary
Optimization
Approach
We perform
better for
advanced
dimensional
failures.
We are robust
in terms of
noisy valuation
functions that
do not reap any
sensible
outcome in a
given stipulated
amount of time.
4. CONCLUSIONS
In this paper, a new approach is used todeal withcategorical
data confidentiality using the SCCA algorithm clustering
technique, which can result in more contented clustering
accuracy than the obsolete kmodes algorithm on each
collection. The efficiency of TSGS algorithms is greater than
that of robust estimation techniques.
Ldiversity will intensify the privacy of the defendantbutthis
function is not sufficient to protect critical attributes.Hence,
evolutionary optimization strategy is a better method of
defense.
REFERENCES
[1]H. Zhao and Z. Qi, "Hierarchical Agglomerative Clustering
with Ordering Constraints," 2010 Third International
Conference on Knowledge Discovery and Data Mining,
Phuket, 2010, pp. 195-199.
doi:10.1109/WKDD.2010.123
[2] Lei Gu, "A novel locality sensitive k-means clustering
algorithm based on subtractive clustering," 2016 7th IEEE
International Conference on Software Engineering and
Service Science (ICSESS), Beijing, 2016, pp. 836-839.
doi:10.1109/ICSESS.2016.7883196
[3] Jiang Chundong, Jia Haipeng, Du Taihang, Zhang Lei and
Chunbo Jiang, "Evolutionary algorithm and its application in
structural topology optimization," 2008 27th Chinese
Control Conference,Kunming,2008, pp.10-14.
doi:10.1109/CHICC.2008.4605057
[4] Marés J., Torra V. (2012) Clustering-Based Categorical
Data Protection. In: Domingo-Ferrer J., Tinnirello I . (eds)
Privacy in Statistical Databases PSD 2012.Lecture Notes in
Computer Science,vol 7556.Springer,Berlin,Heidelberg
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2838
[5] Wanliang Fu, "Multi-media data mining technology for
the systematic framework," 2012 IEEE International
Conference on Computer Science and Automation
Engineering, Beijing, 2012, pp. 570-572.
doi:10.1109/ICSESS.2012.6269531
[6] H. C. Mandhare and S. R. Idate, "A comparative study of
cluster based outlier detection, distance based outlier
detection and density based outlier detection techniques,"
2017 International ConferenceonIntelligentComputingand
Control Systems (ICICCS),Madurai,2017,pp.931-935.
[7] B. M. Varghese and U. A., "Recursive Decision Tree
Induction Based on Homogeneousness for Data Clustering,"
2008 International Conference on Cyberworlds,
Hangzhou,2008,pp.754-758.
doi:10.1109/CW.2008.56
[8] Han Jianmin, Cen Tingting and Yu Juan, "An l-MDAV
microaggregation algorithm for sensitive attribute l-
diversity," 2008 27th Chinese Control Conference,Kunming,
2008, pp. 713-718.
doi:10.1109/CHICC.2008.4605421
[9] S. Banerjee, A. Choudhary and S. Pal, "Empirical
evaluation of K-Means, Bisecting K-Means, Fuzzy C-Means
and Genetic K-Means clustering algorithms," 2015 IEEE
International WIE Conference on Electrical and Computer
Engineering (WIECON-ECE), Dhaka, 2015, pp. 168-172.
doi:10.1109/WIECON-ECE.2015.7443889
[10] Fayyoumi and O.Nofal,"ApplyingGenetic Algorithmson
Multi-level Micro-Aggregation Techniques for Secure
Statistical Databases," 2018 IEEE/ACS 15th International
Conference on Computer Systems and Applications
(AICCSA),Aqaba,2018,pp.1-6.
doi: 10.1109/AICCSA.2018.8612813

More Related Content

PDF
IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mi...
PDF
IRJET- Classification of Crops and Analyzing the Acreages of the Field
PDF
IRJET- Missing Data Imputation by Evidence Chain
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
PDF
G046024851
PDF
Data mining techniques
PDF
IRJET - Finger Vein Extraction and Authentication System for ATM
IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mi...
IRJET- Classification of Crops and Analyzing the Acreages of the Field
IRJET- Missing Data Imputation by Evidence Chain
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
G046024851
Data mining techniques
IRJET - Finger Vein Extraction and Authentication System for ATM

What's hot (20)

PDF
Survey on semi supervised classification methods and
PDF
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
PDF
Improved correlation analysis and visualization of industrial alarm data
PDF
IRJET- Plant Disease Detection and Classification using Image Processing a...
PDF
Comparative study of various supervisedclassification methodsforanalysing def...
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
PDF
Survey on semi supervised classification methods and feature selection
PDF
IRJET- Agricultural Crop Classification Models in Data Mining Techniques
PDF
Data mining techniques a survey paper
PDF
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
PDF
IRJET - A Survey on Machine Learning Intelligence Techniques for Medical ...
PDF
Fault detection of imbalanced data using incremental clustering
PDF
Iaetsd a survey on one class clustering
PDF
Decision Tree Based Algorithm for Intrusion Detection
PDF
Correlation of artificial neural network classification and nfrs attribute fi...
PDF
Multi sensor-fusion
PDF
IRJET- Detection and Classification of Leaf Diseases
PDF
Data Analysis and Prediction System for Meteorological Data
PDF
Comparison of Data Mining Techniques used in Anomaly Based IDS
Survey on semi supervised classification methods and
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Improved correlation analysis and visualization of industrial alarm data
IRJET- Plant Disease Detection and Classification using Image Processing a...
Comparative study of various supervisedclassification methodsforanalysing def...
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
Survey on semi supervised classification methods and feature selection
IRJET- Agricultural Crop Classification Models in Data Mining Techniques
Data mining techniques a survey paper
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
IRJET - A Survey on Machine Learning Intelligence Techniques for Medical ...
Fault detection of imbalanced data using incremental clustering
Iaetsd a survey on one class clustering
Decision Tree Based Algorithm for Intrusion Detection
Correlation of artificial neural network classification and nfrs attribute fi...
Multi sensor-fusion
IRJET- Detection and Classification of Leaf Diseases
Data Analysis and Prediction System for Meteorological Data
Comparison of Data Mining Techniques used in Anomaly Based IDS
Ad

Similar to IRJET - Survey on Clustering based Categorical Data Protection (20)

PDF
An Analysis of Outlier Detection through clustering method
PDF
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
PDF
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
PDF
G44093135
PPT
DM_clustering.ppt
PDF
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
PDF
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
PDF
A Novel Filtering based Scheme for Privacy Preserving Data Mining
PPT
DM UNIT_4 PPT for btech final year students
PDF
Cancer data partitioning with data structure and difficulty independent clust...
PPT
Cs501 cluster analysis
PPT
clustering.ppt
PDF
Privacy preservation techniques in data mining
PDF
Privacy preservation techniques in data mining
DOCX
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx
PDF
Outlier Detection Approaches in Data Mining
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PDF
winbis1005
PPT
Chapter 07
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
An Analysis of Outlier Detection through clustering method
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
G44093135
DM_clustering.ppt
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
A Novel Filtering based Scheme for Privacy Preserving Data Mining
DM UNIT_4 PPT for btech final year students
Cancer data partitioning with data structure and difficulty independent clust...
Cs501 cluster analysis
clustering.ppt
Privacy preservation techniques in data mining
Privacy preservation techniques in data mining
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx
Outlier Detection Approaches in Data Mining
84cc04ff77007e457df6aa2b814d2346bf1b
winbis1005
Chapter 07
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
573137875-Attendance-Management-System-original
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Mechanical Engineering MATERIALS Selection
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Geodesy 1.pptx...............................................
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Digital Logic Computer Design lecture notes
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Project quality management in manufacturing
PDF
PPT on Performance Review to get promotions
CYBER-CRIMES AND SECURITY A guide to understanding
573137875-Attendance-Management-System-original
Arduino robotics embedded978-1-4302-3184-4.pdf
bas. eng. economics group 4 presentation 1.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Internet of Things (IOT) - A guide to understanding
Mechanical Engineering MATERIALS Selection
Operating System & Kernel Study Guide-1 - converted.pdf
Lesson 3_Tessellation.pptx finite Mathematics
Foundation to blockchain - A guide to Blockchain Tech
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Geodesy 1.pptx...............................................
Model Code of Practice - Construction Work - 21102022 .pdf
Digital Logic Computer Design lecture notes
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Project quality management in manufacturing
PPT on Performance Review to get promotions

IRJET - Survey on Clustering based Categorical Data Protection

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2835 Amrutha HJ1, Anu A Kittur2, Chaitra MS3, Gowri M4, Sowmya SR5 1,2,3,4BE Student, Department of Information Science and Engineering 5Professor, Dept. of ISE, Dayananda Sagar Academy of Technology & Management, Karnataka, India ----------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The amount of publicly accessible datasets isrising every day in the present age. Improving data privacy erefore becomes mandatory. This has become a major reason why prolonged research has been undertaken to deliver effective fortification techniques that obstruct the revelationofentities in the datasets by conserving the data utility. Acomprehensive attachement for categorical data protection is carried out by applying clusters to the dataset and then safeguarding every data segment. Key Words: Categorical Data, Clustering, Data mining, Data privacy 1. INTRODUCTION Providing the requisite privacy is the mainagenda toprotect the data or information. All the clients who entered the data would expect their data to be protected. Data mining is a method in which it transforms the base data tofinisheddata. It is approach that calls for and examines thevastquantityof dat collected to obtain trends. Categorical data can also be known as statistical data consisting of categorical values. There are three major attributes to reflect when consideringa dataset, namely confidential, identifiers and Quasiidentifiers. Quasiidentifiers are pieces of information with some degree of uncertainty that are not by themselves distinct identifiers. In the case of confidential attributes, it includes information of employment, health issues or religion. Clustering can be defined as the process in which the abstract objects become an interconnected class of objects withintheset.Thestudy of clustering takes into account in applications such as market survey, data-analysis, pattern recognition and image processing. Protection approaches are tested on the basis of two important measures they threaten the loss and disclosure of information. The information loss is calculated by comparing the statistical parameter between the anonymousone and the original data table. Security approachescanbeclassifiedinto two general categories: disruptive and non-perturbatory. Perturbative is a technique for changing the attribute’s sensitive value via a new value. NonPerturbative technique does not change the attribute's sensitive value, rather it attribute’s sensitive value, rather it suppresses or deletes certain datasets. 2. METHODOLOGY 2.1 Subtractive Clustering: The currently in effect subtractive clustering approach can be used only for numerical data that cannot be used for data with categorical values. Many cluster grids have a maximum value in the conventional mountain-clustering process. But this mountain clusteringapproachcansometimestrigger the computation's increasing complexity, so one subtractive method to clustering has been proposed. This approach can be used only in numerical data since there is no natural ordering of the categorical data. Though clustering using kmeans gives better efficiency, subtractive clustering is powerful. 2.2 Robust Hierarchical Clustering (RHC): Hierarchical clustering is the popular unsupervised technique used for the Metabolomics data. In the case of conventional hierarchical clustering system, it is highly reactive to outliers and if there is the existenceofmisleading clustering tests, those outliers exist. Two Stage Generalized S-estimator (TSGS) is used to robustify hierarchical clustering which allows use of the covariance matrix. There are 3 major steps in robust hierarchical data segmentation methodology. 1. Estimation of Robust covariance matrix: The biggest hurdle here is to estimate an appropriatematrix of correlation or dispersion at a time in the presence of cell- wise anomalies or outliers in case-wise and cell-wise. 2. Robust evaluation of correlation matrix based on dissimilarity using the TSGS covariance matrix. 3. Estimate of RHC proposed with TSGS dispersion matrix. 2.3 Decision Tree Categorical Value Clustering Data breakdown methods add noise to the data to avoid correct confidential values beingrevealed.Categorical values Survey on Clustering based Categorical Data Protection
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2836 of attributes are clustered in the beginning, and these clusters are then used in the later stages to create noise. Categorical value clustering and disruption technique of the decision-tree disturbs a non-class categorical feature of a dataset. Therefore, we apply it once for each non-class attribute specified on the original dataset to agitate all non- class categorical attributes.Everytimea datasetisgenerated with one disturbed attribute within it. Lastly, we constructa dataset (combining all disturbed data sets) where each non- class categorical attribute is disturbed and all other attributes s are not disturbed. 2.4 Outlier Diagnosis: Outlier is one that does not adhere to the pattern in the dataset or any other feature expected. This may be diagonalised using anomaly detection methods. These phenomena can also be called outliers, novelties, noise, or variations. They come in three different types: 1. Supervised anomaly detection 2. Semi supervised anomaly detection 3. Unsupervised anomaly detection Unmonitored detections of anomalies identify anomalies in an unlabeled test data data set under which the data collection standard of events is considered normal by searching for instances that appear to conform to the rest of the data set atleast . 2.4.1 Outlier Detection Techniques: A. Statistical outlier detection: It calculates the arguments in the case of statistical distribution by imagining all the data points produced by statistical dispersion B. Depth based outlier detection: Depth based search originality at data space cap for outlier detection. They're autonomous regarding statistical data distribution. C. Distance based outlier detection: This judges a point based on separation of neighborhoods. D. Density based outlier detection: It practices the distribution of data element density into the set of data. E. Deviation based outlier detection: The data components are scattered as a sparse matrix in the data set which creates confusion over the analysis ofresults. When departing from standard points some points are considered anomalies. Table 1: Comparison table for outlier algorithms 2.5 Evolutionary Optimization Approach A progressive accession to protection of data is based on an evolutionary algorithm, driven by the amalgamation of loss in information and threat disclosure procedures. This algorithm is dedicated to discover precise or approximate results to simplify or explore problems. The algorithm uses two simple genetic operators: mutation and crossover. It uses state-of-the-art techniques for categorical stability. Mutation: The pieces are randomly arrangedtoobtaina new offspring in case of mutation. Crossover: Consists of 2 chromosomal recombined values which also produce two new off springs. 2.6 L-Diversity The anonymity models through generalizationcanshieldthe confidentiality of individuals but often lead to information loss. (K, l, al)-variety diminishes knowledgelossandensures data quality. This method ensures data privacyevenwithout the knowledge of the opponent’s background to avoid disclosure of attributes. In this case sensitive attributes are well represented. That technique is a k-anonymity modification. A definition from a set of n records (k, l, range) diversity is used in such a way that the data segment cluster includes at least k (k = n) data elements as well as at least 1 dissimilar sensitive characteristics and the sum of all intra cluster distance is reduced.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2837 3. RESULT COMPARISION 3.1 Clustering Algorithms Table 2: Comparison table for clustering algorithms Algorithm Benefit Drawback Subtractive clustering There is an efficient method in this case using. On numerous UCI datasets, a few investigations are carried out, and some experimental results describe that the approach given can attain better clustering precision when compared to k- modes algorithm. Unsupervised clustering is not clear. Robust hierarchical clustering Simulation training clearly shows that the anticipated approach improves performance considerably over conventional hierarchical clustering 1. The preceding step cannot be undoed. 2. Complexity of time: Not suitable for large datasets. 3.2 Outlier Algorithms Table 3: Comparison table for outlier algorithms 3.3 Protection Algorithms Table 4: Comparison table for outlier algorithms Algorithm Advantage Disadvantage L-Diversity 1. Makes distribution more robust 1. This can be redundant and laborious to within the category of critical attributes, thereby increasing data protection 2. Protects from disclosing attribute. achieve. 2. Prone to attacks such as skewness attack. Evolutionary Optimization Approach We perform better for advanced dimensional failures. We are robust in terms of noisy valuation functions that do not reap any sensible outcome in a given stipulated amount of time. 4. CONCLUSIONS In this paper, a new approach is used todeal withcategorical data confidentiality using the SCCA algorithm clustering technique, which can result in more contented clustering accuracy than the obsolete kmodes algorithm on each collection. The efficiency of TSGS algorithms is greater than that of robust estimation techniques. Ldiversity will intensify the privacy of the defendantbutthis function is not sufficient to protect critical attributes.Hence, evolutionary optimization strategy is a better method of defense. REFERENCES [1]H. Zhao and Z. Qi, "Hierarchical Agglomerative Clustering with Ordering Constraints," 2010 Third International Conference on Knowledge Discovery and Data Mining, Phuket, 2010, pp. 195-199. doi:10.1109/WKDD.2010.123 [2] Lei Gu, "A novel locality sensitive k-means clustering algorithm based on subtractive clustering," 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, 2016, pp. 836-839. doi:10.1109/ICSESS.2016.7883196 [3] Jiang Chundong, Jia Haipeng, Du Taihang, Zhang Lei and Chunbo Jiang, "Evolutionary algorithm and its application in structural topology optimization," 2008 27th Chinese Control Conference,Kunming,2008, pp.10-14. doi:10.1109/CHICC.2008.4605057 [4] Marés J., Torra V. (2012) Clustering-Based Categorical Data Protection. In: Domingo-Ferrer J., Tinnirello I . (eds) Privacy in Statistical Databases PSD 2012.Lecture Notes in Computer Science,vol 7556.Springer,Berlin,Heidelberg
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2838 [5] Wanliang Fu, "Multi-media data mining technology for the systematic framework," 2012 IEEE International Conference on Computer Science and Automation Engineering, Beijing, 2012, pp. 570-572. doi:10.1109/ICSESS.2012.6269531 [6] H. C. Mandhare and S. R. Idate, "A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques," 2017 International ConferenceonIntelligentComputingand Control Systems (ICICCS),Madurai,2017,pp.931-935. [7] B. M. Varghese and U. A., "Recursive Decision Tree Induction Based on Homogeneousness for Data Clustering," 2008 International Conference on Cyberworlds, Hangzhou,2008,pp.754-758. doi:10.1109/CW.2008.56 [8] Han Jianmin, Cen Tingting and Yu Juan, "An l-MDAV microaggregation algorithm for sensitive attribute l- diversity," 2008 27th Chinese Control Conference,Kunming, 2008, pp. 713-718. doi:10.1109/CHICC.2008.4605421 [9] S. Banerjee, A. Choudhary and S. Pal, "Empirical evaluation of K-Means, Bisecting K-Means, Fuzzy C-Means and Genetic K-Means clustering algorithms," 2015 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE), Dhaka, 2015, pp. 168-172. doi:10.1109/WIECON-ECE.2015.7443889 [10] Fayyoumi and O.Nofal,"ApplyingGenetic Algorithmson Multi-level Micro-Aggregation Techniques for Secure Statistical Databases," 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA),Aqaba,2018,pp.1-6. doi: 10.1109/AICCSA.2018.8612813