SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4320
Privacy preservation using Apache Spark
Sumedha Shenoy K1, Thamatam Bhavana2, S.Lokesh3
1
Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India
2Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India
3Associate Professor, Dept. of Computer Science Engineering, The National Institute of Engineering, Mysuru,
Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In the present, where the huge amountsofdatais
available; there is a difficulty to preserve the privacy of the
data. There exists medical data in which the privacy of the
patients is of utmost importance. The dataset of the patients
includes sensitive propertiessuchasname, age, disease, etc. So,
to prevent revealing the identity of person, big data
anonymization techniques are used. The implementations of
anonymization techniques are done using Apache Hadoop
previously. In this study, Spark framework is chosen to
facilitatehighprocessingspeedusingIn-memorycomputation.
It caches data in memory for further iterations whichenhance
the overall performance. Faster data anonymization
techniques using Spark are proposed to overcome themedical
dataset privacy problems.
Key Words: Anonymization, big-data, Spark , k-
anonymity, l-diversity, t-closeness, privacy
preservation.
1. INTRODUCTION
Privacy and confidentiality are huge aspects in social
life that we always have the dangers of misuse. In any
real-life situation we see lot of personal data being
shared, by entrusting the people around usforkeeping
it safe and away from misuse. In educational field, data
of studentsandtheiracademics;Ineconomicarea bank
details, salary information, share and stock related
stuff; In medical fields, patients personal data like
address, cell-phone number etc are some of the
sensitive attributes. These data should be with-held
from leaking into public domain. If not there can be
severe consequences of privacybreachanddataabuse.
The data that can be sensitive to a person but does not
directly identify him are called as quasi-identifiers.
These quasi-identifiers when analyzed in a particular
manner can point to the person. For example, a person
of age 30 suffering from cancer is living in a city (say
A). There can be few people matching this description
but the person’s identity can be found if we can put
together some other ofthesequasi-identifiers(Q.I)and
zero-in on a single match. Thus signifying that the Q.I
values also play a role in protecting or disclosing a
person’s privacy.
Anonymization is a way to handle these sensitive
attributes in a sense that there will beonlylimiteddata
available so as to make sure the privacy is preserved.
The approach is to make sure that differentiating
datasets becomes difficult and thus picking out one
individual data is not possible. Big-data is nothing but
the collection of growing datasets that obviously
includes a lot of sensitive attributes. When processing
these large datasets it is possible to implement the
anonymization algorithmsandthuspreservingprivacy
adequately.
2. EXISTING APPROACH
Data anonymization on medical data was done using
Hadoop as proposed in [1].The health-care data
includes a lot of data tuples containing sensitive
attributes enough to divulge privacy. UsingHadoopfor
computation anonymization algorithms like k-
anonymity, l-diversity was implemented to obtain
partitioned datasets. A scalable two phase top-down
specialization approach using MapReduce was
considered in [1].The first phase included partitioning
of datasets into smaller subsets to get an intermediate
anonymized results and second phase covers up to
merging various subsets for further anonymization.
Moredemonstrationsonanonymizationalgorithmsare
obtained from [2].
2.1 Drawbacks in the existing system-
1) Hadoop is not too suitable for smalldata.HDFShasa
high capacity design which restricts it from random
reading of small volume data [3].
2) MapReduce works in two processing phases: Map
and Reduce. So, MapReduce takes a lot of time to
perform these tasks, thus significantly increasing
latency [3], thereby reduces processing speed.
3) Hadoop only supports batch processing; it is not
suitableforstreamingdata.Alsoreal-timeprocessingis
not employed in Hadoop [3].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4321
3. PROPOSED APPROACH
Apache Spark system is proposed in order to curb
some of the drawbacks of the implementations on
Hadoop. Similar algorithms are run on a sparkcluster-
specifically Pyspark - to achieve similar yet faster
processed results. Pyspark is the Python version of
Apache Spark that can also integrate other languages
like Scala and Java [4]. But Python being the easily
implementable language is used in the said system. An
arx anonymisation tool is used for analysis purpose.
3.1 Advantages of the proposed system-
1) Apache Spark uses In-memory processing of data.
This way of processing the data doesn’t involve in
moving the data to andfrom thedisk.Therefore,makes
Apache spark 100 times faster than MapReduce.
2) Spark is suitable for stream processing. Streaming
gives continuous input/output data. It process data in
less time.
3) In Spark, the data is cached in memory for further
iterations, which increases the performance.
4. IMPLEMENTATION
4.1 K-Anonymity
k- Anonymity is a property of a data set, used to
describe the data set’s level of anonymity. A dataset
is k-anonymous if every combination of identity-
revealing characteristics occurs in at least k different
rows of the data set. It involves increasing the
similarity between different rows of the dataset which
leads to k matches in the dataset. The probability that
the data belongs to an individual is 1/k.Givenadataset
and parameter k, the generalized form of the table
should have probability <= 1/k and information loss
minimized. The information loss depends on the
number of tuples on the same attribute. K-Anonymity
is an optimizationproblemformaximizingtheutilityof
the data and minimizing the information loss. It is NP-
hard problem and becomes polynomially solvable if
number of quasi identifiers is 1. There are two
approaches to generalize the dataset, the first one is
Homogeneous generalization in which the cluster has
to be created and similar values have to be given to the
tuples in the cluster.Then,assignageneralizedvalueto
each tuple to show that they belong to the same group.
Original values and anonymized values can be
represented as a bipartite graph and the order is
changed in order to not recognize the tuple. Each edge
in the graph denotes a possible identity. The other
approach is Heterogeneous generalization. In this
approach not all values in the column have been
modified to satisfy anonymity. Dataset is anonymized
with lower value of k. This method results in less
inaccuracy and hence less information loss. In the
bipartite graph the degree of incoming and outgoing
edges should be at least k i.e. same as each other.
Fig -1: Bipartite view for k=3
Generalization graph must have kdisjointassignments
and every edge of the bipartite graph should be in only
one of those assignments. So, as to make the
probability 1/k . For k disjoint assignments to exist,
indegree should be equal to outdegree foreachnodein
the graph. The bipartite graph should be k-regular.
So the idea of k-anonymity is notaboutjustpreventing
certainty of the data but creating an ambiguity in the
actual data in order to reduce suspicionsonfindingthe
matches for the person’s data. Thealgorithmshouldbe
secure enough that even after knowing the algorithm
the adversary should not be able to reverse-engineer
the anonymized data.
4.2 L-Diversity
L-Diversity technique can be implemented after k-
Anonymity is applied on the dataset. It is an extension
to k-Anonymity in which the number of partitions in
the representation of data is reduced. Sensitive
attributes are made diverse within each equivalence
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4322
class (k-matches). This is to ensure that each
equivalence class has to have at least l-distinct values
for a sensitive attribute [5].
4.3 T-Closeness
T-closenessisamodelthatextendsl-diversity;ittreats
the values of a sensitive attribute noticeably by
considering the distribution of data values for that
attribute. There should be a threshold value t that all
the equivalence class (a group of k-matches) should
maintain at-most threshold 't' tobethe deviationofthe
sensitive attribute in this class from thecorresponding
distribution of the attribute in the whole table[6]. For
numerical values of the tuples, using t-closeness
anonymizing algorithm is more effective than many
other privacy-preserving data mining methods.
Fig -2: Flow chart of implementation
5. CONCLUSIONS
The system provides a faster anonymizationapproach
and discarding some major disadvantages from
Hadoop implementation. The Spark provides ease of
use and access.
The anonymization further can be improved for some
optimal condition to reduce the information loss and
improve efficiency. Since anonymization is not just
removal of Q.I but also preserving utility, it has a huge
factor in many big-data issues like scalability and
dimensionality. In the future this system can be
integrated to other system in order to make best useof
the privacy preservation. Analysis on the output can
also be improved.
REFERENCES
[1] Privacy preservation for medical dataset using
Hadoop by Balaji K Bodkhe and Dr. Sanjay P Sood
[2]Bighealthcaredata:preservingsecurityandprivacy
by Karim Abouelmehdi,AbderrahimBeni-Hessaneand
Hayat Khaloufi..
[3] Blog reference:
https://data-flair.training/blogs/hadoop-tutorial/
[4] Apache Spark Documentation:
https://spark.apache.org/docs/latest/
[5]Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke,
Johannes; Venkitasubramaniam, Muthuramakrishnan
(March 2007). "L-diversity: Privacy Beyond K-
anonymity". ACM Trans. Knowl. Discov. Data
[6] Ninghui Li, Tiancheng Li, and Suresh
Venkatasubramanian (2007). "t-Closeness: Privacy
beyond k-anonymity and l-diversity"

More Related Content

PDF
Survey paper on Big Data Imputation and Privacy Algorithms
PDF
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
PDF
Hybrid Algorithm for Clustering Mixed Data Sets
PDF
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
PDF
5 parallel implementation 06299286
PDF
Document retrieval using clustering
PDF
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
PDF
Web Based Fuzzy Clustering Analysis
Survey paper on Big Data Imputation and Privacy Algorithms
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Hybrid Algorithm for Clustering Mixed Data Sets
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
5 parallel implementation 06299286
Document retrieval using clustering
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
Web Based Fuzzy Clustering Analysis

What's hot (20)

PDF
The D-basis Algorithm for Association Rules of High Confidence
PDF
A statistical data fusion technique in virtual data integration environment
PDF
Estimating project development effort using clustered regression approach
PDF
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
PDF
A03202001005
PDF
IRJET - Rainfall Forecasting using Weka Data Mining Tool
PDF
Vol 16 No 2 - July-December 2016
PDF
IRJET- Machine Learning: Survey, Types and Challenges
PDF
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
PDF
Research scholars evaluation based on guides view
PDF
Research scholars evaluation based on guides view using id3
PDF
Ijsws14 423 (1)-paper-17-normalization of data in (1)
PDF
A Preference Model on Adaptive Affinity Propagation
PDF
Recommendation system using bloom filter in mapreduce
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
PDF
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
PDF
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
The D-basis Algorithm for Association Rules of High Confidence
A statistical data fusion technique in virtual data integration environment
Estimating project development effort using clustered regression approach
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
A03202001005
IRJET - Rainfall Forecasting using Weka Data Mining Tool
Vol 16 No 2 - July-December 2016
IRJET- Machine Learning: Survey, Types and Challenges
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Research scholars evaluation based on guides view
Research scholars evaluation based on guides view using id3
Ijsws14 423 (1)-paper-17-normalization of data in (1)
A Preference Model on Adaptive Affinity Propagation
Recommendation system using bloom filter in mapreduce
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Textual Data Partitioning with Relationship and Discriminative Analysis
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
Ad

Similar to IRJET- Privacy Preservation using Apache Spark (20)

PDF
Query Processing with k-Anonymity
PDF
Data Anonymization for Privacy Preservation in Big Data
PDF
Data attribute security and privacy in Collaborative distributed database Pub...
PPT
Privacy preserving dm_ppt
PDF
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PDF
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
PDF
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
PDF
78201919
PDF
78201919
PDF
Privacy Preserving by Anonymization Approach
PDF
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
PPTX
Distinct l diversity anonymization of set valued data
PDF
Data Privacy Patterns in databricks for data engineering professional certifi...
PDF
A New Method for Preserving Privacy in Data Publishing
PDF
Ijcatr04051015
PDF
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
PDF
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
PDF
A Rule based Slicing Approach to Achieve Data Publishing and Privacy
PDF
Ak Anonymity Clustering Method for Effective Data Privacy Preservation 1st Ed...
Query Processing with k-Anonymity
Data Anonymization for Privacy Preservation in Big Data
Data attribute security and privacy in Collaborative distributed database Pub...
Privacy preserving dm_ppt
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
Data Privacy with Apache Spark: Defensive and Offensive Approaches
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
78201919
78201919
Privacy Preserving by Anonymization Approach
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
Distinct l diversity anonymization of set valued data
Data Privacy Patterns in databricks for data engineering professional certifi...
A New Method for Preserving Privacy in Data Publishing
Ijcatr04051015
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
A Rule based Slicing Approach to Achieve Data Publishing and Privacy
Ak Anonymity Clustering Method for Effective Data Privacy Preservation 1st Ed...
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Sustainable Sites - Green Building Construction
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Current and future trends in Computer Vision.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
introduction to datamining and warehousing
PDF
PPT on Performance Review to get promotions
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
additive manufacturing of ss316l using mig welding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Digital Logic Computer Design lecture notes
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Sustainable Sites - Green Building Construction
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Current and future trends in Computer Vision.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Foundation to blockchain - A guide to Blockchain Tech
introduction to datamining and warehousing
PPT on Performance Review to get promotions
Model Code of Practice - Construction Work - 21102022 .pdf
573137875-Attendance-Management-System-original
additive manufacturing of ss316l using mig welding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Digital Logic Computer Design lecture notes
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding

IRJET- Privacy Preservation using Apache Spark

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4320 Privacy preservation using Apache Spark Sumedha Shenoy K1, Thamatam Bhavana2, S.Lokesh3 1 Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India 2Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India 3Associate Professor, Dept. of Computer Science Engineering, The National Institute of Engineering, Mysuru, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - In the present, where the huge amountsofdatais available; there is a difficulty to preserve the privacy of the data. There exists medical data in which the privacy of the patients is of utmost importance. The dataset of the patients includes sensitive propertiessuchasname, age, disease, etc. So, to prevent revealing the identity of person, big data anonymization techniques are used. The implementations of anonymization techniques are done using Apache Hadoop previously. In this study, Spark framework is chosen to facilitatehighprocessingspeedusingIn-memorycomputation. It caches data in memory for further iterations whichenhance the overall performance. Faster data anonymization techniques using Spark are proposed to overcome themedical dataset privacy problems. Key Words: Anonymization, big-data, Spark , k- anonymity, l-diversity, t-closeness, privacy preservation. 1. INTRODUCTION Privacy and confidentiality are huge aspects in social life that we always have the dangers of misuse. In any real-life situation we see lot of personal data being shared, by entrusting the people around usforkeeping it safe and away from misuse. In educational field, data of studentsandtheiracademics;Ineconomicarea bank details, salary information, share and stock related stuff; In medical fields, patients personal data like address, cell-phone number etc are some of the sensitive attributes. These data should be with-held from leaking into public domain. If not there can be severe consequences of privacybreachanddataabuse. The data that can be sensitive to a person but does not directly identify him are called as quasi-identifiers. These quasi-identifiers when analyzed in a particular manner can point to the person. For example, a person of age 30 suffering from cancer is living in a city (say A). There can be few people matching this description but the person’s identity can be found if we can put together some other ofthesequasi-identifiers(Q.I)and zero-in on a single match. Thus signifying that the Q.I values also play a role in protecting or disclosing a person’s privacy. Anonymization is a way to handle these sensitive attributes in a sense that there will beonlylimiteddata available so as to make sure the privacy is preserved. The approach is to make sure that differentiating datasets becomes difficult and thus picking out one individual data is not possible. Big-data is nothing but the collection of growing datasets that obviously includes a lot of sensitive attributes. When processing these large datasets it is possible to implement the anonymization algorithmsandthuspreservingprivacy adequately. 2. EXISTING APPROACH Data anonymization on medical data was done using Hadoop as proposed in [1].The health-care data includes a lot of data tuples containing sensitive attributes enough to divulge privacy. UsingHadoopfor computation anonymization algorithms like k- anonymity, l-diversity was implemented to obtain partitioned datasets. A scalable two phase top-down specialization approach using MapReduce was considered in [1].The first phase included partitioning of datasets into smaller subsets to get an intermediate anonymized results and second phase covers up to merging various subsets for further anonymization. Moredemonstrationsonanonymizationalgorithmsare obtained from [2]. 2.1 Drawbacks in the existing system- 1) Hadoop is not too suitable for smalldata.HDFShasa high capacity design which restricts it from random reading of small volume data [3]. 2) MapReduce works in two processing phases: Map and Reduce. So, MapReduce takes a lot of time to perform these tasks, thus significantly increasing latency [3], thereby reduces processing speed. 3) Hadoop only supports batch processing; it is not suitableforstreamingdata.Alsoreal-timeprocessingis not employed in Hadoop [3].
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4321 3. PROPOSED APPROACH Apache Spark system is proposed in order to curb some of the drawbacks of the implementations on Hadoop. Similar algorithms are run on a sparkcluster- specifically Pyspark - to achieve similar yet faster processed results. Pyspark is the Python version of Apache Spark that can also integrate other languages like Scala and Java [4]. But Python being the easily implementable language is used in the said system. An arx anonymisation tool is used for analysis purpose. 3.1 Advantages of the proposed system- 1) Apache Spark uses In-memory processing of data. This way of processing the data doesn’t involve in moving the data to andfrom thedisk.Therefore,makes Apache spark 100 times faster than MapReduce. 2) Spark is suitable for stream processing. Streaming gives continuous input/output data. It process data in less time. 3) In Spark, the data is cached in memory for further iterations, which increases the performance. 4. IMPLEMENTATION 4.1 K-Anonymity k- Anonymity is a property of a data set, used to describe the data set’s level of anonymity. A dataset is k-anonymous if every combination of identity- revealing characteristics occurs in at least k different rows of the data set. It involves increasing the similarity between different rows of the dataset which leads to k matches in the dataset. The probability that the data belongs to an individual is 1/k.Givenadataset and parameter k, the generalized form of the table should have probability <= 1/k and information loss minimized. The information loss depends on the number of tuples on the same attribute. K-Anonymity is an optimizationproblemformaximizingtheutilityof the data and minimizing the information loss. It is NP- hard problem and becomes polynomially solvable if number of quasi identifiers is 1. There are two approaches to generalize the dataset, the first one is Homogeneous generalization in which the cluster has to be created and similar values have to be given to the tuples in the cluster.Then,assignageneralizedvalueto each tuple to show that they belong to the same group. Original values and anonymized values can be represented as a bipartite graph and the order is changed in order to not recognize the tuple. Each edge in the graph denotes a possible identity. The other approach is Heterogeneous generalization. In this approach not all values in the column have been modified to satisfy anonymity. Dataset is anonymized with lower value of k. This method results in less inaccuracy and hence less information loss. In the bipartite graph the degree of incoming and outgoing edges should be at least k i.e. same as each other. Fig -1: Bipartite view for k=3 Generalization graph must have kdisjointassignments and every edge of the bipartite graph should be in only one of those assignments. So, as to make the probability 1/k . For k disjoint assignments to exist, indegree should be equal to outdegree foreachnodein the graph. The bipartite graph should be k-regular. So the idea of k-anonymity is notaboutjustpreventing certainty of the data but creating an ambiguity in the actual data in order to reduce suspicionsonfindingthe matches for the person’s data. Thealgorithmshouldbe secure enough that even after knowing the algorithm the adversary should not be able to reverse-engineer the anonymized data. 4.2 L-Diversity L-Diversity technique can be implemented after k- Anonymity is applied on the dataset. It is an extension to k-Anonymity in which the number of partitions in the representation of data is reduced. Sensitive attributes are made diverse within each equivalence
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4322 class (k-matches). This is to ensure that each equivalence class has to have at least l-distinct values for a sensitive attribute [5]. 4.3 T-Closeness T-closenessisamodelthatextendsl-diversity;ittreats the values of a sensitive attribute noticeably by considering the distribution of data values for that attribute. There should be a threshold value t that all the equivalence class (a group of k-matches) should maintain at-most threshold 't' tobethe deviationofthe sensitive attribute in this class from thecorresponding distribution of the attribute in the whole table[6]. For numerical values of the tuples, using t-closeness anonymizing algorithm is more effective than many other privacy-preserving data mining methods. Fig -2: Flow chart of implementation 5. CONCLUSIONS The system provides a faster anonymizationapproach and discarding some major disadvantages from Hadoop implementation. The Spark provides ease of use and access. The anonymization further can be improved for some optimal condition to reduce the information loss and improve efficiency. Since anonymization is not just removal of Q.I but also preserving utility, it has a huge factor in many big-data issues like scalability and dimensionality. In the future this system can be integrated to other system in order to make best useof the privacy preservation. Analysis on the output can also be improved. REFERENCES [1] Privacy preservation for medical dataset using Hadoop by Balaji K Bodkhe and Dr. Sanjay P Sood [2]Bighealthcaredata:preservingsecurityandprivacy by Karim Abouelmehdi,AbderrahimBeni-Hessaneand Hayat Khaloufi.. [3] Blog reference: https://data-flair.training/blogs/hadoop-tutorial/ [4] Apache Spark Documentation: https://spark.apache.org/docs/latest/ [5]Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan (March 2007). "L-diversity: Privacy Beyond K- anonymity". ACM Trans. Knowl. Discov. Data [6] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian (2007). "t-Closeness: Privacy beyond k-anonymity and l-diversity"