SlideShare a Scribd company logo
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 22
An Optimal Approach For Knowledge Protection In Structured
Frequent Patterns
Cynthia Selvi P pselvi1501@gmail.com
Associate Professor, Dept. of Computer Science
Kunthavai Naacchiyar Govt. Arts College for Women(Autonomous), Thanjavur 613007
Affiliated to Bharathidasan University, Tiruchirapalli, TamilNadu, India.
Mohamed Shanavas A.R vas0699@yahoo.co.in
Associate Professor, Dept. of Computer Science,
Jamal Mohamed College, Tiruchirapalli 620 020
Affiliated to Bharathidasan University, Tiruchirapalli, TamilNadu, India.
Abstract
Data mining is valuable technology to facilitate the extraction of useful patterns and trends from
large volume of data. When these patterns are to be shared in a collaborative environment, they
must be protectively shared among the parties concerned in order to preserve the confidentiality
of the sensitive data. Sharing of information may be in the form of datasets or in any of the
structured patterns like trees, graphs, lattices, etc., This paper propose a sanitization algorithm for
protecting sensitive data in a structured frequent pattern(tree).
Keywords: Rank Function, Restricted Node, Sanitization, Structured Pattern, Victim States.
1. INTRODUCTION
Data mining is an emerging technology to provide various means for identifying the interesting
and important knowledge from large data collections. When this knowledge is to be shared
among various parties in decision making activities, the sensitive data is to be preserved by the
parties concerned. In particular, when multiple companies want to share the customer’s buying
behavior in a collaborative business environment that promote business, the sensitive information
of the individual company should be protected against sharing. The information to be shared may
be in the form of datasets, frequent itemsets, structured patterns or subsequences. Here
structured pattern refers to substructures like graphs, trees or lattices that contain frequent
itemsets[1]. Various approaches have been proposed so far to address this problem of preserving
sensitive patterns. This paper propose an algorithm that is aimed to sanitize sensitive information
in a frequent pattern tree(because trees exhibits the relationships among the itemsets more
clearly) which leaves no trace for the counterpart or an adversary to extract the hidden
information back, by blocking all possible inferences.
In this article, section-II briefs the literature review; section-III states the definitions needed for the
sanitization approach and algorithm presented in this article and section-IV gives the proposed
algorithm. In section-V illustration with sample graphs are given and in section-VI the
performance metrics are discussed with sample results.
2. LITERATURE REVIEW
Due to wide applicability of the field data mining and in particular for the task of association rules,
the focus has been more specific for the problem of protection of sensitive knowledge against
inference and it has been addressed by various researchers[2-12]. This task is referred to as
sanitization in [2] which blocks inference of sensitive rules that facilitate collaborators to mine
independently their own data and then sharing some of the resulting patterns. The above work
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 23
concentrate on hiding frequent itemsets in databases based on the support and/or confidence
framework. In many situations, it would be more comfortable to share the information in the form
of structured patterns like graphs, trees, lattices, etc instead of sharing the entire databases.
In structured patterns when a particular sensitive pattern is to be removed, its supersets and
subsets should also be removed in order to block the forward and backward references. The work
presented in [7], proposes an algorithm(DSA) that sanitize sensitive information in Graphs and
have compared the efficiency with that of Naïve approach. Naïve blocks only the forward
inference attack; but DSA blocks both forward and backward inference attacks. But in DSA, the
subsets of the sensitive pattern are chosen at random. In this situation, possibility is there for the
removal of more number of patterns; because, when a subset pattern is removed, the other
patterns associated with it would also be removed failing which would leave forward trace for the
counterpart to infer the details of the hidden pattern. Hence, this random removal would reduce
the data utility of the source dataset. To overcome this problem, the work proposed in this article
presents an algorithm(RSS) for sanitizing sensitive information in structured pattern tree that use
a rank function for reducing the computational complexity and legitimate information loss. In
comparison with DSA, this algorithm completely blocks the forward and backward inference
attacks by removing the sensitive information and its associated information in an optimized way
by means of the rank function.
3. BASIC DEFINITIONS
Tree: A Tree is a finite set of one or more nodes such that there is a specially designated node
called the root and the remaining nodes are partitioned into n ≥ 0 disjoint sets T1, …Tn, where
each of these nodes is a tree. The sets T1,..Tn are called the subtrees of the root[13].
Set-Enumeration (SE)-tree: It is a tool for representing and/or enumerating sets in a best-first
fashion. The complete SE-tree systematically enumerates elements of a power-set using a pre-
imposed order on the underlying set of elements.
Structured Pattern Tree: A structured pattern tree denoted by T=(N, L) consists of nonempty set
of frequent itemsets N, a set of links L that are ordered pairs of the elements of N such that for all
nodes a, b Є N there is a link from a to b if a ∩ b = a and |b| - |a| = 1, where |x| is the size of the
itemset x.
Level: Let T=(N, L) be a structured pattern(frequent itemset) tree. The level k of an itemset x such
that x Є N, is the length of the path connecting an 1-itemset (usually al level-0) to x.
Height: Let T=(N, L) be a structured pattern tree. The height, h of T is the length of the maximum
path connecting an 1-itemset a with any other itemset b, such that a,b Є N and a⊂ b.
Delete: The deletion of a node x from Ti, is denoted as Del(x). The resulting Ti
’
is the same as Ti
without the node x. In particular, if p1,..,pm , x, s1, ..., sn is the sibling sequence in a level of Ti, then
p1, ...,pm , s1, ..., sn is the sibling sequence in Ti
’
.
Negative Border Nodes: Negative border nodes possess the property of having all its members
(proper subsets) are frequent.
Problem state: Each node in the tree is a problem state.
R-nodes: Nodes that are sensitive and to be restricted before allowing the structured pattern to be
shared.
P-nodes: Predecessor node(subsets) of R-nodes which are to be identified (in order to block
forward inference) before selecting the particular nodes that are to be deleted.
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 24
S-nodes: Successor nodes(supersets) of R-nodes which are to be identified (in order to block
backward inference) before selecting the particular nodes that are to be deleted.
Victim states: Problem states for which the path from P-node(s) at level-1(containing 2-itemsets)
to R-node and/or to S-node(s) are to be searched to select nodes for deletion.
V-node(s): Node(s) to be deleted selectively (based on rank function) among the victim states.
Rank function - r(.): Choose P-node (of R-node) which leads to only one S-node (with single
Primary Link); choose one at random when tie occurs. The search for victim nodes can often be
speeded up by using the ranking function r(.) for all P-nodes. The ideal way to assign ranks would
be on the basis of minimum additional computational cost needed when this P-node is to be
removed.
4. ALGORITHM
Rank-based Structured-pattern Sanitization(RSS):
Input: Frequent Pattern Tree(T), Set of Restricted Nodes(R-nodes)
Output: Sanitized Tree(T’)
Begin
Obtain height h of the input tree;
identify ri Є R ( R-nodes to be restricted);
//Select victim nodes//
for each ri Є R
{
find level k;
V_nodes[ ] ← ri ;
if k > 0
{
do while(k<=h)
{
obtain S-nodes of ri (ie supersets);
V_nodes[ ] ← V_nodes[ ]+S-nodes(ri);
}
do while(k>=1)
{
obtain P-nodes of ri (ie subsets );
V_nodes[ ] ← V_nodes[ ]+P-node that satisfy r(.);
}
}
delete V_nodes[ ];
}
T’←T;
End
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 25
5. ILLUSTRATION
A sample frequent pattern SE-tree is given in Fig.1. Let the node to be protected(R-node) is the
one with itemset abc(dark-filled in Fig.2); When this is marked for deletion(V-nodes),
conventionaly it becomes infrequent. As per antimonotone property of frequent itemsets, if a set
cannot pass a test, all of its supersets will fail the same test as well. Hence all of its supersets(S-
nodes) are to be identified and deleted until the level equals the height of the tree. In this
example, node abcd (shaded) is the superset of abc and so it is marked for deletion(V-nodes).
Deletion of R-node and its supersets may completely hide the details of the sensitive
data(Restricted nodes) and this ensures the blocking of backward inference of R-node.
Morever, the negative border nodes are also to be deleted to completely block the future
inference of the sensitive data. This can be achieved by identifying the Predecessor nodes(P-
node) of R-node and suitably removing them by means of rank function-r(.) defined earlier. In this
example, abc has two P-nodes, ab and ac which are having primarly links and of them as
ac(shaded) has only one primary link, it is marked for deletion(V-nodes) with all its successors(in
this case, acd). Refer Fig.2.
Finally delete all victim nodes and thus the sanitized frequent pattern tree to be shared is resulted
(the one given Fig.3) and hence forward inference is also blocked.
On the contrary, if ab would have been chosen as victim node, then three more nodes would
have been additionally removed which would result in more information loss and utility loss.
Hence the rank function used in this approach sanitizes the structured frequent pattern tree with
reduced information loss and utility loss.
However, the nodes at level-0 (1-itemsets) are not deleted in any way and this preserves the
distinct items in the given structured frequent pattern tree.
FIGURE 1: Frequent Pattern Tree before Sanitization.
a b c d e
ab ac ad bc bd cd de
abc abd acd ade bde
be
abcd abde
ɸ
bcd
ae
abe
Primary Link Secondary Link
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 26
a b c d e
ab ad bc bd cd de
abd ade bde
be
abde
ɸ
bcd
ae
abe
FIGURE 3: Frequent Pattern Tree after Sanitization.
a b c d e
ab ac ad bc bd cd de
abc abd acd ade bde
be
abcd abde
ɸ
bcd
ae
abe
FIGURE 2: Frequent Pattern Tree with Victim Nodes.
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 27
6. EXPERIMENTAL ANALYSIS
The algorithm was tested for real dataset T10I4D100K[14] with number of transactions ranging
from 1K to 10K and number of restricted nodes from 1 to 5. The test run was made on Intel core
i5 processor with 2.3 GHz speed and 4GB RAM operating on 32 bit OS; The implementation of
the proposed algorithm was done with windows 7 - Netbeans 6.9.1 - SQL 2005. The frequent
patterns were obtained using Matrix Apriori[15], which requires only two scans of original
database and uses simpler data structures.
The efficiency of this approach is studied based on the measures given below and it has been
compared (Figures 4 to 7) with the previously proposed algorithms IMA, PMA, TMA[9-12] which
sanitizes the sensitive patterns(itemsets) in the source datasets.
Dissimilarity(dif) : The dissimilarity between the original(D) and sanitized(D’) databases is
measured in terms of their contents which can be measured by the formula,
dif(D, D’) = x
where fx(i) represents the i
th
item in the dataset X. This approach has very low percentage of
dissimilarity and this shows that information loss is very low and so the utility is well preserved.
From the fig.4 &5, it is observed that the proposed algorithm, RSS has very low dissimilarity in
comparison with previous algorithms. However, when the no. of transactions are increased, the
dissimilarity gets increased; this is due to the removal of subsets(with its associated nodes) of the
sensitive nodes for blocking backward inference attack and it is observed to be less than 5%.
CPU Time: The execution time is tested for the proposed algorithm by varying the number of
nodes to be restricted. Fig.6 & 7 shows that the execution time required for RSS algorithm is low
in comparison with the other algorithms. It is also observed that execution time is minimum, when
the no. of transactions in the source dataset is more. However, time is not a significant criteria as
the sanitization is done offline.
FIGURE 4: Dissimilarity
(varying no.of rules).
FIGURE 5: Dissimilarity
(varying no.of transactions).
FIGURE 6: Execution Time
(varying no.of rules).
FIGURE 7: Execution Time
(varying no.of transactions).
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 28
Scalability: In order to effectively hide sensitive knowledge in patterns, the sanitization
algorithms must be efficient and scalable which means the running time of any sanitizing
algorithm must be predictable and acceptable. The efficiency and scalability of the proposed
approach is proved below:
Theorem: The running time of the Rank-based Structured-pattern Sanitization(RSS) approach is
at least O[r(l+s)]; where r is the number of restrictive nodes(R-nodes), l is the number of
preceding levels that have subsets of R-nodes and s is the number of S-nodes(supersets) of
R-nodes.
Proof : Let T be a given Structured frequent pattern tree with N being the total number of nodes in
T; r be the number of sensitive nodes(R-nodes) to be restricted among N; l be the number of
preceding levels of R-nodes and s be the number of S-nodes(supersets) of R-nodes in T.
The proposed approach finds the height of the given tree. For every given R-node, find the victim
states which are the collection of its S-nodes(supersets) and P-nodes(subsets) that lead with only
one primary link for their own successors. As this approach satisfies anti-monotone property, all s
S-nodes are victim nodes and to be deleted to block the backward inferences. However
among the P-nodes(subsets), at each preceding level (other than level-0) the node(subset)
which forms as a single primary link for its successors is to be obtained and deleted (with all its
successors) in order to block all forward inferences; this selection process is quiet straightforward
and it gets repeated for all R-nodes.
This algorithm makes use of both depth-wise and breadth-wise search which requires atleast
O(l+s) computational complexity for every R-node.
Hence, the running time of proposed algorithm for k R-nodes is atleast O[r(l+s)], which is linear
and better than O(n
2
), O(n
3
), O(2
n
), O(n log N).
7. CONCLUSION
The proposed algorithm in this work sanitizes the structured frequent pattern tree in an optimal
way, by using a rank function that reduces the computational complexity as well as the
information loss and utility loss. Moreover, this approach blocks all the inference channels of the
restrictive patterns in both forward and backward directions leaving no trace of the nodes that are
restricted(removed) before sharing. This simulation process facilitates the task of sanitizing the
structured pattern with different set of restricted information when it is to be shared between
different set of collaborators. However, when the database is large, it is sometimes unrealistic to
construct a main-memory based pattern tree. The proposed algorithm sanitizes patterns in static
dataset and also the sanitization is done offline due to the offline decision analysis of the
restricted rules. But further effort is being taken to apply optimized heuristic approach to sanitize
continuous and dynamic dataset.
8. REFERENCES
[1] J.Han, M.Kamber, Data Mining Concepts and Techniches, Oxford University Press, 2009.
[2] M.Atallah, E.Bertino, A.Elmagarmid, M.Ibrahim and V.Verykios “Disclosure Limitation of
Sensitive Rules”, Proc. of IEEE Knowledge and Data Engineering Workshop, pages 45–52,
Chicago, Illinois, Nov 1999.
[3] E.Dasseni, V.S.Verykios, A.K.Elmagarmid & E.Bertino, “Hiding Association Rules by Using
Confidence and Support”, Proc. of the 4th Information Hiding Workshop, pages 369– 383,
Pittsburg, PA, Apr 2001.
[4] Y.Saygin, V.S.Verykios, and C.Clifton, “Using Unknowns to Prevent Discovery of Association
Rules”, SIGMOD Record, 30(4):45–54, Dec 2001.
Cynthia Selvi P & Mohamed Shanavas A.R
International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 29
[5] S.R.M.Oliveira, and O.R.Zaiane, “Privacy preserving Frequent Itemset Mining”, Proc. of the
IEEE ICDM Workshop on Privacy, Security, and Data Mining, Pages 43-54, Maebashi City,
Japan, Dec 2002.
[6] S.R.M.Oliveira, and O.R.Zaiane, “An Efficient One-Scan Sanitization for Improving the
Balance between Privacy and Knowledge Discovery”, Technical Report TR 03-15, Jun 2003.
[7] S.R.M.Oliveira, and O.R.Zaiane, “Secure Association Rule Mining”, Proc. of the 8
th
Pacific-
Asia Conference on Knowledge Discovery and Data Mining(PAKDD’04), Pages 74-85,
Sydney, Australia, May 2004..
[8] B.Yildz, and B.Ergenc, “Hiding Sensitive Predictive Frequent Itemsets”, Proc. of the
International MultiConference of Engineers and Computer Scientists 2011, Vol-I.
[9] P.Cynthia Selvi, A.R.Mohamed Shanavas, “An effective Heuristic Approach for Hiding
Sensitive Patterns in Databases”, International Organization of Scientific Research-Journal
of Computer Engineering(IOSRJCE) Vol. 5, Issue 1(Sep-Oct 2012), PP 06-11.
[10] P.Cynthia Selvi, A.R.Mohamed Shanavas, “An Improved Item-based Maxcover Algorithm to
protect Sensitive Patterns in Large Databases”, International Organization of Scientific
Research-Journal of Computer Engineering(IOSRJCE) Vol.14, Issue 4, Oct 2013, Pages 1-5.
[11] P.Cynthia Selvi, A.R.Mohamed Shanavas, “Output Privacy Protection With Pattern-Based
Heuristic Algorithm”, International Journal of Computer Science & Information
Technology(IJCSIT) Vol 6, No 2, Apr 2014, Pages 141 – 152.
[12] P.Cynthia Selvi, A.R.Mohamed Shanavas, “Towards Information Privacy Using Transaction-
Based Maxcover Algorithm”, World Applied Sciences Journal 29 (Data Mining and Soft
Computing Techniques): 06-11, 2014.
[13] Ellis Horowitz, Sartaj Sahni, Sanguthevar Rajasekaran, Fundamentals of Computer
Algorithms, Galgotia Pub. Pvt. Ltd, Delhi, 1999.
[14] The Dataset used in this work for experimental analysis was generated using the generator
from IBM Almaden Quest research group and is publicly available from
http://fimi.ua.ac.be/data/.
[15] J.Pavon, S.Viana, S.Gomez, “Matrix Apriori: speeding up the search for frequent patterns”,
Proc. 24th IASTED International Conference on Databases and Applications 2006, pp. 75-82.

More Related Content

PDF
Comparative study of ksvdd and fsvm for classification of mislabeled data
PDF
Analysis of Classification Algorithm in Data Mining
DOC
report.doc
PPT
15857 cse422 unsupervised-learning
PPTX
Data mining: Classification and prediction
PPTX
lazy learners and other classication methods
PDF
Dimensionality reduction by matrix factorization using concept lattice in dat...
PPT
1.7 data reduction
Comparative study of ksvdd and fsvm for classification of mislabeled data
Analysis of Classification Algorithm in Data Mining
report.doc
15857 cse422 unsupervised-learning
Data mining: Classification and prediction
lazy learners and other classication methods
Dimensionality reduction by matrix factorization using concept lattice in dat...
1.7 data reduction

What's hot (20)

PDF
M08 BiasVarianceTradeoff
PPTX
Decision Tree - C4.5&CART
PPTX
Data Compression in Data mining and Business Intelligencs
PDF
Decision tree lecture 3
PPTX
Data Mining: Mining stream time series and sequence data
PPT
Data Mining
PPTX
Cluster Analysis
PPTX
Chapter 4 Classification
PPTX
Data discretization
PDF
2018 p 2019-ee-a2
PPT
1.8 discretization
PPTX
Data mining technique (decision tree)
PPTX
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
PDF
Chapter 05 k nn
PDF
Polikar10missing
PDF
Associative Classification: Synopsis
PDF
Classifiers
PDF
Sparse inverse covariance estimation using skggm
PPT
Health-e-Child CaseReasoner
PDF
Hypothesis on Different Data Mining Algorithms
M08 BiasVarianceTradeoff
Decision Tree - C4.5&CART
Data Compression in Data mining and Business Intelligencs
Decision tree lecture 3
Data Mining: Mining stream time series and sequence data
Data Mining
Cluster Analysis
Chapter 4 Classification
Data discretization
2018 p 2019-ee-a2
1.8 discretization
Data mining technique (decision tree)
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Chapter 05 k nn
Polikar10missing
Associative Classification: Synopsis
Classifiers
Sparse inverse covariance estimation using skggm
Health-e-Child CaseReasoner
Hypothesis on Different Data Mining Algorithms
Ad

Viewers also liked (18)

PDF
Tutorial de eclipse
PPTX
русский писатель в гостях у китая
DOCX
Test 7ºAB_May2012
PPTX
PDF
Overview_complete
PDF
Sample econ 3rd
PPTX
PPT
под прицелом зависти
PDF
Coccyx Cushion
PPTX
христианство. иконопись
PPTX
New zealand
PDF
Decanter centrifuge
PPTX
тhе new zealand
PDF
Performance Assessment of Faculties of Management Discipline From Student Per...
PDF
Sponsorship & Marketing Plan - AC Savoia
PPTX
capacidad del proceso de produccion , analisis de los cuellos de botella
PPT
FARPI-FRANCE vous présente ses convoyeurs standards VIRGINIO
PPSX
Geografi Tingkatan 1: Mukabumi tanah tinggi
Tutorial de eclipse
русский писатель в гостях у китая
Test 7ºAB_May2012
Overview_complete
Sample econ 3rd
под прицелом зависти
Coccyx Cushion
христианство. иконопись
New zealand
Decanter centrifuge
тhе new zealand
Performance Assessment of Faculties of Management Discipline From Student Per...
Sponsorship & Marketing Plan - AC Savoia
capacidad del proceso de produccion , analisis de los cuellos de botella
FARPI-FRANCE vous présente ses convoyeurs standards VIRGINIO
Geografi Tingkatan 1: Mukabumi tanah tinggi
Ad

Similar to An Optimal Approach For Knowledge Protection In Structured Frequent Patterns (20)

PDF
FINDING FREQUENT SUBPATHS IN A GRAPH
PDF
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
PDF
A genetic algorithm coupled with tree-based pruning for mining closed associa...
PDF
Literature Survey of modern frequent item set mining methods
PDF
Graph Machine Learning - Past, Present, and Future -
PDF
Topics In Rough Set Theory Current Applications To Granular Computing Seiki A...
PDF
PDF
Fast Sequential Rule Mining
PDF
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
PDF
Association Rule Hiding using Hash Tree
PDF
Analysis of Pattern Transformation Algorithms for Sensitive Knowledge Protect...
PDF
K017167076
PDF
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
PDF
Effieient Algorithms to Find Frequent Itemset using Data Mining
PPTX
FPPM algorithm
PPTX
An efficient approach to mine flexible periodic patterns in time series datab...
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
PDF
call for papers, research paper publishing, where to publish research paper, ...
PDF
A Study of Various Projected Data Based Pattern Mining Algorithms
PDF
A novel approach for text extraction using effective pattern matching technique
FINDING FREQUENT SUBPATHS IN A GRAPH
An Robust Outsourcing of Multi Party Dataset by Utilizing Super-Modularity an...
A genetic algorithm coupled with tree-based pruning for mining closed associa...
Literature Survey of modern frequent item set mining methods
Graph Machine Learning - Past, Present, and Future -
Topics In Rough Set Theory Current Applications To Granular Computing Seiki A...
Fast Sequential Rule Mining
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
Association Rule Hiding using Hash Tree
Analysis of Pattern Transformation Algorithms for Sensitive Knowledge Protect...
K017167076
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
Effieient Algorithms to Find Frequent Itemset using Data Mining
FPPM algorithm
An efficient approach to mine flexible periodic patterns in time series datab...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
call for papers, research paper publishing, where to publish research paper, ...
A Study of Various Projected Data Based Pattern Mining Algorithms
A novel approach for text extraction using effective pattern matching technique

More from Waqas Tariq (20)

PDF
The Use of Java Swing’s Components to Develop a Widget
PDF
3D Human Hand Posture Reconstruction Using a Single 2D Image
PDF
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...
PDF
A Proposed Web Accessibility Framework for the Arab Disabled
PDF
Real Time Blinking Detection Based on Gabor Filter
PDF
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...
PDF
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...
PDF
Collaborative Learning of Organisational Knolwedge
PDF
A PNML extension for the HCI design
PDF
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 Board
PDF
An overview on Advanced Research Works on Brain-Computer Interface
PDF
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...
PDF
Principles of Good Screen Design in Websites
PDF
Progress of Virtual Teams in Albania
PDF
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...
PDF
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...
PDF
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...
PDF
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
PDF
An Improved Approach for Word Ambiguity Removal
PDF
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
The Use of Java Swing’s Components to Develop a Widget
3D Human Hand Posture Reconstruction Using a Single 2D Image
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...
A Proposed Web Accessibility Framework for the Arab Disabled
Real Time Blinking Detection Based on Gabor Filter
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...
Collaborative Learning of Organisational Knolwedge
A PNML extension for the HCI design
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 Board
An overview on Advanced Research Works on Brain-Computer Interface
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...
Principles of Good Screen Design in Websites
Progress of Virtual Teams in Albania
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
An Improved Approach for Word Ambiguity Removal
Parameters Optimization for Improving ASR Performance in Adverse Real World N...

Recently uploaded (20)

PPTX
master seminar digital applications in india
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
01-Introduction-to-Information-Management.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Lesson notes of climatology university.
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
master seminar digital applications in india
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Complications of Minimal Access Surgery at WLH
VCE English Exam - Section C Student Revision Booklet
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
01-Introduction-to-Information-Management.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Lesson notes of climatology university.
Final Presentation General Medicine 03-08-2024.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Module 4: Burden of Disease Tutorial Slides S2 2025
O5-L3 Freight Transport Ops (International) V1.pdf
Cell Types and Its function , kingdom of life
FourierSeries-QuestionsWithAnswers(Part-A).pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Chapter 2 Heredity, Prenatal Development, and Birth.pdf

An Optimal Approach For Knowledge Protection In Structured Frequent Patterns

  • 1. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 22 An Optimal Approach For Knowledge Protection In Structured Frequent Patterns Cynthia Selvi P pselvi1501@gmail.com Associate Professor, Dept. of Computer Science Kunthavai Naacchiyar Govt. Arts College for Women(Autonomous), Thanjavur 613007 Affiliated to Bharathidasan University, Tiruchirapalli, TamilNadu, India. Mohamed Shanavas A.R vas0699@yahoo.co.in Associate Professor, Dept. of Computer Science, Jamal Mohamed College, Tiruchirapalli 620 020 Affiliated to Bharathidasan University, Tiruchirapalli, TamilNadu, India. Abstract Data mining is valuable technology to facilitate the extraction of useful patterns and trends from large volume of data. When these patterns are to be shared in a collaborative environment, they must be protectively shared among the parties concerned in order to preserve the confidentiality of the sensitive data. Sharing of information may be in the form of datasets or in any of the structured patterns like trees, graphs, lattices, etc., This paper propose a sanitization algorithm for protecting sensitive data in a structured frequent pattern(tree). Keywords: Rank Function, Restricted Node, Sanitization, Structured Pattern, Victim States. 1. INTRODUCTION Data mining is an emerging technology to provide various means for identifying the interesting and important knowledge from large data collections. When this knowledge is to be shared among various parties in decision making activities, the sensitive data is to be preserved by the parties concerned. In particular, when multiple companies want to share the customer’s buying behavior in a collaborative business environment that promote business, the sensitive information of the individual company should be protected against sharing. The information to be shared may be in the form of datasets, frequent itemsets, structured patterns or subsequences. Here structured pattern refers to substructures like graphs, trees or lattices that contain frequent itemsets[1]. Various approaches have been proposed so far to address this problem of preserving sensitive patterns. This paper propose an algorithm that is aimed to sanitize sensitive information in a frequent pattern tree(because trees exhibits the relationships among the itemsets more clearly) which leaves no trace for the counterpart or an adversary to extract the hidden information back, by blocking all possible inferences. In this article, section-II briefs the literature review; section-III states the definitions needed for the sanitization approach and algorithm presented in this article and section-IV gives the proposed algorithm. In section-V illustration with sample graphs are given and in section-VI the performance metrics are discussed with sample results. 2. LITERATURE REVIEW Due to wide applicability of the field data mining and in particular for the task of association rules, the focus has been more specific for the problem of protection of sensitive knowledge against inference and it has been addressed by various researchers[2-12]. This task is referred to as sanitization in [2] which blocks inference of sensitive rules that facilitate collaborators to mine independently their own data and then sharing some of the resulting patterns. The above work
  • 2. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 23 concentrate on hiding frequent itemsets in databases based on the support and/or confidence framework. In many situations, it would be more comfortable to share the information in the form of structured patterns like graphs, trees, lattices, etc instead of sharing the entire databases. In structured patterns when a particular sensitive pattern is to be removed, its supersets and subsets should also be removed in order to block the forward and backward references. The work presented in [7], proposes an algorithm(DSA) that sanitize sensitive information in Graphs and have compared the efficiency with that of Naïve approach. Naïve blocks only the forward inference attack; but DSA blocks both forward and backward inference attacks. But in DSA, the subsets of the sensitive pattern are chosen at random. In this situation, possibility is there for the removal of more number of patterns; because, when a subset pattern is removed, the other patterns associated with it would also be removed failing which would leave forward trace for the counterpart to infer the details of the hidden pattern. Hence, this random removal would reduce the data utility of the source dataset. To overcome this problem, the work proposed in this article presents an algorithm(RSS) for sanitizing sensitive information in structured pattern tree that use a rank function for reducing the computational complexity and legitimate information loss. In comparison with DSA, this algorithm completely blocks the forward and backward inference attacks by removing the sensitive information and its associated information in an optimized way by means of the rank function. 3. BASIC DEFINITIONS Tree: A Tree is a finite set of one or more nodes such that there is a specially designated node called the root and the remaining nodes are partitioned into n ≥ 0 disjoint sets T1, …Tn, where each of these nodes is a tree. The sets T1,..Tn are called the subtrees of the root[13]. Set-Enumeration (SE)-tree: It is a tool for representing and/or enumerating sets in a best-first fashion. The complete SE-tree systematically enumerates elements of a power-set using a pre- imposed order on the underlying set of elements. Structured Pattern Tree: A structured pattern tree denoted by T=(N, L) consists of nonempty set of frequent itemsets N, a set of links L that are ordered pairs of the elements of N such that for all nodes a, b Є N there is a link from a to b if a ∩ b = a and |b| - |a| = 1, where |x| is the size of the itemset x. Level: Let T=(N, L) be a structured pattern(frequent itemset) tree. The level k of an itemset x such that x Є N, is the length of the path connecting an 1-itemset (usually al level-0) to x. Height: Let T=(N, L) be a structured pattern tree. The height, h of T is the length of the maximum path connecting an 1-itemset a with any other itemset b, such that a,b Є N and a⊂ b. Delete: The deletion of a node x from Ti, is denoted as Del(x). The resulting Ti ’ is the same as Ti without the node x. In particular, if p1,..,pm , x, s1, ..., sn is the sibling sequence in a level of Ti, then p1, ...,pm , s1, ..., sn is the sibling sequence in Ti ’ . Negative Border Nodes: Negative border nodes possess the property of having all its members (proper subsets) are frequent. Problem state: Each node in the tree is a problem state. R-nodes: Nodes that are sensitive and to be restricted before allowing the structured pattern to be shared. P-nodes: Predecessor node(subsets) of R-nodes which are to be identified (in order to block forward inference) before selecting the particular nodes that are to be deleted.
  • 3. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 24 S-nodes: Successor nodes(supersets) of R-nodes which are to be identified (in order to block backward inference) before selecting the particular nodes that are to be deleted. Victim states: Problem states for which the path from P-node(s) at level-1(containing 2-itemsets) to R-node and/or to S-node(s) are to be searched to select nodes for deletion. V-node(s): Node(s) to be deleted selectively (based on rank function) among the victim states. Rank function - r(.): Choose P-node (of R-node) which leads to only one S-node (with single Primary Link); choose one at random when tie occurs. The search for victim nodes can often be speeded up by using the ranking function r(.) for all P-nodes. The ideal way to assign ranks would be on the basis of minimum additional computational cost needed when this P-node is to be removed. 4. ALGORITHM Rank-based Structured-pattern Sanitization(RSS): Input: Frequent Pattern Tree(T), Set of Restricted Nodes(R-nodes) Output: Sanitized Tree(T’) Begin Obtain height h of the input tree; identify ri Є R ( R-nodes to be restricted); //Select victim nodes// for each ri Є R { find level k; V_nodes[ ] ← ri ; if k > 0 { do while(k<=h) { obtain S-nodes of ri (ie supersets); V_nodes[ ] ← V_nodes[ ]+S-nodes(ri); } do while(k>=1) { obtain P-nodes of ri (ie subsets ); V_nodes[ ] ← V_nodes[ ]+P-node that satisfy r(.); } } delete V_nodes[ ]; } T’←T; End
  • 4. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 25 5. ILLUSTRATION A sample frequent pattern SE-tree is given in Fig.1. Let the node to be protected(R-node) is the one with itemset abc(dark-filled in Fig.2); When this is marked for deletion(V-nodes), conventionaly it becomes infrequent. As per antimonotone property of frequent itemsets, if a set cannot pass a test, all of its supersets will fail the same test as well. Hence all of its supersets(S- nodes) are to be identified and deleted until the level equals the height of the tree. In this example, node abcd (shaded) is the superset of abc and so it is marked for deletion(V-nodes). Deletion of R-node and its supersets may completely hide the details of the sensitive data(Restricted nodes) and this ensures the blocking of backward inference of R-node. Morever, the negative border nodes are also to be deleted to completely block the future inference of the sensitive data. This can be achieved by identifying the Predecessor nodes(P- node) of R-node and suitably removing them by means of rank function-r(.) defined earlier. In this example, abc has two P-nodes, ab and ac which are having primarly links and of them as ac(shaded) has only one primary link, it is marked for deletion(V-nodes) with all its successors(in this case, acd). Refer Fig.2. Finally delete all victim nodes and thus the sanitized frequent pattern tree to be shared is resulted (the one given Fig.3) and hence forward inference is also blocked. On the contrary, if ab would have been chosen as victim node, then three more nodes would have been additionally removed which would result in more information loss and utility loss. Hence the rank function used in this approach sanitizes the structured frequent pattern tree with reduced information loss and utility loss. However, the nodes at level-0 (1-itemsets) are not deleted in any way and this preserves the distinct items in the given structured frequent pattern tree. FIGURE 1: Frequent Pattern Tree before Sanitization. a b c d e ab ac ad bc bd cd de abc abd acd ade bde be abcd abde ɸ bcd ae abe Primary Link Secondary Link
  • 5. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 26 a b c d e ab ad bc bd cd de abd ade bde be abde ɸ bcd ae abe FIGURE 3: Frequent Pattern Tree after Sanitization. a b c d e ab ac ad bc bd cd de abc abd acd ade bde be abcd abde ɸ bcd ae abe FIGURE 2: Frequent Pattern Tree with Victim Nodes.
  • 6. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 27 6. EXPERIMENTAL ANALYSIS The algorithm was tested for real dataset T10I4D100K[14] with number of transactions ranging from 1K to 10K and number of restricted nodes from 1 to 5. The test run was made on Intel core i5 processor with 2.3 GHz speed and 4GB RAM operating on 32 bit OS; The implementation of the proposed algorithm was done with windows 7 - Netbeans 6.9.1 - SQL 2005. The frequent patterns were obtained using Matrix Apriori[15], which requires only two scans of original database and uses simpler data structures. The efficiency of this approach is studied based on the measures given below and it has been compared (Figures 4 to 7) with the previously proposed algorithms IMA, PMA, TMA[9-12] which sanitizes the sensitive patterns(itemsets) in the source datasets. Dissimilarity(dif) : The dissimilarity between the original(D) and sanitized(D’) databases is measured in terms of their contents which can be measured by the formula, dif(D, D’) = x where fx(i) represents the i th item in the dataset X. This approach has very low percentage of dissimilarity and this shows that information loss is very low and so the utility is well preserved. From the fig.4 &5, it is observed that the proposed algorithm, RSS has very low dissimilarity in comparison with previous algorithms. However, when the no. of transactions are increased, the dissimilarity gets increased; this is due to the removal of subsets(with its associated nodes) of the sensitive nodes for blocking backward inference attack and it is observed to be less than 5%. CPU Time: The execution time is tested for the proposed algorithm by varying the number of nodes to be restricted. Fig.6 & 7 shows that the execution time required for RSS algorithm is low in comparison with the other algorithms. It is also observed that execution time is minimum, when the no. of transactions in the source dataset is more. However, time is not a significant criteria as the sanitization is done offline. FIGURE 4: Dissimilarity (varying no.of rules). FIGURE 5: Dissimilarity (varying no.of transactions). FIGURE 6: Execution Time (varying no.of rules). FIGURE 7: Execution Time (varying no.of transactions).
  • 7. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 28 Scalability: In order to effectively hide sensitive knowledge in patterns, the sanitization algorithms must be efficient and scalable which means the running time of any sanitizing algorithm must be predictable and acceptable. The efficiency and scalability of the proposed approach is proved below: Theorem: The running time of the Rank-based Structured-pattern Sanitization(RSS) approach is at least O[r(l+s)]; where r is the number of restrictive nodes(R-nodes), l is the number of preceding levels that have subsets of R-nodes and s is the number of S-nodes(supersets) of R-nodes. Proof : Let T be a given Structured frequent pattern tree with N being the total number of nodes in T; r be the number of sensitive nodes(R-nodes) to be restricted among N; l be the number of preceding levels of R-nodes and s be the number of S-nodes(supersets) of R-nodes in T. The proposed approach finds the height of the given tree. For every given R-node, find the victim states which are the collection of its S-nodes(supersets) and P-nodes(subsets) that lead with only one primary link for their own successors. As this approach satisfies anti-monotone property, all s S-nodes are victim nodes and to be deleted to block the backward inferences. However among the P-nodes(subsets), at each preceding level (other than level-0) the node(subset) which forms as a single primary link for its successors is to be obtained and deleted (with all its successors) in order to block all forward inferences; this selection process is quiet straightforward and it gets repeated for all R-nodes. This algorithm makes use of both depth-wise and breadth-wise search which requires atleast O(l+s) computational complexity for every R-node. Hence, the running time of proposed algorithm for k R-nodes is atleast O[r(l+s)], which is linear and better than O(n 2 ), O(n 3 ), O(2 n ), O(n log N). 7. CONCLUSION The proposed algorithm in this work sanitizes the structured frequent pattern tree in an optimal way, by using a rank function that reduces the computational complexity as well as the information loss and utility loss. Moreover, this approach blocks all the inference channels of the restrictive patterns in both forward and backward directions leaving no trace of the nodes that are restricted(removed) before sharing. This simulation process facilitates the task of sanitizing the structured pattern with different set of restricted information when it is to be shared between different set of collaborators. However, when the database is large, it is sometimes unrealistic to construct a main-memory based pattern tree. The proposed algorithm sanitizes patterns in static dataset and also the sanitization is done offline due to the offline decision analysis of the restricted rules. But further effort is being taken to apply optimized heuristic approach to sanitize continuous and dynamic dataset. 8. REFERENCES [1] J.Han, M.Kamber, Data Mining Concepts and Techniches, Oxford University Press, 2009. [2] M.Atallah, E.Bertino, A.Elmagarmid, M.Ibrahim and V.Verykios “Disclosure Limitation of Sensitive Rules”, Proc. of IEEE Knowledge and Data Engineering Workshop, pages 45–52, Chicago, Illinois, Nov 1999. [3] E.Dasseni, V.S.Verykios, A.K.Elmagarmid & E.Bertino, “Hiding Association Rules by Using Confidence and Support”, Proc. of the 4th Information Hiding Workshop, pages 369– 383, Pittsburg, PA, Apr 2001. [4] Y.Saygin, V.S.Verykios, and C.Clifton, “Using Unknowns to Prevent Discovery of Association Rules”, SIGMOD Record, 30(4):45–54, Dec 2001.
  • 8. Cynthia Selvi P & Mohamed Shanavas A.R International Journal of Data Engineering (IJDE), Volume (5) : Issue (3) : 2014 29 [5] S.R.M.Oliveira, and O.R.Zaiane, “Privacy preserving Frequent Itemset Mining”, Proc. of the IEEE ICDM Workshop on Privacy, Security, and Data Mining, Pages 43-54, Maebashi City, Japan, Dec 2002. [6] S.R.M.Oliveira, and O.R.Zaiane, “An Efficient One-Scan Sanitization for Improving the Balance between Privacy and Knowledge Discovery”, Technical Report TR 03-15, Jun 2003. [7] S.R.M.Oliveira, and O.R.Zaiane, “Secure Association Rule Mining”, Proc. of the 8 th Pacific- Asia Conference on Knowledge Discovery and Data Mining(PAKDD’04), Pages 74-85, Sydney, Australia, May 2004.. [8] B.Yildz, and B.Ergenc, “Hiding Sensitive Predictive Frequent Itemsets”, Proc. of the International MultiConference of Engineers and Computer Scientists 2011, Vol-I. [9] P.Cynthia Selvi, A.R.Mohamed Shanavas, “An effective Heuristic Approach for Hiding Sensitive Patterns in Databases”, International Organization of Scientific Research-Journal of Computer Engineering(IOSRJCE) Vol. 5, Issue 1(Sep-Oct 2012), PP 06-11. [10] P.Cynthia Selvi, A.R.Mohamed Shanavas, “An Improved Item-based Maxcover Algorithm to protect Sensitive Patterns in Large Databases”, International Organization of Scientific Research-Journal of Computer Engineering(IOSRJCE) Vol.14, Issue 4, Oct 2013, Pages 1-5. [11] P.Cynthia Selvi, A.R.Mohamed Shanavas, “Output Privacy Protection With Pattern-Based Heuristic Algorithm”, International Journal of Computer Science & Information Technology(IJCSIT) Vol 6, No 2, Apr 2014, Pages 141 – 152. [12] P.Cynthia Selvi, A.R.Mohamed Shanavas, “Towards Information Privacy Using Transaction- Based Maxcover Algorithm”, World Applied Sciences Journal 29 (Data Mining and Soft Computing Techniques): 06-11, 2014. [13] Ellis Horowitz, Sartaj Sahni, Sanguthevar Rajasekaran, Fundamentals of Computer Algorithms, Galgotia Pub. Pvt. Ltd, Delhi, 1999. [14] The Dataset used in this work for experimental analysis was generated using the generator from IBM Almaden Quest research group and is publicly available from http://fimi.ua.ac.be/data/. [15] J.Pavon, S.Viana, S.Gomez, “Matrix Apriori: speeding up the search for frequent patterns”, Proc. 24th IASTED International Conference on Databases and Applications 2006, pp. 75-82.