SlideShare a Scribd company logo
International Journal of Computational Engineering Research||Vol, 03||Issue, 8||
||Issn 2250-3005 || ||August||2013|| Page 43
Protecting Frequent Item sets Disclosure in Data Sets and
Preserving Item Sets Mining
1,
Shaik Mahammad Rafi.
2,
M.Suman ,M.Tech
3,
Majjari Sudhakar,M.Tech
4,
P.Ramesh 5,
P.Venkata Ramanaiah.,M.Tech
1
PG (M.Tech) Student in CSE Department, Global College of Engineering and Technology, Kadapa, YSR (D.t).
2,3,4
Assistant Professors in CSE Department, MRRITS ,Udayagiri,SPSR Nellore(D.t).
5
Assistant Professor in CSE Department, GCET, Kadapa,YSR(D.t).
I. INTRODUCTION
The data mining technologies have been an important technology for discovering previously unknown
and potentially useful information from large data sets or databases. They canbe applied to various domains,
such as Web commerce, crime reconnoitering, health care, and customer's consumption analysis. Although these
are useful technologies, there is also a threat to data privacy. For example, the association rule analysis is a
powerful and popular tool for discovering relationships hidden in large data sets. Therefore, some private
information could be easily discovered by this kind of tools. The protection of the confidentiality of sensitive
information in a database becomes a critical issue to be resolved.
The relationships discovered from a database can be represented in a form of frequent itemsets or
association rules. One rule is categorized as sensitive if its disclosure risk is above some given threshold. With
an association analyzer, if an itemset with support above a given minimal support, we call the itemset as a
frequent itemset.
The problem for finding an optimal sanitization of a source database with association rule analysis has
been proven to be NP-Hard [1]. In [2,3,4,5] the authors presented different heuristic algorithms that modify
transactions via inserting or deleting items for hiding sensitive rules or itemsets.
Vassilios S. Verykios et al. [2] presented algorithms to hide sensitive association rules, but they
generate high side effects and require multiple database scans. Instead of hiding sensitive association rules,
Shyue-Liang Wang [3] proposed algorithms to hide sensitive items. The algorithm needs less number of
database scans but the side effects generated is higher. Ali Amiri [4] also presented heuristic algorithms to hide
sensitive items. Finally, Yi-Hung Wu et al. [5] proposed a heuristic method that could hide sensitive association
rules with limited side effects. However, it spent a lot of time on comparing and checking if the sensitive rules
are hidden and if side effects are produced. Besides, it could fail to hide some sensitive rules in some cases.
ABSTRACT:
Based on the network and data mining techniques, the protection of the confidentiality of
sensitive information in a database becomes a critical issue to be resolved. Association analysis is a
powerful and popular tool for discovering relationships hidden in large data sets. The relationships
can be represented in a form of frequent itemsets or association rules. One rule is categorized as
sensitive if its disclosure risk is above some given threshold. Privacy-preserving data mining is an
important issue which can be applied to various domains, such as Web commerce, crime
reconnoitering, health care, and customer's consumption analysis.
The main approach to hide sensitive frequent itemsets is to reduce the support of each given
sensitive itemsets. This is done by modifying transactions or items in the database. However, the
modifications will generate side effects, i.e., nonsensitive frequent itemsets falsely hidden (the loss
itemsets) and spurious frequent itemsets falsely generated (the new itemsets). There is a trade-off
between sensitive frequent itemsets hidden and side effects generated. Furthermore, it should always
take huge computing time to solve the problem.
In this study, we propose a novel algorithm, FHSFI, for fast hiding sensitive frequent
itemsets (SFI). The FHSFI has achieved the following goals: 1) all SFI can be completely hidden
while without generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum
support thresholds are allowed, and 4) only one database scan is required.
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 44
In this study, we propose a novel algorithm, FHSFI for fast hiding sensitive frequent itemsets (SFI).
The FHSFI has achieved the following goals: 1) all SFI can be completely hidden while without generating all
frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are allowed, and 4)
only one database scan is required.
The remainder of this paper is organized as follows: Section 2 presents the problem formulation and
notations. In Section 3, we introduce the concept of the proposed algorithm for fast hiding sensitive frequent
itemsets and giving examples to illustrate the proposed algorithm. Section 4 is the experimental results which
present the performance and various side effects of the proposed algorithm. Section 5 is the conclusion and
further work.
II. PROBLEM FORMULATION AND NOTATIONS
In Table 1, we summarize the notations used hereafter in this paper. Let I be a set of items in a transaction
database D.
And let I = {i1, i2, ..., im}; D = {t1, t2, …, tn}, where every transaction ti is a subset of I, i.e. ti⊆I. An
example database is shown in Table 2. Let X be a set of items in I. If X⊆ti, we say that the transaction ti
supports X. There are nine items, |I|=9,be minimized.
Table 1. Definitions of variables used in this paper
Variable Definition
D the original database
D‟ the released database which is transformed from D
U the sets of frequent item sets generated from D
U‟ the sets of frequent item sets generated from D‟
ti a transaction in Database D
|ti| the number of items in ti
TID a unique identifier of each transaction
SFI the set of sensitive frequent itemsets to be hidden
SFI.tj a sensitive frequent itemset in the SFI
||.|| the support count of an itemset, i.e., the number of
transactions that support the itemset
wi prior weight of ti
PWT a table for storing TID and wi for each transaction
MICi
in an order decreasing by wi
the maximal number of itemsets in SFI that contain
an item ik, where ik∈ ti, SFI.tj⊆ti
SFI.t.i transaction to be modified
t
ransaction to be modified
and five transactions, |D|=5, in the database. The support of itemset X can be computed by equation (1). An
association rule is an implication of the form X→Y, where X⊂I, Y⊂I and X∩ Y= Ø. A rule X→Y will be
extracted from a database if
1) support(X∪Y) ≥ min_support (a given minimum support threshold) and
2) confidence(X ∪Y) ≥ min_confidence (a given minimum confidence threshold),
where support(X ∪Y) and confidence(X ∪Y) are given by
equations (2) and (3), .
support(X) = ||X|| / |D| (1)
support(X∪Y) = ||X∪Y|| / |D| (2)
confidence(X∪Y) = ||X∪Y|| / | X | (3)
Table 2.
Database D
TID Transaction
1 1,2,4,5,7
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 45
2 1,4,5,7
3 1,4,6,7,8
4 1,2,5,9
5 6,7,8
Table 3.Frequent Itemsets
Itemset Support
1 80%
4 60%
5 60%
7 80%
1,4 60%
1,5 60%
1,7 60%
1,4,7 60%
4,7 60%
In equation (1), ||X|| denotes the number of transactions in the database that contains the itemset X, and
|D| denotes the number of the transactions in the database D. If support(X) ≥ min_support, we call X as a
frequent itemset. Table 3 shows the frequent itemsets for a given min_support = 60%.
For the example X = {1,4,7}, since X⊆t1, X⊆t2 and X⊆ t3, we obtain ||X||=3. Therefore,
support(1,4,7)=60%. Using the form X→Y (support, confidence) for association rules, the rules generated from
the above itemset {1,4,7} can be described as 1→4,7 (60%,75%), 4→1,7 (60%,100%), 7→1,4 (60%,75%),
1,4→7 (60%,100%), 1,7→4 (60%,100%) and 4,7→1 (60%,100%).
Figure 1 shows the relationships among the sets, U, U‟, and SFI. The study goal is to hide all SFI and
to minimize the loss itemsets. That is, U‟∩SFI = Ø and the set U–U‟–SFI should be minimized.
Figure 1. The relationships among the sets, U, U‟, and SFI
III. THE PROPOSED ALGORITHM
We now demonstrate the algorithm, FHSFI. Given D, SFI, and min_support, the algorithm is to
generate a database to be released, D‟, in which the sensitive frequent itemsets are hidden and the side effects
generated are minimized.
The sketch of the FHSFI algorithm is shown in Figure 2, which can be depicted as the following stages.
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 46
Stage 2 repeats to modify transitions one-by-one until all SFI have been hidden. The order of the
transaction modifications is according to the prior weight associated with a transition. The following tasks are
repeated until SFI is empty.
• Select a transaction tk from PWT such that wk is maximal.
• Select the item to be deleted, according to the heuristic shown in Figure 4, and delete it.
• Recompute wk after modifying each item, and then insert it into the PWT in the maintained order.
• Subtract 1 from ||SFI.tj|| if SFI.tj contains the deleted item and is supported by tk.
• Remove SFI.tj from SFI, if the (||SFI.tj|| / |D|)< min_support.
Figure 3. The correlation between t1 and SFI
20
If ||SFI.tj|| / |D| < min_support
then
21 Remove SFI.tj from SFI;
22 End;
23 End;
Figure 2. The pseudo code of the FHSFI algorithm
In stage 1, FHSFI scans database once while collects all useful information about the correlation with SFI
for each
Table 4.
An example of sensitive frequent itemsets, SFI
Itemset
1 1,2,5
2 1,4,7
3 1,5,7
4 6,8
Table 5.
The support count for each itemset in SFI
Itemset ||.||
1 1,2,5 2
2 1,4,7 3
3 1,5,7 2
4 6,8 2
transaction, including ||SFI.tj|| and wi. The ||SFI.tj|| is used for checking if SFI.tj has been hidden. The wi is a
prior weight of a transaction ti, which provides a heuristic for estimating side
effects and can be computedbyequation(4).
wi = 1 / [2( | ti | - 1)
/ MICi].
Table 4 shows an example of sensitive frequent itemset. Let t1 = {1,2,4,5,7}, which supports SFI.t1,
SFI.t2 and SFI.t3. As shown in Figure 3 the correlation between t1 and the SFI can be represented by a graph
G=<V,E>. Each node is for an item ik in t1; the weight associated with each edge in E denotes the number of the
itemsets in SFI that contain the both adjacent nodes connected by the edge. Each node can be represented as
({SFI.tj | SFI.tj ⊆ti, ik∈ SFI.tj}, item_countSFI.t). For example, the node < {1,2,3}, 3> for item „1‟ indicates that
three itemsets in SFI that contain the item „1‟, namely the SFI.t1, SFI.t2, and SFI.t3. As shown in Figure 3, item
„1‟ has the maximum item_countSFI.t which is equal to 3. Hence, we obtain MIC1 = 3 and w1 = 3/16.
Figure 4 shows the heuristic procedure for determining which item to be modified and for computing MIC for
transaction ti.
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 47
Heuristic ();
Input: TID, SFI;
Output: the item to be modified, MICi;
1 Begin
2 For each SFI.t in SFI do
3 Begin
4 If the transaction tTID fully supports SFI.tj then
5 Begin
6 For each item SFI.tj.i in SFI.tj Do
7 item_countSFI.t.i = item_countSFI.t.i + 1;
8 End;
9 End;
10 Select the SFI.t.i with maximum item_count as the item of tTID to be midified;
11 Return(SFI.tj.i, item_count);
12 End;
Fig:pseudo code of heuristic procedure
Table 6.
The MIC and prior weight for each transaction in D
TID Transaction |ti| MIC w
1 1,2,4,5,7 5 3 3/16
2 1,4,5,7 4 2 2/8
3 1,4,6,7,8 5 1 1/16
4 1,2,5,9 4 1 1/8
5 6,7,8 3 1 1/4
Table 7.The example
PWT
TID w
1 2 2/8
2 5 1/4
3 1 3/16
4 4 1/8
5 3 1/16
Table 8. Experiment results for |SFI|=5
|D| CPU time(ms) |U| |U‟| #loss itemsets #modified entries
5000 326.6 439 428.6 5.4 143
10000 454.2 417 406.4 5.6 307.2
15000 701 426 415.6 5.4 513
20000 905 442 431 6 711.6
25000 1183.6 432 421.2 5.8 902.8
30000 1502 443 432.4 5.6 863.8
Now, we use the following example for illustrating the proposed algorithm FHSFI.
Example 1. Given D, SFI, as shown in Tables 2 and 4, and min_support = 40%. As shown in Table 5,
the support count for each SFI.t can be obtained from D and SFI. For example, SFI.t2, {1,4,7}, is supported by
t1, t2, and t3, so ||SFI.t2|| = 3. Table 6 lists the length, MIC, and the prior weight for each transaction in the
database. The PWT, as shown in Table 7, can be obtained by sorting Table 6 in the decreasing order by w. Then,
the first transaction, i.e., t2, in PWT is chosen to be modified. According the heuristic shown in Figure 3, the
item „1‟ in t2 are removed. Hence, ||SFI.t2|| and ||SFI.t3|| will be reduced by 1. SFI.t3 is removed from SFI
because the (||SFI.t3|| / |D|) < min_support. The process is repeated until the SFI is empty. Finally, the FHSFI
algorithm removes the item „1‟ in t2, the item „6‟ or „8‟ in t5 (select randomly), and the item „1‟ in t1. Now all
sensitive frequent itemsets in SFI have been hidden. ■
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 48
IV. PERFORMANCE EVALUATION
We have performed our experiments on a notebook with 1.5G MHz processor and 512 MB memory,
under Windows XP operating system. The IBM data generator [11] is used to synthesize the databases for the
experiments. Databases with sizes 5K, 10K, 15K, 20K, 25K, and 30K are generated for the series of
experiments. The average length of transactions of each database is 10 and 50 items in the generated database.
The minimum support threshold given is 30%. The experimental results are obtained by averaging from 5
independent trials with different SFIs.
The performance of the FHSFI algorithm has been measured according to three criteria: CPU time
requirements, side effects produced, and the number of entries modified. Tables 8 and 9 present the
experimental results for |SFI|=5 and |SFI|=10, respectively.
The CPU time requirements, side-effect evaluation, and the number of entries modified for varied |D|
and |SFI| are shown in Figures 6, 7, and 8, respectively.
Table 9. Experiment results for |SFI|=10
Figure 6. CPU time requirements
Figure 7. The side-effect evaluation
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 49
Figure 8. The number of entries modified
The experimental results for FHSFI can be summarized as follows:
• As shown in Figure 6, the CPU time is linear growth with the size of database and is scalable with the size
of SFI.
• The number of loss itemsets is independent of the size of database, but linear-related with the size of SFI
sets, which can be discovered in Figure 7.
• The number of the modified entries depends on the size of the database and the size of SFI. However, since
the heuristic procedures are used to determine the order of modifications, we can observe in Figure 8 that
only a small part of transactions in the database are modified. For |D|=10000, only 600 transactions are
modified for completely hiding the 10 item sets in SFI.
V. CONCLUSIONS AND FURTHER WORK
In this paper, we have presented the FHSFI algorithm in order to fast hide sensitive frequent itemsets
with limited side effects. The correlations between the sensitive itemsets and each transaction in the original
database are analyzed. A heuristic function to obtain a prior weight for each transaction is given. The order of
transactions to be modified can be efficiently decided by the weight for each transaction. This will reduce the
time to deal with the transactions whose modification is not helpful for hiding the given sensitive frequent
itemsets. In other words, the number of transactions in D that we have to deal with could also be reduced.
Our approach has achieved the following goals: 1) all SFI can be completely hidden while without
generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are
allowed; and 4) only one database scan is required. In this research, one of our goals is hiding all SFI with
limited side effects, but our algorithm still causes some loss rule sets. We are currently considering extensions
on the algorithms to solve the problem. Another one is to apply the ideas introduced in this paper to fast hide
sensitive association rules. These issues could be studied in the future.
REFERENCES
[1] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. Verykios, “Disclosure limitation of sensitive rules”
Knowledge and Data Engineering Exchange, pp. 45-52, 1999.
[2] Vassilios S. Verykios, A.K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, “Association Rule Hiding,” IEEE Transactions on
Knowledge and Data Engineering, vol. 16, no. 4, pp. 434-447, 2004.
[3] Shyue-Liang Wang, “Hiding sensitive predictive association rules”, Systems, Man and Cybernetics, 2005 IEEE International
Conference on Information Reuse and Integration, vol. 1, pp. 164-169, 2005.
[4] Ali Amiri, “Dare to share: Protecting sensitive knowledge with data sanitization", Decision Support Systems archive vol. 43, issue 1,
pp. 181-191, 2007.
[5] Yi-Hung Wu, Chia-Ming Chiang, and Arbee L.P. Chen, “Hiding Sensitive Association Rules with Limited Side Effects”, IEEE
Transactions on Knowledge and Data Engineering, vol. 19, issue 1, pp. 29 - 42, 2007.
Protecting Frequent Itemsets Disclosure In Data Sets…
||Issn 2250-3005 || ||August||2013|| Page 50
Authors Profile
Mr. Shaik.Mahammad Rafi –He was born in Rajampet, Kadapa, A.P, India in 1990.He is studying
M.Tech in the department of Computer Science and Engineering at Global College of Engineering
and technology, Kadapa. He has done Bachelor's of Technology from JNTUA University in the year
2011 in Information Technology.
Mr. M.SUMAN- He was born in Rajampet, Kadapa, A.P, India in 1985.He is Masters of Technology
in Computer Science and Engineering from JNTUH University in the year 2013 at Vathsalya Institute
of Technology & Science, Anantharam Bhongiri, Nalgonda . He has given guidance to many students
in their thesis work of M.Tech. He has also contributed in the research work on Image Processing with
his papers. He is presently working as Asst. Professor in Mekapati Rajamohan Reddy Institute of
Technology and Science ,Udayagiri,SPSR Nellore. He has done Bachelor's of Technology from JNTUH
University in the year 2006 in Computer Science & Engineering at Annamacharya Institute of
Technology & Science, Rajampet, Kadapa.
Mr. MAJJARI SUDHAKAR- He was born in Raghunathapuram, Badvel, Kadapa, A.P, India in
1983.He is Masters of Technology in Information Technology from JNTUH University in the year
2010 at Aurora's Scientific Technological & Research Academy , Bandlaguda , Hyderabad.He
has 3 years of Teaching Experience given guidance to many students in their thesis work of M.Tech.
He has also contributed in the research work on Software Engineering with his papers. He is
presently working as Asst. Professor in Mekapati Rajamohan Reddy Institute of Technology
and Science ,Udayagiri,SPSR Nellore. He has done Bachelor's of Technology from JNTUH
University in the year 2006 in Computer Science & Engineering at Sri Venkateswara College of
Engineering & Technology,R.V.S Nagar Chittur.
Mr. P.RAMESH- He was born in Naidupet, S.P.S.R.Nellore, A.P, India in 1989.He is studying
M.Tech in the department of Computer Science and Engineering at SIR C.V.RAMAN INSTITUTE
OF TECHNOLOGY & SCIENCE, Tadipatri, Anantapur. He has done Bachelor's of Technology
from JNTUA University in the year 2011 in Computer Science & Engineering
Mr. P.Venkata Ramanaiah –He was born in Khajipet,Kadapa,A.P,India. He is Master of
Technology in Computer Science and Engineering at Madanapalle Institute of Technology and
Science from JNTUA University in the year 2012. He has given guidance to many students in their
thesis work of M.Tech. He has also contributed in the research work on Data Mining with his papers.
He was presently working as Assistant. Professor in Global College of Engineering and
technology,GCET, kadapa.

More Related Content

PPTX
Lesson 1 introduction_to_time_series
PDF
IRJET-Handwritten Digit Classification using Machine Learning Models
PDF
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
PDF
Modeling Crude Oil Prices (CPO) using General Regression Neural Network (GRNN)
PDF
Comparative study of frequent item set in data mining
PPS
Ds 4
PDF
Average sort
PDF
Haoying1999
Lesson 1 introduction_to_time_series
IRJET-Handwritten Digit Classification using Machine Learning Models
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
Modeling Crude Oil Prices (CPO) using General Regression Neural Network (GRNN)
Comparative study of frequent item set in data mining
Ds 4
Average sort
Haoying1999

Viewers also liked (20)

PDF
International Journal of Computational Engineering Research(IJCER)
PDF
Ng2421772180
PDF
A new area and power efficient single edge triggered flip flop structure for ...
PDF
En34855860
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
Mp3422592263
PDF
Is3515091514
PDF
Ad4101172176
PDF
Fc36951956
PDF
Df34655661
PDF
A Single-Phase Clock Multiband Low-Power Flexible Divider
PDF
VHDL Implementation of Flexible Multiband Divider
PDF
H33038041
PDF
Design and Implementation of Digital PLL using Self Correcting DCO System
PDF
Design of multi bit multi-phase vco-based adc
PDF
J42046469
PDF
20120140505003 2-3
International Journal of Computational Engineering Research(IJCER)
Ng2421772180
A new area and power efficient single edge triggered flip flop structure for ...
En34855860
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Mp3422592263
Is3515091514
Ad4101172176
Fc36951956
Df34655661
A Single-Phase Clock Multiband Low-Power Flexible Divider
VHDL Implementation of Flexible Multiband Divider
H33038041
Design and Implementation of Digital PLL using Self Correcting DCO System
Design of multi bit multi-phase vco-based adc
J42046469
20120140505003 2-3
Ad

Similar to International Journal of Computational Engineering Research(IJCER) (20)

PDF
F04713641
PDF
F04713641
PDF
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
PDF
An apriori based algorithm to mine association rules with inter itemset distance
PDF
Comparative analysis of association rule generation algorithms in data streams
PDF
Efficient Temporal Association Rule Mining
PDF
PDF
Output Privacy Protection With Pattern-Based Heuristic Algorithm
PDF
I43055257
PDF
Association Rule Hiding using Hash Tree
PDF
Ijcet 06 06_003
PDF
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
PDF
A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET
PDF
IRJET- Effecient Support Itemset Mining using Parallel Map Reducing
PDF
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
PDF
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
PDF
Dy33753757
PDF
Dy33753757
PDF
Comparison Between High Utility Frequent Item sets Mining Techniques
F04713641
F04713641
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
An apriori based algorithm to mine association rules with inter itemset distance
Comparative analysis of association rule generation algorithms in data streams
Efficient Temporal Association Rule Mining
Output Privacy Protection With Pattern-Based Heuristic Algorithm
I43055257
Association Rule Hiding using Hash Tree
Ijcet 06 06_003
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET
IRJET- Effecient Support Itemset Mining using Parallel Map Reducing
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Dy33753757
Dy33753757
Comparison Between High Utility Frequent Item sets Mining Techniques
Ad

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...

International Journal of Computational Engineering Research(IJCER)

  • 1. International Journal of Computational Engineering Research||Vol, 03||Issue, 8|| ||Issn 2250-3005 || ||August||2013|| Page 43 Protecting Frequent Item sets Disclosure in Data Sets and Preserving Item Sets Mining 1, Shaik Mahammad Rafi. 2, M.Suman ,M.Tech 3, Majjari Sudhakar,M.Tech 4, P.Ramesh 5, P.Venkata Ramanaiah.,M.Tech 1 PG (M.Tech) Student in CSE Department, Global College of Engineering and Technology, Kadapa, YSR (D.t). 2,3,4 Assistant Professors in CSE Department, MRRITS ,Udayagiri,SPSR Nellore(D.t). 5 Assistant Professor in CSE Department, GCET, Kadapa,YSR(D.t). I. INTRODUCTION The data mining technologies have been an important technology for discovering previously unknown and potentially useful information from large data sets or databases. They canbe applied to various domains, such as Web commerce, crime reconnoitering, health care, and customer's consumption analysis. Although these are useful technologies, there is also a threat to data privacy. For example, the association rule analysis is a powerful and popular tool for discovering relationships hidden in large data sets. Therefore, some private information could be easily discovered by this kind of tools. The protection of the confidentiality of sensitive information in a database becomes a critical issue to be resolved. The relationships discovered from a database can be represented in a form of frequent itemsets or association rules. One rule is categorized as sensitive if its disclosure risk is above some given threshold. With an association analyzer, if an itemset with support above a given minimal support, we call the itemset as a frequent itemset. The problem for finding an optimal sanitization of a source database with association rule analysis has been proven to be NP-Hard [1]. In [2,3,4,5] the authors presented different heuristic algorithms that modify transactions via inserting or deleting items for hiding sensitive rules or itemsets. Vassilios S. Verykios et al. [2] presented algorithms to hide sensitive association rules, but they generate high side effects and require multiple database scans. Instead of hiding sensitive association rules, Shyue-Liang Wang [3] proposed algorithms to hide sensitive items. The algorithm needs less number of database scans but the side effects generated is higher. Ali Amiri [4] also presented heuristic algorithms to hide sensitive items. Finally, Yi-Hung Wu et al. [5] proposed a heuristic method that could hide sensitive association rules with limited side effects. However, it spent a lot of time on comparing and checking if the sensitive rules are hidden and if side effects are produced. Besides, it could fail to hide some sensitive rules in some cases. ABSTRACT: Based on the network and data mining techniques, the protection of the confidentiality of sensitive information in a database becomes a critical issue to be resolved. Association analysis is a powerful and popular tool for discovering relationships hidden in large data sets. The relationships can be represented in a form of frequent itemsets or association rules. One rule is categorized as sensitive if its disclosure risk is above some given threshold. Privacy-preserving data mining is an important issue which can be applied to various domains, such as Web commerce, crime reconnoitering, health care, and customer's consumption analysis. The main approach to hide sensitive frequent itemsets is to reduce the support of each given sensitive itemsets. This is done by modifying transactions or items in the database. However, the modifications will generate side effects, i.e., nonsensitive frequent itemsets falsely hidden (the loss itemsets) and spurious frequent itemsets falsely generated (the new itemsets). There is a trade-off between sensitive frequent itemsets hidden and side effects generated. Furthermore, it should always take huge computing time to solve the problem. In this study, we propose a novel algorithm, FHSFI, for fast hiding sensitive frequent itemsets (SFI). The FHSFI has achieved the following goals: 1) all SFI can be completely hidden while without generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are allowed, and 4) only one database scan is required.
  • 2. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 44 In this study, we propose a novel algorithm, FHSFI for fast hiding sensitive frequent itemsets (SFI). The FHSFI has achieved the following goals: 1) all SFI can be completely hidden while without generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are allowed, and 4) only one database scan is required. The remainder of this paper is organized as follows: Section 2 presents the problem formulation and notations. In Section 3, we introduce the concept of the proposed algorithm for fast hiding sensitive frequent itemsets and giving examples to illustrate the proposed algorithm. Section 4 is the experimental results which present the performance and various side effects of the proposed algorithm. Section 5 is the conclusion and further work. II. PROBLEM FORMULATION AND NOTATIONS In Table 1, we summarize the notations used hereafter in this paper. Let I be a set of items in a transaction database D. And let I = {i1, i2, ..., im}; D = {t1, t2, …, tn}, where every transaction ti is a subset of I, i.e. ti⊆I. An example database is shown in Table 2. Let X be a set of items in I. If X⊆ti, we say that the transaction ti supports X. There are nine items, |I|=9,be minimized. Table 1. Definitions of variables used in this paper Variable Definition D the original database D‟ the released database which is transformed from D U the sets of frequent item sets generated from D U‟ the sets of frequent item sets generated from D‟ ti a transaction in Database D |ti| the number of items in ti TID a unique identifier of each transaction SFI the set of sensitive frequent itemsets to be hidden SFI.tj a sensitive frequent itemset in the SFI ||.|| the support count of an itemset, i.e., the number of transactions that support the itemset wi prior weight of ti PWT a table for storing TID and wi for each transaction MICi in an order decreasing by wi the maximal number of itemsets in SFI that contain an item ik, where ik∈ ti, SFI.tj⊆ti SFI.t.i transaction to be modified t ransaction to be modified and five transactions, |D|=5, in the database. The support of itemset X can be computed by equation (1). An association rule is an implication of the form X→Y, where X⊂I, Y⊂I and X∩ Y= Ø. A rule X→Y will be extracted from a database if 1) support(X∪Y) ≥ min_support (a given minimum support threshold) and 2) confidence(X ∪Y) ≥ min_confidence (a given minimum confidence threshold), where support(X ∪Y) and confidence(X ∪Y) are given by equations (2) and (3), . support(X) = ||X|| / |D| (1) support(X∪Y) = ||X∪Y|| / |D| (2) confidence(X∪Y) = ||X∪Y|| / | X | (3) Table 2. Database D TID Transaction 1 1,2,4,5,7
  • 3. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 45 2 1,4,5,7 3 1,4,6,7,8 4 1,2,5,9 5 6,7,8 Table 3.Frequent Itemsets Itemset Support 1 80% 4 60% 5 60% 7 80% 1,4 60% 1,5 60% 1,7 60% 1,4,7 60% 4,7 60% In equation (1), ||X|| denotes the number of transactions in the database that contains the itemset X, and |D| denotes the number of the transactions in the database D. If support(X) ≥ min_support, we call X as a frequent itemset. Table 3 shows the frequent itemsets for a given min_support = 60%. For the example X = {1,4,7}, since X⊆t1, X⊆t2 and X⊆ t3, we obtain ||X||=3. Therefore, support(1,4,7)=60%. Using the form X→Y (support, confidence) for association rules, the rules generated from the above itemset {1,4,7} can be described as 1→4,7 (60%,75%), 4→1,7 (60%,100%), 7→1,4 (60%,75%), 1,4→7 (60%,100%), 1,7→4 (60%,100%) and 4,7→1 (60%,100%). Figure 1 shows the relationships among the sets, U, U‟, and SFI. The study goal is to hide all SFI and to minimize the loss itemsets. That is, U‟∩SFI = Ø and the set U–U‟–SFI should be minimized. Figure 1. The relationships among the sets, U, U‟, and SFI III. THE PROPOSED ALGORITHM We now demonstrate the algorithm, FHSFI. Given D, SFI, and min_support, the algorithm is to generate a database to be released, D‟, in which the sensitive frequent itemsets are hidden and the side effects generated are minimized. The sketch of the FHSFI algorithm is shown in Figure 2, which can be depicted as the following stages.
  • 4. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 46 Stage 2 repeats to modify transitions one-by-one until all SFI have been hidden. The order of the transaction modifications is according to the prior weight associated with a transition. The following tasks are repeated until SFI is empty. • Select a transaction tk from PWT such that wk is maximal. • Select the item to be deleted, according to the heuristic shown in Figure 4, and delete it. • Recompute wk after modifying each item, and then insert it into the PWT in the maintained order. • Subtract 1 from ||SFI.tj|| if SFI.tj contains the deleted item and is supported by tk. • Remove SFI.tj from SFI, if the (||SFI.tj|| / |D|)< min_support. Figure 3. The correlation between t1 and SFI 20 If ||SFI.tj|| / |D| < min_support then 21 Remove SFI.tj from SFI; 22 End; 23 End; Figure 2. The pseudo code of the FHSFI algorithm In stage 1, FHSFI scans database once while collects all useful information about the correlation with SFI for each Table 4. An example of sensitive frequent itemsets, SFI Itemset 1 1,2,5 2 1,4,7 3 1,5,7 4 6,8 Table 5. The support count for each itemset in SFI Itemset ||.|| 1 1,2,5 2 2 1,4,7 3 3 1,5,7 2 4 6,8 2 transaction, including ||SFI.tj|| and wi. The ||SFI.tj|| is used for checking if SFI.tj has been hidden. The wi is a prior weight of a transaction ti, which provides a heuristic for estimating side effects and can be computedbyequation(4). wi = 1 / [2( | ti | - 1) / MICi]. Table 4 shows an example of sensitive frequent itemset. Let t1 = {1,2,4,5,7}, which supports SFI.t1, SFI.t2 and SFI.t3. As shown in Figure 3 the correlation between t1 and the SFI can be represented by a graph G=<V,E>. Each node is for an item ik in t1; the weight associated with each edge in E denotes the number of the itemsets in SFI that contain the both adjacent nodes connected by the edge. Each node can be represented as ({SFI.tj | SFI.tj ⊆ti, ik∈ SFI.tj}, item_countSFI.t). For example, the node < {1,2,3}, 3> for item „1‟ indicates that three itemsets in SFI that contain the item „1‟, namely the SFI.t1, SFI.t2, and SFI.t3. As shown in Figure 3, item „1‟ has the maximum item_countSFI.t which is equal to 3. Hence, we obtain MIC1 = 3 and w1 = 3/16. Figure 4 shows the heuristic procedure for determining which item to be modified and for computing MIC for transaction ti.
  • 5. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 47 Heuristic (); Input: TID, SFI; Output: the item to be modified, MICi; 1 Begin 2 For each SFI.t in SFI do 3 Begin 4 If the transaction tTID fully supports SFI.tj then 5 Begin 6 For each item SFI.tj.i in SFI.tj Do 7 item_countSFI.t.i = item_countSFI.t.i + 1; 8 End; 9 End; 10 Select the SFI.t.i with maximum item_count as the item of tTID to be midified; 11 Return(SFI.tj.i, item_count); 12 End; Fig:pseudo code of heuristic procedure Table 6. The MIC and prior weight for each transaction in D TID Transaction |ti| MIC w 1 1,2,4,5,7 5 3 3/16 2 1,4,5,7 4 2 2/8 3 1,4,6,7,8 5 1 1/16 4 1,2,5,9 4 1 1/8 5 6,7,8 3 1 1/4 Table 7.The example PWT TID w 1 2 2/8 2 5 1/4 3 1 3/16 4 4 1/8 5 3 1/16 Table 8. Experiment results for |SFI|=5 |D| CPU time(ms) |U| |U‟| #loss itemsets #modified entries 5000 326.6 439 428.6 5.4 143 10000 454.2 417 406.4 5.6 307.2 15000 701 426 415.6 5.4 513 20000 905 442 431 6 711.6 25000 1183.6 432 421.2 5.8 902.8 30000 1502 443 432.4 5.6 863.8 Now, we use the following example for illustrating the proposed algorithm FHSFI. Example 1. Given D, SFI, as shown in Tables 2 and 4, and min_support = 40%. As shown in Table 5, the support count for each SFI.t can be obtained from D and SFI. For example, SFI.t2, {1,4,7}, is supported by t1, t2, and t3, so ||SFI.t2|| = 3. Table 6 lists the length, MIC, and the prior weight for each transaction in the database. The PWT, as shown in Table 7, can be obtained by sorting Table 6 in the decreasing order by w. Then, the first transaction, i.e., t2, in PWT is chosen to be modified. According the heuristic shown in Figure 3, the item „1‟ in t2 are removed. Hence, ||SFI.t2|| and ||SFI.t3|| will be reduced by 1. SFI.t3 is removed from SFI because the (||SFI.t3|| / |D|) < min_support. The process is repeated until the SFI is empty. Finally, the FHSFI algorithm removes the item „1‟ in t2, the item „6‟ or „8‟ in t5 (select randomly), and the item „1‟ in t1. Now all sensitive frequent itemsets in SFI have been hidden. ■
  • 6. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 48 IV. PERFORMANCE EVALUATION We have performed our experiments on a notebook with 1.5G MHz processor and 512 MB memory, under Windows XP operating system. The IBM data generator [11] is used to synthesize the databases for the experiments. Databases with sizes 5K, 10K, 15K, 20K, 25K, and 30K are generated for the series of experiments. The average length of transactions of each database is 10 and 50 items in the generated database. The minimum support threshold given is 30%. The experimental results are obtained by averaging from 5 independent trials with different SFIs. The performance of the FHSFI algorithm has been measured according to three criteria: CPU time requirements, side effects produced, and the number of entries modified. Tables 8 and 9 present the experimental results for |SFI|=5 and |SFI|=10, respectively. The CPU time requirements, side-effect evaluation, and the number of entries modified for varied |D| and |SFI| are shown in Figures 6, 7, and 8, respectively. Table 9. Experiment results for |SFI|=10 Figure 6. CPU time requirements Figure 7. The side-effect evaluation
  • 7. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 49 Figure 8. The number of entries modified The experimental results for FHSFI can be summarized as follows: • As shown in Figure 6, the CPU time is linear growth with the size of database and is scalable with the size of SFI. • The number of loss itemsets is independent of the size of database, but linear-related with the size of SFI sets, which can be discovered in Figure 7. • The number of the modified entries depends on the size of the database and the size of SFI. However, since the heuristic procedures are used to determine the order of modifications, we can observe in Figure 8 that only a small part of transactions in the database are modified. For |D|=10000, only 600 transactions are modified for completely hiding the 10 item sets in SFI. V. CONCLUSIONS AND FURTHER WORK In this paper, we have presented the FHSFI algorithm in order to fast hide sensitive frequent itemsets with limited side effects. The correlations between the sensitive itemsets and each transaction in the original database are analyzed. A heuristic function to obtain a prior weight for each transaction is given. The order of transactions to be modified can be efficiently decided by the weight for each transaction. This will reduce the time to deal with the transactions whose modification is not helpful for hiding the given sensitive frequent itemsets. In other words, the number of transactions in D that we have to deal with could also be reduced. Our approach has achieved the following goals: 1) all SFI can be completely hidden while without generating all frequent itemsets; 2) limited side effects are generated; 3) any minimum support thresholds are allowed; and 4) only one database scan is required. In this research, one of our goals is hiding all SFI with limited side effects, but our algorithm still causes some loss rule sets. We are currently considering extensions on the algorithms to solve the problem. Another one is to apply the ideas introduced in this paper to fast hide sensitive association rules. These issues could be studied in the future. REFERENCES [1] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. Verykios, “Disclosure limitation of sensitive rules” Knowledge and Data Engineering Exchange, pp. 45-52, 1999. [2] Vassilios S. Verykios, A.K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, “Association Rule Hiding,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 4, pp. 434-447, 2004. [3] Shyue-Liang Wang, “Hiding sensitive predictive association rules”, Systems, Man and Cybernetics, 2005 IEEE International Conference on Information Reuse and Integration, vol. 1, pp. 164-169, 2005. [4] Ali Amiri, “Dare to share: Protecting sensitive knowledge with data sanitization", Decision Support Systems archive vol. 43, issue 1, pp. 181-191, 2007. [5] Yi-Hung Wu, Chia-Ming Chiang, and Arbee L.P. Chen, “Hiding Sensitive Association Rules with Limited Side Effects”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, issue 1, pp. 29 - 42, 2007.
  • 8. Protecting Frequent Itemsets Disclosure In Data Sets… ||Issn 2250-3005 || ||August||2013|| Page 50 Authors Profile Mr. Shaik.Mahammad Rafi –He was born in Rajampet, Kadapa, A.P, India in 1990.He is studying M.Tech in the department of Computer Science and Engineering at Global College of Engineering and technology, Kadapa. He has done Bachelor's of Technology from JNTUA University in the year 2011 in Information Technology. Mr. M.SUMAN- He was born in Rajampet, Kadapa, A.P, India in 1985.He is Masters of Technology in Computer Science and Engineering from JNTUH University in the year 2013 at Vathsalya Institute of Technology & Science, Anantharam Bhongiri, Nalgonda . He has given guidance to many students in their thesis work of M.Tech. He has also contributed in the research work on Image Processing with his papers. He is presently working as Asst. Professor in Mekapati Rajamohan Reddy Institute of Technology and Science ,Udayagiri,SPSR Nellore. He has done Bachelor's of Technology from JNTUH University in the year 2006 in Computer Science & Engineering at Annamacharya Institute of Technology & Science, Rajampet, Kadapa. Mr. MAJJARI SUDHAKAR- He was born in Raghunathapuram, Badvel, Kadapa, A.P, India in 1983.He is Masters of Technology in Information Technology from JNTUH University in the year 2010 at Aurora's Scientific Technological & Research Academy , Bandlaguda , Hyderabad.He has 3 years of Teaching Experience given guidance to many students in their thesis work of M.Tech. He has also contributed in the research work on Software Engineering with his papers. He is presently working as Asst. Professor in Mekapati Rajamohan Reddy Institute of Technology and Science ,Udayagiri,SPSR Nellore. He has done Bachelor's of Technology from JNTUH University in the year 2006 in Computer Science & Engineering at Sri Venkateswara College of Engineering & Technology,R.V.S Nagar Chittur. Mr. P.RAMESH- He was born in Naidupet, S.P.S.R.Nellore, A.P, India in 1989.He is studying M.Tech in the department of Computer Science and Engineering at SIR C.V.RAMAN INSTITUTE OF TECHNOLOGY & SCIENCE, Tadipatri, Anantapur. He has done Bachelor's of Technology from JNTUA University in the year 2011 in Computer Science & Engineering Mr. P.Venkata Ramanaiah –He was born in Khajipet,Kadapa,A.P,India. He is Master of Technology in Computer Science and Engineering at Madanapalle Institute of Technology and Science from JNTUA University in the year 2012. He has given guidance to many students in their thesis work of M.Tech. He has also contributed in the research work on Data Mining with his papers. He was presently working as Assistant. Professor in Global College of Engineering and technology,GCET, kadapa.