SlideShare a Scribd company logo
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 4 (Nov. - Dec. 2013), PP 20-23
www.iosrjournals.org
www.iosrjournals.org 20 | Page
Mining Top-k Closed Sequential Patterns in Sequential Databases
K.Sohini
1
, Mr.V.Purushothama Raju
2
1
Dept of CSE, Shri Vishnu Engineering College for Women, Bhimavaram, A.P, India
2
Associate Professor, Dept of CSE, Shri Vishnu Engineering College for Women, Bhimavaram, A.P, India
Abstract: In data mining community, sequential pattern mining has been studied extensively. Most studies
require the specification of minimum support threshold to mine the sequential patterns. However, it is difficult
for users to provide an appropriate threshold in practice. To overcome this, we propose mining top-k closed
sequential patterns of length no less than min_l, where k is the number of closed sequential patterns to be
mined, and min_l is the minimum length of each pattern. We mine closed patterns since they are solid
representations of frequent patterns.
Keywords: closed pattern, data mining, sequential pattern, scalability
I. Introduction
Sequential pattern mining is an important data mining task that has been studied extensively. Given a
set of sequences, which consists of a list of itemsets, and given a user-specified minimum support threshold
(min_support), sequential pattern mining is to find all frequent subsequences whose frequency is no less than
min support. This mining task leads to the following two problems that may hinder its popular use.
First, sequential pattern mining often generates an exponential number of patterns, which is unavoidable when
the database consists of long frequent sequences. The similar phenomena also exist in itemset and graph patterns
when the patterns are large. For example, assume the database contains a frequent sequence <(a1) (a2)…(a100)>,
it will generate 2100
-1 frequent subsequences. It is very possible some subsequences share the exact same
support with this long sequence, which are essentially redundant patterns.
Second, setting min_support is a subtle task: A too small value may lead to the generation of
thousands of patterns, whereas a too big one may lead to no answer found. To come up with an appropriate
min_support, one needs to have prior knowledge about the mining query and the task specific data, and be able
to guess beforehand how many patterns will be generated with a particular threshold.
A solution to the first problem was proposed recently by Yan, et al. [1]. Their algorithm, called
CloSpan, can mine closed sequential patterns. A sequential pattern s is closed, if there exists, no super pattern of
s with the same support in the database. When the patterns were mined for closed sequences and information
lossless then the Patterns will be reduced, because it can be used to derive the complete set of sequential
patterns.
As to the second problem, a similar situation occurs in frequent itemset mining. As proposed in [3], a
good solution is to change the task of mining frequent patterns to mining top-k frequent closed patterns of
minimum length min_l, where k is the number of closed patterns to be mined, top-k refers to the k most
frequent patterns, and min_l is the minimum length of the closed patterns. This setting is also desirable in the
context of sequential pattern mining. Unfortunately, most of the techniques developed in [3] cannot be directly
applied in sequence mining, because this subsequence testing requires order matching which is more difficult
than subset testing. Moreover, the search space of sequences is much larger than that of itemsets. Nevertheless,
some ideas developed in [3] are still influential in our algorithm design.
II. Related Work
Efficient algorithms like PrefixSpan [5] SPADE[6] GSP[7] were developed for mining the sequential
patterns. As sequential pattern mining produces many patterns so closed sequential pattern mining algorithms
like CloSpan[1] , BIDE[8] were developed. All these algorithms can deliver less patterns than sequential pattern
mining, but do not lose any information. Top-k closed pattern mining will reduce the number of patterns further
by only mining the most frequent ones. As to mining the top-k patterns, CloSpan[1] and TFP [3] are the most
related. CloSpan mines the frequent closed sequential patterns while TFP discovers top-k closed itemsets.
III. Method Development
The sequential pattern mining based on the concept of projection based, is introduced (PrefixSpan[5]) it
gives some background idea. Then a new Top-k mining algorithm was introduced with background of CloSpan
and top-k.
Mining Top-k Closed Sequential Patterns in Sequential Databases
www.iosrjournals.org 21 | Page
For each discovered sequence s and its projected database Ds, it performs itemset extension and sequence-
extension recursively until all the frequent sequences with prefix s are discovered.
Given a sequence s=<t1,…..,tm> and an item a, sa means s concatenates with a
It can be I-Step extension ,si a =<t1,…..,tm U{a}>
Or S-Step extension, ss a =<t1,…..,tm ,{a}>
Example:<(ae)> is an I-Step extension of <(a)>
<(a)(c)> is an S-Step extension of <(a)>
Algorithm: Top-k mining
Input: A sequence s, a projected database Ds, minimum length min_l, histograms H and constant factor f
Output: The top-k closed sequence set k
1. if support of s is less than min_support then return
2. if length of s is equal to minimum length then
3. call Prefix Span With Support Raising (s, Ds, min_support, k ) ;
4. return;
5. scan the database Ds once and find every frequent item a such that s can be extended to sa;
6. insert a in histogram at length (l+1);
7. next_level_top_support  Get Top Support From Histogram ( f, H[l(s)+1])
8. for each a, support(a)>=next_level_top_support do
9. call Top-k mining ( sa , Dsa, min_l );
10. return;
In the prefix span it finds all the frequent sequences, they include both closed and non-closed
sequences. Since our task is to mine top-k closed sequential patterns without min_support threshold, the mining
process should start with min_support = 1, raise it progressively during the process, and then use the raised
min_support to prune the search space. As soon as at least k closed sequential patterns with length no less than
min_l are found, min_support can be set to the support of the least frequent pattern, and this min_support raising
process continues throughout the mining process. This min_support raising technique is simple and can lead to
efficient mining. There is only one algorithm CloSpan[1], that guarantees the slightest k closed sequential
patterns are found as a result min_support can be raised during the mining.
In Top-k mining, it includes if the support of s is less than min_support then return, if the length of s is equal to
min_l then call Prefix Span With Support Raising. Then scan the database Ds once and find the closed sequence
so that s can be extended to sa. Then insert a into histogram, finally if the support of a is greater or equal to
next level top support then call top-k mining.
IV. Performance evaluation
This section reports the performance testing of Top-k closed sequential patterns in large data sets. The
performance of Top-k with CloSpan is compared. The comparison is by assigning the optimal min-support to
CloSpan so that it generates the same set of top-k closed sequential patterns for specified values of k.
These were performed on a 2.10GHz Intel core duo PC with 2GB main memory, Windows XP/7 Professional.
The performance of algorithms was compared by varying k. When k is fixed, its value is set to either 50 or 500
which covers the range of typical values for this parameter. Fig.1, show the performance of sample dataset. This
dataset consists of relatively short sequences, each sequence contains 6 itemsets on average and the itemsets
have 3 items on average. This experiment shows that top-k mines the data efficiently and with minimum
running time than CloSpan. Mainly there are two reasons for better performance of Top-k in this dataset, first it
uses min_l condition to reduce short sequences during the mining process that will reduce the search space and
improves performance. Second, Top_k have efficient pattern verification and stores the result set that contains
only a small number of patterns.
Mining Top-k Closed Sequential Patterns in Sequential Databases
www.iosrjournals.org 22 | Page
a. when k=50
b. when k=500
Fig. 1: performance of clospan and top-k when k is fixed.
These were the graphs obtained when compared with sample database on clospan and top-k with constant min-l
and running time.
V. Conclusions
In this paper, we have studied the problem of mining top-k closed sequential patterns with length no
less than min_l and we proposed an efficient algorithm with the following distinct features: it implement a new
Top-k algorithm that mine the sequential patterns quickly and raise support dynamically, and it perform efficient
verification during the mining process, and it extend optimization technique together with applying the
minimum length constraint (min_l).
0
20
40
60
80
100
120
140
160
180
2 4 6 8 10 12
runningtime
min-l
clospan
top-k
Mining Top-k Closed Sequential Patterns in Sequential Databases
www.iosrjournals.org 23 | Page
References
[1] X. Yan, R. Afshar, and J. Han. CloSpan: Mining closed sequential patterns in large datasets. In May 2003, SDM, in California.
[2] C. J. Hsiao and M. J. Zaki. CHARM: An efficient algorithm for closed itemset mining. In April 2002, SDM, in Arlington,VA.
[3] J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining topk frequent closed patterns without minimum support. In Dec. 2002, ICDM, in
Maebashi, Japan.
[4] R. Agrawal and R. Srikant. Mining sequential patterns. In Mar. 1995, ICDE, in Taipei, Taiwan.
[5] J. Pei, J. Han, B. Mortazavi-Asl, M.C. Hsu , Q. Chen,U. Dayal, and H. Pinto. PrefixSpan: Mining sequential patterns efficiently by
prefix-projected pattern growth. In April 2001, ICDE, in Germany.
[6] M. Zaki. In 2001, SPADE: An efficient algorithm for mining frequent sequences.
[7] R.Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Avignon, France, Mar.
1996.
[8] J.Wang and J. Han. BIDE, Efficient Mining of Frequent Closed Sequences, in 2004, ICDE.

More Related Content

PPT
Operating System
PDF
A Short Course in Data Stream Mining
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
PDF
強化学習の分散アーキテクチャ変遷
PPTX
Next generation analytics with yarn, spark and graph lab
PDF
A novel algorithm for mining closed sequential patterns
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Operating System
A Short Course in Data Stream Mining
Parallel External Memory Algorithms Applied to Generalized Linear Models
強化学習の分散アーキテクチャ変遷
Next generation analytics with yarn, spark and graph lab
A novel algorithm for mining closed sequential patterns
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...

What's hot (20)

PPTX
Storm 2012-03-29
PDF
Genetic Algorithm for Process Scheduling
PDF
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PDF
Incremental Mining of Sequential Patterns Using Weights
PDF
Storm Users Group Real Time Hadoop
PDF
One More Comments on Programming with Big Number Library in Scientific Computing
PPTX
Graphlab Ted Dunning Clustering
PDF
Deadlock Avoidance - OS
PPTX
Deadlock avoidance (Safe State, Resource Allocation Graph Algorithm)
PDF
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
PDF
CSCC-X2007
PPTX
STRIP: stream learning of influence probabilities.
PDF
Heapsort quick sort
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
PDF
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
PDF
FINDING FREQUENT SUBPATHS IN A GRAPH
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PDF
Scalable and Adaptive Graph Querying with MapReduce
Storm 2012-03-29
Genetic Algorithm for Process Scheduling
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
Incremental Mining of Sequential Patterns Using Weights
Storm Users Group Real Time Hadoop
One More Comments on Programming with Big Number Library in Scientific Computing
Graphlab Ted Dunning Clustering
Deadlock Avoidance - OS
Deadlock avoidance (Safe State, Resource Allocation Graph Algorithm)
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
CSCC-X2007
STRIP: stream learning of influence probabilities.
Heapsort quick sort
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
FINDING FREQUENT SUBPATHS IN A GRAPH
GoodFit: Multi-Resource Packing of Tasks with Dependencies
Scalable and Adaptive Graph Querying with MapReduce
Ad

Viewers also liked (20)

PDF
Content Based Image Retrieval for Unlabelled Images
PDF
Design Technique of Bandpass FIR filter using Various Window Function
PDF
A Novel Mechanism for Low Bit-Rate Compression
PDF
Evaluation of Radiation Emmission from Refuse Dump Sites in Owerri, Nigeria
PDF
Security System Based on Ultrasonic Sensor Technology
PDF
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
PDF
Structural analysis of the magnetic poles of the 20 MeV Injector Microtron.
PDF
Oil and Fatty Acids Composition in Glasswort (Salicornia Fruticosa) Seeds
PDF
H0563843
PDF
Optimized mould design of an Air cooler tank
PDF
Hybrid Algorithm for Dose Calculation in Cms Xio Treatment Planning System
PDF
Neonatal Omphalitis in Iraq
PDF
An On-Situ Study of Stability Analysis on Slopes Using Undrained Shear Streng...
PDF
Discovery of superluminal velocities of X-rays and Bharat Radiation challengi...
PDF
Stock Selection Skills of Indian Mutual Fund Managers during 2000-2012
PDF
Minimization of Surface Roughness in CNC Turning Using Taguchi Method
PDF
Public Transport Accessibility Index for Thiruvananthapuram Urban Area
PDF
Hydrochemistry of groundwater with special reference to arsenic in Lakhimpur ...
PDF
Industrial Process Management Using LabVIEW
PDF
GMR Materials: A New Generation of Miniaturizated Technology
Content Based Image Retrieval for Unlabelled Images
Design Technique of Bandpass FIR filter using Various Window Function
A Novel Mechanism for Low Bit-Rate Compression
Evaluation of Radiation Emmission from Refuse Dump Sites in Owerri, Nigeria
Security System Based on Ultrasonic Sensor Technology
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Structural analysis of the magnetic poles of the 20 MeV Injector Microtron.
Oil and Fatty Acids Composition in Glasswort (Salicornia Fruticosa) Seeds
H0563843
Optimized mould design of an Air cooler tank
Hybrid Algorithm for Dose Calculation in Cms Xio Treatment Planning System
Neonatal Omphalitis in Iraq
An On-Situ Study of Stability Analysis on Slopes Using Undrained Shear Streng...
Discovery of superluminal velocities of X-rays and Bharat Radiation challengi...
Stock Selection Skills of Indian Mutual Fund Managers during 2000-2012
Minimization of Surface Roughness in CNC Turning Using Taguchi Method
Public Transport Accessibility Index for Thiruvananthapuram Urban Area
Hydrochemistry of groundwater with special reference to arsenic in Lakhimpur ...
Industrial Process Management Using LabVIEW
GMR Materials: A New Generation of Miniaturizated Technology
Ad

Similar to Mining Top-k Closed Sequential Patterns in Sequential Databases (20)

PDF
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
PDF
A survey paper on sequence pattern mining with incremental
PDF
A survey paper on sequence pattern mining with incremental
PDF
Mining Approach for Updating Sequential Patterns
PDF
A Survey of Sequential Rule Mining Techniques
PPT
Mining top k frequent closed itemsets
PDF
Mining closed sequential patterns in large sequence databases
PDF
Sequential Pattern Tree Mining
PDF
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
PDF
DAA Notes.pdf
PPTX
StackNet Meta-Modelling framework
PDF
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
PPTX
Machine Learning Algorithms (Part 1)
PDF
5 parallel implementation 06299286
PDF
IEEE Datamining 2016 Title and Abstract
PDF
H0964752
PDF
A Load-Balanced Parallelization of AKS Algorithm
PPTX
A Novel Approach of Caching Direct Mapping using Cubic Approach
PDF
Advanced Algorithms Lecture Notes Mit 6854j Itebooks
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
Mining Approach for Updating Sequential Patterns
A Survey of Sequential Rule Mining Techniques
Mining top k frequent closed itemsets
Mining closed sequential patterns in large sequence databases
Sequential Pattern Tree Mining
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
DAA Notes.pdf
StackNet Meta-Modelling framework
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Machine Learning Algorithms (Part 1)
5 parallel implementation 06299286
IEEE Datamining 2016 Title and Abstract
H0964752
A Load-Balanced Parallelization of AKS Algorithm
A Novel Approach of Caching Direct Mapping using Cubic Approach
Advanced Algorithms Lecture Notes Mit 6854j Itebooks

More from IOSR Journals (20)

PDF
A011140104
PDF
M0111397100
PDF
L011138596
PDF
K011138084
PDF
J011137479
PDF
I011136673
PDF
G011134454
PDF
H011135565
PDF
F011134043
PDF
E011133639
PDF
D011132635
PDF
C011131925
PDF
B011130918
PDF
A011130108
PDF
I011125160
PDF
H011124050
PDF
G011123539
PDF
F011123134
PDF
E011122530
PDF
D011121524
A011140104
M0111397100
L011138596
K011138084
J011137479
I011136673
G011134454
H011135565
F011134043
E011133639
D011132635
C011131925
B011130918
A011130108
I011125160
H011124050
G011123539
F011123134
E011122530
D011121524

Recently uploaded (20)

PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PPT on Performance Review to get promotions
PPTX
web development for engineering and engineering
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Artificial Intelligence
PPTX
Current and future trends in Computer Vision.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
composite construction of structures.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
573137875-Attendance-Management-System-original
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CH1 Production IntroductoryConcepts.pptx
PPT on Performance Review to get promotions
web development for engineering and engineering
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Artificial Intelligence
Current and future trends in Computer Vision.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Internet of Things (IOT) - A guide to understanding
R24 SURVEYING LAB MANUAL for civil enggi
composite construction of structures.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
573137875-Attendance-Management-System-original
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf

Mining Top-k Closed Sequential Patterns in Sequential Databases

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 4 (Nov. - Dec. 2013), PP 20-23 www.iosrjournals.org www.iosrjournals.org 20 | Page Mining Top-k Closed Sequential Patterns in Sequential Databases K.Sohini 1 , Mr.V.Purushothama Raju 2 1 Dept of CSE, Shri Vishnu Engineering College for Women, Bhimavaram, A.P, India 2 Associate Professor, Dept of CSE, Shri Vishnu Engineering College for Women, Bhimavaram, A.P, India Abstract: In data mining community, sequential pattern mining has been studied extensively. Most studies require the specification of minimum support threshold to mine the sequential patterns. However, it is difficult for users to provide an appropriate threshold in practice. To overcome this, we propose mining top-k closed sequential patterns of length no less than min_l, where k is the number of closed sequential patterns to be mined, and min_l is the minimum length of each pattern. We mine closed patterns since they are solid representations of frequent patterns. Keywords: closed pattern, data mining, sequential pattern, scalability I. Introduction Sequential pattern mining is an important data mining task that has been studied extensively. Given a set of sequences, which consists of a list of itemsets, and given a user-specified minimum support threshold (min_support), sequential pattern mining is to find all frequent subsequences whose frequency is no less than min support. This mining task leads to the following two problems that may hinder its popular use. First, sequential pattern mining often generates an exponential number of patterns, which is unavoidable when the database consists of long frequent sequences. The similar phenomena also exist in itemset and graph patterns when the patterns are large. For example, assume the database contains a frequent sequence <(a1) (a2)…(a100)>, it will generate 2100 -1 frequent subsequences. It is very possible some subsequences share the exact same support with this long sequence, which are essentially redundant patterns. Second, setting min_support is a subtle task: A too small value may lead to the generation of thousands of patterns, whereas a too big one may lead to no answer found. To come up with an appropriate min_support, one needs to have prior knowledge about the mining query and the task specific data, and be able to guess beforehand how many patterns will be generated with a particular threshold. A solution to the first problem was proposed recently by Yan, et al. [1]. Their algorithm, called CloSpan, can mine closed sequential patterns. A sequential pattern s is closed, if there exists, no super pattern of s with the same support in the database. When the patterns were mined for closed sequences and information lossless then the Patterns will be reduced, because it can be used to derive the complete set of sequential patterns. As to the second problem, a similar situation occurs in frequent itemset mining. As proposed in [3], a good solution is to change the task of mining frequent patterns to mining top-k frequent closed patterns of minimum length min_l, where k is the number of closed patterns to be mined, top-k refers to the k most frequent patterns, and min_l is the minimum length of the closed patterns. This setting is also desirable in the context of sequential pattern mining. Unfortunately, most of the techniques developed in [3] cannot be directly applied in sequence mining, because this subsequence testing requires order matching which is more difficult than subset testing. Moreover, the search space of sequences is much larger than that of itemsets. Nevertheless, some ideas developed in [3] are still influential in our algorithm design. II. Related Work Efficient algorithms like PrefixSpan [5] SPADE[6] GSP[7] were developed for mining the sequential patterns. As sequential pattern mining produces many patterns so closed sequential pattern mining algorithms like CloSpan[1] , BIDE[8] were developed. All these algorithms can deliver less patterns than sequential pattern mining, but do not lose any information. Top-k closed pattern mining will reduce the number of patterns further by only mining the most frequent ones. As to mining the top-k patterns, CloSpan[1] and TFP [3] are the most related. CloSpan mines the frequent closed sequential patterns while TFP discovers top-k closed itemsets. III. Method Development The sequential pattern mining based on the concept of projection based, is introduced (PrefixSpan[5]) it gives some background idea. Then a new Top-k mining algorithm was introduced with background of CloSpan and top-k.
  • 2. Mining Top-k Closed Sequential Patterns in Sequential Databases www.iosrjournals.org 21 | Page For each discovered sequence s and its projected database Ds, it performs itemset extension and sequence- extension recursively until all the frequent sequences with prefix s are discovered. Given a sequence s=<t1,…..,tm> and an item a, sa means s concatenates with a It can be I-Step extension ,si a =<t1,…..,tm U{a}> Or S-Step extension, ss a =<t1,…..,tm ,{a}> Example:<(ae)> is an I-Step extension of <(a)> <(a)(c)> is an S-Step extension of <(a)> Algorithm: Top-k mining Input: A sequence s, a projected database Ds, minimum length min_l, histograms H and constant factor f Output: The top-k closed sequence set k 1. if support of s is less than min_support then return 2. if length of s is equal to minimum length then 3. call Prefix Span With Support Raising (s, Ds, min_support, k ) ; 4. return; 5. scan the database Ds once and find every frequent item a such that s can be extended to sa; 6. insert a in histogram at length (l+1); 7. next_level_top_support  Get Top Support From Histogram ( f, H[l(s)+1]) 8. for each a, support(a)>=next_level_top_support do 9. call Top-k mining ( sa , Dsa, min_l ); 10. return; In the prefix span it finds all the frequent sequences, they include both closed and non-closed sequences. Since our task is to mine top-k closed sequential patterns without min_support threshold, the mining process should start with min_support = 1, raise it progressively during the process, and then use the raised min_support to prune the search space. As soon as at least k closed sequential patterns with length no less than min_l are found, min_support can be set to the support of the least frequent pattern, and this min_support raising process continues throughout the mining process. This min_support raising technique is simple and can lead to efficient mining. There is only one algorithm CloSpan[1], that guarantees the slightest k closed sequential patterns are found as a result min_support can be raised during the mining. In Top-k mining, it includes if the support of s is less than min_support then return, if the length of s is equal to min_l then call Prefix Span With Support Raising. Then scan the database Ds once and find the closed sequence so that s can be extended to sa. Then insert a into histogram, finally if the support of a is greater or equal to next level top support then call top-k mining. IV. Performance evaluation This section reports the performance testing of Top-k closed sequential patterns in large data sets. The performance of Top-k with CloSpan is compared. The comparison is by assigning the optimal min-support to CloSpan so that it generates the same set of top-k closed sequential patterns for specified values of k. These were performed on a 2.10GHz Intel core duo PC with 2GB main memory, Windows XP/7 Professional. The performance of algorithms was compared by varying k. When k is fixed, its value is set to either 50 or 500 which covers the range of typical values for this parameter. Fig.1, show the performance of sample dataset. This dataset consists of relatively short sequences, each sequence contains 6 itemsets on average and the itemsets have 3 items on average. This experiment shows that top-k mines the data efficiently and with minimum running time than CloSpan. Mainly there are two reasons for better performance of Top-k in this dataset, first it uses min_l condition to reduce short sequences during the mining process that will reduce the search space and improves performance. Second, Top_k have efficient pattern verification and stores the result set that contains only a small number of patterns.
  • 3. Mining Top-k Closed Sequential Patterns in Sequential Databases www.iosrjournals.org 22 | Page a. when k=50 b. when k=500 Fig. 1: performance of clospan and top-k when k is fixed. These were the graphs obtained when compared with sample database on clospan and top-k with constant min-l and running time. V. Conclusions In this paper, we have studied the problem of mining top-k closed sequential patterns with length no less than min_l and we proposed an efficient algorithm with the following distinct features: it implement a new Top-k algorithm that mine the sequential patterns quickly and raise support dynamically, and it perform efficient verification during the mining process, and it extend optimization technique together with applying the minimum length constraint (min_l). 0 20 40 60 80 100 120 140 160 180 2 4 6 8 10 12 runningtime min-l clospan top-k
  • 4. Mining Top-k Closed Sequential Patterns in Sequential Databases www.iosrjournals.org 23 | Page References [1] X. Yan, R. Afshar, and J. Han. CloSpan: Mining closed sequential patterns in large datasets. In May 2003, SDM, in California. [2] C. J. Hsiao and M. J. Zaki. CHARM: An efficient algorithm for closed itemset mining. In April 2002, SDM, in Arlington,VA. [3] J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining topk frequent closed patterns without minimum support. In Dec. 2002, ICDM, in Maebashi, Japan. [4] R. Agrawal and R. Srikant. Mining sequential patterns. In Mar. 1995, ICDE, in Taipei, Taiwan. [5] J. Pei, J. Han, B. Mortazavi-Asl, M.C. Hsu , Q. Chen,U. Dayal, and H. Pinto. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In April 2001, ICDE, in Germany. [6] M. Zaki. In 2001, SPADE: An efficient algorithm for mining frequent sequences. [7] R.Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Avignon, France, Mar. 1996. [8] J.Wang and J. Han. BIDE, Efficient Mining of Frequent Closed Sequences, in 2004, ICDE.