SlideShare a Scribd company logo
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
DOI : 10.5121/ijist.2013.3402 11
Space Efficient Suffix Array Construction using
Induced Sorting LMS Substrings
Rajesh. Yelchuri1
, Nagamalleswara Rao.N2
Department of Computer Science and Engineering, R.V.R & J.C College of Engg.
Chowdavaram, Guntur, Andhra Pradesh -522119,India
1
rajesh.yelchuri@gmail.com
2
nnmr_m@yahoo.com
ABSTRACT
This paper presents, an space efficient algorithm for linear time suffix array construction. The algorithm
uses the techniques of divide-and-conquer, and recursion. What differentiates the proposed algorithm from
the variable-length leftmost S-type (LMS) substrings is the efficient usage of the memory to construct the
suffix array. The modified induced sorting algorithm for the variable-length LMS substrings uses efficient
usage of the memory space than the existing variable length left most S-type(LMS) substrings algorithm
KEYWORDS
Divide and Conquer, Suffix Array.
1. INTRODUCTION
This document describes, the concept of suffix arrays was introduced by Manber and Myers in
SODA’90 [4] and SICOMP’93 [3] as a space efficient alternative to suffix trees. It has been well
recognized as a fundamental data structure, useful for a broad range of applications, for e.g.,
string search, data indexing, searching for patterns in DNA or protein sequences, data
compression and also in Burrows-Wheeler transformation. For an n-character string, denoted by
STR, its suffix array, denoted by SAR(STR), is an array of indices pointing to all the suffixes of
STR, sorted according to their ascending(or descending) lexicographical order. The suffix array
of STR itself requires only n[log n]-bit space. However, different suffix array construction
algorithms may require different space and time complexities. During the past decade, a many
researches have been developing suffix array construction algorithms that are both time and space
efficient, for which we suggest a detailed survey from Puglisi [5]. Time and space efficient suffix
array construction algorithms has become popular because of their wide usage. Construction of
suffix arrays are needed for large scale applications, e.g., biological genome database and web
searching and, where the size of a huge data set is measured in billions of characters [6], [7], [8],
[9], [10].Time and space efficient linear time algorithms are crucial for large-scale applications to
have predictable worst-case performance. The three known algorithms are KSP [1],KA [12],
[13],KS [11], [2] all are reported in 2003.
2. BASIC NOTATIONS
In this section we bring out some basic terminology, used in the presentation of the algorithm. Let
STR be a string of n characters in an array [0..n-1], and ∑(STR) be the alphabet of STR. To
denote a substring in STR where i and j ranges from 0 to n-1,i<j, we denote it as STR[i..j]. For
simplicity assume, STR is supposed to be terminated by a character called as sentinel and
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
12
represented by $, which is the unique lexicographically smallest character in STR. Let
suffix(STR, i) be the suffix in STR starting at STR[i] and running to the end of the character
array i.e. to the sentinel.
A suffix suffix(STR,i) is called as S-type or L-type, if suffix(STR, i) < suffix(STR,i+1) or
suffix(STR, i) > suffix(STR,i+1), respectively. The last suffix suffix(STR,n-1) consisting of only
the single character $ (the sentinel) which is predefined as S-type. We can classify a character
STR[i] to be S-type or L-type. To store the type of every character/ suffix, we introduce an n-bit
Boolean array b, where b[i] records the type of character STR[i] as well as suffix(STR, i): 1 for S-
type and 0 for L-type. From the S-type and L-type descriptions, we observed the following
properties:
Property 1:STR[i] is S-type, if STR[i] < STR[i+1] or
STR[i]=STR[i+1] and suffix(STR,i+1) is S-type.
Property 2:STR[i] is L-type, if STR[i] > STR[i+1] or
STR[i]=STR[i+1] and suffix(STR,i+1) is L-type.
By reading STR once from right to left, we can store the type of each character/suffix into type
array ‘b’ in O(n) time.
As defined earlier, SAR(STR) (the notation of SAR is used for it when there is no confusion in
the context), i.e., the suffix array of STR, stores the indices of all the suffixes of STR according to
their lexicographical order. We observe that the pointers for all the suffixes beginning with a
same character must span successively. Let us call a sub array in SAR for all the suffixes with the
same first character as a bucket, where the head and the tail of a bucket refer to the first and the
last items of the bucket. There must be no tie between any two suffixes sharing the identical
character but of different types i.e., in the same bucket, all the suffixes of the same type are
grouped together and the S-type suffixes are to the right of the L-type suffixes [12], [13].
Therefore, each bucket can be divided into two sub-buckets with respect to the types of suffixes
inside i.e. the L and S-type buckets, where the S-type bucket is on the right of the L-type bucket.
3. Existing Algorithm: INDUCED SORTING VARIABLE LENGTH LMS
SUBSTRINGS
A. Algorithm Framework
The framework of existing linear time suffix array sorting algorithm SAR-IS[15] that samples
and sorts the variable-length LMS-substrings, is given in section III-C. Lines 1 to 4 give the
reduced problem, which is then again recursively solved by the lines 5-8, and finally from the
solution of the reduced problem, Line 9 induces the final solution for the original problem.
B. Basic Definitions
We start by introducing the terms of leftmost S-type (LMS) character, suffix, and substring as
follows:
Definition 1:(LMS Character/Suffix) A character STR[i], iЄ[1,n-1] is called LMS, if STR[i] is S-
type and STR[i-1] is L-type. A suffix suffix(STR,i) is called LMS, if STR[i] is an LMS character.
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
13
Definition 2: (LMS-Substring) An LMS-substring is (i) a substring STR[i..j] with both STR[i]
and STR[j] being LMS characters and there exists no other LMS character in the substring, for i
≠ j; or (ii) the sentinel itself. If we treat the LMS-substrings as elementary blocks of the string, we
can effortlessly sort all the LMS substrings, then by using the order index of each LMS substring
as its name, and replace all of the LMS-substrings in STR by their names. Therefore, the string
STR can be represented by a shortened string, denoted by R1, thus the problem size can be further
minimized to fast up solving the problem in divide-and-conquer manner
Definition 3: (Order of Substring) To find out the order of any two LMS-substrings, first
compare their corresponding characters from left to right. For each pair of characters, compare
their lexicographical values first and then their types, if the two characters are of the same
lexicographical value, where the S-type is taken as highest priority than the L-type. From this
definition ,we see that two LMS-substrings can be of the same order index, i.e., the same name, if
they have same, in terms of the lengths, and the characters, and the types. Assigning the S-type
character a higher priority is based on a property directly from the definitions of L-type and S-
type suffixes in [12]: suffix(STR, i)> suffix(STR, j), if (1) STR[i] > STR[j], or (2)
STR[i]=STR[j], suffix(STR, i)and suffix(STR, j) are S-type and L-type, respectively. To sort all
the LMS-substrings, no excess physical space is essential for storing them. We simply maintain a
pointer array, denoted by P1, which contains the pointers for all the LMS-substrings in STR and
can be made by scanning STR or by reading the Boolean array b once from right to left in O(n)
time.
Definition 4: (Pointer Array P1) is an array which has the pointers for all the LMS substrings in
STR with their original positional order being conserved. If we have all the LMS substrings
sorted in the buckets in their lexicographical order, where all the LMS substrings in a bucket are
identical, now we name each and every item of the pointer array P1 by the index of its bucket to
result in a revived string R1. We say the two equal size substrings STR[i..j] and STR[i′..j′] are
identical, if and only if STR[ i + k]=STR[i′ +k] and b[i +k]=b[i′ +k], for k Є [0,j-i].
C. Algorithm
SAR-IS(STR,SAR)
STR- is input string;
SAR-output of suffix array of STR;
b:array[0..n-1] of Boolean;
P1,R1:array[0....n1] of integer; n1=||R1||
BKT:array[0..||∑(STR)||-1] of integer;
Step 1. Scan STR once to classify all the characters as L-Type or S-Type into b;
Step 2. Scan b once to find all LMS –substrings in STR into P1;
Step 3. Induced sort all the LMS-substrings using P1 and BKT;
Step 4. Name each LMS-substring in STR by its bucket index to get a new shortened string R1;
Step 5. if each character in R1 is unique then
Step 6. Directly compute SAR1 from R1;
Step 7. else
Step 8. SAR-IS(R1,SAR1); //Recursive call
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
14
Step 9. Induce SAR from R1;
Step 10. Return
The above mentioned algorithm is the existing one.
4. Proposed Algorithm
In SA-IS, the additional working space is mainly composed of the bucket counter array ‘BKT’
and the type array ‘t’ at each recursion level. Our proposed algorithm differs from the existing
one in two cases. They are
1. We use the MSB bit of the suffix array to store the type of the character(S-type or L-
type) thereby avoiding the space needed for the type array ‘t’ suggested in the existing
algorithm.
2. We reuse the unused space in SAR for the bucket array BKT.
We have observed that the input STR has been reduced to at least n/3 at the initial level (level-0)
for the standard suffix array datasets .So, we can use of the unused space of SAR for the variable
BKT in deeper levels rather than creating memory using malloc. As, in the existing algorithm -1
is used as initialization (default) value for suffix array SAR. In the proposed algorithm we use
0X7FFFFFFF as initialization value for suffix array SAR as the MSB bit is used to classify the S-
type or L-type characters. Here we assume a 32-bit machine and the integer occupies 4-bytes.
The variable Buf_ptr is used which records the start address of the unused space of SAR at initial
level(i.e level-0) so that we can reuse this space in the next levels (i.e. from 1st Level) for the
bucket array (See Fig 1). We can also make use of this space for the L or S-type arrays if the
space is still available.
As we can see the space of SAR0 is reused for the level-1 because the size of the problem gets
decreased as the level progresses.
4.1 Algorithm
SAR-IS (STR, SAR)
STR- is input string;
SAR-output of suffix array of STR;
P1, R1: array [0...n1] of integer; n1=||R1||
BKT: array [0...||∑ (STR) ||-1] of integer; //uses unused space in subsequent iterations
Buf_ptr : pointer to unused space in SAR
Step 1. Scan STR once to classify all the characters as L-Type or S-Type into MSB bits of SAR;
Step 2. Scan MSB’s of SAR once to find all LMS substrings in STR into P1;
Step 3. if level Not Equal to 0 then
BKT=buf_ptr;//assign the start address of unused buffer
Step 4. Induced sort all the LMS-substrings using P1 and BKT;
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
15
Step 5. Name each LMS-substring in STR by its bucket index to get a new shortened string R1;
Step 6. If level Equal to 0 then assign the start address of
unused space of SAR to buf_ptr.
Step 7. if Each character in R1 is unique then
Step 8. Directly compute SAR1 from R1;
Step 9. else SAR-IS(R1,SAR1); //Recursive call
Step 10.Once again scan STR to classify all the characters as L-Type or S-Type into MSB bits
of SAR;
Step 11.Induce SAR from SAR1;
Step 12.return
Fig 1.Example for the re usage of the buffer SAR
The re usage of the buffer is illustrated in Fig 1.The notation L 0, L 1, L 2 stands for Level-0,
Level-1, Level-2.
4.2 Experimental Results
The algorithm was implemented in VC++ using the Microsoft Visual Studio under Windows XP
platform. The Table II and Fig 2 give the overview of the space consumed by the existing and the
proposed algorithms. The data sets in Table I used in our experiment are downloaded from
Canterbury [14] and Manzini-Ferragina[16].
Dataset ||∑||,Characters
bible.txt 63,4047392
chr22.dna 4,34553758
e.coli 4,4638690
howto 197,39422105
world192.txt 94.2473400
sprot34.dat 66,109617186
etext99 146,105277340
rfc 120,116,421,901
rctail196 93,114,711,151
linux-2.4.5.tar 256,21,508,430
w3c2 256,104,201,579
alphabet 26,100000
random 26,100000
TABLE I Datasets used in the Experiment
L 2
L 1
L 0 B0
SAR0
R1 SAR1 BKT1
R2 SA R 2 BKT2
R1
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
16
1
2
4
8
16
32
64
128
256
512
1024
Existing Algorithm
Proposed Algorithm
Dataset Space(in Mega Bytes)
Existing
Algorithm
Proposed
Algorithm
bible.txt 21.81 20.10
chr22.dna 179.10 165.85
e.coli 25.25 22.9
howto 204.47 189.11
world192.txt 13.61 12.58
sprot34.dat 556.57 524.48
etext99 544.14 503.74
rfc 590.53 556.99
rctail196 577.29 548.81
linux-2.4.5.tar 130.82 103.53
w3c2 521.11 498.60
alphabet 1.35 1.23
random 1.48 1.23
TABLE II Space Consumed by the Existing and Proposed Algorithm
Fig 2. Logarithmic graph (base 2) showing the comparison between Existing and Proposed Algorithm
The datasets that are in Table I are downloaded from the benchmark repositories for SACAs,
which includes Canterbury [14], Manzini-Ferragina[16].These datasets have constant alphabets
with sizes less than or equal to 256 and one byte is taken for each character.
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
17
4.3 Conclusions
The proposed algorithm makes the algorithm space efficient by using the MSB bit of SAR to
classify L-type and S-type characters and reuses the space of SAR for the bucket array at each
level there by reducing nearly 25% of the space needed when compared to the existing
algorithm. The results for the various data sets are shown in the Table II.
REFERENCES
[1] D.K. Kim, J.S. Sim, H. Park, and K. Park, “Linear-Time Construction of Suffix Arrays,” Proc.
Ann. Symp Combinatorial Pattern Matching (CPM ’03), pp. 186-199. 2003.
[2] J. Karkkainen, P. Sanders, and S. Burkhardt, “Linear Work Suffix Array Construction,” J. ACM,
no. 6, pp. 918-936, Nov. 2006.
[3] U. Manber and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,” SIAM J.
Computing, vol. 22, no. 5, pp. 935-948, 1993.
[4] U. Manber and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,” Proc.
First Ann. ACM-SIAM Symp. Discrete Algorithms (SODA ’90), pp. 319-327, 1990.
[5] S.J. Puglisi, W.F. Smyth, and A.H. Turpin, “A Taxonomy of Suffix Array Construction
Algorithms,” ACM Computing Surveys, vol. 39, no. 2, pp. 1-31, 2007.
[6] R. Grossi and J.S. Vitter, “Compressed Suffix Arrays and Suffix Trees with Applications to Text
Indexing and String Matching,” Proc. Symp. Theory of Computing (STOC ’00), pp. 397-406,
2000.
[7] T.W. Lam, K. Sadakane, W.K. Sung, and S.M. Yiu, “A Space and Time Efficient Algorithm for
Constructing Compressed Suffix Arrays,” Proc. Int’l Conf. Computing and Combinatorics, pp.
401-410, 2002.
[8] G. Manzini and P. Ferragina, “Engineering a Lightweight Suffix Array Construction Algorithm,”
Algorithmica, vol. 40, no. 1, pp. 33- 50, Sept. 2004.
[9] S. Kurtz, “Reducing the Space Requirement of Suffix Trees,” Software Practice and Experience,
vol. 29, pp. 1149-1171, 1999.
[10] W.K. Hon, K. Sadakane, and W.K. Sung, “Breaking a Time-and-Space Barrier for Constructing
Full-Text Indices,” Proc. 44th Ann. IEEE Symp. Foundations of Computer Science (FOCS ’03),
pp. 251-260, 2003.
[11] J. Karkkainen and P. Sanders, “Simple Linear Work Suffix Array Construction,” Proc. 30th Int’l
Conf. Automata, Languages, and Programming (ICALP ’03), pp. 943-955, 2003.
[12] P. Ko and S. Aluru, “Space Efficient Linear Time Construction of Suffix Arrays,” Proc. Ann.
Symp. Combinatorial Pattern Matching(CPM ’03), pp. 200-210. 2003.
[13] P. Ko and S. Aluru, “Space-Efficient Linear Time Construction of Suffix Arrays,” J. Discrete
Algorithms, vol. 3, nos. 2-4, pp. 143-156, 2005
[14] The Canterbury Corpus website. [Online]. Available: http://corpus.canterbury.ac.nz/.
[15] GeNong, Sen Zhang, Wai Hong Chan, “Two Efficient Algorithms for Linear Time Suffix Array
Construction”, IEE Transactions on Computers, vol. 60, pp.1471-1484,Oct.2011.
International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013
18
[16] Light weight corpus datasets [Online].Available:
http://people.unipmn.it/manzini/lightweight/corpus

More Related Content

PDF
Flat unit 1
DOCX
Mc0082 theory of computer science
PDF
Flat unit 2
PDF
FLAT Notes
DOCX
Mc0082 theory of computer science
PPT
Theory of computing
PPTX
Theory of automata and formal language
DOC
Chapter 2 2 1 1
Flat unit 1
Mc0082 theory of computer science
Flat unit 2
FLAT Notes
Mc0082 theory of computer science
Theory of computing
Theory of automata and formal language
Chapter 2 2 1 1

What's hot (19)

PPT
Lecture 1,2
PPTX
Theory of computation Lec1
PPT
Theory of Automata
PPT
Lecture 3,4
PDF
02 representing position and orientation
PDF
Chapter1 Formal Language and Automata Theory
PPT
Lecture 7
PDF
Language
PPT
Lecture 8
PDF
Theory of Computation Lecture Notes
PPTX
Theory of automata and formal language
PDF
Ch3 4 regular expression and grammar
PPT
Theory of Automata Lesson 02
PPT
Lecture 5
PDF
Regular Expression
PPT
Lecture 6
PPT
Theory of computing
DOC
Generalized transition graphs
PPTX
Regular Expression in Compiler design
Lecture 1,2
Theory of computation Lec1
Theory of Automata
Lecture 3,4
02 representing position and orientation
Chapter1 Formal Language and Automata Theory
Lecture 7
Language
Lecture 8
Theory of Computation Lecture Notes
Theory of automata and formal language
Ch3 4 regular expression and grammar
Theory of Automata Lesson 02
Lecture 5
Regular Expression
Lecture 6
Theory of computing
Generalized transition graphs
Regular Expression in Compiler design
Ad

Similar to Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings (20)

PDF
A taxonomy of suffix array construction algorithms
PDF
Pattern Matching Part Two: Suffix Arrays
PDF
32 -longest-common-prefix
PPTX
Suffix Tree and Suffix Array
PDF
Data Representation of Strings
PDF
Suffix Array 構築方法の紹介
PDF
Sp 3828 10.1007-2_f978-3-642-31265-6_20
PDF
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
PPTX
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
DOCX
A project on advanced C language
PDF
Cr25555560
PDF
Approximate Indexing: Gapped Suffix Array
KEY
SuffixArrayにまつわるソートアルゴリズムの話
PPTX
Boyer-Moore-algorithm-Vladimir.pptx
PPT
4888009.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PPTX
DOC
Compiler Design QA
PPT
String kmp
PDF
メモリより大きなデータの Sufix Array 構築方法の紹介
PDF
Iaetsd effective method for searching substrings in large databases
A taxonomy of suffix array construction algorithms
Pattern Matching Part Two: Suffix Arrays
32 -longest-common-prefix
Suffix Tree and Suffix Array
Data Representation of Strings
Suffix Array 構築方法の紹介
Sp 3828 10.1007-2_f978-3-642-31265-6_20
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
A project on advanced C language
Cr25555560
Approximate Indexing: Gapped Suffix Array
SuffixArrayにまつわるソートアルゴリズムの話
Boyer-Moore-algorithm-Vladimir.pptx
4888009.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Compiler Design QA
String kmp
メモリより大きなデータの Sufix Array 構築方法の紹介
Iaetsd effective method for searching substrings in large databases
Ad

More from ijistjournal (20)

PDF
MATHEMATICAL EXPLANATION TO SOLUTION FOR EX-NOR PROBLEM USING MLFFN
PPTX
Call for Papers - International Journal of Information Sciences and Technique...
PDF
3rd International Conference on NLP, AI & Information Retrieval (NLAII 2025)
PDF
SURVEY ON LI-FI TECHNOLOGY AND ITS APPLICATIONS
PPTX
Research Article Submission - International Journal of Information Sciences a...
PDF
A BRIEF REVIEW OF SENTIMENT ANALYSIS METHODS
PDF
14th International Conference on Information Technology Convergence and Servi...
PPTX
Online Paper Submission - International Journal of Information Sciences and T...
PDF
New Era of Teaching Learning : 3D Marker Based Augmented Reality
PPTX
Submit Your Research Articles - International Journal of Information Sciences...
PDF
GOOGLE CLOUD MESSAGING (GCM): A LIGHT WEIGHT COMMUNICATION MECHANISM BETWEEN ...
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PPTX
Call for Papers - International Journal of Information Sciences and Technique...
PDF
SURVEY OF ANDROID APPS FOR AGRICULTURE SECTOR
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
PDF
International Journal of Information Sciences and Techniques (IJIST)
PPTX
Research Article Submission - International Journal of Information Sciences a...
PDF
SURVEY OF DATA MINING TECHNIQUES USED IN HEALTHCARE DOMAIN
PDF
International Journal of Information Sciences and Techniques (IJIST)
PPTX
Online Paper Submission - International Journal of Information Sciences and T...
MATHEMATICAL EXPLANATION TO SOLUTION FOR EX-NOR PROBLEM USING MLFFN
Call for Papers - International Journal of Information Sciences and Technique...
3rd International Conference on NLP, AI & Information Retrieval (NLAII 2025)
SURVEY ON LI-FI TECHNOLOGY AND ITS APPLICATIONS
Research Article Submission - International Journal of Information Sciences a...
A BRIEF REVIEW OF SENTIMENT ANALYSIS METHODS
14th International Conference on Information Technology Convergence and Servi...
Online Paper Submission - International Journal of Information Sciences and T...
New Era of Teaching Learning : 3D Marker Based Augmented Reality
Submit Your Research Articles - International Journal of Information Sciences...
GOOGLE CLOUD MESSAGING (GCM): A LIGHT WEIGHT COMMUNICATION MECHANISM BETWEEN ...
6th International Conference on Artificial Intelligence and Machine Learning ...
Call for Papers - International Journal of Information Sciences and Technique...
SURVEY OF ANDROID APPS FOR AGRICULTURE SECTOR
6th International Conference on Machine Learning Techniques and Data Science ...
International Journal of Information Sciences and Techniques (IJIST)
Research Article Submission - International Journal of Information Sciences a...
SURVEY OF DATA MINING TECHNIQUES USED IN HEALTHCARE DOMAIN
International Journal of Information Sciences and Techniques (IJIST)
Online Paper Submission - International Journal of Information Sciences and T...

Recently uploaded (20)

PDF
Well-logging-methods_new................
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
composite construction of structures.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Welding lecture in detail for understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Sustainable Sites - Green Building Construction
PDF
PPT on Performance Review to get promotions
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Well-logging-methods_new................
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CYBER-CRIMES AND SECURITY A guide to understanding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
R24 SURVEYING LAB MANUAL for civil enggi
composite construction of structures.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Welding lecture in detail for understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Foundation to blockchain - A guide to Blockchain Tech
Sustainable Sites - Green Building Construction
PPT on Performance Review to get promotions
Internet of Things (IOT) - A guide to understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
OOP with Java - Java Introduction (Basics)
UNIT-1 - COAL BASED THERMAL POWER PLANTS

Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings

  • 1. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 DOI : 10.5121/ijist.2013.3402 11 Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings Rajesh. Yelchuri1 , Nagamalleswara Rao.N2 Department of Computer Science and Engineering, R.V.R & J.C College of Engg. Chowdavaram, Guntur, Andhra Pradesh -522119,India 1 rajesh.yelchuri@gmail.com 2 nnmr_m@yahoo.com ABSTRACT This paper presents, an space efficient algorithm for linear time suffix array construction. The algorithm uses the techniques of divide-and-conquer, and recursion. What differentiates the proposed algorithm from the variable-length leftmost S-type (LMS) substrings is the efficient usage of the memory to construct the suffix array. The modified induced sorting algorithm for the variable-length LMS substrings uses efficient usage of the memory space than the existing variable length left most S-type(LMS) substrings algorithm KEYWORDS Divide and Conquer, Suffix Array. 1. INTRODUCTION This document describes, the concept of suffix arrays was introduced by Manber and Myers in SODA’90 [4] and SICOMP’93 [3] as a space efficient alternative to suffix trees. It has been well recognized as a fundamental data structure, useful for a broad range of applications, for e.g., string search, data indexing, searching for patterns in DNA or protein sequences, data compression and also in Burrows-Wheeler transformation. For an n-character string, denoted by STR, its suffix array, denoted by SAR(STR), is an array of indices pointing to all the suffixes of STR, sorted according to their ascending(or descending) lexicographical order. The suffix array of STR itself requires only n[log n]-bit space. However, different suffix array construction algorithms may require different space and time complexities. During the past decade, a many researches have been developing suffix array construction algorithms that are both time and space efficient, for which we suggest a detailed survey from Puglisi [5]. Time and space efficient suffix array construction algorithms has become popular because of their wide usage. Construction of suffix arrays are needed for large scale applications, e.g., biological genome database and web searching and, where the size of a huge data set is measured in billions of characters [6], [7], [8], [9], [10].Time and space efficient linear time algorithms are crucial for large-scale applications to have predictable worst-case performance. The three known algorithms are KSP [1],KA [12], [13],KS [11], [2] all are reported in 2003. 2. BASIC NOTATIONS In this section we bring out some basic terminology, used in the presentation of the algorithm. Let STR be a string of n characters in an array [0..n-1], and ∑(STR) be the alphabet of STR. To denote a substring in STR where i and j ranges from 0 to n-1,i<j, we denote it as STR[i..j]. For simplicity assume, STR is supposed to be terminated by a character called as sentinel and
  • 2. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 12 represented by $, which is the unique lexicographically smallest character in STR. Let suffix(STR, i) be the suffix in STR starting at STR[i] and running to the end of the character array i.e. to the sentinel. A suffix suffix(STR,i) is called as S-type or L-type, if suffix(STR, i) < suffix(STR,i+1) or suffix(STR, i) > suffix(STR,i+1), respectively. The last suffix suffix(STR,n-1) consisting of only the single character $ (the sentinel) which is predefined as S-type. We can classify a character STR[i] to be S-type or L-type. To store the type of every character/ suffix, we introduce an n-bit Boolean array b, where b[i] records the type of character STR[i] as well as suffix(STR, i): 1 for S- type and 0 for L-type. From the S-type and L-type descriptions, we observed the following properties: Property 1:STR[i] is S-type, if STR[i] < STR[i+1] or STR[i]=STR[i+1] and suffix(STR,i+1) is S-type. Property 2:STR[i] is L-type, if STR[i] > STR[i+1] or STR[i]=STR[i+1] and suffix(STR,i+1) is L-type. By reading STR once from right to left, we can store the type of each character/suffix into type array ‘b’ in O(n) time. As defined earlier, SAR(STR) (the notation of SAR is used for it when there is no confusion in the context), i.e., the suffix array of STR, stores the indices of all the suffixes of STR according to their lexicographical order. We observe that the pointers for all the suffixes beginning with a same character must span successively. Let us call a sub array in SAR for all the suffixes with the same first character as a bucket, where the head and the tail of a bucket refer to the first and the last items of the bucket. There must be no tie between any two suffixes sharing the identical character but of different types i.e., in the same bucket, all the suffixes of the same type are grouped together and the S-type suffixes are to the right of the L-type suffixes [12], [13]. Therefore, each bucket can be divided into two sub-buckets with respect to the types of suffixes inside i.e. the L and S-type buckets, where the S-type bucket is on the right of the L-type bucket. 3. Existing Algorithm: INDUCED SORTING VARIABLE LENGTH LMS SUBSTRINGS A. Algorithm Framework The framework of existing linear time suffix array sorting algorithm SAR-IS[15] that samples and sorts the variable-length LMS-substrings, is given in section III-C. Lines 1 to 4 give the reduced problem, which is then again recursively solved by the lines 5-8, and finally from the solution of the reduced problem, Line 9 induces the final solution for the original problem. B. Basic Definitions We start by introducing the terms of leftmost S-type (LMS) character, suffix, and substring as follows: Definition 1:(LMS Character/Suffix) A character STR[i], iЄ[1,n-1] is called LMS, if STR[i] is S- type and STR[i-1] is L-type. A suffix suffix(STR,i) is called LMS, if STR[i] is an LMS character.
  • 3. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 13 Definition 2: (LMS-Substring) An LMS-substring is (i) a substring STR[i..j] with both STR[i] and STR[j] being LMS characters and there exists no other LMS character in the substring, for i ≠ j; or (ii) the sentinel itself. If we treat the LMS-substrings as elementary blocks of the string, we can effortlessly sort all the LMS substrings, then by using the order index of each LMS substring as its name, and replace all of the LMS-substrings in STR by their names. Therefore, the string STR can be represented by a shortened string, denoted by R1, thus the problem size can be further minimized to fast up solving the problem in divide-and-conquer manner Definition 3: (Order of Substring) To find out the order of any two LMS-substrings, first compare their corresponding characters from left to right. For each pair of characters, compare their lexicographical values first and then their types, if the two characters are of the same lexicographical value, where the S-type is taken as highest priority than the L-type. From this definition ,we see that two LMS-substrings can be of the same order index, i.e., the same name, if they have same, in terms of the lengths, and the characters, and the types. Assigning the S-type character a higher priority is based on a property directly from the definitions of L-type and S- type suffixes in [12]: suffix(STR, i)> suffix(STR, j), if (1) STR[i] > STR[j], or (2) STR[i]=STR[j], suffix(STR, i)and suffix(STR, j) are S-type and L-type, respectively. To sort all the LMS-substrings, no excess physical space is essential for storing them. We simply maintain a pointer array, denoted by P1, which contains the pointers for all the LMS-substrings in STR and can be made by scanning STR or by reading the Boolean array b once from right to left in O(n) time. Definition 4: (Pointer Array P1) is an array which has the pointers for all the LMS substrings in STR with their original positional order being conserved. If we have all the LMS substrings sorted in the buckets in their lexicographical order, where all the LMS substrings in a bucket are identical, now we name each and every item of the pointer array P1 by the index of its bucket to result in a revived string R1. We say the two equal size substrings STR[i..j] and STR[i′..j′] are identical, if and only if STR[ i + k]=STR[i′ +k] and b[i +k]=b[i′ +k], for k Є [0,j-i]. C. Algorithm SAR-IS(STR,SAR) STR- is input string; SAR-output of suffix array of STR; b:array[0..n-1] of Boolean; P1,R1:array[0....n1] of integer; n1=||R1|| BKT:array[0..||∑(STR)||-1] of integer; Step 1. Scan STR once to classify all the characters as L-Type or S-Type into b; Step 2. Scan b once to find all LMS –substrings in STR into P1; Step 3. Induced sort all the LMS-substrings using P1 and BKT; Step 4. Name each LMS-substring in STR by its bucket index to get a new shortened string R1; Step 5. if each character in R1 is unique then Step 6. Directly compute SAR1 from R1; Step 7. else Step 8. SAR-IS(R1,SAR1); //Recursive call
  • 4. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 14 Step 9. Induce SAR from R1; Step 10. Return The above mentioned algorithm is the existing one. 4. Proposed Algorithm In SA-IS, the additional working space is mainly composed of the bucket counter array ‘BKT’ and the type array ‘t’ at each recursion level. Our proposed algorithm differs from the existing one in two cases. They are 1. We use the MSB bit of the suffix array to store the type of the character(S-type or L- type) thereby avoiding the space needed for the type array ‘t’ suggested in the existing algorithm. 2. We reuse the unused space in SAR for the bucket array BKT. We have observed that the input STR has been reduced to at least n/3 at the initial level (level-0) for the standard suffix array datasets .So, we can use of the unused space of SAR for the variable BKT in deeper levels rather than creating memory using malloc. As, in the existing algorithm -1 is used as initialization (default) value for suffix array SAR. In the proposed algorithm we use 0X7FFFFFFF as initialization value for suffix array SAR as the MSB bit is used to classify the S- type or L-type characters. Here we assume a 32-bit machine and the integer occupies 4-bytes. The variable Buf_ptr is used which records the start address of the unused space of SAR at initial level(i.e level-0) so that we can reuse this space in the next levels (i.e. from 1st Level) for the bucket array (See Fig 1). We can also make use of this space for the L or S-type arrays if the space is still available. As we can see the space of SAR0 is reused for the level-1 because the size of the problem gets decreased as the level progresses. 4.1 Algorithm SAR-IS (STR, SAR) STR- is input string; SAR-output of suffix array of STR; P1, R1: array [0...n1] of integer; n1=||R1|| BKT: array [0...||∑ (STR) ||-1] of integer; //uses unused space in subsequent iterations Buf_ptr : pointer to unused space in SAR Step 1. Scan STR once to classify all the characters as L-Type or S-Type into MSB bits of SAR; Step 2. Scan MSB’s of SAR once to find all LMS substrings in STR into P1; Step 3. if level Not Equal to 0 then BKT=buf_ptr;//assign the start address of unused buffer Step 4. Induced sort all the LMS-substrings using P1 and BKT;
  • 5. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 15 Step 5. Name each LMS-substring in STR by its bucket index to get a new shortened string R1; Step 6. If level Equal to 0 then assign the start address of unused space of SAR to buf_ptr. Step 7. if Each character in R1 is unique then Step 8. Directly compute SAR1 from R1; Step 9. else SAR-IS(R1,SAR1); //Recursive call Step 10.Once again scan STR to classify all the characters as L-Type or S-Type into MSB bits of SAR; Step 11.Induce SAR from SAR1; Step 12.return Fig 1.Example for the re usage of the buffer SAR The re usage of the buffer is illustrated in Fig 1.The notation L 0, L 1, L 2 stands for Level-0, Level-1, Level-2. 4.2 Experimental Results The algorithm was implemented in VC++ using the Microsoft Visual Studio under Windows XP platform. The Table II and Fig 2 give the overview of the space consumed by the existing and the proposed algorithms. The data sets in Table I used in our experiment are downloaded from Canterbury [14] and Manzini-Ferragina[16]. Dataset ||∑||,Characters bible.txt 63,4047392 chr22.dna 4,34553758 e.coli 4,4638690 howto 197,39422105 world192.txt 94.2473400 sprot34.dat 66,109617186 etext99 146,105277340 rfc 120,116,421,901 rctail196 93,114,711,151 linux-2.4.5.tar 256,21,508,430 w3c2 256,104,201,579 alphabet 26,100000 random 26,100000 TABLE I Datasets used in the Experiment L 2 L 1 L 0 B0 SAR0 R1 SAR1 BKT1 R2 SA R 2 BKT2 R1
  • 6. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 16 1 2 4 8 16 32 64 128 256 512 1024 Existing Algorithm Proposed Algorithm Dataset Space(in Mega Bytes) Existing Algorithm Proposed Algorithm bible.txt 21.81 20.10 chr22.dna 179.10 165.85 e.coli 25.25 22.9 howto 204.47 189.11 world192.txt 13.61 12.58 sprot34.dat 556.57 524.48 etext99 544.14 503.74 rfc 590.53 556.99 rctail196 577.29 548.81 linux-2.4.5.tar 130.82 103.53 w3c2 521.11 498.60 alphabet 1.35 1.23 random 1.48 1.23 TABLE II Space Consumed by the Existing and Proposed Algorithm Fig 2. Logarithmic graph (base 2) showing the comparison between Existing and Proposed Algorithm The datasets that are in Table I are downloaded from the benchmark repositories for SACAs, which includes Canterbury [14], Manzini-Ferragina[16].These datasets have constant alphabets with sizes less than or equal to 256 and one byte is taken for each character.
  • 7. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 17 4.3 Conclusions The proposed algorithm makes the algorithm space efficient by using the MSB bit of SAR to classify L-type and S-type characters and reuses the space of SAR for the bucket array at each level there by reducing nearly 25% of the space needed when compared to the existing algorithm. The results for the various data sets are shown in the Table II. REFERENCES [1] D.K. Kim, J.S. Sim, H. Park, and K. Park, “Linear-Time Construction of Suffix Arrays,” Proc. Ann. Symp Combinatorial Pattern Matching (CPM ’03), pp. 186-199. 2003. [2] J. Karkkainen, P. Sanders, and S. Burkhardt, “Linear Work Suffix Array Construction,” J. ACM, no. 6, pp. 918-936, Nov. 2006. [3] U. Manber and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,” SIAM J. Computing, vol. 22, no. 5, pp. 935-948, 1993. [4] U. Manber and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,” Proc. First Ann. ACM-SIAM Symp. Discrete Algorithms (SODA ’90), pp. 319-327, 1990. [5] S.J. Puglisi, W.F. Smyth, and A.H. Turpin, “A Taxonomy of Suffix Array Construction Algorithms,” ACM Computing Surveys, vol. 39, no. 2, pp. 1-31, 2007. [6] R. Grossi and J.S. Vitter, “Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching,” Proc. Symp. Theory of Computing (STOC ’00), pp. 397-406, 2000. [7] T.W. Lam, K. Sadakane, W.K. Sung, and S.M. Yiu, “A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays,” Proc. Int’l Conf. Computing and Combinatorics, pp. 401-410, 2002. [8] G. Manzini and P. Ferragina, “Engineering a Lightweight Suffix Array Construction Algorithm,” Algorithmica, vol. 40, no. 1, pp. 33- 50, Sept. 2004. [9] S. Kurtz, “Reducing the Space Requirement of Suffix Trees,” Software Practice and Experience, vol. 29, pp. 1149-1171, 1999. [10] W.K. Hon, K. Sadakane, and W.K. Sung, “Breaking a Time-and-Space Barrier for Constructing Full-Text Indices,” Proc. 44th Ann. IEEE Symp. Foundations of Computer Science (FOCS ’03), pp. 251-260, 2003. [11] J. Karkkainen and P. Sanders, “Simple Linear Work Suffix Array Construction,” Proc. 30th Int’l Conf. Automata, Languages, and Programming (ICALP ’03), pp. 943-955, 2003. [12] P. Ko and S. Aluru, “Space Efficient Linear Time Construction of Suffix Arrays,” Proc. Ann. Symp. Combinatorial Pattern Matching(CPM ’03), pp. 200-210. 2003. [13] P. Ko and S. Aluru, “Space-Efficient Linear Time Construction of Suffix Arrays,” J. Discrete Algorithms, vol. 3, nos. 2-4, pp. 143-156, 2005 [14] The Canterbury Corpus website. [Online]. Available: http://corpus.canterbury.ac.nz/. [15] GeNong, Sen Zhang, Wai Hong Chan, “Two Efficient Algorithms for Linear Time Suffix Array Construction”, IEE Transactions on Computers, vol. 60, pp.1471-1484,Oct.2011.
  • 8. International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.4, July 2013 18 [16] Light weight corpus datasets [Online].Available: http://people.unipmn.it/manzini/lightweight/corpus