SlideShare a Scribd company logo
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
42
OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM
Hemraj Kumawat
CSE,IIT Jodhpur, Jodhpur, India
Jitendra Chaudhary
CSE,IIT Jodhpur, Jodhpur, India
ABSTRACT
Data compression refers to reducing the amount of space needed to store data or reducing the
amount of time needed to transmit data. Many data compression techniques allow encoding the
compressed form of data with different compression ratio. In particular, in the case of LZ77
technique, it reduces the data concurrency of an input file. In the output of this technique it conveys
more information that is actually not needed in practical. Removing the extra information from the
encoded file that makes this algorithm more optimal. Our task is to identify how much extra
information it conveys and how can we minimize it so that there is no trouble at the time of
decoding. Basically the encoded output of LZ77 is the sequence of triplets (a structure of encoded
output) that is in binary and having fix size. For making the triplets of fix size, sometimes we are
creating unnecessary information. We present the method of variable triplet size as a way to improve
LZ77 compression and demonstrate it through many experiments. In our optimization algorithm we
are getting more compression ratio compare to the conventional LZ77 data compression algorithm.
Keywords:
Look-ahead Buffer: The look-ahead buffer contains characters yet to be encoded. This buffer starts
where the Search buffer ends and during the algorithm the Search buffer extends into the look-ahead
buffer.
Match Length: The Match Length is the length of largest matching block in the look-ahead buffer.
These pairs are called triplets, consisting of offset, matching length and code word of character. If
the character is matching then next character code word is used, otherwise same character code word
is used.
Offset: The actual distance between the current position of the pointer and the look-ahead buffer is
known as offset.
Search Buffer: The Search Buffer represents the most recently encoded characters.
Sliding Window: The Structure for Data manipulation, in which the Data is held The Sliding
Window, is divided into two parts as Search buffer and look-ahead buffer.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 5, September – October (2013), pp. 42-48
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
43
1. INTRODUCTION
LZ77 algorithm achieves compression by replacing repeated occurrences of data with
references to a single copy of that data existing earlier in the input (uncompressed) data stream. A
match is encoded by a pair of numbers called a length-distance pair. Some common convention and
definition of the words that we are using in this paper.
2. CONVENTIONAL LZ77 ALGORITHM
LZ77 compression algorithm exploits the fact that words and phrases within a text file are
likely to be repeated. When there is repetition, they can be encoded as a pointer to an earlier
occurrence, with the pointer accompanied by the number of characters to be matched. It is a very
simple adaptive scheme that requires no prior knowledge of the source and seems to require no
assumptions about the characteristics of the source.
In the LZ77 approach the dictionary is simply a portion of the previously encoded sequence.
The encoder examines the input sequence through a sliding window which consists of two parts: a
search buffer that contains a portion of the recently encoded sequence and a look ahead buffer that
contains the next portion of the sequence to be encoded. The algorithm searches the sliding window
for the longest match with the beginning of the look-ahead buffer and outputs a reference (a pointer)
to that match. It is possible that there is no match at all, so the output cannot contain just pointers. In
LZ77 the reference is always represented as a triplet<o,l,c>, where ‘o’ is an offset to the match, ‘l’ is
length of the match and ‘c’ is the next symbol after the match. If there is no match, the algorithm
outputs a null-pointer (both the offset and the match length equal to 0) and the first symbol in the
look-ahead buffer. The values of an offset to a match and length must be limited to some maximum
constant. For this algorithm we have to define the length of the look-ahead buffer, search buffer. The
symbol is usually encoded in 8 bit. More over the compression performance of LZ77 mainly depends
on these values. Generally the search buffer length is more than the look-ahead-buffer size. So the
total triplet size:
While ( look-ahead Buffer not empty) {
get a reference (position, length) to longest match;
if (length > 0)
{
output (position, length, next symbol);
shift the window length+1 positions along;
}
else {
output (0, 0, first symbol in the look-ahead buffer);
shift the window 1 character along;
}
}
ST= [⌈log2(search buffer length)⌉] +[⌈log2(look-ahead buffer length)⌉]+8
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
44
We can have better understanding with an example- “aacaacabcabaaac” For this example the size of
look-ahead buffer is 6 and search buffer is 4.
Triple Binary
-- <0, 0, a> 00000011000001
-- <1, 1, c> 00100111000011
-- < 3, 4, b> 01110011000010
--< 3, 3, a> 01101111000001
-- <1, 2, c> 00101011000011
Sliding window( Size: 6 )
Longest match
Next Character
The triplet length for this example is 14(3+3+8). So here the encoded binary string of this example.
0000001100000100100111000011011100110000100110111100000100101011000011
triplet triplet triplet triplet triplet
-------------------|-------------------|-----------------------|----------------------|--------------------|
The decoding is much faster than the encoding in this process because we have to move our
pointer with fixed length (triple length-14 for this example) and it is one of the important features of
this process. From this way we get the triplets .Now we can easily decode to original data by
reversing the encoding process.
3. OPTIMIZATION OF LZ77
As we have described earlier the triplets of LZ77 algorithm have fix size. In the case, when
offset is equal to the matching length we can modify the structure of triplets and represent it with the
new structure that have only <l,c> where l is the matching length and ‘c’ is the next symbol after the
match .We called this new structure as “doublet” in this paper .All the things are same as the
conventional LZ77 algorithm except replacing the triplet with doublet in the case of matching length
equal to offset length.
ࡿࡰ= log2(look-ahead buffer length)⌉] + 8
By replacing the triplet with doublet we are saving [⌈log2(search buffer length)⌉] number of
bits per matching case. The exact algorithms is described in the block
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
a a c a a c a b c a b a a a c
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
45
While ( look-ahead Buffer not empty) {
get a reference (position, length) to longest match;
if (length > 0)
{
If (length== position){
output (length, next symbol);
shift the window length+1 positions along;
}
else{
output (position, length, next symbol);
shift the window length+1 positions along;
}}
else {
output (0, first symbol in the look-ahead buffer);
shift the window 1 character along;
}
}
We will have better understanding of this optimized algorithm with the previous section
example- “aacaacabcabaaac”
Triplet/doublet Binary
--<0, a> 00011000001
--<1, c> 00111000011
--< 3, 4, b> 01110011000010
--< 3, a> 01111000001
--<1, 2, c> 00101011000011
The length of triplet for this example is 14(3+3+8) and the doublet length is 11(3+8). So here
the encoded binary string of this example
0001100000100111000011011100110000100111100000100101011000011
doublet doublet triplet doublet triplet
-----------------|------------------|---------------|-----------------|-----------------------|
The decoding process is slower than the conventional LZ77 Decoding. In this algorithm we
have to move our pointer with the variable size (doublet and triplet length) .But the problem in
decoding is how we will identify which one is doublet and which one is triplet. So the decoding
process is described in the next section. Once we decode encoded file to triplets and doublet then we
can easily get to original data by reversing the encoding (data to triplet/doublet) process.
a a c a A c A b c a b a A a c
a a c a A c A b c a b a A a c
a a c a A c a b c a b a A a c
a a c a A c A b c a b a a a c
a a c a A c A b c a b a A a c
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
46
4. DECODING PROCESS
For decoding the binary string that contains doublets and triplets, first we have to identify
which one is doublet’s binary string and which one is triplet binary string .To identify this we are
using the concept of delimiter that will depend on the size of the sliding window length. We will put
the delimiter before the doublet binary string.
The effective doublet length is: doublet size + delimiter size.
Now each binary substring is starting from the delimiter .Now for decoding we move the pointer
from starting of the binary string and check whether the first consecutive k bits are equal to the
delimiter or not where k is the length of delimiter .If it is equal then move the current pointer with
the effective doublet length ahead and make this substring to the doublet substring otherwise move
the pointer with the triplet length and make this substring to the triplet substring.
So the decoding algorithm is given below
Effective doublet length = doublet size + delimiter length
//delimiter is a binary string that depends on the offset length
Starting the moving pointer from 0
While(the moving pointer m<total binary length){
if((m to m+ delimiter’s length substring)==delimiter )
{
Move the pointer m with effective doublet length ahead.
Get the doublet binary substring form (m + delimiter’s length ) to (m + effective
length)
}
else{
Move the pointer m with triplet length ahead.
Get the triplet binary substring form m to (m + triplet length)}
So here the above example’s binary string after using the delimiter before the doublets.(here we are
using the delimiter=”1”)
1000110000011001110000110111001100001010111100000100101011000011
doublet double triplet doublet triplet
------------------|----------------|----------------|----------------------|----------------------|
So the substring of doublet length just after the blue one is the doublet binary substring and the rest
substrings of triple size are the triplet binary string .We can easily identify the delimiter (blue one for
this example) by moving the pointer with appropriate length according the doublet and triplets.
But what will happen when the offset’s first k bits are equal to the delimiter then this algorithm is not
valid, where k is the length of delimiter .We cannot decompress the binary string from this delimiter.
Let’s see this from an example:
For maximum search buffer length=31 and look-ahead buffer=7 and delimiter =”111”
Binary string
1110001100000111100011000010111010101100110111100111000111
|___|---doublet----||___|--doublet----||___|----triplet----------||___|--doublet---
From the starting we will see that first two strings are doublet string. The third substring is
actually triplet substring but from our algorithm it reads out as the doublet substring because the first
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
47
3 bits are equal to the delimiter. This will affect the rest decoding process .Eventually we will get the
wrong decoding output.
So we are bounding our delimiter size by:
Delimiter size = no. of first consecutive 1’s in binary representation of sliding window length +1
All the bits in the delimiter are 1’s. From this Delimiter string we are getting the correct decoding
output.
5. ALGORITHMS ANALYSIS
C: Total no of Matching (offset=Matching Length)
CR: Compression ratio
Op LZ77: Optimized LZ77 Algorithm
LZ77: Conventional LZ77 Algorithm
D: % Difference between conventional LZ77 and Optimized LZ77= (Op LZ77 – LZ77) / LZ77
Max. Offset Value=maximum offset value= search buffer length
6. CONCLUSION
As we can see from the above analysis that for maximum offset value (111) the improved
compression ratio is 1.896 and for maximum offset value (223) the improved compression ratio is
1.154. From this analysis result we can see that maximum offset value is increasing, the improved
compression ratio is decreasing. When we increase the search buffer length then the total number of
matching (offset=matching length) will be decreased. The improved compression is directly
proportional to the matching. The improved compression will be more when the matching is more.
So generally this optimal algorithm is more effective, where either conventional LZ77 algorithm is
not so optimal or the input has less repetition.
Serial
Number
File size
(in bytes)
Comparison between conventional LZ77 and Optimized LZ77
Max. Offset Value=111 Max. Offset Value=223 Max. Offset Value=447
C
CR
C
CR
C
CR
LZ77
Op
LZ77
D LZ77
Op
LZ77
D LZ77
Op
LZ77
D
1 61,305 2208 1.180 1.200 2.280 1099 1.293 1.313 1.547 621 1.400 1.417 1.130
2 4,21,144 11417 1.210 1.230 1.770 5514 1.318 1.333 1.153 3104 1.420 1.433 0.842
3 7,79,959 17044 1.197 1.213 1.370 6412 1.310 1.319 0.700 3710 1.386 1.393 0.521
4 73,246 2820 1.189 1.169 1.690 1398 1.265 1.275 0.800 742 1.334 1.338 0.341
5 1,09,684 3755 1.200 1.226 2.180 1857 1.317 1.336 1.469 1063 1.423 1.440 1.080
6 6,04,919 19281 1.264 1.291 2.160 8246 1.390 1.407 1.258 4053 1.507 1.519 0.800
7 1,51,610 3489 1.236 1.255 1.497 1430 1.349 1.360 0.830 779 1.453 1.462 0.591
8 1,30,725 3309 1.222 1.242 1.635 1494 1.347 1.360 1.020 750 1.465 1.475 0.660
9 4,27,180 14369 1.254 1.282 2.253 6673 1.389 1.409 1.437 3388 1.517 1.531 0.950
10 6,78,036 23537 1.157 1.182 2.130 10763 1.263 1.279 1.320 5683 1.362 1.374 0.903
Average 10123 1.211 1.229 1.896 4488 1.324 1.339 1.154 2389 1.426 1.438 0.783
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
48
REFERENCES
[1] IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-23, NO. 3, MAY 1977
337A Universal Algorithm for Sequential Data Compression
[2] International Symposium on Information Theory and its Applications, ISITA2006 Seoul,
Korea, October 29–November 1, 2006 Improving LZ77 Data Compression using Bit Recycling
[3] International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A
Comparative Study Of Text Compression Algorithms
[4] Northwestern University Department of Electrical and Computer Engineering ECE 428:
Information Theory Spring 2004
[5] http://www.stringology.org/DataCompression/lz77/index_en.html
[6] http://www.zlib.net/feldspar.html
• Links of the text files used in Analysis are given below (sorted by the index of the table)
1. http://www.gutenberg.org/cache/epub/28466/pg28466.txt
2. http://www.gutenberg.org/cache/epub/16728/pg16728.txt
3. http://www.gutenberg.org/cache/epub/9173/pg9173.txt
4. http://www.gutenberg.org/cache/epub/32482/pg32482.txt
5. http://www.gutenberg.org/files/25731/25731-0.txt
6. http://www.gutenberg.org/cache/epub/26598/pg26598.txt
7. http://www.gutenberg.org/cache/epub/28569/pg28569.txt
8. http://www.gutenberg.org/cache/epub/32962/pg32962.txt
9. http://www.gutenberg.org/cache/epub/23319/pg23319.txt
10. http://www.gutenberg.org/cache/epub/101/pg101.txt

More Related Content

PDF
Deep Learning for Natural Language Processing
PDF
Scaling 5G to new frontiers with NR-Light (RedCap)
PDF
6G Training Course Part 8: 6G Devices
PPTX
Intel’s core i7
PDF
1. FPGA architectures.pdf
PPTX
Natural language processing: feature extraction
PDF
ARM Architecture
PPT
Core 2 processors
Deep Learning for Natural Language Processing
Scaling 5G to new frontiers with NR-Light (RedCap)
6G Training Course Part 8: 6G Devices
Intel’s core i7
1. FPGA architectures.pdf
Natural language processing: feature extraction
ARM Architecture
Core 2 processors

What's hot (20)

PDF
Mini Project on 4 BIT SERIAL MULTIPLIER
PDF
Unit 1 Introduction to Embedded computing and ARM processor
PPTX
Lte basic
PPTX
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
PPTX
Sequence to Sequence Learning with Neural Networks
PPT
L5 Adders
PPTX
Regularization in deep learning
PPTX
Digital signal processing
PDF
BERT - Part 1 Learning Notes of Senthil Kumar
PDF
GPRS : Architecture and Applications
PDF
Unit II arm 7 Instruction Set
PPT
Gsm presentation shaikot
ODP
Control Area Network
PPTX
EC8004 wireless networks unit 1
PPTX
Controller area network -ppt
PPT
Information Theory MSU-EEE.ppt
PPTX
Intel i7
PDF
Emerging Non-Volatile Memory 2020 report by Yole Développement
PDF
PR-132: SSD: Single Shot MultiBox Detector
PPTX
5 g nr numerology
Mini Project on 4 BIT SERIAL MULTIPLIER
Unit 1 Introduction to Embedded computing and ARM processor
Lte basic
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
Sequence to Sequence Learning with Neural Networks
L5 Adders
Regularization in deep learning
Digital signal processing
BERT - Part 1 Learning Notes of Senthil Kumar
GPRS : Architecture and Applications
Unit II arm 7 Instruction Set
Gsm presentation shaikot
Control Area Network
EC8004 wireless networks unit 1
Controller area network -ppt
Information Theory MSU-EEE.ppt
Intel i7
Emerging Non-Volatile Memory 2020 report by Yole Développement
PR-132: SSD: Single Shot MultiBox Detector
5 g nr numerology
Ad

Viewers also liked (20)

PPT
Lz77 (sliding window)
PDF
Lz77 / Lempel-Ziv Algorithm
PPTX
Text compression in LZW and Flate
PPTX
PPTX
Lzw compression ppt
PPTX
Lzw compression
PDF
Dictionary Based Compression
PPT
Compression
PPTX
Source coding
PPTX
Lzw algorithm
PPTX
Multiple object detection
PPTX
4 data compression
PDF
Image compression
PPTX
Compression project presentation
PPT
Lzw coding technique for image compression
PDF
Microprocessor-Based Automatic Door Opener
PPTX
Data compression
PPTX
Image compression
PPTX
Digital Communication Techniques
PPTX
Data compression
Lz77 (sliding window)
Lz77 / Lempel-Ziv Algorithm
Text compression in LZW and Flate
Lzw compression ppt
Lzw compression
Dictionary Based Compression
Compression
Source coding
Lzw algorithm
Multiple object detection
4 data compression
Image compression
Compression project presentation
Lzw coding technique for image compression
Microprocessor-Based Automatic Door Opener
Data compression
Image compression
Digital Communication Techniques
Data compression
Ad

Similar to OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM (20)

PDF
Efficient text compression using special character replacement
PDF
50120130406023
PDF
Lossless Data Compression Using Rice Algorithm Based On Curve Fitting Technique
PDF
Optimization of Latency of Temporal Key Integrity Protocol (TKIP) Using Graph...
PDF
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
PDF
Lossless LZW Data Compression Algorithm on CUDA
PDF
Hardware implementation of the serpent block cipher using fpga technology
PDF
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
PDF
50120140506013 2
PDF
Synchronizing Parallel Tasks Using STM
PDF
Arm recognition encryption by using aes algorithm
PPT
Introduction to Algorithms
PDF
A comprehensive study of non blocking joining techniques
PDF
A comprehensive study of non blocking joining technique
PDF
Analysing space complexity of various encryption algorithms 2
PDF
Pipelining Architecture of AES Encryption and Key Generation with Search Base...
PDF
Pipelining Architecture of AES Encryption and Key Generation with Search Base...
PDF
FPGA and ASIC Implementation of Speech Encryption and Decryption using AES Al...
PDF
A Comparative Study of RSA and ECC and Implementation of ECC on Embedded Systems
PDF
IRJET - Multi-Key Privacy in Cloud Computing
Efficient text compression using special character replacement
50120130406023
Lossless Data Compression Using Rice Algorithm Based On Curve Fitting Technique
Optimization of Latency of Temporal Key Integrity Protocol (TKIP) Using Graph...
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
Lossless LZW Data Compression Algorithm on CUDA
Hardware implementation of the serpent block cipher using fpga technology
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
50120140506013 2
Synchronizing Parallel Tasks Using STM
Arm recognition encryption by using aes algorithm
Introduction to Algorithms
A comprehensive study of non blocking joining techniques
A comprehensive study of non blocking joining technique
Analysing space complexity of various encryption algorithms 2
Pipelining Architecture of AES Encryption and Key Generation with Search Base...
Pipelining Architecture of AES Encryption and Key Generation with Search Base...
FPGA and ASIC Implementation of Speech Encryption and Decryption using AES Al...
A Comparative Study of RSA and ECC and Implementation of ECC on Embedded Systems
IRJET - Multi-Key Privacy in Cloud Computing

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Clinical guidelines as a resource for EBP(1).pdf
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Supervised vs unsupervised machine learning algorithms
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
oil_refinery_comprehensive_20250804084928 (1).pptx
climate analysis of Dhaka ,Banglades.pptx

OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

  • 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 42 OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM Hemraj Kumawat CSE,IIT Jodhpur, Jodhpur, India Jitendra Chaudhary CSE,IIT Jodhpur, Jodhpur, India ABSTRACT Data compression refers to reducing the amount of space needed to store data or reducing the amount of time needed to transmit data. Many data compression techniques allow encoding the compressed form of data with different compression ratio. In particular, in the case of LZ77 technique, it reduces the data concurrency of an input file. In the output of this technique it conveys more information that is actually not needed in practical. Removing the extra information from the encoded file that makes this algorithm more optimal. Our task is to identify how much extra information it conveys and how can we minimize it so that there is no trouble at the time of decoding. Basically the encoded output of LZ77 is the sequence of triplets (a structure of encoded output) that is in binary and having fix size. For making the triplets of fix size, sometimes we are creating unnecessary information. We present the method of variable triplet size as a way to improve LZ77 compression and demonstrate it through many experiments. In our optimization algorithm we are getting more compression ratio compare to the conventional LZ77 data compression algorithm. Keywords: Look-ahead Buffer: The look-ahead buffer contains characters yet to be encoded. This buffer starts where the Search buffer ends and during the algorithm the Search buffer extends into the look-ahead buffer. Match Length: The Match Length is the length of largest matching block in the look-ahead buffer. These pairs are called triplets, consisting of offset, matching length and code word of character. If the character is matching then next character code word is used, otherwise same character code word is used. Offset: The actual distance between the current position of the pointer and the look-ahead buffer is known as offset. Search Buffer: The Search Buffer represents the most recently encoded characters. Sliding Window: The Structure for Data manipulation, in which the Data is held The Sliding Window, is divided into two parts as Search buffer and look-ahead buffer. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 5, September – October (2013), pp. 42-48 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 43 1. INTRODUCTION LZ77 algorithm achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. A match is encoded by a pair of numbers called a length-distance pair. Some common convention and definition of the words that we are using in this paper. 2. CONVENTIONAL LZ77 ALGORITHM LZ77 compression algorithm exploits the fact that words and phrases within a text file are likely to be repeated. When there is repetition, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched. It is a very simple adaptive scheme that requires no prior knowledge of the source and seems to require no assumptions about the characteristics of the source. In the LZ77 approach the dictionary is simply a portion of the previously encoded sequence. The encoder examines the input sequence through a sliding window which consists of two parts: a search buffer that contains a portion of the recently encoded sequence and a look ahead buffer that contains the next portion of the sequence to be encoded. The algorithm searches the sliding window for the longest match with the beginning of the look-ahead buffer and outputs a reference (a pointer) to that match. It is possible that there is no match at all, so the output cannot contain just pointers. In LZ77 the reference is always represented as a triplet<o,l,c>, where ‘o’ is an offset to the match, ‘l’ is length of the match and ‘c’ is the next symbol after the match. If there is no match, the algorithm outputs a null-pointer (both the offset and the match length equal to 0) and the first symbol in the look-ahead buffer. The values of an offset to a match and length must be limited to some maximum constant. For this algorithm we have to define the length of the look-ahead buffer, search buffer. The symbol is usually encoded in 8 bit. More over the compression performance of LZ77 mainly depends on these values. Generally the search buffer length is more than the look-ahead-buffer size. So the total triplet size: While ( look-ahead Buffer not empty) { get a reference (position, length) to longest match; if (length > 0) { output (position, length, next symbol); shift the window length+1 positions along; } else { output (0, 0, first symbol in the look-ahead buffer); shift the window 1 character along; } } ST= [⌈log2(search buffer length)⌉] +[⌈log2(look-ahead buffer length)⌉]+8
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 44 We can have better understanding with an example- “aacaacabcabaaac” For this example the size of look-ahead buffer is 6 and search buffer is 4. Triple Binary -- <0, 0, a> 00000011000001 -- <1, 1, c> 00100111000011 -- < 3, 4, b> 01110011000010 --< 3, 3, a> 01101111000001 -- <1, 2, c> 00101011000011 Sliding window( Size: 6 ) Longest match Next Character The triplet length for this example is 14(3+3+8). So here the encoded binary string of this example. 0000001100000100100111000011011100110000100110111100000100101011000011 triplet triplet triplet triplet triplet -------------------|-------------------|-----------------------|----------------------|--------------------| The decoding is much faster than the encoding in this process because we have to move our pointer with fixed length (triple length-14 for this example) and it is one of the important features of this process. From this way we get the triplets .Now we can easily decode to original data by reversing the encoding process. 3. OPTIMIZATION OF LZ77 As we have described earlier the triplets of LZ77 algorithm have fix size. In the case, when offset is equal to the matching length we can modify the structure of triplets and represent it with the new structure that have only <l,c> where l is the matching length and ‘c’ is the next symbol after the match .We called this new structure as “doublet” in this paper .All the things are same as the conventional LZ77 algorithm except replacing the triplet with doublet in the case of matching length equal to offset length. ࡿࡰ= log2(look-ahead buffer length)⌉] + 8 By replacing the triplet with doublet we are saving [⌈log2(search buffer length)⌉] number of bits per matching case. The exact algorithms is described in the block a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c a a c a a c a b c a b a a a c
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 45 While ( look-ahead Buffer not empty) { get a reference (position, length) to longest match; if (length > 0) { If (length== position){ output (length, next symbol); shift the window length+1 positions along; } else{ output (position, length, next symbol); shift the window length+1 positions along; }} else { output (0, first symbol in the look-ahead buffer); shift the window 1 character along; } } We will have better understanding of this optimized algorithm with the previous section example- “aacaacabcabaaac” Triplet/doublet Binary --<0, a> 00011000001 --<1, c> 00111000011 --< 3, 4, b> 01110011000010 --< 3, a> 01111000001 --<1, 2, c> 00101011000011 The length of triplet for this example is 14(3+3+8) and the doublet length is 11(3+8). So here the encoded binary string of this example 0001100000100111000011011100110000100111100000100101011000011 doublet doublet triplet doublet triplet -----------------|------------------|---------------|-----------------|-----------------------| The decoding process is slower than the conventional LZ77 Decoding. In this algorithm we have to move our pointer with the variable size (doublet and triplet length) .But the problem in decoding is how we will identify which one is doublet and which one is triplet. So the decoding process is described in the next section. Once we decode encoded file to triplets and doublet then we can easily get to original data by reversing the encoding (data to triplet/doublet) process. a a c a A c A b c a b a A a c a a c a A c A b c a b a A a c a a c a A c a b c a b a A a c a a c a A c A b c a b a a a c a a c a A c A b c a b a A a c
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 46 4. DECODING PROCESS For decoding the binary string that contains doublets and triplets, first we have to identify which one is doublet’s binary string and which one is triplet binary string .To identify this we are using the concept of delimiter that will depend on the size of the sliding window length. We will put the delimiter before the doublet binary string. The effective doublet length is: doublet size + delimiter size. Now each binary substring is starting from the delimiter .Now for decoding we move the pointer from starting of the binary string and check whether the first consecutive k bits are equal to the delimiter or not where k is the length of delimiter .If it is equal then move the current pointer with the effective doublet length ahead and make this substring to the doublet substring otherwise move the pointer with the triplet length and make this substring to the triplet substring. So the decoding algorithm is given below Effective doublet length = doublet size + delimiter length //delimiter is a binary string that depends on the offset length Starting the moving pointer from 0 While(the moving pointer m<total binary length){ if((m to m+ delimiter’s length substring)==delimiter ) { Move the pointer m with effective doublet length ahead. Get the doublet binary substring form (m + delimiter’s length ) to (m + effective length) } else{ Move the pointer m with triplet length ahead. Get the triplet binary substring form m to (m + triplet length)} So here the above example’s binary string after using the delimiter before the doublets.(here we are using the delimiter=”1”) 1000110000011001110000110111001100001010111100000100101011000011 doublet double triplet doublet triplet ------------------|----------------|----------------|----------------------|----------------------| So the substring of doublet length just after the blue one is the doublet binary substring and the rest substrings of triple size are the triplet binary string .We can easily identify the delimiter (blue one for this example) by moving the pointer with appropriate length according the doublet and triplets. But what will happen when the offset’s first k bits are equal to the delimiter then this algorithm is not valid, where k is the length of delimiter .We cannot decompress the binary string from this delimiter. Let’s see this from an example: For maximum search buffer length=31 and look-ahead buffer=7 and delimiter =”111” Binary string 1110001100000111100011000010111010101100110111100111000111 |___|---doublet----||___|--doublet----||___|----triplet----------||___|--doublet--- From the starting we will see that first two strings are doublet string. The third substring is actually triplet substring but from our algorithm it reads out as the doublet substring because the first
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 47 3 bits are equal to the delimiter. This will affect the rest decoding process .Eventually we will get the wrong decoding output. So we are bounding our delimiter size by: Delimiter size = no. of first consecutive 1’s in binary representation of sliding window length +1 All the bits in the delimiter are 1’s. From this Delimiter string we are getting the correct decoding output. 5. ALGORITHMS ANALYSIS C: Total no of Matching (offset=Matching Length) CR: Compression ratio Op LZ77: Optimized LZ77 Algorithm LZ77: Conventional LZ77 Algorithm D: % Difference between conventional LZ77 and Optimized LZ77= (Op LZ77 – LZ77) / LZ77 Max. Offset Value=maximum offset value= search buffer length 6. CONCLUSION As we can see from the above analysis that for maximum offset value (111) the improved compression ratio is 1.896 and for maximum offset value (223) the improved compression ratio is 1.154. From this analysis result we can see that maximum offset value is increasing, the improved compression ratio is decreasing. When we increase the search buffer length then the total number of matching (offset=matching length) will be decreased. The improved compression is directly proportional to the matching. The improved compression will be more when the matching is more. So generally this optimal algorithm is more effective, where either conventional LZ77 algorithm is not so optimal or the input has less repetition. Serial Number File size (in bytes) Comparison between conventional LZ77 and Optimized LZ77 Max. Offset Value=111 Max. Offset Value=223 Max. Offset Value=447 C CR C CR C CR LZ77 Op LZ77 D LZ77 Op LZ77 D LZ77 Op LZ77 D 1 61,305 2208 1.180 1.200 2.280 1099 1.293 1.313 1.547 621 1.400 1.417 1.130 2 4,21,144 11417 1.210 1.230 1.770 5514 1.318 1.333 1.153 3104 1.420 1.433 0.842 3 7,79,959 17044 1.197 1.213 1.370 6412 1.310 1.319 0.700 3710 1.386 1.393 0.521 4 73,246 2820 1.189 1.169 1.690 1398 1.265 1.275 0.800 742 1.334 1.338 0.341 5 1,09,684 3755 1.200 1.226 2.180 1857 1.317 1.336 1.469 1063 1.423 1.440 1.080 6 6,04,919 19281 1.264 1.291 2.160 8246 1.390 1.407 1.258 4053 1.507 1.519 0.800 7 1,51,610 3489 1.236 1.255 1.497 1430 1.349 1.360 0.830 779 1.453 1.462 0.591 8 1,30,725 3309 1.222 1.242 1.635 1494 1.347 1.360 1.020 750 1.465 1.475 0.660 9 4,27,180 14369 1.254 1.282 2.253 6673 1.389 1.409 1.437 3388 1.517 1.531 0.950 10 6,78,036 23537 1.157 1.182 2.130 10763 1.263 1.279 1.320 5683 1.362 1.374 0.903 Average 10123 1.211 1.229 1.896 4488 1.324 1.339 1.154 2389 1.426 1.438 0.783
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME 48 REFERENCES [1] IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-23, NO. 3, MAY 1977 337A Universal Algorithm for Sequential Data Compression [2] International Symposium on Information Theory and its Applications, ISITA2006 Seoul, Korea, October 29–November 1, 2006 Improving LZ77 Data Compression using Bit Recycling [3] International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A Comparative Study Of Text Compression Algorithms [4] Northwestern University Department of Electrical and Computer Engineering ECE 428: Information Theory Spring 2004 [5] http://www.stringology.org/DataCompression/lz77/index_en.html [6] http://www.zlib.net/feldspar.html • Links of the text files used in Analysis are given below (sorted by the index of the table) 1. http://www.gutenberg.org/cache/epub/28466/pg28466.txt 2. http://www.gutenberg.org/cache/epub/16728/pg16728.txt 3. http://www.gutenberg.org/cache/epub/9173/pg9173.txt 4. http://www.gutenberg.org/cache/epub/32482/pg32482.txt 5. http://www.gutenberg.org/files/25731/25731-0.txt 6. http://www.gutenberg.org/cache/epub/26598/pg26598.txt 7. http://www.gutenberg.org/cache/epub/28569/pg28569.txt 8. http://www.gutenberg.org/cache/epub/32962/pg32962.txt 9. http://www.gutenberg.org/cache/epub/23319/pg23319.txt 10. http://www.gutenberg.org/cache/epub/101/pg101.txt