SlideShare a Scribd company logo
INTERNATIONALComputer Engineering and Technology ENGINEERING
  International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976-
  6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
                            & TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)                                                     IJCET
Volume 4, Issue 2, March – April (2013), pp. 94-101
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
                                                                         ©IAEME
www.jifactor.com




       A NEW REVISITED COMPRESSION TECHNIQUE THROUGH
      INNOVATIVE PARTITION GROUP BINARY COMPRESSION :A
                       NOVEL APPROACH

                                         V. Hari Prasad
                              Sphoorthy Engineering college, A.P



  ABSTRACT

          Day by day the ratio of living organisms are increased and its accumulation in the
  biological databases creating a major task. In this connection some of the classical algorithms
  are strived into the world and fails to compress genetic sequences due to the encrypted
  alphabets. Existing substitution techniques will work on repetitive and non repetitiveness of
  bases of DNA and they achieve Best compression rates if any sequence may contain tandem
  repeats. But the category of grasses like maize and rice contains very less repeats (nearer of
  27) over 67789 total bps. By working existing techniques on such sequences the results are
  not bountiful and running on worst case comparisons. This paper introduces a novel
  Innovative partition group Binary compression technique yields first art compression rates
  which is far better than existing techniques. This algorithm is developed based on
  comparative study of existing algorithms and which is more applicable for non tandem
  repeats of DNA sequences in genomes.

  Keywords: Dnabit compress, Genbit compress, Huffbit compress, Encode, Decode

  I. INTRODUCTION

          The human genome was finally deciphered! In other words, scientists have succeeded
  in reading the chain of more than 3 billion base pairs that constitute the DNA molecule of
  humans; this process is called, sequencing. That daunting task required new analytical
  methods created by bioinformatics .Bio informatics is nothing but information technology is
  applied in terms of biological databases. To maintain such huge databases we require
  efficient Bio informatics computational tools to store and process data in a more efficient
                                                94
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

way. Today more and more DNA sequences are storing in biological databases so the size of
the databases is growing in an exponential manner. Thus need of compression arises and it is
becoming a vital challenge, In this connection many universal algorithms are came into the
existence but they fails to compress genetic sequences due to the specificity of “text”. Some
classical algorithms are also introduced but they are running on negative compression rates.
Compression may be in two flavors one is lossless and other is loss. Loss compression can be
applicable for multimedia applications like image, audio and video. In multimedia
applications if we remove some unused pixels also resultant may not variant like removing
noise from audio or removing unnecessary pixels from images But Text compression is
always loss less even after decoding the entire encoded text we have to retain its original
property. DNA can be encoded in four letter alphabets like text {A, C, G, and T}.Thus each
Base of symbol (Base) can be represented by two bits. . General purpose compression
algorithms do not perform well with biological sequences. Giancarlo et al. [1][2] have
provided a review of compression algorithms designed for biological sequences. Finding the
characteristics and comparing Genomes is a major task (Koonin 1999[3]; Wooley 1999[4]).
In mathematical point of view, compression implies understanding and comprehension (Li
and Vitanyi 1998) [5]. Compression is a great tool for Genome comparison and for studying
various properties of Genomes. DNA sequences, which encode life should be compressible.
It is well known that DNA sequences in higher eukaryotes contain many tandem repeats, and
essentials genes (like rRNAs) have many copies. It is also proved that genes duplicate
themselves sometimes for evolutionary purposes. All these facts conclude that DNA
sequences should be compressible. The compression of DNA sequences is not an easy task.
(Grumback and Tahi 1994[6], Rivals et al. 1995 [7]; Chen et al. 2000 [8]) DNA sequences
consists of only four nucleotides bases {a,c,g,t}. Two bits are enough to store each base. The
standard compression software’s such as “compress”, “gzip”, “bzip2”, “winzip” expanded the
DNA genome file more than compressing it.
        Most of the Existing software tools worked well for English text compression (Bell et
al. 1990[9]) but not for DNA Genomes. There are many text compression algorithms
available having quite a good compression ratio. But they have not been proved well for
compressing DNA sequences as the algorithm does not incorporate the characteristics of
DNA sequences even though DNA sequences can be represented in simple text form.DNA
sequences are comprised of just four different bases labeled A, T, C, and G (for adenine,
thymine, cytosine, and guanine respectively). T pairs with A, and G pairs with C. Each base
can be represented in computer code by a two character binary digit, two bits in other words,
A (00), C (01), G (10), and T (11). At first glance, one might imagine that this is the most
efficient way to store DNA sequences. Like the binary alphabet {0, 1} used in computers, the
four-letter alphabet of DNA {A, T, C, and G} can encode messages of arbitrary complexity
when encoded into long sequences.

 A. Plan of the paper
  This paper is organized as follows. Section 2 describes general compression algorithms.
 Section 3 describes related existing algorithms to compress genome data. Section 4
 describes proposed algorithms analysis how it is better one than existing techniques. Section
 5 describes comparative study on a sample sequence. Section 6 is concluding with future
 work.



                                             95
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

II. GENERAL COMPRESSION ALGORITHMS
         The compression of DNA sequences is considered as one of the most challenging
tasks in the field of data compression. In this connection the very first DNA compression and
its subsequent algorithms BioCompress[10] and BioCompress-2[11] detects exact repeats and
complementary palindromes located in the sequence and Encode the factor by the size
representation( l, p ) where l is the length of the factor and p is the position of its first
occurrence .If the size is greater the factor then use two bit encoding . More memory
references will require decoding the same, so the performance may degrade.
DnaPack[12] which uses hamming distance for the repeats and complementary palindromes
and it is implemented by dynamic programming approach. So that it is not simple in design
and it will require more time to execute and require more memory requirements also. The
algorithm achieves a compression rate in an average of 1.6602.
DNAcompress [13] will work on approximation of repeats if number of tandem repeats more
it saves bits to encode if not discard. Non repeated sequences will be appended to the
sequence at the end. This algorithm achieves a compression rate only 1.72 bits per base. If
there is no tandem repeat in the sequence it may run in worst case..DNASequitur is a
grammar based compression algorithm for DNA sequences which infers a context free
grammar to represent input data..Designing a CFG for the given input data may leads to
redundancies and constructing type2 language corresponding the grammar also leads to
ambiguity. The Lossless segment based compression enables part by part decompression by
introducing non base character so that it will save memory requirements but it is applicable
well on repeating sequences are more and more in the sequence. If such sequences like AT-
rich DNA, which constitutes a distinct fraction of the cellular DNA of the archaebacterium
Methanococcus voltae, consists of non-repetitive sequences, so part by part decompression is
little bit tedious.
III. RELATED EXISTING ALGORITHMS
Compression methods are fall into two categories.
        • Statistical methods which compress data by replacing the shorter code. Huffman
            code comes under this category and it’s not suitable larger sequences.
        • Dictionary based replacing larger strings by shorter code. Cfact which searches the
            longest exact matching repeat using tree data structures
                     Based on the above methods some loss less compression algorithms
strived based on two bits encoding schemes i.e. A(00),C(01),G(10) and T(11). HUFFBIT
COMPRESS [14], GENBITCOMPRESS [15] algorithms are explained performance analysis
(Best, Avg and Worst) based on repetitive and non repetitive bases of DNA and computed
results..Suppose in the given sequence more and more tandem repeats are there then [14],[15]
will run in Beast case and achieves bountiful compression ratios, if not the same may run in
worst case and achieves in an average 2.323 bits per bases[14],[15].Our proposed algorithm
PGBC(Partitioned group binary compression) techniques will achieves best compression
ration even the given DNA sequence may doesn’t contain tandem repeats or very less tandem
repeats which can consider as minute in a very larger sequence like category of maze and rice
grass sequences Our proposed algorithms are better suitable for non-repetitive DNA
sequences in genome and which is achieving in an average of 1.333 bits per bases which is
far better than all existing techniques.

                                             96
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

IV. PROPOSED ALGORITHM
       Our proposed algorithms are developed based on the comparative study of existing
techniques and we are starching on non-repetitiveness of DNA sequences. Existing
techniques are run in worst case if any DNA sequences contain no tandem repeats. Here we
are working with worst-case scenarios and achieves better compression rates in terms bits per
symbols.
  A. Idea behind the algorithm
     Every DNA sequence contain {A, C, G, T} nucleotides where each literal is named as
BASE and encoded in two bits as follows
 A=00, C=01, G=10 and T=11.
 Compression ratio is calculated encoded bits per Bases.
 Compression Ratio = Encoded Bits/Bases.
   B. Plan of work
       Here we took and input sequence as sample DNA (which doesn’t contain any tandem
repeats) of length n and divides it into n/4 fragments (where each fragment contain four
bases i.e. A, C, G and T). In Encoding process every six fragments can grouped as partition
(P) which contains two sub partitions (Fh and Sh ) . We can substitute its equivalent binary
bits before making sub partitions and later we can group it into single main partition set. (Gs).
In decoding process we can do the reverse or encoding to retain loss less DNA property.
Finally we will calculate will group all the partition       the equations are total number of
encoded bits by grouping all the part ions.
   Suppose if we took sample sequence of DNA which contain 72 bases then by applying
PGBC techniques it will fragmented into n/4 i.e. 18 fragments , 3 partitions which will
contain 6 sub partitions and then grouped it into single main partition. This will represent
number of encoded bits in the given sequence. Our PGBC technique work as follows




Partition set (Ps) can calculate as follows.


Group partition set can calculate as follows




Here Gs will represent the binary equivalent numeric (nearest to integer) in terms of Bytes
storage. (Suppose if we will implement the technique in C language unsigned int will require
4 bytes of storage).Here m=n i.e. length of the given sequence.


                                               97
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

Total Number of Encoded group bits are calculated as follows.




Finally compression Ratio can calculate as follows.



      C. Analysis
      Length of the given sequence N = 72 Bases.
      Then possibly 3 part ion sets which will include 6 sub partitions (3-Fh and 3-Sh) and
      finally we can group all partitions into a Group partition set.

      Ps=P1 + P2 + P3.
       we will calculate binary equivalent numeric(which is nearer to integer) value for each
      partition in terms of Bytes storage(suppose if we will store each partition in C we will
      require two bytes if we can accommodate on int if not we can go for unsigned long).
                                           Gs=Ps (p1 + p2 + p3)

                                    Total Number of Encoded Bits
                                            Eg b= Gs= Ps
      Every partition may contain 24 bases so it may not fit in integer so that we can store in
      unsigned long. So totally our sequence is divided into three partitions and grouped as one
      set .So totally we require 12 bytes to store.


                               =4    + 4 + 4 = 12 Bytes (96 bits)
        Finally we calculate Compression Ratio as follows.


                                          = 96 / 72
                                   = 1.333 (bits per Bases.)
     Encoding and decoding algorithms for DNA compression is as follows.
      D. Encoding Algorithm
      INP: input String
      OPS: Encoded String
 PROCEDURE ENCODE
 Begin
 •     Group INS into equivalent fragments as four bases
 •     Generate all possible combinations of DNA and it will contain non- repetitive (our INS
       assumed as no tandem repeats).

                                               98
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

 •   Group six fragments into partition set which will consist of two subpartions.
 •   Assign binary bits(0&1) for every base of DNA like
      A=00, C=01, G=10 and T=11
 •   Calculate Gs for every Ps in INP till eof INP

 •   Calculate Egb for every Gs till eof INP

 •   Repeat the steps 4 and 5 until the length of the INP
 •   Transfer the sequence Egb to the output string i.e. OPS String.
     End.

     E. Decoding Algorithm
      INP: input String
      OPS: Decoded String
      PROCEDURE DECODE

     Begin
     • Generate all possible combinations of (A,C,G,T)
     • Read the binary data of each sub partition from OPS and assign the two bits by
        equivalent Base s (00=A,01=C,10=G and 11=T) and then store it in an array till eof
     • Repeat step 2 until eof INS is reached and calculate Dgb and Ds in the reverse
        process..
     • Transfer the sequence Db to the input String i.e. INP
       End.
V. EXAMPLE AND COMPARISON
  Let us consider the sequence.
Sequence1:
ACGT GCGC GATC GCCT GCTA GGCG TACG TCGC AGGC GATC GATG TGCT
AGAT CAGA TGAC TCAG
TGCA CGAT.
      Sequence length (no of bases)         = 72.
      Bytes required to store in a text file = 72 Bytes.
The above sequence doesn’t contain tandem repeats so existing algorithms like Huffbit
compress,Genbit Compress and Dnabit compress may run on worst case and require more
bits to encode the sequence.
       Huffbit,GenBit and Dna compress =162 bits(2.25)
       Genbit Compress (Tool based) = 160 bits (2.23)
       PGBC Technique (Compression) = 96 bits (1.333)

                                               99
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

VI. CONCLUSION AND FUTURE WORK

        By using of our algorithm we can encode every base by 1.33 bits .By applying of ours
we are saving nearer of 8 bytes to encode the given sequence, compression may vary with
size of the sequence. So our technique is far better than existing ones and we can apply this
technique on non repetitive DNA sequences of genomes .If the given sequence can contain
tandem repeats also our technique will achieve same compression rate in an average. In
addition to that existing techniques uses dynamic programming to compress the sequence
which is complex in implementation and time consuming. Our technique is implemented
without dynamic programming approach, so it is simple and fast. The simplicity of this will
reduce the complexity in processing and definitely it will be the invaluable tool in Bio
informatics era. Our algorithm can be extended to any tool based approach.

ACKNOWLEDGEMENTS

        We would like thank other members of the Bio –Informatics teams (Faculty of CSE
and IT) at Sphoorthy Engineering college Nadergul(V),R.R Dist,Hyderabad. I am very much
greatful to my mother and father V.subba lakshmamma and V.C Obanna in every path of my
success. Last but not least students of CSE in sphoorthy for sharing their ideas with us in
refining of our architecture

REFERENCES

[1]    E Schrodinger. Cambridge University Press: Cambridge, UK, 1944.[PMID:
15985324]
[2]    R Giancarlo et al. A synopsis Bioinformatics 25:1575 (2009) [PMID:19251772]
[3]     EV Koonin. Bioinformatics 15: 265 (1999)
[4]     JC Wooley. J.Comput.Biol 6: 459 (1999) [PMID: 10582579]
[5]     CH Bennett et al. IEEE Trans.Inform.Theory 44: 4 (1998)
[6]     S Grumbach & F Tahi. Journal of Information Processing and Management 30(6):
875 (1994)
[7]     E Rivals et al. A guaranteed compression scheme for repetitive DNA sequences.
LIFL, Lille I University, technical report IT-285 (1995)
[8]     X Chen et al. A compression algorithm for DNA sequences and its applications in
Genome comparison. In Proceedings of the Fourth Annual International Conference on
Computational Molecular Biology, Tokyo, Japan, April 8-11, 2000. [PMID: 11072342]
[9]     TC Bell et al. Newyork:Prentice Hall (1990)
[10]    J Ziv & A Lempel. IEEE Trans. Inf. Theory 23: 337 (1977)
[11] A Grumbach & F Tahi. In Proceedings of the IEEE Data [12]
[12] DNA compression is challenge is revisited Beshad Behajadi
[13] Allam AppaRao.In proceedings of the Bio medical Informatics                 Journal
[2011].DNABIT compress-compression of DNA sequences
[14] Allam AppaRao.In proceedings of the JATIT journal computationalf Biology and
Bio Informatics:[2009].HuffBit compress-compression of DNA using extended binary trees
[15] Allam AppaRao.In proceedings of the JATIT journal computational Biology and Bio
Informatics:[2011].Genbit compress-compression of DNA sequences.

                                            100
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

AUTHORS’ INFORMATION

                   V Hari Prasad, Assoc. professor, B.Tech CSE from JNTU University,
                   Anantapur, M.Tech CSE from JNTUCEH,HYD and pursuing research in the
                   area of Bio Informatics at JNTU KAKINADA, A.P as a External Research
                   scholar in CSE .He has 10 years of teaching experience in various
                   Engineerig colleges. Presently He is heading the CSE Dept at Sphoorthy
                   Engineering college ,Nadergul(V),Hyd. He is a Life Member of MISTE and
Member of IEEE and UGC-NET qualified.He presented papers at International & National
conferences on various domains. His interested areas are Bio Informatics, Databases, and
Artificial Intelligence.




                                          101

More Related Content

DOCX
Final doc of dna
PDF
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
PDF
A comparative review on symmetric and asymmetric DNA-based cryptography
PDF
A Study on DNA based Computation and Memory Devices
PPTX
2014 anu-canberra-streaming
PDF
Comparision Of Various Lossless Image Compression Techniques
PDF
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
PPTX
DNA based Cryptography_Final_Review
Final doc of dna
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
A comparative review on symmetric and asymmetric DNA-based cryptography
A Study on DNA based Computation and Memory Devices
2014 anu-canberra-streaming
Comparision Of Various Lossless Image Compression Techniques
Enhanced Level of Security using DNA Computing Technique with Hyperelliptic C...
DNA based Cryptography_Final_Review

What's hot (19)

PDF
Secure data transmission using dna encryption
PDF
Develop and design hybrid genetic algorithms with multiple objectives in data...
PDF
DNA Encryption Algorithms: Scope and Challenges in Symmetric Key Cryptography
PDF
Dna cryptography
PPTX
[Chung il kim] 0829 thesis
PPTX
Genetic data storage
PDF
Survey on Text Prediction Techniques
PPTX
2013 talk at TGAC, November 4
PPTX
DNA secret writing project first review
PPTX
20131019 生物物理若手 Journal Club
PPT
A NEW APPROACH TOWARDS INFORMATION SECURITY BASED ON DNA CRYPTOGRAPHY
PDF
IRJET- DNA Cryptography
PDF
DNA as Storage Medium
PDF
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
PDF
G0210032039
PPTX
Dna computing
PDF
G017444651
PDF
37de29c2ae88c046317fcfbebd7a66784874
PDF
On Text Realization Image Steganography
Secure data transmission using dna encryption
Develop and design hybrid genetic algorithms with multiple objectives in data...
DNA Encryption Algorithms: Scope and Challenges in Symmetric Key Cryptography
Dna cryptography
[Chung il kim] 0829 thesis
Genetic data storage
Survey on Text Prediction Techniques
2013 talk at TGAC, November 4
DNA secret writing project first review
20131019 生物物理若手 Journal Club
A NEW APPROACH TOWARDS INFORMATION SECURITY BASED ON DNA CRYPTOGRAPHY
IRJET- DNA Cryptography
DNA as Storage Medium
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
G0210032039
Dna computing
G017444651
37de29c2ae88c046317fcfbebd7a66784874
On Text Realization Image Steganography
Ad

Viewers also liked (8)

PPSX
Epa San Martin Marzo
PPTX
Merlo coupon design
PDF
Miguel Mejía Temas
ODP
Guide to relational therapy
PPS
Estupidez humana
XLSX
Copia de copia de mmm
PPT
Intervenció En Lespai PúBlic
DOCX
Useful web resource links
Epa San Martin Marzo
Merlo coupon design
Miguel Mejía Temas
Guide to relational therapy
Estupidez humana
Copia de copia de mmm
Intervenció En Lespai PúBlic
Useful web resource links
Ad

Similar to A new revisited compression technique through innovative partition group binary (20)

PDF
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
PDF
Performance Efficient DNA Sequence Detectionalgo
PDF
50320130403003 2
PDF
Image Compression Through Combination Advantages From Existing Techniques
PDF
Design & Implementation of a DNA Compression Algorithm
PDF
50120130406023
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
40120130405015 2
PDF
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
PDF
An analogy of algorithms for tagging of single nucleotide polymorphism and ev
PDF
A research paper_on_lossless_data_compre
PDF
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
PDF
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
PDF
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
PPTX
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
PDF
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
PDF
Dna data compression algorithms based on redundancy
PDF
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
PDF
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
PDF
De novo transcriptome assembly of solid sequencing data in cucumis melo
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
Performance Efficient DNA Sequence Detectionalgo
50320130403003 2
Image Compression Through Combination Advantages From Existing Techniques
Design & Implementation of a DNA Compression Algorithm
50120130406023
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
40120130405015 2
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
An analogy of algorithms for tagging of single nucleotide polymorphism and ev
A research paper_on_lossless_data_compre
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Dna data compression algorithms based on redundancy
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELO
De novo transcriptome assembly of solid sequencing data in cucumis melo

More from IAEME Publication (20)

PDF
IAEME_Publication_Call_for_Paper_September_2022.pdf
PDF
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
PDF
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
PDF
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
PDF
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
PDF
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
PDF
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
PDF
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
PDF
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
PDF
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
PDF
GANDHI ON NON-VIOLENT POLICE
PDF
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
PDF
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
PDF
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
PDF
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
PDF
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
PDF
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
PDF
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
PDF
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
PDF
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
IAEME_Publication_Call_for_Paper_September_2022.pdf
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
GANDHI ON NON-VIOLENT POLICE
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT

A new revisited compression technique through innovative partition group binary

  • 1. INTERNATIONALComputer Engineering and Technology ENGINEERING International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) IJCET Volume 4, Issue 2, March – April (2013), pp. 94-101 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) ©IAEME www.jifactor.com A NEW REVISITED COMPRESSION TECHNIQUE THROUGH INNOVATIVE PARTITION GROUP BINARY COMPRESSION :A NOVEL APPROACH V. Hari Prasad Sphoorthy Engineering college, A.P ABSTRACT Day by day the ratio of living organisms are increased and its accumulation in the biological databases creating a major task. In this connection some of the classical algorithms are strived into the world and fails to compress genetic sequences due to the encrypted alphabets. Existing substitution techniques will work on repetitive and non repetitiveness of bases of DNA and they achieve Best compression rates if any sequence may contain tandem repeats. But the category of grasses like maize and rice contains very less repeats (nearer of 27) over 67789 total bps. By working existing techniques on such sequences the results are not bountiful and running on worst case comparisons. This paper introduces a novel Innovative partition group Binary compression technique yields first art compression rates which is far better than existing techniques. This algorithm is developed based on comparative study of existing algorithms and which is more applicable for non tandem repeats of DNA sequences in genomes. Keywords: Dnabit compress, Genbit compress, Huffbit compress, Encode, Decode I. INTRODUCTION The human genome was finally deciphered! In other words, scientists have succeeded in reading the chain of more than 3 billion base pairs that constitute the DNA molecule of humans; this process is called, sequencing. That daunting task required new analytical methods created by bioinformatics .Bio informatics is nothing but information technology is applied in terms of biological databases. To maintain such huge databases we require efficient Bio informatics computational tools to store and process data in a more efficient 94
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME way. Today more and more DNA sequences are storing in biological databases so the size of the databases is growing in an exponential manner. Thus need of compression arises and it is becoming a vital challenge, In this connection many universal algorithms are came into the existence but they fails to compress genetic sequences due to the specificity of “text”. Some classical algorithms are also introduced but they are running on negative compression rates. Compression may be in two flavors one is lossless and other is loss. Loss compression can be applicable for multimedia applications like image, audio and video. In multimedia applications if we remove some unused pixels also resultant may not variant like removing noise from audio or removing unnecessary pixels from images But Text compression is always loss less even after decoding the entire encoded text we have to retain its original property. DNA can be encoded in four letter alphabets like text {A, C, G, and T}.Thus each Base of symbol (Base) can be represented by two bits. . General purpose compression algorithms do not perform well with biological sequences. Giancarlo et al. [1][2] have provided a review of compression algorithms designed for biological sequences. Finding the characteristics and comparing Genomes is a major task (Koonin 1999[3]; Wooley 1999[4]). In mathematical point of view, compression implies understanding and comprehension (Li and Vitanyi 1998) [5]. Compression is a great tool for Genome comparison and for studying various properties of Genomes. DNA sequences, which encode life should be compressible. It is well known that DNA sequences in higher eukaryotes contain many tandem repeats, and essentials genes (like rRNAs) have many copies. It is also proved that genes duplicate themselves sometimes for evolutionary purposes. All these facts conclude that DNA sequences should be compressible. The compression of DNA sequences is not an easy task. (Grumback and Tahi 1994[6], Rivals et al. 1995 [7]; Chen et al. 2000 [8]) DNA sequences consists of only four nucleotides bases {a,c,g,t}. Two bits are enough to store each base. The standard compression software’s such as “compress”, “gzip”, “bzip2”, “winzip” expanded the DNA genome file more than compressing it. Most of the Existing software tools worked well for English text compression (Bell et al. 1990[9]) but not for DNA Genomes. There are many text compression algorithms available having quite a good compression ratio. But they have not been proved well for compressing DNA sequences as the algorithm does not incorporate the characteristics of DNA sequences even though DNA sequences can be represented in simple text form.DNA sequences are comprised of just four different bases labeled A, T, C, and G (for adenine, thymine, cytosine, and guanine respectively). T pairs with A, and G pairs with C. Each base can be represented in computer code by a two character binary digit, two bits in other words, A (00), C (01), G (10), and T (11). At first glance, one might imagine that this is the most efficient way to store DNA sequences. Like the binary alphabet {0, 1} used in computers, the four-letter alphabet of DNA {A, T, C, and G} can encode messages of arbitrary complexity when encoded into long sequences. A. Plan of the paper This paper is organized as follows. Section 2 describes general compression algorithms. Section 3 describes related existing algorithms to compress genome data. Section 4 describes proposed algorithms analysis how it is better one than existing techniques. Section 5 describes comparative study on a sample sequence. Section 6 is concluding with future work. 95
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME II. GENERAL COMPRESSION ALGORITHMS The compression of DNA sequences is considered as one of the most challenging tasks in the field of data compression. In this connection the very first DNA compression and its subsequent algorithms BioCompress[10] and BioCompress-2[11] detects exact repeats and complementary palindromes located in the sequence and Encode the factor by the size representation( l, p ) where l is the length of the factor and p is the position of its first occurrence .If the size is greater the factor then use two bit encoding . More memory references will require decoding the same, so the performance may degrade. DnaPack[12] which uses hamming distance for the repeats and complementary palindromes and it is implemented by dynamic programming approach. So that it is not simple in design and it will require more time to execute and require more memory requirements also. The algorithm achieves a compression rate in an average of 1.6602. DNAcompress [13] will work on approximation of repeats if number of tandem repeats more it saves bits to encode if not discard. Non repeated sequences will be appended to the sequence at the end. This algorithm achieves a compression rate only 1.72 bits per base. If there is no tandem repeat in the sequence it may run in worst case..DNASequitur is a grammar based compression algorithm for DNA sequences which infers a context free grammar to represent input data..Designing a CFG for the given input data may leads to redundancies and constructing type2 language corresponding the grammar also leads to ambiguity. The Lossless segment based compression enables part by part decompression by introducing non base character so that it will save memory requirements but it is applicable well on repeating sequences are more and more in the sequence. If such sequences like AT- rich DNA, which constitutes a distinct fraction of the cellular DNA of the archaebacterium Methanococcus voltae, consists of non-repetitive sequences, so part by part decompression is little bit tedious. III. RELATED EXISTING ALGORITHMS Compression methods are fall into two categories. • Statistical methods which compress data by replacing the shorter code. Huffman code comes under this category and it’s not suitable larger sequences. • Dictionary based replacing larger strings by shorter code. Cfact which searches the longest exact matching repeat using tree data structures Based on the above methods some loss less compression algorithms strived based on two bits encoding schemes i.e. A(00),C(01),G(10) and T(11). HUFFBIT COMPRESS [14], GENBITCOMPRESS [15] algorithms are explained performance analysis (Best, Avg and Worst) based on repetitive and non repetitive bases of DNA and computed results..Suppose in the given sequence more and more tandem repeats are there then [14],[15] will run in Beast case and achieves bountiful compression ratios, if not the same may run in worst case and achieves in an average 2.323 bits per bases[14],[15].Our proposed algorithm PGBC(Partitioned group binary compression) techniques will achieves best compression ration even the given DNA sequence may doesn’t contain tandem repeats or very less tandem repeats which can consider as minute in a very larger sequence like category of maze and rice grass sequences Our proposed algorithms are better suitable for non-repetitive DNA sequences in genome and which is achieving in an average of 1.333 bits per bases which is far better than all existing techniques. 96
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME IV. PROPOSED ALGORITHM Our proposed algorithms are developed based on the comparative study of existing techniques and we are starching on non-repetitiveness of DNA sequences. Existing techniques are run in worst case if any DNA sequences contain no tandem repeats. Here we are working with worst-case scenarios and achieves better compression rates in terms bits per symbols. A. Idea behind the algorithm Every DNA sequence contain {A, C, G, T} nucleotides where each literal is named as BASE and encoded in two bits as follows A=00, C=01, G=10 and T=11. Compression ratio is calculated encoded bits per Bases. Compression Ratio = Encoded Bits/Bases. B. Plan of work Here we took and input sequence as sample DNA (which doesn’t contain any tandem repeats) of length n and divides it into n/4 fragments (where each fragment contain four bases i.e. A, C, G and T). In Encoding process every six fragments can grouped as partition (P) which contains two sub partitions (Fh and Sh ) . We can substitute its equivalent binary bits before making sub partitions and later we can group it into single main partition set. (Gs). In decoding process we can do the reverse or encoding to retain loss less DNA property. Finally we will calculate will group all the partition the equations are total number of encoded bits by grouping all the part ions. Suppose if we took sample sequence of DNA which contain 72 bases then by applying PGBC techniques it will fragmented into n/4 i.e. 18 fragments , 3 partitions which will contain 6 sub partitions and then grouped it into single main partition. This will represent number of encoded bits in the given sequence. Our PGBC technique work as follows Partition set (Ps) can calculate as follows. Group partition set can calculate as follows Here Gs will represent the binary equivalent numeric (nearest to integer) in terms of Bytes storage. (Suppose if we will implement the technique in C language unsigned int will require 4 bytes of storage).Here m=n i.e. length of the given sequence. 97
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Total Number of Encoded group bits are calculated as follows. Finally compression Ratio can calculate as follows. C. Analysis Length of the given sequence N = 72 Bases. Then possibly 3 part ion sets which will include 6 sub partitions (3-Fh and 3-Sh) and finally we can group all partitions into a Group partition set. Ps=P1 + P2 + P3. we will calculate binary equivalent numeric(which is nearer to integer) value for each partition in terms of Bytes storage(suppose if we will store each partition in C we will require two bytes if we can accommodate on int if not we can go for unsigned long). Gs=Ps (p1 + p2 + p3) Total Number of Encoded Bits Eg b= Gs= Ps Every partition may contain 24 bases so it may not fit in integer so that we can store in unsigned long. So totally our sequence is divided into three partitions and grouped as one set .So totally we require 12 bytes to store. =4 + 4 + 4 = 12 Bytes (96 bits) Finally we calculate Compression Ratio as follows. = 96 / 72 = 1.333 (bits per Bases.) Encoding and decoding algorithms for DNA compression is as follows. D. Encoding Algorithm INP: input String OPS: Encoded String PROCEDURE ENCODE Begin • Group INS into equivalent fragments as four bases • Generate all possible combinations of DNA and it will contain non- repetitive (our INS assumed as no tandem repeats). 98
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME • Group six fragments into partition set which will consist of two subpartions. • Assign binary bits(0&1) for every base of DNA like A=00, C=01, G=10 and T=11 • Calculate Gs for every Ps in INP till eof INP • Calculate Egb for every Gs till eof INP • Repeat the steps 4 and 5 until the length of the INP • Transfer the sequence Egb to the output string i.e. OPS String. End. E. Decoding Algorithm INP: input String OPS: Decoded String PROCEDURE DECODE Begin • Generate all possible combinations of (A,C,G,T) • Read the binary data of each sub partition from OPS and assign the two bits by equivalent Base s (00=A,01=C,10=G and 11=T) and then store it in an array till eof • Repeat step 2 until eof INS is reached and calculate Dgb and Ds in the reverse process.. • Transfer the sequence Db to the input String i.e. INP End. V. EXAMPLE AND COMPARISON Let us consider the sequence. Sequence1: ACGT GCGC GATC GCCT GCTA GGCG TACG TCGC AGGC GATC GATG TGCT AGAT CAGA TGAC TCAG TGCA CGAT. Sequence length (no of bases) = 72. Bytes required to store in a text file = 72 Bytes. The above sequence doesn’t contain tandem repeats so existing algorithms like Huffbit compress,Genbit Compress and Dnabit compress may run on worst case and require more bits to encode the sequence. Huffbit,GenBit and Dna compress =162 bits(2.25) Genbit Compress (Tool based) = 160 bits (2.23) PGBC Technique (Compression) = 96 bits (1.333) 99
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME VI. CONCLUSION AND FUTURE WORK By using of our algorithm we can encode every base by 1.33 bits .By applying of ours we are saving nearer of 8 bytes to encode the given sequence, compression may vary with size of the sequence. So our technique is far better than existing ones and we can apply this technique on non repetitive DNA sequences of genomes .If the given sequence can contain tandem repeats also our technique will achieve same compression rate in an average. In addition to that existing techniques uses dynamic programming to compress the sequence which is complex in implementation and time consuming. Our technique is implemented without dynamic programming approach, so it is simple and fast. The simplicity of this will reduce the complexity in processing and definitely it will be the invaluable tool in Bio informatics era. Our algorithm can be extended to any tool based approach. ACKNOWLEDGEMENTS We would like thank other members of the Bio –Informatics teams (Faculty of CSE and IT) at Sphoorthy Engineering college Nadergul(V),R.R Dist,Hyderabad. I am very much greatful to my mother and father V.subba lakshmamma and V.C Obanna in every path of my success. Last but not least students of CSE in sphoorthy for sharing their ideas with us in refining of our architecture REFERENCES [1] E Schrodinger. Cambridge University Press: Cambridge, UK, 1944.[PMID: 15985324] [2] R Giancarlo et al. A synopsis Bioinformatics 25:1575 (2009) [PMID:19251772] [3] EV Koonin. Bioinformatics 15: 265 (1999) [4] JC Wooley. J.Comput.Biol 6: 459 (1999) [PMID: 10582579] [5] CH Bennett et al. IEEE Trans.Inform.Theory 44: 4 (1998) [6] S Grumbach & F Tahi. Journal of Information Processing and Management 30(6): 875 (1994) [7] E Rivals et al. A guaranteed compression scheme for repetitive DNA sequences. LIFL, Lille I University, technical report IT-285 (1995) [8] X Chen et al. A compression algorithm for DNA sequences and its applications in Genome comparison. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, April 8-11, 2000. [PMID: 11072342] [9] TC Bell et al. Newyork:Prentice Hall (1990) [10] J Ziv & A Lempel. IEEE Trans. Inf. Theory 23: 337 (1977) [11] A Grumbach & F Tahi. In Proceedings of the IEEE Data [12] [12] DNA compression is challenge is revisited Beshad Behajadi [13] Allam AppaRao.In proceedings of the Bio medical Informatics Journal [2011].DNABIT compress-compression of DNA sequences [14] Allam AppaRao.In proceedings of the JATIT journal computationalf Biology and Bio Informatics:[2009].HuffBit compress-compression of DNA using extended binary trees [15] Allam AppaRao.In proceedings of the JATIT journal computational Biology and Bio Informatics:[2011].Genbit compress-compression of DNA sequences. 100
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME AUTHORS’ INFORMATION V Hari Prasad, Assoc. professor, B.Tech CSE from JNTU University, Anantapur, M.Tech CSE from JNTUCEH,HYD and pursuing research in the area of Bio Informatics at JNTU KAKINADA, A.P as a External Research scholar in CSE .He has 10 years of teaching experience in various Engineerig colleges. Presently He is heading the CSE Dept at Sphoorthy Engineering college ,Nadergul(V),Hyd. He is a Life Member of MISTE and Member of IEEE and UGC-NET qualified.He presented papers at International & National conferences on various domains. His interested areas are Bio Informatics, Databases, and Artificial Intelligence. 101