SlideShare a Scribd company logo
Comparisons of Sequence Alignment Scoring Functions:

  On the Use of Structural Information to Improve Performance




                        Feb 6, 2008
What’s a scoring function?

                                a b b c d d d e f g
                            a
                            b
                            c
                            d   a b b c d d d e f g
                            e
                            f   a b - c d - - e f g
                            g

                                a b b c d d d e f g

                                a b - c c d - e f g
                                      - d -




              Optimal ::   MAX ( ∑ S (•) - ∑C ( – ) )
                           similarity S        cost C > 0

Aims
Optimal alignment problem:                        Native alignment scores best
SO alignment sampling problem:          Native alignment scores best &
                                        Poor alignments kept at minimum &
                                        Avoid “unproductive” alignments
Productive versus Unproductive Alignment Sampling


                            AK




                                 AML


                                            YYY
                      XXX
                                                                    A       A
                                                  XXXAAAYYY
   XXXAAADEFAAAYYY
                                                  XXX--aYYY             a
   XXX-AKLMA---YYY
                            A
                                                                            A




                                                               X
                                                  XXXAAAYYY




                                                              XX
   XXXAAADEFAAAYYY



                                 MLK


                                            YYY
   XXX--AKLMA--YYY    XXX                         XXX-a-YYY




                                                                    Y
                                                                   YY
                                       A          XXXAAAYYY
   XXXAAADEFAAAYYY
                                                  XXXa--YYY
   XXX---AKLMA-YYY
                                 LKA


                                            YYY
                      XXX




                                       MA




          Non-redundant (good)                        Redundant (not good)
Classes of Methods for Sampling Suboptimal Alignments
• Top-down Enumeration
    – Classical Waterman (Near-optimal alignments)
                                                                 Path > Opt-δ

• Iterative Elimination (IE)
    – Waterman & Eggert
    – Saqi, Bates & Sternberg


• Parametric Sampling (PS)
    – Chivian & Baker; 2006


• Combined IE + PS
    – Jaroszewski, Li & Godzik; 2002                 sample over
                                                     lots of …
• Stochastic Sampling                                P (sim1,gap1,ss1)
    – John & Sali; 2003                              P (sim2,gap1,ss1)
                                                     …
                                                     P (simn,gapn,ssn)
• Fragment Set Approach (S4)
Critical Questions

Am I ranking the most native alignment first?       Within the scope of
                                                    the scoring function
Am I eliminating poor/impossible alignments?
                                                     Within the scope of
Am I sampling efficiently/with little redundancy?    alignment sampling




 New GN2 v. HMAP – sp2 – sp3 – sp4
Organization

Talk about software library for doing sequence alignment

Talk about the HMAP and Sparks-family of scoring functions

New method: GN2

Benchmark design & results
T1        T2         T3                           Q1        Q2                HMAP2 – STL in C++
                                                                              (generic programming)

     Algorithm


      Evaluator                                    Enumerator

                          dynamic                       alignment
                                                                             Format
                          pgram’ing                     set
                          matrix                             [pair list]

                                sparks?                optimal
               HMAP       gnoali       gn2             S4        Waterman    RC   ?




      T        HMAP       Q        =      DPM                    aabbccdef
                                                                 aa---cdef
                                                                                      Fasta, PIR
                                                                 aabbccdef             (formatted
               ENU
                M
                          DPM      =         AS                  ---aacdef               output)
primary        secondary               Structure               residue
                                                                       depth
      sequence         structure

                            contact
        sequence-
                           numbers,
                                           solvent              depth-dep.            hydro-
        based prof.     distances, HBs   accessibility           a.a. freq           philicity

PSI BLAST


                           Template
                            Profile                      Algorithm              Alignments
    sequence
    database
            NR               Query
                             Profile                                            Models



       primary            Sequence-               PSIPRED
      sequence            based prof.            prediction
                                                                          SABLE
                                                                        prediction
a    b      c   d    e
                                                            a
                                                            b
                                                            x
                                                            y
                                                            e


        Affine gaps              Arbitrary gaps           Double-sided gaps            abcd--e
                                                          (zigzag alignment)           ab--xye
                        ss
                        coil

G



    0 1 2…
                         l     0 1 2…
                                                    l

       Fast, good for            Nonlinear gaps,             Most flexible,
       DB search                 structure-derived gaps      potentially most costly
       (HMAP)                    (AS Yang - 2002)            (A Sali - 2006)
HMAP                                secondary
                                     structure         gap
        sequence profile                            open, extn




                                              nf.
                                    H E C




                                            co
       .01 .02 … 0.45 ... 0.02     0   0   1   1     3.7   0.3
  T
       ……                          1   0   0   1                       SQ,T = dot [ aaQ , aaT ] * exp [ W * ssQT ]
  E
  M    …. …                        1   0   0   1
       ……                          1   0   0   1                                1 * confQ           : if ssQ = ssT
  P                                                                    ssQT
       … ….                        0   0   1   1
  L                                                                             -0.5 * confQ        : if ssQ ≠ ssT
  A    ……                          0   1   0   1
  T    …. …                        0   1   0   1                        W = 0.5 (new opt value = 0.55)
  E    … .04 .025 0.02             0   1   0   1    12.8   0.9

                                                                      ZQ,T = (SQ,T - µ) / σ
                                     PSIPRED
                                                       gap
        sequence profile                            open, extn
                                              nf.




                                    H E C
                                            co




       .02 .08 … 0.25 ... 0.02     0   0   1
  Q
  U    ……                          1   0   0                                    3.7,0.3        : if ssT = coil
       …. …
                                                                     GI,GE
  E                                1   0   0                                    12.8,0.9       : if ssT ≠ coil
  R    ……                          1   0   0
       … .03 .015 0.05             0   0   1
  Y
                                 = continuously valued from [0..1]
Sparks scoring functions
• Sparks 2
    – Sequence-based profile-to-profile
    – Secondary structure prediction using PSIPRED (Jones) [+1/-1]
• SP3
    – Sparks 2 plus…
    – Residue-depth dependent profile
• SP4
    – SP3 plus…
    – Solvent accessibility prediction using SABLE (Adamczak, Porollo, Meller)


• Trained (parameterized)
    – using ProSup (Sippl; 2000) alignments
• Tests performed
    –   Fold recognition (FR) + Model building: Lindahl FR set
    –   FR + Model building: LiveBench 8 (MaxSub)
    –   FR + Model building: CASP7 (GDT Z-score)
    –   Alignment: Sali’s test set (200 pairs, 65% overlap, 3.5 Å) (TM overlap)
HMAP                      GN2
   Sequence-based           Sequence-based
      profile                   profile (AA)
   Secondary structure      Contact number (CN)
   Affine gap penalty       Secondary structure (SS)
                             Hydrophilicity index (HI)
                             Structure-derived gap penalty
                                 Geometric distance (GP = exp (D – 8Å))
                                 Hydrogen bonding
                                 Insertions more likely with small CN
                                 Deletions beg./end in same SS =
                                impossible (very high GP)
Log-likelihood ratios from structural alignments

     SKA                            Make training alignments




                                             Count
                                          frequencies

                                                                      Convert to
                                                                 log-likelihood ratios
    S (i,j) = LLR0 +
              wAA * LLRAA (i,j) +                                        f structure 
                                                               LLR = log
                                                                         f           
                                                                                      
              wSS * LLRSS (i,j) +
                                                                         random 
              wCN * LLRCN (i,j) +
              wHI * LLRHI (i,j)
Log-odd substitution matrix for aligning SS-to-predicted SS (PSI-PRED)
based on structural alignment (SKA)
Should we use dot [ aaQ , aaT ] ?
Construction of a log-odds score based on the cos-angle function between profiles
CA _ atoms
                         1
CN = 0.72      ∑         r2
Construction of a log-odds score based on contact number counts of structural alignments




                                                                 1 CA _ atoms  1               CA _ atoms
                                                                                                          1
                                               N weighted_CN =
                                                                 20
                                                                      ∑ ( r / 3 .8 Å ) 2 = 0.72 ∑
                                                                                                          r2
K   RE   QD   N   P H   ST GY W AMFLVIC


                                     hydrophilicity index




                                  profile
                           HI =   ∑ i
                                        HI i
Construction of a log-odds score based on observed levels of HI agreement btwn the Q&T

          K                 RE         QD    N   P H    ST GY W AMFLVIC




              Observed                                                               Fitted




                                                    (
                                                 exp exp
                                                                (
                                                           − abs H Q − H T   )
                                                                                                                )
                                                                                 ⋅ ( .75 + .3 * abs ( H T − .22) ) − 1.8
Training and Benchmarking Sets
SCOP 1.71 all vs. all ( skan psd < 0.6, rmsd < 3.5 )  1M pairs



      sort pairs by % sid ( from 0%, “devilish set” )

                                     re-order, 7.5% sid on top ( “difficult set” )

  filter ( ali len > 80, % sid < 40, ska psd < 0.6 )  326k pairs
                                                         no
                                Any more pairs?                     Done!  test set
                                                                    difficult: 995 pairs
                                                         yes        devilish: 913 pairs
take next top pair ( lowest % sid in list )
                                                   yes

                               Scop family already in benchmark?

                                                   no
        add pair to benchmark
SCOP 1.71
          pairs from all vs. all comparison

                                                       no
                                     Any more pairs?         Done! 
yes                                                          List of protein pairs
                                                       yes   w/o sequence
                                                             similarity to test set
          Blast against test set sequences


      No e-value < 1?



no             remove sequence pair

                                                             make training set…
                                                             difficult: 238 pairs
Scop 1.71 Training set results
Summary of counts:
Class:       5
Fold:        102
Superfamily: 120
                     +148 folds represented once
Sequence Identity:
0 - 5%       30
5 - 10%      110
10 - 15%     48
15 - 20%     18
20 - 25%     11
25 - 30%     5
30 - 35%     7
35 - 40%     9
40 - 45%
45 - 100%
all: 238

Classes:
c    49
d     44
b     32
a     23
e     2
Shift performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




      39/31

                                                    54/18




      65/19                                         55/18



                                                                  3 gn2 alignments with shift > 50
Qmod performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




    129/94                                       136/85




    119/105                                      110/114
Overall performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




            Scoring function       Total shift           Residues aligned
                                                         correctly
            gn2                    1124                  21,500
            nalign                 1179*                 21,197
            sparks2                1522                  20,669
            sp3                    1607                  21,020
            sp4                    1672                  21,299*
Scop 1.71 Test set results
Summary of counts:
Class:       7
Fold:        341
Superfamily: 460
                     +230 folds represented once
Sequence Identity:
0 - 5%       72
5 - 10%      423
10 - 15%     148
15 - 20%     103
20 - 25%     90
25 - 30%     67
30 - 35%     42
35 - 40%     37
40 - 45%
45 - 100%
all:         995

Classes:
d            182
c            141
b            137
a            115
e            18
f            18
g            3
Shift performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




        136/111                                      159/112




        174/102                                      161/111



                                                                  18 gn2 alignments with shift > 50
Qmod performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4
      nalign                                    sparks2




       524/342                                    544/344

     sp3                                         sp4




       514/379                                    489/408
Overall performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

            Scoring function       Total shift           Residues aligned
                                                         correctly
            gn2                    4718                  110,289*
            sparks2                4935*                 109,328
            sp4                    5038                  111,351
            nalign                 5071                  109,349
            sp3                    5377                  110,172


                                   Total shift           Correctly aligned
                                   relative to best      relative to best
                                   (Wilcoxon test)       (Wilcoxon test)
            gn2                    0                     -1,062 (p < 0.22)
            sparks2                +217 (p < 5*10-4)     -2,023 (p < 5*10-4)
            sp4                    +320 (p < 5*10-4)     0
            nalign                 +353 (p < 5*10-4)     -2,002 (p < 1.36*10-2)
            sp3                    +659 (p < 5*10-4)     -1,179 (p < 5*10-4)
Spo0 set results
(Q = 1F51, T = Spo0 family)
141 ali’s      74/29




Scoring function   Total shift    Residues aligned
                                  correctly
gn2                1035 (-22%)    6283 (+13%)
nalign             1323           5547
Remarks

Apparent success of the LLR method, but some mysteries

Sali test set
          (next slide)

Performance is underestimated in alignments with structural repeats
        (next slide +1)

Need for looking at alternative structural alignments

Room for improvement
       E.g. adding FUGUE-like (Blundell) sequence-structure LLR
       -or- SABLE/SA prediction
       -or- IBR potential (Zhu)
Summary of counts:
Class: 7
Fold: 74
Superfamily: 86
NA: 2

Sequence Identity:                                                 +38 represented once
0 - 5%             13
5 - 10%            11
10 - 15%           11
15 - 20%           16
20 - 25%           79
25 - 30%           60
30 - 35%           24
35 - 40%           12
40 - 45%           4
45 - 100%          2
all:               239
(note: psid calculated by ska)

Classes:
b 99
c 95
d 63
a    34
e 4
g 2
f 1




                                 Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A.
                                 Variable gap penalty for protein sequence-structure alignment.
                                 Protein Eng Des Sel. 2006 Mar;19(3):129-33
Caveat (example #1) from Training Set
Structural information in protein sequence alignment accuracy
Sparks score




SP3 score
What’s a scoring function?


                                                 a b b c d d d e f g
                    a b b c d d d e f g      a
                                             b
                    a b - c d - - e f g      c                             a b b c d d d e f g
                                             d                         a
                                             e                         b
                                             f                         c
                                             g                         d
                                                                       e
                                                                       f
                                                                       g

                                                                           a b b c d d d e f g
                           max ∑ S (•) + min ∑ C ( – )                     a b - - c d - e f g

                          similarity S            cost C < 0


Aims
Optimal alignment problem:                          Native alignment scores best
Sampling suboptimal alignments:           Native alignment scores best &
                                          Poor alignments kept at minimum
Structural information in protein sequence alignment accuracy

More Related Content

PDF
T coffee algorithm dissection
PDF
Sequence Alignment
PPT
Sequence alignment belgaum
PPT
Sequence alignments complete coverage
PPTX
Sequence Alignment,Blast, Fasta, MSA
PPT
Alignments
PDF
Ch06 alignment
PPT
B.sc biochem i bobi u 3.1 sequence alignment
T coffee algorithm dissection
Sequence Alignment
Sequence alignment belgaum
Sequence alignments complete coverage
Sequence Alignment,Blast, Fasta, MSA
Alignments
Ch06 alignment
B.sc biochem i bobi u 3.1 sequence alignment

Similar to Structural information in protein sequence alignment accuracy (20)

PDF
BITS: Basics of Sequence similarity
PPT
The Needleman Wunsch algorithm
PPT
Alignment scoring functions
PDF
Next generation sequencing course - part 2: sequence mapping
PPT
PPT
Seq db searching
PPT
Sequence Analysis.ppt
PDF
Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop
PDF
EMF Compare 2.0: Scaling to Millions (updated)
PPTX
Parsing using graphs
PPT
Paper
PPTX
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
PDF
ICSM 2012 ERA
KEY
Verification with LoLA: 7 Implementation
PPT
PPTX
Bioinformatica t5-database searching
PDF
Parallel Random Projection for Motif Discovery on GPUs
PPT
Red black 2
PDF
Fuzzy String Matching
BITS: Basics of Sequence similarity
The Needleman Wunsch algorithm
Alignment scoring functions
Next generation sequencing course - part 2: sequence mapping
Seq db searching
Sequence Analysis.ppt
Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop
EMF Compare 2.0: Scaling to Millions (updated)
Parsing using graphs
Paper
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
ICSM 2012 ERA
Verification with LoLA: 7 Implementation
Bioinformatica t5-database searching
Parallel Random Projection for Motif Discovery on GPUs
Red black 2
Fuzzy String Matching
Ad

Structural information in protein sequence alignment accuracy

  • 1. Comparisons of Sequence Alignment Scoring Functions: On the Use of Structural Information to Improve Performance Feb 6, 2008
  • 2. What’s a scoring function? a b b c d d d e f g a b c d a b b c d d d e f g e f a b - c d - - e f g g a b b c d d d e f g a b - c c d - e f g - d - Optimal :: MAX ( ∑ S (•) - ∑C ( – ) ) similarity S cost C > 0 Aims Optimal alignment problem: Native alignment scores best SO alignment sampling problem: Native alignment scores best & Poor alignments kept at minimum & Avoid “unproductive” alignments
  • 3. Productive versus Unproductive Alignment Sampling AK AML YYY XXX A A XXXAAAYYY XXXAAADEFAAAYYY XXX--aYYY a XXX-AKLMA---YYY A A X XXXAAAYYY XX XXXAAADEFAAAYYY MLK YYY XXX--AKLMA--YYY XXX XXX-a-YYY Y YY A XXXAAAYYY XXXAAADEFAAAYYY XXXa--YYY XXX---AKLMA-YYY LKA YYY XXX MA Non-redundant (good) Redundant (not good)
  • 4. Classes of Methods for Sampling Suboptimal Alignments • Top-down Enumeration – Classical Waterman (Near-optimal alignments) Path > Opt-δ • Iterative Elimination (IE) – Waterman & Eggert – Saqi, Bates & Sternberg • Parametric Sampling (PS) – Chivian & Baker; 2006 • Combined IE + PS – Jaroszewski, Li & Godzik; 2002 sample over lots of … • Stochastic Sampling P (sim1,gap1,ss1) – John & Sali; 2003 P (sim2,gap1,ss1) … P (simn,gapn,ssn) • Fragment Set Approach (S4)
  • 5. Critical Questions Am I ranking the most native alignment first? Within the scope of the scoring function Am I eliminating poor/impossible alignments? Within the scope of Am I sampling efficiently/with little redundancy? alignment sampling New GN2 v. HMAP – sp2 – sp3 – sp4
  • 6. Organization Talk about software library for doing sequence alignment Talk about the HMAP and Sparks-family of scoring functions New method: GN2 Benchmark design & results
  • 7. T1 T2 T3 Q1 Q2 HMAP2 – STL in C++ (generic programming) Algorithm Evaluator Enumerator dynamic alignment Format pgram’ing set matrix [pair list] sparks? optimal HMAP gnoali gn2 S4 Waterman RC ? T HMAP Q = DPM aabbccdef aa---cdef Fasta, PIR aabbccdef (formatted ENU M DPM = AS ---aacdef output)
  • 8. primary secondary Structure residue depth sequence structure contact sequence- numbers, solvent depth-dep. hydro- based prof. distances, HBs accessibility a.a. freq philicity PSI BLAST Template Profile Algorithm Alignments sequence database NR Query Profile Models primary Sequence- PSIPRED sequence based prof. prediction SABLE prediction
  • 9. a b c d e a b x y e Affine gaps Arbitrary gaps Double-sided gaps abcd--e (zigzag alignment) ab--xye ss coil G 0 1 2… l 0 1 2… l Fast, good for Nonlinear gaps, Most flexible, DB search structure-derived gaps potentially most costly (HMAP) (AS Yang - 2002) (A Sali - 2006)
  • 10. HMAP secondary structure gap sequence profile open, extn nf. H E C co .01 .02 … 0.45 ... 0.02 0 0 1 1 3.7 0.3 T …… 1 0 0 1 SQ,T = dot [ aaQ , aaT ] * exp [ W * ssQT ] E M …. … 1 0 0 1 …… 1 0 0 1 1 * confQ : if ssQ = ssT P ssQT … …. 0 0 1 1 L -0.5 * confQ : if ssQ ≠ ssT A …… 0 1 0 1 T …. … 0 1 0 1 W = 0.5 (new opt value = 0.55) E … .04 .025 0.02 0 1 0 1 12.8 0.9 ZQ,T = (SQ,T - µ) / σ PSIPRED gap sequence profile open, extn nf. H E C co .02 .08 … 0.25 ... 0.02 0 0 1 Q U …… 1 0 0 3.7,0.3 : if ssT = coil …. … GI,GE E 1 0 0 12.8,0.9 : if ssT ≠ coil R …… 1 0 0 … .03 .015 0.05 0 0 1 Y = continuously valued from [0..1]
  • 11. Sparks scoring functions • Sparks 2 – Sequence-based profile-to-profile – Secondary structure prediction using PSIPRED (Jones) [+1/-1] • SP3 – Sparks 2 plus… – Residue-depth dependent profile • SP4 – SP3 plus… – Solvent accessibility prediction using SABLE (Adamczak, Porollo, Meller) • Trained (parameterized) – using ProSup (Sippl; 2000) alignments • Tests performed – Fold recognition (FR) + Model building: Lindahl FR set – FR + Model building: LiveBench 8 (MaxSub) – FR + Model building: CASP7 (GDT Z-score) – Alignment: Sali’s test set (200 pairs, 65% overlap, 3.5 Å) (TM overlap)
  • 12. HMAP GN2  Sequence-based  Sequence-based profile profile (AA)  Secondary structure  Contact number (CN)  Affine gap penalty  Secondary structure (SS)  Hydrophilicity index (HI)  Structure-derived gap penalty  Geometric distance (GP = exp (D – 8Å))  Hydrogen bonding  Insertions more likely with small CN  Deletions beg./end in same SS = impossible (very high GP)
  • 13. Log-likelihood ratios from structural alignments SKA Make training alignments Count frequencies Convert to log-likelihood ratios S (i,j) = LLR0 + wAA * LLRAA (i,j) +  f structure  LLR = log  f   wSS * LLRSS (i,j) +  random  wCN * LLRCN (i,j) + wHI * LLRHI (i,j)
  • 14. Log-odd substitution matrix for aligning SS-to-predicted SS (PSI-PRED) based on structural alignment (SKA)
  • 15. Should we use dot [ aaQ , aaT ] ?
  • 16. Construction of a log-odds score based on the cos-angle function between profiles
  • 17. CA _ atoms 1 CN = 0.72 ∑ r2
  • 18. Construction of a log-odds score based on contact number counts of structural alignments 1 CA _ atoms 1 CA _ atoms 1 N weighted_CN = 20 ∑ ( r / 3 .8 Å ) 2 = 0.72 ∑ r2
  • 19. K RE QD N P H ST GY W AMFLVIC hydrophilicity index profile HI = ∑ i HI i
  • 20. Construction of a log-odds score based on observed levels of HI agreement btwn the Q&T K RE QD N P H ST GY W AMFLVIC Observed Fitted ( exp exp ( − abs H Q − H T ) ) ⋅ ( .75 + .3 * abs ( H T − .22) ) − 1.8
  • 22. SCOP 1.71 all vs. all ( skan psd < 0.6, rmsd < 3.5 )  1M pairs sort pairs by % sid ( from 0%, “devilish set” ) re-order, 7.5% sid on top ( “difficult set” ) filter ( ali len > 80, % sid < 40, ska psd < 0.6 )  326k pairs no Any more pairs? Done!  test set difficult: 995 pairs yes devilish: 913 pairs take next top pair ( lowest % sid in list ) yes Scop family already in benchmark? no add pair to benchmark
  • 23. SCOP 1.71 pairs from all vs. all comparison no Any more pairs? Done!  yes List of protein pairs yes w/o sequence similarity to test set Blast against test set sequences No e-value < 1? no remove sequence pair make training set… difficult: 238 pairs
  • 24. Scop 1.71 Training set results
  • 25. Summary of counts: Class: 5 Fold: 102 Superfamily: 120 +148 folds represented once Sequence Identity: 0 - 5% 30 5 - 10% 110 10 - 15% 48 15 - 20% 18 20 - 25% 11 25 - 30% 5 30 - 35% 7 35 - 40% 9 40 - 45% 45 - 100% all: 238 Classes: c 49 d 44 b 32 a 23 e 2
  • 26. Shift performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 39/31 54/18 65/19 55/18 3 gn2 alignments with shift > 50
  • 27. Qmod performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 129/94 136/85 119/105 110/114
  • 28. Overall performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 Scoring function Total shift Residues aligned correctly gn2 1124 21,500 nalign 1179* 21,197 sparks2 1522 20,669 sp3 1607 21,020 sp4 1672 21,299*
  • 29. Scop 1.71 Test set results
  • 30. Summary of counts: Class: 7 Fold: 341 Superfamily: 460 +230 folds represented once Sequence Identity: 0 - 5% 72 5 - 10% 423 10 - 15% 148 15 - 20% 103 20 - 25% 90 25 - 30% 67 30 - 35% 42 35 - 40% 37 40 - 45% 45 - 100% all: 995 Classes: d 182 c 141 b 137 a 115 e 18 f 18 g 3
  • 31. Shift performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 136/111 159/112 174/102 161/111 18 gn2 alignments with shift > 50
  • 32. Qmod performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 nalign sparks2 524/342 544/344 sp3 sp4 514/379 489/408
  • 33. Overall performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 Scoring function Total shift Residues aligned correctly gn2 4718 110,289* sparks2 4935* 109,328 sp4 5038 111,351 nalign 5071 109,349 sp3 5377 110,172 Total shift Correctly aligned relative to best relative to best (Wilcoxon test) (Wilcoxon test) gn2 0 -1,062 (p < 0.22) sparks2 +217 (p < 5*10-4) -2,023 (p < 5*10-4) sp4 +320 (p < 5*10-4) 0 nalign +353 (p < 5*10-4) -2,002 (p < 1.36*10-2) sp3 +659 (p < 5*10-4) -1,179 (p < 5*10-4)
  • 34. Spo0 set results (Q = 1F51, T = Spo0 family)
  • 35. 141 ali’s 74/29 Scoring function Total shift Residues aligned correctly gn2 1035 (-22%) 6283 (+13%) nalign 1323 5547
  • 36. Remarks Apparent success of the LLR method, but some mysteries Sali test set (next slide) Performance is underestimated in alignments with structural repeats (next slide +1) Need for looking at alternative structural alignments Room for improvement E.g. adding FUGUE-like (Blundell) sequence-structure LLR -or- SABLE/SA prediction -or- IBR potential (Zhu)
  • 37. Summary of counts: Class: 7 Fold: 74 Superfamily: 86 NA: 2 Sequence Identity: +38 represented once 0 - 5% 13 5 - 10% 11 10 - 15% 11 15 - 20% 16 20 - 25% 79 25 - 30% 60 30 - 35% 24 35 - 40% 12 40 - 45% 4 45 - 100% 2 all: 239 (note: psid calculated by ska) Classes: b 99 c 95 d 63 a 34 e 4 g 2 f 1 Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A. Variable gap penalty for protein sequence-structure alignment. Protein Eng Des Sel. 2006 Mar;19(3):129-33
  • 38. Caveat (example #1) from Training Set
  • 41. What’s a scoring function? a b b c d d d e f g a b b c d d d e f g a b a b - c d - - e f g c a b b c d d d e f g d a e b f c g d e f g a b b c d d d e f g max ∑ S (•) + min ∑ C ( – ) a b - - c d - e f g similarity S cost C < 0 Aims Optimal alignment problem: Native alignment scores best Sampling suboptimal alignments: Native alignment scores best & Poor alignments kept at minimum