Structural information in protein sequence alignment accuracy

Comparisons of Sequence Alignment Scoring Functions:

On the Use of Structural Information to Improve Performance

Feb 6, 2008

What’s a scoring function?

a b b c d d d e f g
a
b
c
d a b b c d d d e f g
e
f a b - c d - - e f g
g

a b b c d d d e f g

a b - c c d - e f g
- d -

Optimal :: MAX ( ∑ S (•) - ∑C ( – ) )
similarity S cost C > 0

Aims
Optimal alignment problem: Native alignment scores best
SO alignment sampling problem: Native alignment scores best &
Poor alignments kept at minimum &
Avoid “unproductive” alignments

Productive versus Unproductive Alignment Sampling

AK

AML

YYY
XXX
A A
XXXAAAYYY
XXXAAADEFAAAYYY
XXX--aYYY a
XXX-AKLMA---YYY
A
A

X
XXXAAAYYY

XX
XXXAAADEFAAAYYY

MLK

YYY
XXX--AKLMA--YYY XXX XXX-a-YYY

Y
YY
A XXXAAAYYY
XXXAAADEFAAAYYY
XXXa--YYY
XXX---AKLMA-YYY
LKA

YYY
XXX

MA

Non-redundant (good) Redundant (not good)

Classes of Methods for Sampling Suboptimal Alignments
• Top-down Enumeration
– Classical Waterman (Near-optimal alignments)
Path > Opt-δ

• Iterative Elimination (IE)
– Waterman & Eggert
– Saqi, Bates & Sternberg

• Parametric Sampling (PS)
– Chivian & Baker; 2006

• Combined IE + PS
– Jaroszewski, Li & Godzik; 2002 sample over
lots of …
• Stochastic Sampling P (sim1,gap1,ss1)
– John & Sali; 2003 P (sim2,gap1,ss1)
…
P (simn,gapn,ssn)
• Fragment Set Approach (S4)

Critical Questions

Am I ranking the most native alignment first? Within the scope of
the scoring function
Am I eliminating poor/impossible alignments?
Within the scope of
Am I sampling efficiently/with little redundancy? alignment sampling

New GN2 v. HMAP – sp2 – sp3 – sp4

Organization

Talk about software library for doing sequence alignment

Talk about the HMAP and Sparks-family of scoring functions

New method: GN2

Benchmark design & results

T1 T2 T3 Q1 Q2 HMAP2 – STL in C++
(generic programming)

Algorithm

Evaluator Enumerator

dynamic alignment
Format
pgram’ing set
matrix [pair list]

sparks? optimal
HMAP gnoali gn2 S4 Waterman RC ?

T HMAP Q = DPM aabbccdef
aa---cdef
Fasta, PIR
aabbccdef (formatted
ENU
M
DPM = AS ---aacdef output)

primary secondary Structure residue
depth
sequence structure

contact
sequence-
numbers,
solvent depth-dep. hydro-
based prof. distances, HBs accessibility a.a. freq philicity

PSI BLAST

Template
Profile Algorithm Alignments
sequence
database
NR Query
Profile Models

primary Sequence- PSIPRED
sequence based prof. prediction
SABLE
prediction

a b c d e
a
b
x
y
e

Affine gaps Arbitrary gaps Double-sided gaps abcd--e
(zigzag alignment) ab--xye
ss
coil

G

0 1 2…
l 0 1 2…
l

Fast, good for Nonlinear gaps, Most flexible,
DB search structure-derived gaps potentially most costly
(HMAP) (AS Yang - 2002) (A Sali - 2006)

HMAP secondary
structure gap
sequence profile open, extn

nf.
H E C

co
.01 .02 … 0.45 ... 0.02 0 0 1 1 3.7 0.3
T
…… 1 0 0 1 SQ,T = dot [ aaQ , aaT ] * exp [ W * ssQT ]
E
M …. … 1 0 0 1
…… 1 0 0 1 1 * confQ : if ssQ = ssT
P ssQT
… …. 0 0 1 1
L -0.5 * confQ : if ssQ ≠ ssT
A …… 0 1 0 1
T …. … 0 1 0 1 W = 0.5 (new opt value = 0.55)
E … .04 .025 0.02 0 1 0 1 12.8 0.9

ZQ,T = (SQ,T - µ) / σ
PSIPRED
gap
sequence profile open, extn
nf.

H E C
co

.02 .08 … 0.25 ... 0.02 0 0 1
Q
U …… 1 0 0 3.7,0.3 : if ssT = coil
…. …
GI,GE
E 1 0 0 12.8,0.9 : if ssT ≠ coil
R …… 1 0 0
… .03 .015 0.05 0 0 1
Y
= continuously valued from [0..1]

Sparks scoring functions
• Sparks 2
– Sequence-based profile-to-profile
– Secondary structure prediction using PSIPRED (Jones) [+1/-1]
• SP3
– Sparks 2 plus…
– Residue-depth dependent profile
• SP4
– SP3 plus…
– Solvent accessibility prediction using SABLE (Adamczak, Porollo, Meller)

• Trained (parameterized)
– using ProSup (Sippl; 2000) alignments
• Tests performed
– Fold recognition (FR) + Model building: Lindahl FR set
– FR + Model building: LiveBench 8 (MaxSub)
– FR + Model building: CASP7 (GDT Z-score)
– Alignment: Sali’s test set (200 pairs, 65% overlap, 3.5 Å) (TM overlap)

HMAP GN2
 Sequence-based  Sequence-based
profile profile (AA)
 Secondary structure  Contact number (CN)
 Affine gap penalty  Secondary structure (SS)
 Hydrophilicity index (HI)
 Structure-derived gap penalty
 Geometric distance (GP = exp (D – 8Å))
 Hydrogen bonding
 Insertions more likely with small CN
 Deletions beg./end in same SS =
impossible (very high GP)

Log-likelihood ratios from structural alignments

SKA Make training alignments

Count
frequencies

Convert to
log-likelihood ratios
S (i,j) = LLR0 +
wAA * LLRAA (i,j) +  f structure 
LLR = log
 f 

wSS * LLRSS (i,j) +
 random 
wCN * LLRCN (i,j) +
wHI * LLRHI (i,j)

Log-odd substitution matrix for aligning SS-to-predicted SS (PSI-PRED)
based on structural alignment (SKA)

Should we use dot [ aaQ , aaT ] ?

Construction of a log-odds score based on the cos-angle function between profiles

CA _ atoms
1
CN = 0.72 ∑ r2

Construction of a log-odds score based on contact number counts of structural alignments

1 CA _ atoms 1 CA _ atoms
1
N weighted_CN =
20
∑ ( r / 3 .8 Å ) 2 = 0.72 ∑
r2

K RE QD N P H ST GY W AMFLVIC

hydrophilicity index

profile
HI = ∑ i
HI i

Construction of a log-odds score based on observed levels of HI agreement btwn the Q&T

K RE QD N P H ST GY W AMFLVIC

Observed Fitted

(
exp exp
(
− abs H Q − H T )
)
⋅ ( .75 + .3 * abs ( H T − .22) ) − 1.8

Training and Benchmarking Sets

SCOP 1.71 all vs. all ( skan psd < 0.6, rmsd < 3.5 )  1M pairs

sort pairs by % sid ( from 0%, “devilish set” )

re-order, 7.5% sid on top ( “difficult set” )

filter ( ali len > 80, % sid < 40, ska psd < 0.6 )  326k pairs
no
Any more pairs? Done!  test set
difficult: 995 pairs
yes devilish: 913 pairs
take next top pair ( lowest % sid in list )
yes

Scop family already in benchmark?

no
add pair to benchmark

SCOP 1.71
pairs from all vs. all comparison

no
Any more pairs? Done! 
yes List of protein pairs
yes w/o sequence
similarity to test set
Blast against test set sequences

No e-value < 1?

no remove sequence pair

make training set…
difficult: 238 pairs

Scop 1.71 Training set results

Summary of counts:
Class: 5
Fold: 102
Superfamily: 120
+148 folds represented once
Sequence Identity:
0 - 5% 30
5 - 10% 110
10 - 15% 48
15 - 20% 18
20 - 25% 11
25 - 30% 5
30 - 35% 7
35 - 40% 9
40 - 45%
45 - 100%
all: 238

Classes:
c 49
d 44
b 32
a 23
e 2

Shift performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

39/31

54/18

65/19 55/18

3 gn2 alignments with shift > 50

Qmod performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

129/94 136/85

119/105 110/114

Overall performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

Scoring function Total shift Residues aligned
correctly
gn2 1124 21,500
nalign 1179* 21,197
sparks2 1522 20,669
sp3 1607 21,020
sp4 1672 21,299*

Summary of counts:
Class: 7
Fold: 341
Superfamily: 460
+230 folds represented once
Sequence Identity:
0 - 5% 72
5 - 10% 423
10 - 15% 148
15 - 20% 103
20 - 25% 90
25 - 30% 67
30 - 35% 42
35 - 40% 37
40 - 45%
45 - 100%
all: 995

Classes:
d 182
c 141
b 137
a 115
e 18
f 18
g 3

Shift performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

136/111 159/112

174/102 161/111

18 gn2 alignments with shift > 50

Qmod performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4
nalign sparks2

524/342 544/344

sp3 sp4

514/379 489/408

Overall performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

correctly
gn2 4718 110,289*
sparks2 4935* 109,328
sp4 5038 111,351
nalign 5071 109,349
sp3 5377 110,172

Total shift Correctly aligned
relative to best relative to best
(Wilcoxon test) (Wilcoxon test)
gn2 0 -1,062 (p < 0.22)
sparks2 +217 (p < 5*10-4) -2,023 (p < 5*10-4)
sp4 +320 (p < 5*10-4) 0
nalign +353 (p < 5*10-4) -2,002 (p < 1.36*10-2)
sp3 +659 (p < 5*10-4) -1,179 (p < 5*10-4)

Spo0 set results
(Q = 1F51, T = Spo0 family)

141 ali’s 74/29

correctly
gn2 1035 (-22%) 6283 (+13%)
nalign 1323 5547

Remarks

Apparent success of the LLR method, but some mysteries

Sali test set
(next slide)

Performance is underestimated in alignments with structural repeats
(next slide +1)

Need for looking at alternative structural alignments

Room for improvement
E.g. adding FUGUE-like (Blundell) sequence-structure LLR
-or- SABLE/SA prediction
-or- IBR potential (Zhu)

Summary of counts:
Class: 7
Fold: 74
Superfamily: 86
NA: 2

Sequence Identity: +38 represented once
0 - 5% 13
5 - 10% 11
10 - 15% 11
15 - 20% 16
20 - 25% 79
25 - 30% 60
30 - 35% 24
35 - 40% 12
40 - 45% 4
45 - 100% 2
all: 239
(note: psid calculated by ska)

Classes:
b 99
c 95
d 63
a 34
e 4
g 2
f 1

Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A.
Variable gap penalty for protein sequence-structure alignment.
Protein Eng Des Sel. 2006 Mar;19(3):129-33

Caveat (example #1) from Training Set

Structural information in protein sequence alignment accuracy

What’s a scoring function?

a b b c d d d e f g
a b b c d d d e f g a
b
a b - c d - - e f g c a b b c d d d e f g
d a
e b
f c
g d
e
f
g

a b b c d d d e f g
max ∑ S (•) + min ∑ C ( – ) a b - - c d - e f g

similarity S cost C < 0

Aims
Optimal alignment problem: Native alignment scores best
Sampling suboptimal alignments: Native alignment scores best &
Poor alignments kept at minimum

Structural information in protein sequence alignment accuracy

More Related Content

Similar to Structural information in protein sequence alignment accuracy (20)

Structural information in protein sequence alignment accuracy