SlideShare a Scribd company logo
Introduction To Protein Structure Prediction
Methods And Algorithms Wiley Series In
Bioinformatics Huzefa Rangwala download
https://ebookbell.com/product/introduction-to-protein-structure-
prediction-methods-and-algorithms-wiley-series-in-bioinformatics-
huzefa-rangwala-2160732
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Introduction To Proteins Structure Function And Motion Amit Kessel Nir
Bental
https://ebookbell.com/product/introduction-to-proteins-structure-
function-and-motion-amit-kessel-nir-bental-4393606
Introduction To Proteins Structure Function And Motion Second Edition
2nd Edition Bental
https://ebookbell.com/product/introduction-to-proteins-structure-
function-and-motion-second-edition-2nd-edition-bental-7033134
Introduction To Protein Architecture The Structural Biology Of
Proteins Arthur M Lesk
https://ebookbell.com/product/introduction-to-protein-architecture-
the-structural-biology-of-proteins-arthur-m-lesk-7368798
Introduction To Protein Science Architecture Function And Genomics 1st
Edition Arthur M Lesk
https://ebookbell.com/product/introduction-to-protein-science-
architecture-function-and-genomics-1st-edition-arthur-m-lesk-24720564
Introduction To Protein Mass Spectrometry 1st Edition Ghosh Pradip K
https://ebookbell.com/product/introduction-to-protein-mass-
spectrometry-1st-edition-ghosh-pradip-k-5432600
An Introduction To Protein Informatics 1st Edition Karlheinz
Zimmermann Auth
https://ebookbell.com/product/an-introduction-to-protein-
informatics-1st-edition-karlheinz-zimmermann-auth-4200130
Introduction To Proteins Kessel A Bental N
https://ebookbell.com/product/introduction-to-proteins-kessel-a-
bental-n-4584468
Introduction To Peptides And Proteins Ulo Langel Benjamin F Cravatt
https://ebookbell.com/product/introduction-to-peptides-and-proteins-
ulo-langel-benjamin-f-cravatt-4767322
Illuminating Disease An Introduction To Green Fluorescent Proteins
Marc Zimmer
https://ebookbell.com/product/illuminating-disease-an-introduction-to-
green-fluorescent-proteins-marc-zimmer-4914542
Introduction To Protein Structure Prediction Methods And Algorithms Wiley Series In Bioinformatics Huzefa Rangwala
Introduction To Protein Structure Prediction Methods And Algorithms Wiley Series In Bioinformatics Huzefa Rangwala
INTRODUCTION TO
PROTEIN STRUCTURE
PREDICTION
ffirs.indd i
ffirs.indd i 8/20/2010 3:37:39 PM
8/20/2010 3:37:39 PM
WILEY SERIES ON BIOINFORMATICS:
COMPUTATIONAL TECHNIQUES AND ENGINEERING
Series Editors, Yi Pan & Albert Zomaya
Knowledge Discovery in Bioinformatics: Techniques, Methods and
Applications / Xiaohua Hu & Yi Pan
Grid Computing for Bioinformatics and Computational Biology /
Albert Zomaya & El-Ghazali Talbi
Analysis of Biological Networks / Björn H. Junker & Falk Schreiber
Bioinformatics Algorithms: Techniques and Applications / Ion Mandoiu
& Alexander Zelikovsky
Machine Learning in Bioinformatics / Yanqing Zhang &
Jagath C. Rajapakse
Biomolecular Networks / Luonan Chen, Rui-Sheng Wang, &
Xiang-Sun Zhang
Computational Systems Biology / Huma Lodhi
Computational Intelligence and Pattern Analysis in Biology Informatics /
Ujjwal Maulik, Sanghamitra, & Jason T. Wang
Mathematics of Bioinformatics: Theory, Practice, and Applications /
Matthew He
Introduction to Protein Structure Prediction: Methods and Algorithms /
Huzefa Rangwala & George Karypis
ffirs.indd ii
ffirs.indd ii 8/20/2010 3:37:39 PM
8/20/2010 3:37:39 PM
INTRODUCTION TO
PROTEIN STRUCTURE
PREDICTION
Methods and Algorithms
Edited by
HUZEFA RANGWALA
GEORGE KARYPIS
A JOHN WILEY & SONS, INC., PUBLICATION
ffirs.indd iii
ffirs.indd iii 8/20/2010 3:37:39 PM
8/20/2010 3:37:39 PM
Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means, electronic, mechanical, photocopying, recording, scanning, or
otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright
Act, without either the prior written permission of the Publisher, or authorization through
payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222
Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at
www.copyright.com. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030,
(201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created
or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional
where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any
other commercial damages, including but not limited to special, incidental, consequential, or
other damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Rangwala, Huzefa.
Introduction to protein structure prediction : methods and algorithms / Huzefa Rangwala,
George Karypis.
p. cm.—(Wiley series in bioinformatics; 14)
Includes bibliographical references and index.
ISBN 978-0-470-47059-6 (hardback)
1. Proteins—Structure—Mathematical models. 2. Proteins—Structure—Computer
simulation. I. Karypis, G. (George) II. Title.
QP551.R225 2010
572′.633—dc22
2010028352
Printed in Singapore
10 9 8 7 6 5 4 3 2 1
ffirs.indd iv
ffirs.indd iv 8/20/2010 3:37:39 PM
8/20/2010 3:37:39 PM
CONTENTS
v
PREFACE vii
CONTRIBUTORS xi
1 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION 1
Huzefa Rangwala and George Karypis
2 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE
MODELING 15
Andriy Kryshtafovych, Krzysztof Fidelis, and John Moult
3 THE PROTEIN STRUCTURE INITIATIVE 33
Andras Fiser, Adam Godzik, Christine Orengo, and Burkhard Rost
4 PREDICTION OF ONE-DIMENSIONAL STRUCTURAL
PROPERTIES OF PROTEINS BY
INTEGRATED NEURAL NETWORKS 45
Yaoqi Zhou and Eshel Faraggi
5 LOCAL STRUCTURE ALPHABETS 75
Agnel Praveen Joseph, Aurélie Bornot, and Alexandre G. de Brevern
6 SHEDDING LIGHT ON TRANSMEMBRANE TOPOLOGY 107
Gábor E. Tusnády and István Simon
7 CONTACT MAP PREDICTION BY MACHINE LEARNING 137
Alberto J.M. Martin, Catherine Mooney, Ian Walsh,
and Gianluca Pollastri
8 A SURVEY OF REMOTE HOMOLOGY DETECTION
AND FOLD RECOGNITION METHODS 165
Huzefa Rangwala
9 INTEGRATIVE PROTEIN FOLD RECOGNITION BY
ALIGNMENTS AND MACHINE LEARNING 195
Allison N. Tegge, Zheng Wang, and Jianlin Cheng
ftoc.indd v
ftoc.indd v 8/20/2010 3:37:41 PM
8/20/2010 3:37:41 PM
vi CONTENTS
10 TASSER-BASED PROTEIN STRUCTURE PREDICTION 219
Shashi Bhushan Pandit, Hongyi Zhou, and Jeffrey Skolnick
11 COMPOSITE APPROACHES TO PROTEIN TERTIARY
STRUCTURE PREDICTION: A CASE-STUDY BY I-TASSER 243
Ambrish Roy, Sitao Wu, and Yang Zhang
12 HYBRID METHODS FOR PROTEIN
STRUCTURE PREDICTION 265
Dmitri Mourado, Bostjan Kobe, Nicholas E. Dixon,
and Thomas Huber
13 MODELING LOOPS IN PROTEIN STRUCTURES 279
Narcis Fernandez-Fuentes, Andras Fiser
14 MODEL QUALITY ASSESSMENT USING A STATISTICAL
PROGRAM THAT ADOPTS A SIDE CHAIN
ENVIRONMENT VIEWPOINT 299
Genki Terashi, Mayuko Takeda-Shitaka, Kazuhiko Kanou
and Hideaki Umeyama
15 MODEL QUALITY PREDICTION 323
Liam J. McGuffin
16 LIGAND-BINDING RESIDUE PREDICTION 343
Chris Kauffman and George Karypis
17 MODELING AND VALIDATION OF TRANSMEMBRANE
PROTEIN STRUCTURES 369
Maya Schushan and Nir Ben-Tal
18 STRUCTURE-BASED MACHINE LEARNING MODELS FOR
COMPUTATIONAL MUTAGENESIS 403
Majid Masso and Iosif I. Vaisman
19 CONFORMATIONAL SEARCH FOR THE PROTEIN
NATIVE STATE 431
Amarda Shehu
20 MODELING MUTATIONS IN PROTEINS USING MEDUSA
AND DISCRETE MOLECULE DYNAMICS 453
Shuangye Yin, Feng Ding, and Nikolay V. Dokholyan
INDEX 477
ftoc.indd vi
ftoc.indd vi 8/20/2010 3:37:41 PM
8/20/2010 3:37:41 PM
vii
PREFACE
PROTEIN STRUCTURE PREDICTION
Proteins play a crucial role in governing several life processes. Stunningly
complex networks of proteins perform innumerable functions in every living
cell. Knowing the function and structure of proteins is crucial for the develop-
ment of better drugs, higher yield crops, and even synthetic biofuels. As such,
knowledge of protein structure and function leads to crucial advances in life
sciences and biology. The motivation behind the structural determination of
proteins is based on the belief that structural information provides insights as
to their function, which will ultimately result in a better understanding of
intricate biological processes.
Breakthroughs in large-scale sequencing have led to a surge in the available
protein sequence information that has far outstripped our ability to character-
ize the structural and functional characteristic of these proteins. Several
research groups have been working on determining the three-dimensional
structure of the protein using a wide variety of computational methods. The
problem of unraveling the relationship between the amino acid sequence of a
protein and its three-dimensional structure has been one of the grand chal-
lenges in molecular biology.The importance and the far reaching implications
of being able to predict the structure of a protein from its amino acid sequence
is manifested by the ongoing biennial competition on “Critical Assessment of
Protein Structure Prediction” (CASP) that started more than 16 years ago.
CASP is designed to assess the performance of current structure prediction
methods and over the years the number of groups that have been participating
in it continues to increase.
This book presents a series of chapters by authors who are involved in the
task of structure determination and using modeled structures for applications
involving drug discovery and protein design. The book is divided into the fol-
lowing themes.
fpref.indd vii
fpref.indd vii 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
viii PREFACE
BACKGROUND ON STRUCTURE PREDICTION
Chapter 1 provides an introduction to the protein structure prediction problem
along with information about databases and resources that are widely used.
Chapters 2 and 3 provide information regarding two very important initiatives
in the field: (i) the structure prediction flagship competition (CASP), and (ii)
the protein structure initiative (PSI),respectively.Since many of the approaches
developed have been tested in the CASP competition, Chapter 2 lays the
foundation for the need for such an evaluation, the problem definitions, sig-
nificant innovations, competition format, as well as future outlook. Chapter 3
describes the protein structure initiative, which is designed to determine rep-
resentative three-dimensional structures within the human genome.
PREDICTION OF STRUCTURAL ELEMENTS
Within each structural entity called a protein there lies a set of recurring sub-
structures, and within these substructures are smaller substructures. Beyond
the goal of predicting the three-dimensional structure of a protein from
sequence several other problems have been defined and methods have been
developed for solving the same. Chapters 4–6 provide the definitions of these
recurring substructures called local alphabets or secondary structures and the
computational approaches used for solving these problems. Chapter 6 specifi-
cally focuses on a class of transmembrane proteins known to be harder to
crystallize. Knowing the pairs of residues within a protein that are within
contact or at a closer distance provides useful distance constraints that can be
used while modeling the three-dimensional structure of the protein. Chapter
7 focuses on the problem of contact map prediction and also shows the use of
sophisticated machine learning methods to solve the problem. A successful
solution for each of these subproblems assists in solving the overarching
protein structure prediction problem.
TERTIARY STRUCTURE PREDICTION
Chapters 8–11 discuss the widely used structure prediction methods that rely
on homology modeling, threading, and fragment assembly. Chapters 8–9
discuss the problems of fold recognition and remote homology detection that
attempt to model the three-dimensional structure of a protein using known
structures. Chapters 10 and 11 discuss a combination of threading-based
approaches along with modeling the protein in parts or fragments and usually
helps in modeling the structure of proteins known not to have a close homolog
within the structure databases. Chapter 12 is a survey of the hybrid methods
that use a combination of the computational and experimental methods to
achieve high-resolution protein structures in a high-throughput manner.
fpref.indd viii
fpref.indd viii 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
PREFACE ix
Chapter 17 provides information about the challenges in modeling transmem-
brane proteins along with a discussion of some of the widely used methods
for these sets of proteins.
Chapter 13 describes the loop prediction problem and how the technique
can be used for refinement of the modeled structures. Chapters 14 and 15
assess the modeled structures and provide a notion of the quality of structures.
This is extremely important from a biologist’s perspective who would like to
have a metric that describes the goodness of the structure before use. Chapter
19 provides insights into the different conformations that a protein may take
and the approaches used to sample the different conformations.
FUNCTIONAL INSIGHTS
Certain parts of the protein structure may be conserved and interact with
other biomolecules (e.g., proteins, DNA, RNA, and small molecules) and
perform a particular function due to such interactions. Chapter 16 discusses
the problem of ligand-binding site prediction and its role in determining the
function of the proteins. The approach uses some of the homology modeling
principles used for modeling the entire structure. Chapter 18 introduces a
computational model that detects the differences between protein structure
(modeled or experimentally-determined) and its modeled mutant. Chapter 20
describes the use of molecular dynamic-based approaches for modeling
mutants.
ACKNOWLEDGEMENTS
We wish to acknowledge the many people who have helped us with this
project. We firstly thank all the coauthors who spent time and energy to edit
their chapters and also served as reviewers by providing critical feedback for
improving other chapters. Kevin Deronne, Christopher Kauffman, and
Rezwan Ahmed also assisted in reviewing several of the chapters and helped
the book take a form that is complete on the topic of protein structure predic-
tion and exciting to read. Finally, we wish to thank our families and friends.
We hope that you as a reader benefit from this book and feel as excited
about this field as we are.
Huzefa Rangwala
George Karypis
fpref.indd ix
fpref.indd ix 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
xi
CONTRIBUTORS
Nir Ben-Tal, Department of Biochemistry and Molecular Biology, Tel Aviv
University, Tel Aviv, Israel
Aurélie Bornot, Institut National de la Santé et de la Recherche Médicale,
UMR-S 665,Dynamique des Structures et Interactions des Macromolécules
Biologiques (DSIMB), Université Paris Diderot, Paris, France
Alexandre G. de Brevern, Institut National de la Santé et de la Recherche
Médicale, Université Paris Diderot, Institut National de la Transfusion
Sanguine, 75015, Paris, France
Jianlin Cheng, Computer Science Department and Informatics Institute
University of Missouri, Columbia, MO 65211
Feng Ding, Department of Biochemistry and Biophysics University of North
Carolina—Chapel Hill, NC 27599
Nicholas E. Dixon, School of Chemistry, University of Wollongong, NSW
2522, Australia
Nikolay V. Dokholyan, Department of Biochemistry and Biophysics,
University of North Carolina, Chapel Hill, NC 27599
Eshel Faraggi, Indiana University School of Informatics, Indiana University-
Purdue University Indianapolis, and Center for Computational Biology
and Bioinformatics, Indiana University School of Medicine, Indianapolis,
IN 46202
Krzysztof Fidelis, Protein Structure Prediction Center, Genome Center,
University of California, Davis, Davis, CA
Andras Fiser, Department of Systems and Computational Biology and
Department of Biochemistry, Albert Einstein College of Medicine, Bronx,
NY 10461
Narcis Fernandez-Fuentes, Leeds Institute of Molecular Medicine,
University of Leeds, Leeds, UK
flast.indd xi
flast.indd xi 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
xii CONTRIBUTORS
Adam Godzik, Program in Bioinformatics and Systems Biology, Sanford-
Burnham Medical Research Institute, La Jolla, CA 92037
Thomas Huber, The University of Queensland, School of Chemistry and
Molecular Biosciences, QLD, Australia
Agnel Praveen Joseph, Institut National de la Santé et de la Recherche
Médicale, UMR-S 665, Dynamique des Structures et Interactions des
Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris,
France
Kazuhiko Kanou, School of Pharmacy, Kitasato University, Tokyo 108-8641,
Japan
George Karypis, Department of Computer Science, University of Minnesota
Minneapolis, MN 55455
Chris Kauffman, Department of Computer Science,University of Minnesota,
Minneapolis, MN 55455
Bostjan Kobe, The University of Queensland, School of Chemistry and
Molecular Biosciences, Brisbane, Australia
Andriy Kryshtafovych, Protein Structure Prediction Center, Genome
Center, University of California, Davis, Davis, CA
Alberto J.M. Martin, Complex and Adaptive Systems Lab, School of
Computer Science and Informatics, UCD Dublin, Ireland
Majid Massa, Department of Bioinformatics and Computational Biology,
George Mason University, Manassas, VA 20110
Liam J. McGuffin, School of Biological Sciences, The University of Reading,
Reading, UK
Catherine Mooney, Shields Lab, School of Medicine and Medical Science,
University College Dublin, Ireland
John Moult, Institute for Bioscience and Biotechnology Research,University
of Maryland, Rockville, MD 20850
Dmitri Mouradov, The University of Queensland, School of Chemistry and
Molecular Biosciences, QLD, Australia
Christine Orengo, Department of Structural and Molecular Biology,
University College London, London UK
Shashi Bhushan Pandit, Center for the Study of Systems Biology, School of
Biology, Georgia Institute of Technology, Atlanta, GA 30318
Gianluca Pollastri, Complex and Adaptive Systems Lab, School of
Computer Science and Informatics, UCD Dublin, Ireland
Huzefa Rangwala, Department of Computer Science, George Mason
University, Fairfax, VA 22030
flast.indd xii
flast.indd xii 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
CONTRIBUTORS xiii
Burkhard Rost, Department of Biochemistry and Molecular Biophysics,
Columbia University, New York, NY 10032
Ambrish Roy, Center for Computational Medicine and Bioinformatics,
University of Michigan, Ann Arbor, MI 48109
Maya Schushan, Department of Biochemistry and Molecular Biology, Tel
Aviv University, Tel Aviv, Israel
Amarda Shehu, Department of Computer Science,George Mason University,
Fairfax, VA 22030
Mayuko Takeda-Shitaka, School of Pharmacy, Kitasato University, Tokyo
108-8641, Japan
István Simon, lntsitute of Enzymology, BRC, Hungarian Academy of
Sciences, Budapest, Hungary
Jeffrey Skolnick, Center for the Study of Systems Biology, School of Biology,
Georgia Institute of Technology Atlanta, GA 30318
Allison N. Tegge, Computer Science Department and Informatics Institute,
University of Missouri, Columbia, MO 65211
Genki Terashi, School of Pharmacy, Kitasato University, Tokyo 108-8641,
Japan
Gábor E. Tusnady, Intsitute of Enzymology, BRC, Hungarian Academy of
Sciences, Budapest, Hungary
Hideaki Umeyama, School of Pharmacy, Kitasato University, Tokyo 108-
8641, Japan
Iosif I. Vaisman, Department of Bioinformatics and Computational Biology,
George Mason University, Manassas, VA 20110
IanWalsh, Complex andAdaptive Systems Lab,School of Computer Science
and Informatics, UCD Dublin, Ireland
Zheng Wang, Computer Science Department, University of Missouri,
Columbia, MO 65211
SitaoWu, Center for Computational Medicine and Bioinformatics,University
of Michigan, Ann Arbor, MI 48109
Shuangye Yin, Department of Biochemistry and Biophysics, University of
North Carolina, Chapel Hill, NC 27599
Yang Zhang, Center for Computational Medicine and Bioinformatics,
University of Michigan, Ann Arbor, MI 48109
Hongyi Zhou, Center for the Study of Systems Biology, School of Biology
Georgia Institute of Technology, Atlanta, GA 30318
flast.indd xiii
flast.indd xiii 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
xiv CONTRIBUTORS
Yaoqi Zhou, Indiana University School of Informatics, Indiana University-
Purdue University Indianapolis, and Center for Computational Biology
and Bioinformatics, Indiana University School of Medicine, Indianapolis,
IN 46202
flast.indd xiv
flast.indd xiv 8/20/2010 3:37:40 PM
8/20/2010 3:37:40 PM
1
CHAPTER 1
Introduction to Protein Structure Prediction: Methods and Algorithms,
Edited by Huzefa Rangwala and George Karypis
Copyright © 2010 John Wiley & Sons, Inc.
Proteins have a vast influence on the molecular machinery of life. Stunningly
complex networks of proteins perform innumerable functions in every living
cell. Knowing the function and structure of proteins is crucial for the develop-
ment of improved drugs, better crops, and even synthetic biofuels. As such,
knowledge of protein structure and function leads to crucial advances in life
sciences and biology.
With recent advances in large-scale sequencing technologies, we have seen
an exponential growth in protein sequence information. Protein structures are
primarily determined using X-ray crystallography or nuclear magnetic reso-
nance (NMR) spectroscopy, but these methods are time consuming, expen-
sive, and not feasible for all proteins. The experimental approaches
to determine protein function (e.g., gene knockout, targeted mutation, and
inhibitions of gene expression studies) are low-throughput in nature [1,2]. As
such, our ability to produce sequence information far outpaces the rate at
which we can produce structural and functional information.
Consequently, researchers are increasingly reliant on computational
approaches to extract useful information from experimentally determined
three-dimensional (3D) structures and functions of proteins. Unraveling the
INTRODUCTION TO PROTEIN
STRUCTURE PREDICTION
HUZEFA RANGWALA
Department of Computer Science
George Mason University
Fairfax, VA
GEORGE KARYPIS
Department of Computer Science
University of Minnesota
Minneapolis, MN
c01.indd 1
c01.indd 1 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
2 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
relationship between pure sequence information and 3D structure and/or
function remains one of the fundamental challenges in molecular biology.
Function prediction is generally approached by using inheritance through
homology [2], that is, proteins with similar sequences (common evolutionary
ancestry) frequently carry out similar functions. However, several studies [2–4]
have shown that a stronger correlation exists between structure conservation
and function, that is, structure implies function, and a higher correlation exists
between sequence conservation and structure, that is, sequence implies struc-
ture (sequence → structure → function).
1.1. INTRODUCTION TO PROTEIN STRUCTURES
In this section we introduce the basic definitions and facts about protein struc-
ture, the four different levels of protein structure, as well as provide details
about protein structure databases.
1.1.1. Protein Structure Levels
Within each structural entity called a protein lies a set of recurring substruc-
tures, and within these substructures are smaller substructures still. As an
example, consider hemoglobin, the oxygen-carrying molecule in human blood.
Hemoglobin has four domains that come together to form its quaternary
structure. Each domain assembles (i.e., folds) itself independently to form a
tertiary structure. These tertiary structures are comprised of multiple second-
ary structure elements—in hemoglobin’s case α-helices. α-Helices (and their
counterpart β-sheets) have elegant repeating patterns dependent upon
sequences of amino acids.
1.1.1.1. Primary Structure. Amino acids form the basic building blocks of
proteins. Amino acids consists of a central carbon atom (Cα) attached by an
amino (NH2), a carboxyl (COOH) group, and a side chain (R) group.The side
chain group differentiates the various amino acids. In case of proteins, there
are primarily 20 different amino acids that form the building blocks.A protein
is a chain of amino acids linked with peptide bonds. Pairs of amino acid form
a peptide bond between the amino group of one and the carboxyl group of
the other. This polypeptide chain of amino acids is known as the primary
structure or the protein sequence.
1.1.1.2. Secondary Structure. A sequence of characters representing the
secondary structure of a protein describes the general 3D form of local regions.
These regions organize themselves independently from the rest of the protein
into patterns of repeatedly occurring structural fragments.The most dominant
local conformations of polypeptide chains are α-helices and β-sheets. These
local structures have a certain regularity in their form, attributed to the hydro-
gen bond interactions between various residues. An α-helix has a coil-like
c01.indd 2
c01.indd 2 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
INTRODUCTION TO PROTEIN STRUCTURES 3
structure, whereas a β-sheet consists of parallel strands of residues. In addition
to regular secondary structure elements, irregular shapes form an important
part of the structure and function of proteins. These elements are typically
termed coil regions.
Secondary structure can be divided into several types, although usually at
least three classes (α-helix, coils, and β-sheet) are used. No unique method of
assigning residues to a particular secondary structure state from atomic coor-
dinates exists, although the most widely accepted protocol is based on the
Dictionary of Protein Secondary Structure (DSSP) algorithm [5]. DSSP uses
the following structural classes: H (α-helix), G (310-helix), I (π-helix), E (β-
strand), B (isolated β-bridge), T (turn), S (bend), and – (other). Several other
secondary structure assignment algorithms use a reduction scheme that con-
verts this eight-state assignment down to three states by assigning H and G
to the helix state (H), E and B to a the strand state (E), and the rest (I, T, S,
and –) to a coil state (C). This is the format generally used in structure
databases.
1.1.1.3. Tertiary Structure. The tertiary structure of the protein is defined
as the global 3D structure, represented by 3D coordinates for each atoms.
These tertiary structures are comprised of multiple secondary structure ele-
ments, and the 3D structure is a function of the interacting side chains between
the different amino acids. Hence, the linear ordering of amino acids forms
secondary structure; arranging secondary structures yields tertiary structure.
1.1.1.4. Quaternary Structure. Quaternary structures represent the interac-
tion between multiple polypeptide chains.The interaction between the various
chains is due to the non-covalent interactions between the atoms of the dif-
ferent chains. Examples of these interactions include hydrogen bonding, van
Der Walls interactions, ionic bonding, and disulfide bonding.
Research in computational structure prediction concerns itself mainly with
predicting secondary and tertiary structures from known experimentally
determined primary structure or sequence. This is due to the relative ease of
determining primary structure and the complexity involved in quaternary
structure.
1.1.2. Protein Sequence and Structure Databases
The large amount of protein sequence information, experimentally deter-
mined structure information, and structural classification information is
stored in publicly available databases. In this section we review some of the
databases that are used in this field, and provide their availability information
in Table 1.1.
1.1.2.1. Sequence Databases. The Universal Protein Resource (UniProt)
[6] is the most comprehensive warehouse containing information about protein
c01.indd 3
c01.indd 3 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
4 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
sequences and their annotation. It is a database of protein sequences and their
function that is formed by aggregating the information present in the Swiss-
Prot, TrEMBL, and Protein Information Resources (PIR) databases. The
UniProtKB 13.2 version of database (released on April 8, 2008) consists of
5,939,836 protein sequence entries (Swiss-Prot providing 362,782 entries and
TrEMBL providing 5,577,054 entries).
However, several proteins have high pairwise sequence identity, and as
such lead to redundant information.The UniProt database [6] creates a subset
of sequences such that the sequence identity between all pairs of sequences
within the subset is less than a predetermined threshold. In essence, UniProt
contains the UniRef100, UniRef90, and UniRef50 subsets where within each
group the sequence identity between a pair of sequences is less than 100%,
90%, and 50%, respectively.
The National Center for Biotechnology Information (NCBI) also provides
a nonredundant (NCBI nr) database of protein sequences using sequences
from a wide variety of sources. This database will have pairs of proteins with
high sequence identity, but removes all the duplicates. The NCBI nr version
2.2.18 (released on March 2, 2008) contains 6,441,864 protein sequences.
1.1.2.2. Protein Data Bank (PDB). The Research Collaboratory for
Structural Bioinformatics (RSCB) PDB [7] stores experimentally determined
3D structure of biological macromolecules including nucleotides and proteins.
As of April 20, 2008 this database consists of 46,287 protein structures that
are determined using X-ray crystallography (90%), NMR (9%), and other
methods like Cryo-electron microscopy (Cryo-EM). These experimental
methods are time-consuming, expensive, and need protein to crystallize.
1.1.2.3. Structure Classification Databases. Various methods have been
proposed to categorize protein structures. These methods are based on the
pairwise structural similarity between the protein structures, as well as the
topological and geometric arrangement of atoms and predominant secondary
TABLE 1.1 Protein Sequence and Structure Databases
Database Information Availability Link
UniProt Sequence http://www.pir.uniprot.org/
UniRef Cluster sequences http://www.pir.uniprot.org/
NCBI nr Nonredundant sequences ftp://ftp.ncbi.nlm.nih.gov/blast/db/
PDB Structure http://www.rcsb.org/
SCOP Structure classification http://scop.mrc-lmb.cam.ac.uk/scop/
CATH Structure classification http://www.cathdb.info/
FSSP Structure classification http://www.ebi.ac.uk/dali/fssp/
ASTRAL Compendium http://astral.berkeley.edu/
The databases referred to in this table are most popular for protein structure-related
information.
c01.indd 4
c01.indd 4 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
INTRODUCTION TO PROTEIN STRUCTURES 5
structure like subunits. Structural Classification of Proteins (SCOP) [8], Class,
Architecture, Topology, and Homologous superfamily (CATH) [9], and
Families of Structurally Similar Proteins (FSSP) [10] are three widely used
structure classification databases. The classification methodology involves
breaking a protein chain or complex into independent folding units called
domains, and then classifying these domains into a set of hierarchical classes
sharing similar structural characteristics.
SCOP Database. SCOP [8] is a manually curated database that provides a
detailed and comprehensive description of the evolutionary and structural
relationships between proteins whose structure is known (present in the PDB).
SCOP classifies proteins structures using visual inspection as well as structural
comparison using a suite of automated tools. The basic unit of classification is
generally a domain. SCOP classification is based on four hierarchical levels
that encompass evolutionary and structural relationships [8]. In particular,
proteins with clear evolutionary relationship are classified to be within the
same family. Generally, protein pairs within the same family have pairwise
residue identities greater than 30%. Protein pairs with low sequence identity,
but whose structural and functional features imply probably common evolu-
tionary information, are classified to be within the same superfamily. Protein
pairs with similar major secondary structure elements and topological arrange-
ment of substructures (as well as favoring certain packing geometries) are
classified to be within the same fold. Finally, protein pairs having a predomi-
nant set of secondary structures (e.g., all α-helices proteins) lie within the same
class. The four hierarchical levels, that is, family, superfamily, fold, and class
define the structure of the SCOP database.
The SCOP 1.73 version database (released on September 26, 2007) classifies
34,494 PDB entries (97,178 domains) into 1086 unique folds, 1777 unique
superfamilies, and 3464 unique families.
CATH Database. CATH [9] database is a semi-automated protein structure
classification database like the SCOP database. CATH uses a consensus of
three automated classification techniques to break a chain into domains and
classify them in the various structural categories [11]. Domains for proteins
that are not resolved by the consensus approach are determined manually.
These domains are then classified into the following hierarchical categories
using both manual and automated methods in conjunction.
The first level membership, class, is determined based on the secondary
structure composition and packing within the structure. The second level,
architecture, clusters proteins sharing the same orientation of the secondary
structure element but ignoring the connectivity between these substructural
units. The third level, topology, groups protein pairs with a high structure
alignment score as determined by the SSAP [12] algorithm, and in essence
share both overall shape and connectivity of secondary structures. The fourth
level, homologous pairs, shares a common ancestor and is identified by
c01.indd 5
c01.indd 5 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
6 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
sequence alignment as well as the SSAP structure alignment method.Structures
are further classified to be within the same sequence families if they share a
high sequence identity.
The CATH 3.1.0 version database (released on January 19, 2007) classifies
30,028 (93,885 domains) proteins from the PDB into 40 architecture-level
classes, 1084 topology-level classes, and 2091 homologous-level classes.
FSSP Database. The FSSP [10] is a structure classification database. FSSP
uses an automatic classification scheme that employs exhaustive structure-
to-structure alignment of proteins using the DALI [13] alignment. FSSP does
not provide a hierarchical classification like the SCOP and CATH databases,
but instead employs a hierarchical clustering algorithm using the pairwise
structure similarity scores that can be used for the definition of fold classes—
however, not very accurate.
There have been several studies [14,15] analyzing the relationship between
the SCOP, CATH, and FSSP databases for representing the fold space for
proteins. The major disagreement between the three databases lies in the
domain identification step, rather than the domain classification step. A high
percentage of agreement exists between the SCOP, CATH, and FSSP data-
bases especially at the fold level with sequence identity greater than 25%.
ASTRAL Compendium. The A Structural Alignment Library (ASTRAL)
[16–18] compendium is a set of database and tools used for analysis of protein
structures and sequences. This database is partially derived from, and aug-
ments, the SCOP [8] database. ASTRAL provides accurate linkage between
the biological sequence and the reported structure in PDB, and identifies the
domains within the sequence using SCOP. Since the majority of domain
sequences in PDB are very similar to others,ASTRAL tools reduce the redun-
dancy by selecting high-quality representatives. Using the reduced nonredun-
dant set of representation proteins allows for sampling of all the different
structures in the PDB. This also removes bias due to overrepresented struc-
tures. Subsets provided by ASTRAL are based on SCOP domains and use
high-quality structure files only. Independent subsets of representative pro-
teins are identified using a greedy algorithm with filtering criterion based on
pairwise sequence identity determined using the Basic LocalAlignment Search
Tool (BLAST) [19], an e-value-based threshold, or a SCOP level-based filter.
1.2. PROTEIN STRUCTURE PREDICTION METHODS
One of the biggest goals in structural bioinformatics is the prediction of the
3D structure of a protein from its one-dimensional (1D) protein sequence.
The goal is to be able to determine the shape (known as a fold) that a given
amino acid sequence will adopt. The problem is further divided based on
c01.indd 6
c01.indd 6 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
PROTEIN STRUCTURE PREDICTION METHODS 7
whether the sequence will adopt a new fold or bear resemblance to an existing
fold (template) in some protein structure database. Fold recognition is easy
when the sequence in question has a high degree of sequence similarity to a
sequence with known structure [20]. If the two sequences share evolutionary
ancestry they are said to be homologous. For such sequence pairs we can build
the structure for the query protein by choosing the structure of the known
homologous sequence as template. This is known as comparative modeling.
In the case where no good template structure exists for the query, one must
attempt to build the protein tertiary structure from scratch. These methods
are usually called ab initio methods. In a third-fold prediction scenario, there
may not necessarily be a good sequence similarity with a known structure, but
a structural template may still exist for the given sequence.To clarify this case,
if one were aware of the target structure then they could extract the template
using structure–structure alignments of the target against the entire structural
database. It is important to note that the target and template need not be
homologous. These two cases define the fold prediction (homologous) and
fold prediction (analogous) problems during the Critical Assessment of
Protein Structure Prediction (CASP) competition.
1.2.1. Comparative Modeling
Comparative Modeling or homology modeling is used when there exists a
clear relationship between the sequence of a query protein (unknown struc-
ture) and a sequence of a known structure. The most basic approach to struc-
ture prediction for such (query) proteins is to perform a pairwise sequence
alignment against each sequence in protein sequence databases. This can be
accomplished using sequence alignment algorithms such as Smith-Waterman
[21] or sequence search algorithms (e.g., BLAST [19]). With a good sequence
alignment in hand, the challenge in comparative modeling becomes how to
best build a 3D protein structure for a query protein using the template
structure.
The heart of the above process is the selection of a suitable structural tem-
plate based on sequence pair similarity. This is followed by the alignment of
query sequence to the template structure selected to build the backbone of
the query protein. Finally the entire modeled structure is refined by loop
construction and side chain modeling. Several comparative modeling methods,
more commonly known as modeler programs, have been developed over the
past several years [22,23] focusing on various parts of the problem.
As seen in the various years of CASP [24,25], the span of comparative
modeling approaches [22,23] follows five basic steps: (i) selecting one or suit-
able templates, (ii) utilizing sensitive sequence template alignment algorithms,
(iii) building a protein model using the sequence structure alignment as refer-
ence, (iv) evaluating the quality of the model, and (v) refining the model.These
typical steps for the comparative modeling process are shown in Figure 1.1.
c01.indd 7
c01.indd 7 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
8 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
FIGURE 1.1 Flowchart for the comparative modeling process.
Raw Model
Start
Template Identification
(Structure Databases)
Choose Template
Align Target Sequence to
Template Structure
Build Model for Target
Using Template Structure
Evaluate the Model
Model
Good?
Stop
Side Chain Placement
Loop Modeling
Refinement
c01.indd 8
c01.indd 8 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
PROTEIN STRUCTURE PREDICTION METHODS 9
1.2.2. Fold Prediction (Homologous)
While satisfactory methods exist to detect homologs (proteins that share
similar evolutionary ancestry) with high levels of similarity, accurately detect-
ing homologs at low levels of sequence similarity (remote homology detec-
tion) remains a challenging problem. Some of the most popular approaches
for remote homology prediction compare a protein with a collection of related
proteins using methods such as Position-Specific Iterative-BLAST (PSI-
BLAST) [26], protein family profiles [27], hidden Markov models (HMMs)
[28,29], and Sequence Alignment and Modeling System (SAM) [30]. These
schemes produce models that are generative in the sense that they build a
model for a set of related proteins and then check to see how well this model
explains a candidate protein.
In recent years, the performance of remote homology detection has been
further improved through the use of methods that explicitly model the differ-
ences between the various protein families (classes) by building discriminative
models. In particular, a number of different methods that use Support Vector
Machines (SVM) [31] have been developed to produce results that are gener-
ally superior to those produced by either pairwise sequence comparisons or
approaches based on generative models—provided there are sufficient train-
ing data [32–39].
1.2.3. Fold Prediction (Analogous)
Occasionally a query sequence will have a native fold similar to another
known fold in a database, but the two sequences will have no detectable simi-
larity. In many cases the two proteins will lack an evolutionary relationship as
well.As the definition of this problem relies on the inability of current methods
to detect sequential similarity, the set of proteins falling into this category
remains in flux. As new methods continue to improve at finding sequential
similarities as a result of increasing database size and better techniques, the
number of proteins in question decreases. Techniques to find structures for
such query sequences revolve around mounting the query sequence on a series
of template structures in a process known as threading [40–42]. An objective
energy function provides a score for each alignment, and the highest scoring
template is chosen.
Obviously, if the correct template does not exist in the series then the
method will not produce an accurate prediction. As a result of this limitation,
predicting the structure of proteins in this category is as challenging as predict-
ing protein targets that are part of the new or rare folds.
1.2.4. Ab Initio
Techniques to predict novel protein structure have come a long way in recent
years, although a definitive solution to the problem remains elusive. Research
c01.indd 9
c01.indd 9 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
10 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
in this area can be roughly divided into fragment assembly [43–45] and first
principle-based approaches, although occasionally the two are combined [46].
The former attempt to assign a fragment with known structure to a section of
the unknown query sequence. The latter start with an unfolded conformation,
usually surrounded by solvent, and allow simulated physical forces to fold the
protein as would normally happen in vivo. Usually, algorithms from either
class will use reduced representations of query proteins during initial stages
to reduce the overall complexity of the problem.
Even in case of these ab initio prediction methods, the state-of-the-art
methods [46–48] determine several template structures (using the template
selection methods used in comparative modeling methods). The final protein
is modeled using an assembly of fragments or substructures fitted together
using a highly optimized approximate energy and statistics-based potential
function.
This book presents methods developed for protein structure prediction. In
particular methods and problems that are prevalent in a biennial structure
prediction competition (CASP) are discussed in the first half of the book.The
second half of the book discusses approaches that combine experimental and
computational approaches for structure prediction and also new techniques
for predicting structures of transmembrane proteins. Finally, the book dis-
cusses the applications of protein structure within the context of function
prediction and drug discovery.
REFERENCES
1. G. Pandey, V. Kumar, and M. Steinbach. Computational approaches for protein
function prediction: A survey. Technical Report 06-23, Department of Computer
Science and Engineering, University of Minnesota, 2006.
2. D. Lee, O. Redfern, and C. Orengo. Predicting protein function from sequence
and structure. Nature Reviews. Molecular Cell Biology, 8(12):995–1005, 2007.
3. J.C. Whisstock and A.M. Lesk. Prediction of protein function from protein
sequence and structure. Quarterly Reviews of Biophysics, 36(3):307–340, 2003.
4. D. Devos and A. Valencia. Practical limits of function prediction. Proteins,
41(1):98–107, 2000.
5. W. Kabsch and C. Sander. Dictionary of protein secondary structure: Pattern
recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–
2637, 1983.
6. UniProt Consortium. The universal protein resource (uniprot). Nucleic Acids
Research, 36(Database issue):D190–D195, 2008.
7. H.M. Berman, T.N. Bhat, P.E. Bourne, Z. Feng, G.G.H. Weissig, and J. Westbrook.
The Protein Data Bank and the challenge of structural genomics. Nature Structural
Biology, 7:957–959, 2000.
8. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural clas-
sification of proteins database for the investigation of sequences and structures.
Journal of Molecular Biology, 247:536–540, 1995.
c01.indd 10
c01.indd 10 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
REFERENCES 11
9. C.A. Orengo, A.D. Mitchie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thorton.
Cath- a hierarchic classification of protein domain structures. Structure, 5(8):1093–
1108, 1997.
10. L. Holm and C. Sander. The fssp database: Fold classification based on structur-
estructure alignment of proteins. Nucleic Acids Research, 24(1):206–209, 1996.
11. S. Jones, M. Stewart, A. Michie, M.B. Swindells, C. Orengo, and J.M. Thornton.
Domain assignment for protein structures using a consensus approach:
Characterization and analysis. Protein Science, 7(2):233–242, 1998.
12. W.R. Taylor and A.C. Orengo. Protein structure alignment. Journal of Molecular
Biology, 208(1):1–22, 1989.
13. L. Holm and C. Sander. Protein structure comparison by alignment of distance
matrices. Journal of Molecular Biology, 233(1):123–138, 1993.
14. C. Hadley and D. Jones. A systematic comparison of protein structure classifica-
tions: Scop, cath and fssp. Structure, 7(9):1099–1112, 1999.
15. R. Day, D.A.C. Beck, R.S. Armen, and V. Daggett. A consensus view of fold space:
Combining SCOP, CATH, and the Dali Dom ain Dictionary. Protein Science,
12(10):2150–2160, 2003.
16. S.E. Brenner, P. Koehl, and M. Levitt. The astral compendium for sequence and
structure analysis. Nucleic Acids Research, 28:254–256, 2000.
17. J.-M. Chandonia, N.S. Walker, L.L. Conte, P. Koehl, M. Levitt, and S.E. Brenner.
ASTRAL compendium enhancements. Nucleic Acids Research, 30(1):260–263,
2002.
18. J.M. Chandonia, G. Hon, N.S. Walker, L.L. Conte, P. Koehl, M. Levitt, and S.E.
Brenner. The astral compendium in 2004. Nucleic Acids Research, 32:D189–D192,
2004.
19. S.F. Altschul, W. Gish, E.W. Miller, and D.J. Lipman. Basic local alignment search
tool. Journal of Molecular Biology, 215:403–410, 1990.
20. P. Bourne and H. Weissig. Structural Bioinformatics. Hoboken, NJ: John Wiley &
Sons, 2003.
21. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences.
Journal of Molecular Biology, 147:195–197, 1981.
22. P.A. Bates and M.J.E Sternberg. Model building by comparison at casp3: Using
expert knowledge and computer automation. Proteins: Structure, Functions, and
Genetics, 3:47–54, 1999.
23. A. Fiser, R.K. Do, and A. Sali. Modeling of loops in protein structures. Protein
Science, 9:1753–1773, 2000.
24. C. Venclovas. Comparative modeling in casp5: Progress is evident, but alignment
errors remain a significant hindrance. Proteins: Structure, Function, and Genetics,
53:380–388, 2003.
25. C. Venclovas and M. Margelevicius. Comparative modeling in casp6 using consen-
sus approach to template selection, sequence-structure alignment, and structure
assessment. Proteins: Structure, Function, and Bioinformatics, 7:99–105, 2005.
26. S.F. Altschul, L.T. Madden, A.A. SchÃd’ffer, J. Zhang, Z. Zhang, W. Miller, and
D.J. Lipman. Gapped blast and psi-blast: A new generation of protein database
search programs. Nucleic Acids Research, 25(17):3389–3402, 1997.
c01.indd 11
c01.indd 11 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
12 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
27. M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: Detection of
distantly related proteins. PNAS, 84:4355–4358, 1987.
28. A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov
models in computational biology: Applications to protein modeling. Journal of
Molecular Biology, 235:1501–1531, 1994.
29. P. Baldi, Y. Chauvin, T. Hunkapiller, and M. McClure. Hidden Markov models of
biological primary sequence information. PNAS, 91:1053–1063, 1994.
30. K. Karplus, C. Barrett, and R. Hughey. Hidden Markov models for detecting
remote protein homologies. Bioinformatics, 14(10):846–856, 1998.
31. V. Vapnik. Statistical Learning Theory. New York: John Wiley, 1998.
32. T. Jaakkola, M. Diekhans, and D. Hassler. A dscriminative framework for detect-
ing remote protein homologies. Journal of Computational Biology, 7(1/2):95–114,
2000.
33. L. Liao and W.S. Noble. Combining pairwise sequence similarity and support
vector machines for detecting remote protein evolutionary and structural relation-
ships. Proceedings of the International Conference on Research in Computational
Molecular Biology, 225–232, 2002.
34. C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: A string kernel for svm
protein classification. Proceedings of the Pacific Symposium on Biocomputing,
564–575, 2002.
35. C. Leslie, E. Eskin, W.S. Noble, and J. Weston. Mismatch string kernels for svm
protein classification. Advances in Neural Information Processing Systems,
20(4):467–476, 2003.
36. Y. Hou, W. Hsu, M.L. Lee, and C. Bystroff. Efficient remote homology detection
using local structure. Bioinformatics, 19(17):2294–2301, 2003.
37. Y. Hou, W. Hsu, M.L. Lee, and C. Bystroff. Remote homology detection using
local sequence-structure correlations. Proteins: Structure, Function, and
Bioinformatics, 57:518–530, 2004.
38. H. Saigo, J.P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using
string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004.
39. R. Kuang, E. Ie, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile-
based string kernels for remote homology detection and motif extraction. Journal
of Bioinformatics and Computational Biology, 3:152–160, 2004.
40. D.T. Jones, W.R. Taylor, and J.M. Thorton. A new approach to protein fold rec-
ognition. Nature, 358:86–89, 1992.
41. D.T. Jones. Genthreader: An efficient and reliable protein fold recognition method
for genomic sequences. Journal of Molecular Biology, 287(4):797–815, 1999.
42. J.U. Bowie, R. Luethy, and D. Eisenberg. A method to identify protein sequences
that fold into a known three-dimensional structure. Science, 253:797–815, 1991.
43. K.T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein ter-
tiary structures from fragments with similar local sequences using simulated
annealing and bayesian scoring functions. Journal of Molecular Biology, 268:209–
225, 1997.
44. K. Karplus, R. Karchin, J. Draper, J. Casper, Y. Mandel-Gutfreund, M. Diekhans,
and R. Hughey. Combining local-structure, fold-recognition, and new fold methods
for protein structure prediction. Proteins: Structure, Function, and Genetics,
53:491–496, 2003.
c01.indd 12
c01.indd 12 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
REFERENCES 13
45. J. Lee, S.-Y. Kim, K. Joo, I. Kim, and J. Lee. Prediction of protein tertiary structure
using profesy, a novel method based on fragment assembly and conformational
space annealing. Proteins: Structure, Function, and Bioinformatics, 56:704–714,
2004.
46. C.A. Rohl, C.E.M. Strauss, K.M.S. Misura, and D. Baker. Protein structure predic-
tion using rosetta. Methods in Enzymology, 383:66–93, 2004.
47. Y. Zhang. I-tasser server for protein 3d structure prediction. BMC Bioinformatics,
9:40, 2008.
48. Y. Zhang, A.J. Arakaki, and J. Skolnick. Tasser: An automated method for the
prediction of protein tertiary structures in casp6. Proteins: Structure, Function, and
Bioinformatics, 7:91–98, 2005.
c01.indd 13
c01.indd 13 8/20/2010 3:36:15 PM
8/20/2010 3:36:15 PM
15
CHAPTER 2
CASP: A DRIVING FORCE IN PROTEIN
STRUCTURE MODELING
ANDRIY KRYSHTAFOVYCH and KRZYSZTOF FIDELIS
Protein Structure Prediction Center
Genome Center
University of California, Davis
Davis, CA
Introduction to Protein Structure Prediction: Methods and Algorithms,
Edited by Huzefa Rangwala and George Karypis
Copyright © 2010 John Wiley & Sons, Inc.
2.1. WHY CRITICAL ASSESSMENT OF PROTEIN STRUCTURE
PREDICTION (CASP) WAS NEEDED?
More than half a century has elapsed since it was shown that amino acid
sequence determines the three-dimensional structure of a protein [1], but a
general procedure to translate sequence into structure is still to be established.
Several dozen methods for generating protein structure from sequence have
been developed, providing different levels of model accuracy in different
modeling circumstances. With such a variety of modeling approaches and
success levels, it was important to establish an objective procedure to compare
the performances of the methods and learn their advantages and weaknesses.
Also, with only sparse reports on the performance of most methods it was
difficult to arrive at a clear understanding of current capabilities and bottle-
necks in the field. Specifically, it was not possible to address many key ques-
tions about modeling methods, in particular:
JOHN MOULT
Center for Advanced Research in Biotechnology
University of Maryland, College Park
College Park, MD
c02.indd 15
c02.indd 15 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
16 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
1. What are the most effective strategies for protein structure modeling?
2. What are the main factors influencing the outcome of a protein structure
modeling experiment and how close can a model get to the correspond-
ing experimental structure?
3. How can related structures on which a model can be based be identified
reliably (the template identification problem)? How accurately can coor-
dinates from the template structure be mapped to the correct positions
on the target sequence (the alignment problem)? Are models produced
by altering/refining templates more accurate than the models built by
simply copying coordinates of the template (the refinement problem)?
4. How well can the reliability of the model in general and specific regions
in particular be estimated (the quality assessment problem)?
5. How well can fully automatic modeling servers perform, compared with
a combination of computing methods and human knowledge?
6. Has there been progress in the field?
7. What are the bottlenecks to further progress?
8. Where can future efforts be most productively focused?
In order to rigorously address these issues John Moult and colleagues pio-
neered the CASP experiment in 1994 [2]. The initiative was well accepted by
the community of computational biologists, and the experiment, after eight
completed rounds, continues to attract considerable attention to protein struc-
ture modelers from around the world. Two hundred thirty four predictor
groups from 25 countries participated in the last completed CASP8, submit-
ting over 80,000 predictions (see Fig. 2.1 for historical CASP participation
statistics), and approximately the same number of predictor groups are par-
ticipating in CASP9, which is currently (July 2010) under way.
Even though we, CASP co-organizers, continue to emphasize that CASP
is primarily a scientific endeavor aimed at establishing the current state of the
art in the protein structure prediction, many view it more as a “world cham-
pionship” in this field of science. Thus, to a large extent, CASP owes its popu-
larity to the twin human drives of competitiveness and curiosity.Whatever the
case, a large community of structure modelers devote very considerable effort
to the process, and it has now been emulated in other areas of computational
biology [3–6].
2.2. CASP PRINCIPLES AND ORGANIZATION
In the pre-CASP times, protein structure modeling methods were tested using
the procedure schematically shown in Figure 2.2a. Method developers selected
sequences to test their own methods (usually with different research groups
selecting different sets of proteins), and assessed the results by comparing
models to the experimental structures already known to them at the time of
c02.indd 16
c02.indd 16 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
CASP PRINCIPLES AND ORGANIZATION 17
“prediction.” Many apparently successful modeling results were reported in
the literature but the inability of others to reproduce the results and the lack
of resulting useful applications strongly suggested that this testing approach
was not strict enough to ensure objective assessment of the results. In particu-
lar, many felt that the reported results were too easily influenced by the
known answers. CASP was established to address the deficiencies in these
FIGURE 2.1 Statistics on (a) the number of participating groups and (b) number of
submitted predictions in CASP experiments held so far. In panel (b), bars representing
the number of tertiary structure predictions are shown in dark gray, while bars repre-
senting the cumulative number of predictions in other categories (secondary structure,
residue-residue contacts, disorder regions, domain boundaries, function, quality assess-
ment) are shown in light gray.
129 0 891 56
25691238
9698
1438
25105
3623
34831
6452
52235
11482
55130
25430
0
10,000
20,000
30,000
40,000
50,000
60,000
CASP1
1994
CASP2
1996
CASP3
1998
CASP4
2000
CASP5
2002
CASP6
2004
CASP7
2006
CASP8
2008
CASP Predictions
3D Other
35
70
98
163
215
208
253
234
0
50
100
150
200
250
300
CASP1
1994
CASP2
1996
CASP3
1998
CASP4
2000
CASP5
2002
CASP6
2004
CASP7
2006
CASP8
2008
Participating Groups
(a)
(b)
c02.indd 17
c02.indd 17 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
18 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
traditional testing procedures. The main principles of CASP summarized in
Figure 2.2b are:
• “Blind” prediction regime. Predictors are required to submit their models
before the answers (experimental structures) are publicly available. This
is the primary CASP principle for ensuring rigorous conclusions.
• Independent assessment of the results. Experts in the field are invited to
perform an independent assessment of all submitted models. The asses-
sors may not participate in the experiment in the role of predictors.
• Same targets for everyone. Proteins for modeling (“targets” in CASP
jargon) are selected not by the predictors but by the organizers who are
not permitted to participate in the experiment and so have no interest in
introducing any selection bias. The same set of targets is used to test all
the methods, thus facilitating direct comparison of performance.
Organizers strive to provide a reasonably large set of targets with a bal-
anced range of difficulty, so that the assessment is statistically sound and
shows the range of success and failure across the spectrum of structure
modeling problems.
• Anonymity of assessment. All information that could be directly or indi-
rectly used to identify submitting research groups are stripped off the
predictions. This information is not made available to the assessors until
after their analysis of the results is completed.
• Same evaluation criteria for everyone. All predictions are evaluated using
the same set of numerical criteria.
FIGURE 2.2 Schematics of (a) pre-CASP and (b) CASP testing procedures for
protein structure prediction methods.
c02.indd 18
c02.indd 18 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
CASP PROCESS 19
• Data availability for post-experiment comparisons. All predictions and
automatic evaluation results are released to the public upon completion
of each CASP experiment, so as to allow others to reproduce the results,
and to facilitate methods development.
• Control of the experiment by the participants. Those participating in
CASP are involved in shaping the rules and scope of the experiment
through a variety of mechanisms, particularly a discussion forum
(FORCASP) and a predictors’ meeting at each conference,where motions
for change are considered and voted upon.
Together, these principles ensure a more objective determination of capa-
bilities in the field of protein structure modeling than the conventional peer-
review publication system. They make unjustified claims more difficult to
publish, and provide a powerful mechanism for predictors to establish the
strength of their methods.
The principles remain untouched from one experiment to another, but a
number of changes and additions to the details have been introduced, and
these are summarized in Table 2.1.
2.3. CASP PROCESS
CASP is a complicated process, requiring careful planning, data management,
and security. The Protein Structure Prediction Center, established to support
the experiment at the Lawrence Livermore Laboratory in 1996 and in 2005 at
the University of California, Davis, provides the infrastructure for methods
testing, develops method evaluation and visualization tools, and handles all
data management issues [7].
Experiments are held every 2 years.The timetable of a typical CASP round
is schematically shown in Figure 2.3. The experiment is open to all. The
Prediction Center releases targets for prediction and collects models from
registered participants for approximately 3 months. Targets for structure pre-
diction are either structures soon-to-be solved by X-ray crystallography or
nuclear magnetic resonance (NMR) spectroscopy, or structures already solved
but not yet publicly accessible. Prediction methods are divided into two
categories—those using a combination of computational methods and human
experience, and those relying solely on computational methods. The integrity
of the latter category is ensured by requiring that servers process target infor-
mation and return models automatically. A window of 3 weeks is usually
provided for prediction of a target by human-expert groups and 3 days by
servers. Following closing of the server prediction window, the server models
are posted at the Prediction Center web site. These models can then be used
by human-expert predictors as starting points for further, more detailed mod-
eling. They are also used for testing model quality assessment methods in
CASP. Once all models of a target have been collected and the experimental
c02.indd 19
c02.indd 19 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
TABLE
2.1
Changes
in
the
Consecutive
CASP
s
CASP
Prediction
Categories
Main
Evaluation
Measures/
Packages
General
CASP1
(1994)
TS,
AL,
SS.
Protein
tertiary
structure
(TS
—
coordinates
format,
AL
—
alignment
format).
Secondary
structure
(SS).
RMSD.
Main
CASP
principles
were
established.
CASP2
(1996)
TS,
AL,
SS.
Prediction
of
protein
-
ligand
complexes
introduced.
Prediction
Center
established
to
support
CASP.
CASP3
(1998)
TS,
AL,
SS.
Prediction
of
complexes
dropped.
ProSup
and
DALI
packages
were
used
for
structural
superpositions.
New
evaluation
software
tested
at
the
Prediction
Center
to
replace
RMSD
with
a
measure
more
suitable
for
model
-
target
comparison.
CAFASP
experiment
to
evaluate
fold
recognition
servers
run
as
a
satellite
to
CASP.
CASP4
(2000)
TS,
AL,
SS,
RR.
Residue
–
residue
(RR)
contact
prediction
introduced.
New
evaluation
software
further
developed,
resulting
in
the
LGA
package
[9]
.
The
GDT_TS
measure
of
structural
similarity,
and
AL0
score
for
correctness
of
the
model
-
target
alignment
used
as
basic
CASP
measures.
CASP5
(2002)
TS,
AL,
RR,
DR.
SS
dropped.
Disordered
regions
(DR)
prediction
introduced.
20
c02.indd 20
c02.indd 20 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
CASP
Prediction
Categories
Main
Evaluation
Measures/
Packages
General
CASP6
(2004)
TS,
AL,
RR,
DR,
DP,
FN
Domain
boundary
(DP)
prediction
introduced.
Function
prediction
(FN)
introduced.
DAL,
nonrigid
body
structure
superposition
software,
used
for
scoring
models
in
addition
to
LGA.
CASP
moved
to
the
independent
of
CAFASP
server
testing
procedure.
Time
for
server
response
was
set
to
48
hours
plus
24
hours
for
potential
format
corrections.
Release
of
server
predictions
to
human
-
expert
groups
72
hours
after
target
release.
CASP7
(2006)
TS,
AL,
RR,
DR,
DP,
FN,
QA,
TR
Model
quality
assessment
(QA)
category
introduced.
Model
refi
nement
(TR)
category
introduced.
Prediction
of
multimers
introduced.
MAMMOTH
structure
superposition
program
additionally
used
for
analysis
of
the
results.
Structural
assessment
categories
changed
from
classic
division
on
comparative
modeling/fold
recognition/
ab
initio
to
template
-
based/template
-
free.
High
-
accuracy
modeling
category
separately
assessed.
CASP8
(2008)
TS,
AL,
RR,
DR,
DP,
FN,
QA,
TR
Prediction
of
multimers
dropped.
FN
category
was
narrowed
to
binding
site
prediction.
DALI
structure
superposition
program
was
additionally
used
for
analysis
of
the
results.
Prediction
Center
automatically
calculated
group
rankings
for
comparative
modeling
targets
according
to
different
measures.
Limit
on
number
of
targets
for
human
-
expert
groups.
Division
of
targets
into
human
-
server
and
server
only
categories.
Time
for
server
response
was
set
to
72
hours.
Separate
assessor
for
contacts,
domains,
and
function
predictions.
CASP9
(2010,
under
way)
TS,
AL,
RR,
DR,
FN,
QA,
TR
DP
prediction
is
dropped.
Prediction
of
multimers
is
reinstated.
21
c02.indd 21
c02.indd 21 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
22 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
structure is available, the Prediction Center performs a standard numerical
evaluation of the models, taking the experimental structure as the gold stan-
dard. A battery of tools is used for the numerical evaluation of predictions—
LGA [8], ACE [9], DAL [10], MAMMOTH [11], DALI [12]. If the target
consists of more than one well-defined structural domain, the evaluation is
performed on each of these as well as on the complete target (the official
domain boundaries are defined by the assessors). The results of automatic
evaluation are made available to the independent assessors, who typically add
their own analysis methods and make more subjective assessments of the
merits and faults of the models. The identity of the predictors is concealed
from the assessors while they conduct their analysis.Assessment outcomes are
presented to the community at the predictors’ meeting usually held in
December of a CASP year. At that time, results of the evaluations are also
made publicly available through the Prediction Center web site (http://
predictioncenter.org) allowing predictors to compare their own models with
those submitted by other groups. Details of all the experiments completed so
far and their results are available through this web site.The web site also hosts
a discussion forum, FORCASP, allowing exchange of thoughts by the predic-
tors. The articles by the assessors, the organizers, and the most successful
prediction groups are published in special issues of the journal Proteins:
Structure, Function, and Bioinformatics. There are currently eight such issues
available, one for each of the eight CASP experiments [2,13–19]. The articles
in the special issues discuss in detail the methods tested in CASP, the evalu-
ation results, and the analysis of the progress made. Below we briefly sum-
marize the state of the art in different CASP modeling categories.
2.4. METHOD CLASSES AND PREDICTION
DIFFICULTY CATEGORIES
In evaluating the ability of prediction methods, it is important to realize that
difficulty of a modeling problem is determined by many factors. In theory, it
FIGURE 2.3 Timetable of the CASP experiment.
c02.indd 22
c02.indd 22 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
TBM 23
is possible to calculate the structure of any protein from knowledge of its
amino acid composition and environmental conditions alone, since it has long
been established that these factors determine the functional conformation
[1]. In practice, it is not yet possible to follow the detailed folding behavior of
a system with as many atoms and degrees of freedom as a protein, nor to
thoroughly search for the global free energy minimum of such a system [20–
22]. Two types of methods for combating these limitations have been devel-
oped. One, by far the most effective at present, utilizes experimental
structures of evolutionarily related proteins, providing templates on which to
base a model. For cases where no such relationship exists, or none can be
discovered, partially effective structure prediction techniques have been
developed using simplified energy functions and employing approximate
energy landscape search strategies. These two approaches define the main
two classes of prediction methods—template-based modeling (TBM), some-
times referred to as comparative or homology modeling, and template-free
modeling. Historically, template-free methods were often termed ab initio (or
first principles), but members of the CASP community objected on the
grounds that these methods often make use of knowledge-based potentials to
evaluate interactions and assemblies of observed peptide fragment confor-
mations to generate trial structures. Template-free methods are currently
effective only for modeling small proteins (100 residues or less). Template-
based methods can be applied wherever it is possible to identify a structurally
similar protein that can be used as a template for building the model, irre-
spective of size. When the two approaches have been applied to the same
modeling problem, template-based methods have usually proven more accu-
rate than template-free methods. Thus, the most significant division in mod-
eling difficulty is between cases where a model can be built based on templates
derived from known experimental structures, and those where it cannot. At
one extreme, high-resolution models competitive with experiments can be
produced for proteins with sequences very similar to that of a known struc-
ture. At the other extreme, low resolution, very approximate models can be
generated by template-free methods for proteins with no detectable sequence
or structure relationship to known structures. To properly assess method suc-
cesses and failures, CASP subdivides modeling into these two separate cate-
gories, each with its own challenges, and hence requiring its own evaluation
procedures.
2.5. TBM
Whenever there is a detectable sequence relationship between two proteins,
the corresponding structures have been found to be similar. Thus, if at least
a single structure within a family of homologous proteins is determined experi-
mentally, then template-based methods can be used to model practically all
proteins in that family. The potential of this modeling is huge—by some esti-
mates, structures are already known for a quarter of the protein single-domain
c02.indd 23
c02.indd 23 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
24 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
families of significant size and half of all known sequences can be partially
modeled due to their membership in these families (M. Levitt in [23]).
A typical template-based method consists of several consecutive steps:
identifying probable templates; selecting/combining suitable templates; align-
ing target-template(s) sequence; copying structurally conserved regions from
the selected template(s); modeling structurally variable regions; packing side
chains; refining the model; and evaluating its quality. Each modeling step is
prone to errors, but, as a rule, the earlier in the process the error is introduced,
the costlier it is. As the template-based category covers a wide range of struc-
ture similarity, different kinds of errors are typical for different modeling
difficulty subcategories.
2.5.1. High-Resolution TBM
The most reliable models can be built in cases where there is a strong sequence
relationship between the target protein and a template (i.e., higher than ∼40%
sequence identity between target and template). In these situations target and
template are expected to have very similar structures. Template selection and
alignment errors are rare here, and simply copying the backbone of a suitable
template may be sufficient in producing a model that may rival NMR or low-
resolution X-ray structures in accuracy (∼1Å C-alpha atom root-mean-square
deviations [RMSD] from the experimental structure). The main effort in this
class of prediction shifts to modeling of regions of structure not present in a
template (loops), proper placement of side chains, and fine adjustment of the
structure (refinement).
Such high-resolution models often present a level of detail that is sufficient
for detecting sites of protein–protein interactions, understanding enzyme reac-
tion mechanisms, interpreting disease-causing mutations, molecular replace-
ment in solving crystal structures, and occasionally even drug design.
2.5.2. Medium Difficulty Range TBM
New, more sensitive methods of detecting remote sequence relationships,
especially Position-Specific Iterative-Basic Local Alignment Search Tool (PSI-
BLAST) and profile–profile methods, have greatly extended our ability to
utilize structure templates based on more remote sequence relationships. The
quality of models in this category has steadily improved over the course of the
CASP experiments. Models with quite accurate core (typically 2–3Å C-alpha
atom RMSD from the native structure) can now often be generated. Factors
still limiting progress include difficulty in recognizing best templates, com-
bining information from several templates, aligning target sequences with
template structures, adjusting for considerable shifts in conserved regions of
structure, and modeling regions not represented in any of the available tem-
plates. As in high-resolution homology modeling, refinement methods play a
role in improving the accuracy of final models.
c02.indd 24
c02.indd 24 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
TBM 25
Even though less accurate than high-resolution models, these models can
also be used in many biological applications such as detecting of probable sites
of protein–protein interactions, identifying the approximate role of disease-
associated substitutions, or assessing the likely role of alternative splicing in
protein function.
2.5.3. Difficult TBM
In cases where no evolutionary relationship can be detected based on sequence,
it is still likely that the fold of a target protein is nevertheless similar to that
of a known structure (implying a very remote evolutionary relationship or
convergence of folds). Methods that check the compatibility of a target protein
with the experimental structures use more sophisticated analyses (e.g., second-
ary structure comparison, knowledge-based structural potentials of various
types) and can sometimes assist in identifying templates for modeling. As in
such cases the templates have no explicit sequence relationship with the target,
alignment is often not reliable and not surprisingly, the accuracy of the result-
ing model is often low. Nevertheless impressive models are sometimes
obtained, and there has been substantial progress over the course of CASP
experiments.We attribute this progress to both methodological improvements
and the increased size of sequence and structure databases.
Although models for hard TBM targets may not provide accurate structural
detail, they are useful for providing an overall idea of what a structure is like,
recognizing approximate domain boundaries, helping choose residues for
mutagenesis experiments, and providing approximate information about
molecular function.
2.5.4. Progress and Challenges in the TBM
Assessment of template-based predictions over the several rounds of CASP
clearly showed an indisputable progress in the area, and the accuracy of the
models has grown substantially [24–28]. One measure of this is that for the
majority of targets the best models for each target are now closer to the native
structure than any of the available template structures. Despite this very
evident progress, there are many challenges still remaining. After years of
development, finding a good template and the alignment still remain the two
issues with a major impact on the quality of models.The coverage of the target
by the template imposes the upper limit on the fraction of residues that can
be aligned between the template and the target.Figure 2.4 shows the maximum
alignability together with the alignment accuracy for the best models in the
latest four CASPs (see our article [28], pp. 196, 198, for the definitions). It can
be observed that the trend in all CASPs is the same—both maximum align-
ability and alignment accuracy fall steadily and approximately linearly with
increasing target difficulty. The slope of the fall off for these two measures,
however, is different. For the easiest targets, predictors can routinely achieve
c02.indd 25
c02.indd 25 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
26 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
alignment accuracy close to the maximum possible from a single template or
even better; in the mid range of difficulty best alignments are typically within
20% of the optimum, but up to 40% of the structure cannot be aligned at all;
for the difficult targets the gap between the maximum alignability and align-
ment accuracy grows to 30% with the percentage of nonaligned residues
increasing to 70% [28]. Predictors often manage to achieve alignment accuracy
higher than a single template maximum by using additional templates or by
employing free modeling methods for the structurally nonconserved regions
such as loops, insertions or deletions. It is encouraging to see an increase in
the number of such cases: there are 22 targets in all CASPs where predictors
superseded maximum alignability by at least 2%; out of these nine cases were
from CASP8 (squares above 0% level in Fig. 2.4), eight from CASP7, four
from CASP6, and one from CASP5.
Improvement in alignment over the best template shows only one side of
the effectiveness of TBM methods. Analysis of the overall quality of the
models (measured in terms of Global Distance Test_Total Score [GDT_TS])
shows that typically the best models are superior to the corresponding naïve
models built by simply copying coordinates of aligned residues from the best
possible template. This additional gain in quality can be associated with the
modeling regions not present in the best template, and also with improving
the quality of the model by refinement. Figure 2.5 provides comparison of
FIGURE 2.4 Maximum template-imposed alignability (SWALI, solid lines) and
alignment accuracy of the best template-based models (AL0, dashed lines) from
CASP5–8 as a function of target difficulty. Maximum alignability is defined as the frac-
tion of equivalent residues in superposition of the target and best template structure;
target difficulty combines coverage of the target structure by the best template and
target-template sequence identity. CASP8—black lines; CASP7—blue; CASP6—
brown; CASP5—red. Squares represent the difference between alignment quality and
maximum alignability for CASP8 targets. Points over the 0% level represent targets
where alignment accuracy was better than maximum alignability. (See color insert.)
–40
–20
0
20
40
60
80
100
Target difficulty
Models—AL0,
templates—SWALI
(%)
c02.indd 26
c02.indd 26 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
FREE MODELING OF NEW FOLD PROTEINS 27
quality of the best submitted models versus naïve models built on the best
single template. Data trend lines indicate that in general the best submitted
models are better than the corresponding naïve models, except for the targets
representing the hardest one-fourth of the difficulty scale. In CASP6–8, over
70% of the best models in the template-based category have registered added
value over the naïve model. The inset histogram shows that the majority of
best predictions (153 out of 242) are up to eight GDT_TS units above the
corresponding best naïve model. The median difference between the best
model and naïve model equals 2.74 GDT_TS units (mean—2.07 GDT_TS).
2.6. FREE MODELING OF NEW FOLD PROTEINS
A quarter of the protein sequences in the contemporary databases do not
appear to match any sequence pattern corresponding to an already known
FIGURE 2.5 GDT_TS score of the best submitted model and the best naïve model
built on a single template for each TBM target in CASP5–8. The darker trendline cor-
responds to the predicted models; the lighter one, to the naive ones. Naïve models are
built on the top 20 templates according to the target coverage for each target, and the
score for the naïve model with the highest GDT_TS is shown. For the easier three-
fourths of the difficulty scale, best models in general outperform naïve models. The
inset histogram shows number of models registering differences in GDT_TS scores
between the best model and naïve model (bins stretch 4 GDT_TS units). The most
representative bin is 0–4 GDT_TS difference (86 targets), followed by the 4–8 GDT_
TS bin (67 targets).
Best predictors' models versus template models
(CASP 5–8 TBM targets sorted by difficulty)
GDT_TS—best predicted model
Highest GDT_TS among 20 best template models
c02.indd 27
c02.indd 27 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
28 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
structure [23]. In such cases, template-free modeling methods must be used.
Free modeling methods can be divided into two categories: structure-based
de-novo modeling methods and ab initio (modeling from the first principles)
methods. Currently, the more successful approaches are the de novo methods,
which rely on the fact that although not all naturally occurring protein folds
have yet been observed, on some length scale, all possible structures of
fragments are known. Fragment assignment, fragment assembly, and finally
selection of correct models from among many candidate structures all remain
formidable challenges.
The quality of free modeling predictions has increased dramatically over
the course of the CASP experiments, with most small proteins (100 residues
or less) usually assigned at least the correct overall fold by a few groups. For
these shorter proteins models are typically 4–10Å C-alpha atom RMSD from
the native structure; for larger proteins, models are usually over the 10Å away
from the native structure. This level of detail is insufficient for many biomedi-
cal applications. But it is encouraging that in the last three CASPs there were
examples of high-resolution accuracy (<2Å) models for a few small proteins
[29,30]. Recently, several notable cases of high-resolution structure prediction
in the absence of a suitable structural template were reported in the literature
[31,32].
2.7. OTHER MODELING CATEGORIES
Besides the three-dimensional protein structure prediction, CASP evaluates
several other structure-related modeling categories.
The Secondary structure prediction category was included in early CASP
experiments. Initial substantial progress gave way to incremental improve-
ments too small to evaluate with the amount of data collected, and so the
category was dropped in 2002.
The Disorder region prediction category was introduced in CASP in 2002
to address growing recognition that some regions of proteins do not adopt a
single three-dimensional structure but nevertheless are involved in the signal-
ing, regulating, or controlling functions of the protein [33]. The three most
recent CASPs have shown that the field has converged and that new ideas to
improve the predictions are needed [34].
Prediction of intramolecular residue–residue contacts could in principle be
helpful for predicting protein structure per se as well as for inferring mutations
in the proteins or distinguishing between correct and incorrect protein docking
models [35]. This type of prediction is still an area of active research, and
continues to be assessed in CASP (starting in CASP4). However, there has
been no detectable progress in that period, and current methods do not appear
sufficiently accurate to be of any significant use.
Many proteins contain multiple domains, and identifying domain boundar-
ies is important not only in modeling but also in selecting constructs for protein
c02.indd 28
c02.indd 28 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
SERVERS IN CASP 29
expression.Assessment in this area started in CASP6. In general, approximate
identification of domain boundaries is straightforward when these lie between
TBM regions. There are too few challenging multi-domain template-free
targets in CASP to evaluate those cases. As a result, this category will be
dropped from future CASPs.
One of the primary uses of a three-dimensional model is to deduce more
about the protein’s function. Testing of methods for function prediction began
in CASP6 [36]. Assessment in this category faced difficulties connected with
unavailability of experimental data to verify the predictions. To make the
analysis more stringent,in CASP8 the category was narrowed to ligand binding
site prediction.
As structure modeling has assumed a more prominent role in biology, the
need to have reliable estimates of overall and detailed structure accuracy
has become apparent. The necessity for an unbiased evaluation of model
quality assessment methods led to the introduction of a separate category in
CASP, starting in 2006. CASP quality assessment evaluation has demon-
strated that at the moment the most accurate methods rely on the availability
of multiple models for the same protein (called consensus-based or clustering
methods) [37]. These methods are based on the observation that the more
different modeling methods agree on structure, either overall or in particular
regions, the more likely that structure is correct. The best quality assessment
methods can provide ranking of overall models significantly correlated
with accuracy, but are not able to consistently select the best model from
the entire collection of models. It has been encouraging to see some quality
assessment methods showing promising results in assessing accuracy of
specific regions in a model, at times reproducing almost the exact C-alpha-
C-alpha deviation along the sequence. The main challenges in the quality
assessment category is developing methods that can be competitive with
consensus-based methods but rely on the structural and sequential features
of a single model and improving the performance of methods for determining
local model accuracy.
Starting in CASP7, special attention has been given to model tertiary struc-
ture refinement. All-atom structure refinement is one of the challenges in
protein structure prediction and the development of reliable refinement pro-
cedures would help in bringing the models up to the high-resolution standards.
The assessment of predictions in this category suggests that while there are
no methods that can consistently improve over the initial model, refinement
can sometimes result in structures that are much closer to the target than the
template (not a trivial task, especially for high accuracy targets).
2.8. SERVERS IN CASP
There are now many millions of proteins for which reasonable models could
be produced, and meeting such a large-scale demand requires automatic
c02.indd 29
c02.indd 29 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
30 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
generation of models. Even though it is apparent that not all human expertise
can be encoded in automatic servers, CASP shows that the best servers are
not much worse than the best human predictors. Moreover, sometimes the
difference in human-server performance is just due to the fact that human
experts have more time for the modeling (and quite often, base their models
on initial structures obtained from servers). Under these circumstances, the
importance of automatic servers for the biomedical community cannot be
over-estimated.
Server performance has been continuously checked by CASP starting in
CASP3 (originally with the help of the CriticalAssessment of FullyAutomated
Structure Prediction Methods [CAFASP] [38] and since CASP6, indepen-
dently). Analysis of server performance in successive CASPs [28,39] shows
that the best human-expert groups in CASP still outperform the best server
groups, but the gap between the best servers and the best human-expert
groups is narrowing. Especially in the case of easy TBM, progress of auto-
mated servers is impressive, with the fraction of targets where at least one
server model is among the best six submitted models increasing from 35% in
CASP5 to 65% in CASP6, to over 90% in CASP7, and slightly decreasing to
83% in CASP8. This statistics confirms the notion that the impact of human
expertise on modeling of easy comparative targets is now marginal. In general,
in both CASP7 and CASP8 servers were at least at par with humans (three
or more models in the best six) for about 20% of targets, and significantly
worse than the best human model for only very few targets.
2.9. MODELING CHALLENGES AND CASP INITIATIVES
Despite evident progress in protein structure modeling, many challenges still
remain to be addressed. In TBM, refinement of high-accuracy models and
improvement of template-to-target alignments in nontrivial cases are the
major limiting factors. In free modeling, the challenge is to predict larger
proteins (over 100 residues) more reliably and to routinely generate models
within 2–3Å RMSD from the native structure for smaller proteins. The
methods tested in CASP in this category have run out of steam and new mod-
eling techniques seem necessary. Besides the traditional CASP categories,
we have already conducted two in-between-CASP experiments to test the
prediction of mutation sites and refinement of models. Assessment of model
quality is currently an area of active research, and more than a dozen of papers
on the subject have been published since this category was introduced in
CASP7 (2006). It is also planned to conduct additional in-between-CASP
experiments in modeling of membrane proteins and in selecting models from
decoy sets.Within the main CASP track, in CASP9 we are reviving the predic-
tion of quaternary structure. An initiative to continuously test free modeling
methods is currently underway.
c02.indd 30
c02.indd 30 8/20/2010 3:36:16 PM
8/20/2010 3:36:16 PM
REFERENCES 31
REFERENCES
1. M. Sela, F.H. Jr. White, and C.B. Anfinsen. Reductive cleavage of disulfide bridges
in ribonuclease. Science, 125(3250):691–692, 1957.
2. J. Moult et al. A large-scale experiment to assess protein structure prediction
methods. Proteins, 23(3):ii–v, 1995.
3. J.M Bujnicki et al. LiveBench-2: Large-scale automated evaluation of protein
structure prediction servers. Proteins, S(5):184–191, 2001.
4. J. Janin et al. CAPRI: A critical assessment of predicted interactions. Proteins,
52(1):2–9, 2003.
5. V.A. Eyrich et al. EVA: Continuous automatic evaluation of protein structure
prediction servers. Bioinformatics, 17(12):1242–1243, 2001.
6. M.G. Reese et al. Genome annotation assessment in Drosophila melanogaster.
Genome Research, 10(4):483–501, 2000.
7. A. Kryshtafovych et al. New tools and expanded data analysis capabilities at the
Protein Structure Prediction Center. Proteins, 69(8):19–26, 2007.
8. A. Zemla. LGA:A method for finding 3D similarities in protein structures. Nucleic
Acids Research, 31(13):3370–3374, 2003.
9. A. Zemla et al. Processing and evaluation of predictions in CASP4. Proteins,
(S5):13–21, 2001.
10. A. Kryshtafovych et al. CASP6 data processing and automatic evaluation at the
Protein Structure Prediction Center. Proteins, 61(S7):19–23, 2005.
11. A.R. Ortiz, C.E. Strauss, and O. Olmea. MAMMOTH (matching molecular models
obtained from theory): An automated method for model comparison. Protein
Science, 11(11):2606–2621, 2002.
12. L. Holm et al. Searching protein structure databases with DaliLite v.3.
Bioinformatics, 24(23):2780–2781, 2008.
13. J. Moult et al. Critical assessment of methods of protein structure prediction-
Round VIII. Proteins, 77(S9):1–4, 2009.
14. J. Moult et al. Critical assessment of methods of protein structure prediction-
Round VII. Proteins, 69(S8):3–9, 2007.
15. J. Moult et al. Critical assessment of methods of protein structure prediction-
Round VI. Proteins, 61(S7):3–7, 2005.
16. J. Moult et al. Critical assessment of methods of protein structure prediction
(CASP)-round V. Proteins, 53(6):334–339, 2003.
17. J. Moult et al. Critical assessment of methods of protein structure prediction
(CASP): round IV. Proteins, (S5):2–7, 2001.
18. J. Moult et al. Critical assessment of methods of protein structure prediction
(CASP): Round III. Proteins, (S3):2–6, 1999.
19. J. Moult et al. Critical assessment of methods of protein structure prediction
(CASP): Round II. Proteins, (S1):2–6, 1997.
20. K.M. Misura and D. Baker. Progress and challenges in high-resolution refinement
of protein structure models. Proteins, 59(1):15–29, 2005.
21. H. Lei and Y. Duan. Protein folding and unfolding by all-atom molecular dynamics
simulations. Methods in Molecular Biology, 443:277–295, 2008.
c02.indd 31
c02.indd 31 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
32 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
22. Y. He et al. Exploring the parameter space of the coarse-grained UNRES force
field by random search: Selecting a transferable medium-resolution force field.
Journal of Computational Chemistry, 30(13):2127–2135, 2009.
23. T. Schwede et al. Outcome of a workshop on applications of protein models in
biomedical research. Structure, 17(2):151–159, 2009.
24. J. Kopp et al., Assessment of CASP7 predictions for template-based modeling
targets. Proteins, 69(S8):38–56, 2007.
25. R.J. Read and G. Chavali. Assessment of CASP7 predictions in the high accuracy
template-based modeling category. Proteins, 69(S8):27–37, 2007.
26. D. Cozzetto et al. Evaluation of template-based models in CASP8 with standard
measures. Proteins, 77(S9):18–28, 2009.
27. D. Keedy et al. The other 90% of the protein: Assessment beyond the C-alphas
for CASP8 template-based and high-accuracy models. Proteins, 77(S9):29–49,
2009.
28. A. Kryshtafovych, K. Fidelis, and J. Moult. Progress from CASP6 to CASP7.
Proteins, 69(S8):194–207, 2007.
29. P. Bradley et al. Free modeling with Rosetta in CASP6. Proteins, 61(S7):128–134,
2005.
30. R. Das et al. Structure prediction for CASP7 targets using extensive all-atom
refinement with Rosetta@home. Proteins, 69(S8):118–128, 2007.
31. P. Bradley, K.M. Misura, and D. Baker, Toward high-resolution de novo structure
prediction for small proteins. Science, 309(5742):1868–1871, 2005.
32. B. Qian et al. High-resolution structure prediction and the crystallographic phase
problem. Nature, 450(7167):259–264, 2007.
33. P. Radivojac et al. Intrinsic disorder and functional proteomics. Biophysical
Journal, 92(5):1439–1456, 2007.
34. L. Bordoli, F. Kiefer, and T. Schwede. Assessment of disorder predictions in
CASP7. Proteins, 69(S8):129–136, 2007.
35. J.M. Izarzugaza et al., Assessment of intramolecular contact predictions for
CASP7. Proteins, 69(S8):152–158, 2007.
36. S. Soro and A. Tramontano. The prediction of protein function at CASP6. Proteins,
61(S7):201–213, 2005.
37. D. Cozzetto et al. Assessment of predictions in the model quality assessment cat-
egory. Proteins, 69(S8):175–183, 2007.
38. D. Fischer et al. CAFASP-1: Critical assessment of fully automated structure pre-
diction methods. Proteins, (S3):209–217, 1999.
39. J.N. Battey et al. Automated server predictions in CASP7. Proteins, 69(S8):68–82,
2007.
c02.indd 32
c02.indd 32 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
33
CHAPTER 3
Introduction to Protein Structure Prediction: Methods and Algorithms,
Edited by Huzefa Rangwala and George Karypis
Copyright © 2010 John Wiley & Sons, Inc.
3.1. BACKGROUND, RATIONALE, AND HISTORY
High-throughput sequencing projects started to pour in an unprecedented
amount of genomic information in the mid 1990s. Subsequently a strong inter-
est emerged for even more ambitious high-throughput experiments that would
assign 3D shapes to all known proteins in all genomes. Three-dimensional
structures of proteins are often more informative than their sequences alone
because interactions take place in the 3D space and because patterns formed
by residues within the same protein that are far in sequence often form a
recognizable motif in space. The large-scale efforts to target and solve protein
THE PROTEIN STRUCTURE INITIATIVE
ANDRAS FISER
Department of Systems and Computational Biology
Department of Biochemistry
Albert Einstein College of Medicine
Bronx, NY
ADAM GODZIK
CHRISTINE ORENGO
BURKHARD ROST
Program in Bioinformatics and Systems Biology
Sanford-Burnham Medical Research Institute
La Jolla, CA
Department of Structural and Molecular Biology
University College London
London, UK
Department of Biochemistry and Molecular Biophysics
Center for Computational Biology
Columbia University
New York, NY
c03.indd 33
c03.indd 33 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
34 THE PROTEIN STRUCTURE INITIATIVE
structures were dubbed as structural genomics projects and were launched
worldwide with a variety of different focuses in Europe, Japan, and the United
States. The U.S. efforts were spearheaded by National Institutes of Health-
National Institute of General Medical Sciences (NIH-NIGMS) and named as
the Protein Structure Initiative (PSI).
The scientific rationale for structural genomics is the recognition of the fact
that the many million known protein sequences seem to cluster into much
fewer structural families or folds. A variety of estimates exists about the
anticipated number of protein folds ranging between 1000 and 20,000 [1–4].
In addition the size distribution of protein folds is very uneven. Domain super-
families from the 12 most highly populated folds (superfolds; [5]) cover
approximately half of a typical genome, with such representatives as the
Rossman, TIM, OB, or Ig fold [6]. On the other hand, many thousands of
much less populated domain superfamilies compose the rest of the genomes.
The rationale behind structural genomics is that a few thousand carefully
selected and solved protein structures will provide a means to structurally
characterize, at least in part, up to 80% of all existing sequences by using the
solved structures as templates and employing comparative protein structure
modeling techniques to model the rest of the proteins in each superfamily
[4,7]. This rationale sheds light on the two most critical computational aspects
of structural genomics efforts: target selection and structure modeling. While
target selection is chiefly responsible for efficiently mapping the fold universe
and properly directing efforts, structure modeling is the actual tool to provide
the 3D characterization for more than 99% of all proteins.
The underlying hypothesis of PSI is that high-throughput pipelines could
be developed to produce high-quality protein structure representatives of
large protein families with little or no prior structural representation.This goal
sets the PSI apart from traditional structural biology approaches that normally
take a much more highly focused and pragmatic approach through study of
one or a limited number of macromolecules via “hypothesis-driven” research.
After 1 year of preparation, a pilot phase of PSI started in year 2000, estab-
lishing nine PSI centers around the United States.These centers were charged
with the initial goals to set up automated pipelines for structural genomics
projects and to start producing 3D structures with an increasing efficiency that
includes both an increased number of solved experimental structures and the
reduction of the cost per structure. While the overall goal of PSI was to
explore the fold universe and to target structurally uncharacterized families,
various centers focused this global aim within more specific biologically
defined frameworks, for example, targeting new folds in specific genomes of
biomedical interest, such as Thermotoga maritima or Mycobacterium tubercu-
losis, or targeting human and other eukaryotic proteins, or targeting proteins
involved in metabolic pathways and cancer. The pilot phase of PSI was a
success, with an unprecedented number of 1300 structures solved, already
making an impact on the composition of Protein Data Bank (PDB), solving
far more structures than all conventional structural biology labs solved with a
c03.indd 34
c03.indd 34 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
OVERVIEW, PIPELINE, AND RESOURCES 35
comparable amount of funding. Thereby, PSI-1 achieved its goals, namely to
demonstrate the feasibility of efficient pipelines and to significantly reduce the
cost of solving protein structures. The second, the so-called production phase
of PSI started in 2005, funding four production centers and six specialized
research centers. The four large-scale centers (Joint Center for Structural
Genomics [JCSG; http://www.jcsg.org]; Midwest Center for Structural
Genomics [MCSG; http://www.mcsg.anl.gov]; New York SGX Research
Center for Structural Genomics [NYSGXRC; http://www.nysgrc.org]; and
Northeast Structural Genomics Consortium [NESG; http://www.nesg.org])
were all selected for their proven high-throughput capability and were charged
to synchronize target selection efforts providing a concerted effort to uncover
fold space. Meanwhile specialized centers focused on addressing technological
problems and known bottlenecks that include solving membrane proteins,
proteins in higher eukaryotes (especially in human), and small protein com-
plexes, and/or developing technology for portability, applicability, and scal-
ability in general.
In this chapter we will review the current state of PSI efforts with focus on
target selection and the impact on coverage of the fold universe.
3.2. OVERVIEW, PIPELINE, AND RESOURCES
Structural genomics centers established pipelines where all steps of the pro-
duction are highly automated. The pipelines include both experimental and
computational modules and typically contain the following steps: target selec-
tion (and target tracking); protein production (including cloning, expression
and purification); protein characterization (e.g. solubility tests); Heteronuclear
Single Quantum Coherence (HSQC) or nuclear magnetic resonance (NMR)
assignment or crystallization (if the structure determination technology used
is NMR or X-ray crystallography, respectively); structure determination (by
X-ray crystallography or NMR spectroscopy); and modeling of related protein
sequences. While the pipelines among various centers differ in their details
and by the specific experimental technologies employed, the previously
described major steps are common.These common steps (e.g., target selection,
cloning, expression, solubility experiments, purification, biophysical character-
ization, crystallization, diffraction or NMR assignments, solved structure)
are also reflected in the databases (Structural Genomics Target Search
[TARGETDB; http://targetdb.pdb.org/] and Protein Expression Purification
and Crystallization DataBase [PEPCDB; http://pepcdb.pdb.org/]) that track
experimental progress at all PSI centers. Additional resources of PSI efforts
include the PSI Materials Repository (http://psimr.asu.edu) that collects all
of the PSI center clones and processes them for general distribution. The PSI
Knowledgebase (http://kb.psi-structuralgenomics.org/) serves as the public
face of PSI efforts, providing all available technologies, experimental struc-
tures and homology models to the community. A partnership with the Nature
c03.indd 35
c03.indd 35 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
36 THE PROTEIN STRUCTURE INITIATIVE
Gateway provides exposure through highlight publication for the PSI and
its products.
3.3. TARGET SELECTION AND TARGET CATEGORIES
3.3.1. Centralized Target Selection
PSI applies the structural genomics paradigm at several levels of biological
investigation: (i) structural coverage of the protein universe, (ii) structural
coverage of proteins targeted in collections of organisms (i.e., metagenomics),
(iii) structural coverage of proteins targeted in specific organisms or organelles
(e.g., T. maritima), (iv) structural coverage of systems of cofunctioning pro-
teins and protein networks (e.g., protein phosphatases), (v) analysis of
structure/function diversity across a large domain family, and (vi) structural
analysis of membrane proteins.
Target selection for the large centers in PSI is overseen by the BioInformatics
Groups (BIG4). Each center is represented by one representative in this
group—the authors of this chapter. Targets in PSI can be assigned in different
categories. About 70% of all targets are centrally compiled and selected
among the four centers by BIG. Another 15% of targets are solved through
collaborations with the research community, while the remaining 15% of
targets are picked by each center individually because of their biomedical
relevance. Target selection is centralized among the four production centers
to increase efficiency, which is achieved by avoiding overlaps among the
centers and by balancing efforts on various target lists available.
3.3.2. Modeling Families
PSI targets protein “families” because of their predicted structural “novelty.”
These subjective qualifiers required a more precise, operational definition.
Definition of a protein family is subjective because it can range from a low-
resolution definition, such as the fold of a protein, to a high-resolution defini-
tion as any structural novelty within the same fold family, such as a novel
sub-domain or a longer loop insertion. In PSI the operational concept is to
map the protein universe into sequence clusters, so-called Modeling Families,
at a resolution, where the experimentally solved proteins can serve as suitable
template for comparative protein structure modeling of similar proteins. A
general guideline relies on the large-scale study that concluded that if a
template-target pair shares more than 30% sequence identity a comparative
model can be built, for which accuracy is expected to be within 2 Ang root-
mean-square difference (RMSD) of the native structure [8]. Consequently
within the PSI a practical definition of Modeling Families refers to groups of
sequences, where any two sequences in a Modeling Family share more than
30% with one other (Fig. 3.1). Therefore if the structure of at least one
member in a Modeling Family is solved experimentally all other family
c03.indd 36
c03.indd 36 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
TARGET SELECTION AND TARGET CATEGORIES 37
members can be accurately modeled. With the continuously improving tech-
nologies in comparative modeling, this threshold, or in general the definition
of the concept of Modeling Family, can be revised, which will impact on the
number of Modeling Families. In the PSI the category of biomedical targets
does not necessarily follow this rule, as these proteins have proven a strong
potential for immediate biomedical application that necessitates a higher reso-
lution approach. For these targets very detailed structural information may be
FIGURE 3.1 Flowchart of a typical PSI pipeline.
physical
properties
functional
information
expression
information
target selection
coordination among centres
disseminate
target list
status reports
abandon
N
N
N
N
Y
Y
Y
Y
–
–
–
– –
–
–
–
–
–
–
–
+
+
+
+
+
+
+
+
+
+
+
+
+
other
expression
systems?
disseminate clones
receive soluble proteins
disseminate proteins
= decision point
= process
receive crystals for
Fed-Ex crystallography
obtain SeMet crystals
MAD data collection
phasing, model building, refinement
MIR search
NMR
identify and correct problem
further purification, subclone,
add metal or cofactor, other?
abandon
MIR data collection
choose targets
choose another
family member?
solubilize
refolding, detergents,
metals, cofactors, etc.
clone coding sequences
expression
abandon
soluble
purify
quality assurance/
biophysical analyses
likely to
crystalize?
crystalization trials
microcrystals
diffraction-quality crystals
contains
methionines?
deposit structure in PDB
deposit homology
models in MoDBase
annotate structure in SPD
human gene
information
sequence
families
databases other
centers
interactions,
cofactors
biochemical
pathways
c03.indd 37
c03.indd 37 8/20/2010 3:36:17 PM
8/20/2010 3:36:17 PM
Other documents randomly have
different content
the legend, 164;
its pictorial brilliance, 165;
its influence on later English poetry, 165.
Examiner, The (Leigh Hunt’s), 25.
Faerie Queene (Spenser’s), 12, 13, 35.
Faithful Shepherdess (Fletcher’s), 95.
Fanny, Lines to, 134.
Feast of the Poets (Leigh Hunt’s), 32.
Fletcher, 95.
Foliage (Leigh Hunt’s), 73.
Genius, births of, 1.
Gisborne, Letter to Maria (Shelley’s), 30.
Goethe, 154.
Grasshopper and Cricket, 35.
Gray, 113.
Greece, Keats’ love of, 58, 77, 154.
Guy Mannering (Scott’s), 115.
Hammond, Mr, 11, 14.
Hampstead, 72, 77.
Haslam, William, 45, 212 (note).
Haydon, 3, 40, 65, 68, 78, 137, 138, 191, 214.
Hazlitt, William, 83, 84.
History of his own Time (Burnet’s), 10.
Holmes, Edward, 8.
Holy Living and Dying (Jeremy Taylor’s), 206.
Homer, On first looking into Chapman’s (Sonnet), 23-24.
Hood, 219.
Hope, address to, 21.
Horne, R. H., 11.
Houghton, Lord, 75, 211-213.
Hunt, John, 25.
Hunt, Leigh, 22, 24, 25, 32, 35, 39, 49, 51, 68, 72, 78, 196.
Hyperion, 129, 133, 144;
its purpose, 152;
one of the grandest poems of our language, 157;
the influences of Paradise Lost on it, 158;
its blank verse compared with Milton’s, 158;
its elemental grandeur, 160;
remodelling of it, 185 seq.;
description of the changes, 186-187;
special interest of the poem, 187.
Imitation of Spenser (Keats’ first lines), 14, 20.
Indolence, Ode on, 174-175.
Isabella, or the Pot of Basil, 86;
source of its inspiration, 148;
minor blemishes, 149;
its Italian metre, 149;
its conspicuous power and charm, 149;
description of its beauties, 151.
Isle of Wight, 67.
Jennings, Mrs, 5, 11.
Jennings, Capt. M. J., 7.
Joseph and his Brethren (Wells’), 45.
Kean, 81.
Keats, John, various descriptions of, 7, 8, 9, 46, 47, 76, 136, 224;
birth, 2;
education at Enfield, 4;
death of his father, 5;
school-life, 5-9;
his studious inclinations, 10;
death of his mother, 10;
leaves school at the age of fifteen, 11;
is apprenticed to a surgeon, 11;
finishes his school-translation of the Æneid, 12;
reads Spenser’s Epithalamium and Faerie Queene, 12;
his first attempts at composition, 13;
goes to London and walks the hospitals, 14;
his growing passion for poetry, 15;
appointed dresser at Guy’s Hospital, 16;
his last operation, 16;
his early life in London, 18;
his early poems, 20 seq.;
his introduction to Leigh Hunt, 24;
Hunt’s great influence over him, 26 seq.;
his acquaintance with Shelley, 38;
his other friends, 40-45;
personal characteristics, 47-48;
goes to live with his brothers in the Poultry, 48;
publication of his first volume of poems, 65;
retires to the Isle of Wight, 66;
lives at Carisbrooke, 67;
changes to Margate, 68;
money troubles, 70;
spends some time at Canterbury, 71;
receives first payment in advance for Endymion, 71;
lives with his two brothers at Hampstead, 71;
works steadily at Endymion, 71-72;
makes more friends, 73;
writes part of Endymion at Oxford, 76;
his love for his sister Fanny, 77;
stays at Burford Bridge, 80;
goes to the ‘immortal dinner,’ 82;
he visits Devonshire, 87;
goes on a walking tour in Scotland with Charles Brown, 113;
crosses over to Ireland, 116;
returns to Scotland and visits Burns’ country, 118;
sows there the seeds of consumption, 120;
returns to London, 120;
is attacked in Blackwood’s Magazine and the Quarterly Review,
121;
Lockhart’s conduct towards him, 122;
death of his young brother Tom, 128;
goes to live with Charles Brown, 128;
falls in love, 130-131;
visits friends in Chichester, 133;
suffers with his throat, 133;
his correspondence with his brother George, 139;
goes to Shanklin, 143;
collaborates with Brown in writing Otho, 143;
goes to Winchester, 144;
returns again to London, 146;
more money troubles, 146;
determines to make a living by journalism, 146;
lives by himself, 146;
goes back to Mr Brown, 181;
Otho is returned unopened after having been accepted, 182;
want of means prevents his marriage, 190;
his increasing illness, 191 seq.;
temporary improvement in his health, 194;
publishes another volume of poems, 196;
stays with Leigh Hunt’s family, 197;
favourable notice in the Edinburgh Review, 197;
lives with the family of Miss Brawne, 198;
goes with Severn to spend the winter in Italy, 199;
the journey improves his health, 200;
writes his last lines, 201;
stays for a time at Naples, 203;
goes on to Rome, 203-204;
further improvement in his health, 205;
sudden and last relapse, 205;
he is tenderly nursed by his friend Severn, 206;
speaks of himself as already living a ‘posthumous life,’ 207;
grows worse and dies, 208;
various tributes to his memory, 214.
His genius awakened by the Faerie Queene, 13;
influence of other poets on him, 21;
experiments in language, 21, 64, 147, 169;
employment of the ‘Heroic’ couplet, 27, 30;
element and spirit of his own poetry, 50;
experiments in metre, 52;
studied musical effect of his verse, 55;
his Grecian spirit, 58, 77, 95, 114, 154;
view of the aims and principles of poetry, 61;
imaginary dependence on Shakspere, 69;
thoughts on the mystery of Evil, 88;
puns, 72, 202;
his poems Greek in idea, English in manner, 96;
his poetry a true spontaneous expression of his mind, 110;
power of vivifying, 161;
verbal licenses, 169;
influence on subsequent poets, 218;
felicity of phrase, 219.
Personal characteristics:
Celtic temperament, 3, 58, 70;
affectionate nature, 6, 7, 9, 10, 77;
morbid temperament, 6, 70, 211;
lovable disposition, 6, 8, 19, 212, 213;
temper, 7, 9, 233;
personal beauty, 8;
penchant for fighting, 8, 9, 72;
studious nature, 9, 112;
humanity, 39, 89, 114-115;
sympathy and tenderness, 47, 213;
eyes, description of, 46, 207, 224;
love of nature, 47, 55-56;;
voice, 47;
desire of fame, 60, 125, 141, 207;
natural sensibility to physical and spiritual spell of moonlight,
95;
highmindedness, 125-126;
love romances, 127, 130-134, 180-181, 197, 200, 203, 212;
pride and sensitiveness, 211;
unselfishness, 213, 214;
instability, 215.
Various descriptions of, 7, 8, 9, 46, 47, 76, 136, 224.
Keats, Admiral Sir Richard, 7.
Keats, Fanny (Mrs Llanos), 77.
Keats, Mrs (Keats’ mother), 5, 10.
Keats, George, 90, 113, 192, 193, 210.
Keats, Thomas (Keats’ father), 2, 5.
Keats, Tom, 6, 127.
King Stephen, 179.
‘Kirk-men,’ 116-117.
La Belle Dame sans Merci, 165, 166, 218;
origin of the title, 165;
a story of the wasting power of love, 166;
description of its beauties, 166.
Lamb, Charles, 26, 82, 83.
Lamia, 143;
its source, 167;
versification, 167;
the picture of the serpent woman, 168;
Keats’ opinion of the Poem, 168.
Landor, 75.
Laon and Cythna, 76.
Letters, extracts, etc., from Keats’, 66, 67, 68, 69, 77, 78, 79, 81, 85,
87, 88, 89, 90, 91, 114, 116-117, 118, 126, 127, 129, 130, 134, 137,
139, 141, 145, 146, 157, 181, 182, 190, 194-195, 200, 203, 226.
‘Little Keats,’ 19.
Lockhart, 33, 122, 123.
London Magazine, 71.
Mackereth, George Wilson, 18.
Madeline, 162 seq.
‘Maiden-Thought,’ 88, 114.
Man about Town (Webb’s), 38.
Man in the Moon (Drayton’s), 93.
Margate, 68.
Mathew, George Felton, 19.
Meg Merrilies, 115-116.
Melancholy, Ode on, 175.
Milton, 51, 52, 54, 88.
Monckton, Milnes, 211.
Moore, 65.
Morning Chronicle, The, 124.
Mother Hubbard’s Tale (Spenser’s), 31.
Mythology, Greek, 10, 58, 152, 153.
Naples, 203.
Narensky (Brown’s), 74.
Newmarch, 19.
Nightingale, Ode to a, 136, 175, 218.
Nymphs, 73.
Odes, 21, 137, 145, 170-171, 172, 174, 175, 177, 218.
Orion, 11.
Otho, 143, 144, 180, 181.
Oxford, 75, 77.
Oxford Herald, The, 122.
Pan, Hymn to, 83.
Pantheon (Tooke’s), 10.
Paradise Lost, 88, 152, 154, 158.
Patriotism, 115.
Peter Corcoran (Reynolds’), 36.
Plays, 178, 179, 181, 182.
Poems (Keats’ first volume), faint echoes of other poets in them, 51;
their form, 52;
their experiments in metre, 52;
merely poetic preludes, 53;
their rambling tendency, 53;
immaturity, 60;
attractiveness, 61;
characteristic extracts, 63;
their moderate success, 65-66.
Poetic Art, Theory and Practice, 61, 64.
Poetry, joys of, 55;
principle and aims of, 61;
genius of, 110.
Polymetis (Spence’s), 10.
Pope, 19, 29, 30.
‘Posthumous Life,’ 207.
Prince Regent, 25.
Proctor, Mrs, 47.
Psyche, Ode to, 136, 171, 172.
Psyche (Mrs Tighe’s), 21.
Quarterly Review, 121, 124.
Rainbow (Campbell’s), 170.
Rawlings, William, 5.
Reynolds, John Hamilton, 36, 211, 214.
Rice, James, 37, 142.
Rimini, Story of, 27, 30, 31, 35.
Ritchie, 82.
Rome, 204.
Rossetti, 220.
Safie (Reynolds’), 36.
Scott, Sir Walter, 1, 33, 65, 115, 123, 124.
Scott, John, 124.
Sculpture, ancient, 136.
Sea-Sonnet, 67.
Severn, Joseph, 45, 72, 135, 191, 199 seq.
Shakspere, 67, 69.
Shanklin, 67, 143.
Shelley, 16, 32, 38, 56, 85, 110, 199, 203, 209.
Shenstone, 21.
Sleep and Poetry, 52, 60, 61, 109.
Smith, Horace, 33, 81.
Sonnets, 22, 23, 43, 48, 49, 57, 201.
Specimen of an Induction to a Poem, 52.
Spenser, 19, 20, 21, 31, 35, 54, 55.
Stephens, Henry, 18-20.
Surrey Institution, 84.
Taylor, Mr, 71, 81, 126, 144, 146, 206, 211.
Teignmouth, 87.
Tennyson, 218.
Thomson, 21.
Urn, Ode on a Grecian, 136, 172-174.
Vision, The, 187, 193 (see Hyperion).
Webb, Cornelius, 38.
Wells, Charles, 45.
Wilson, 33.
Winchester, 143-145.
Windermere, 113, 114.
Wordsworth, 1, 44, 46, 56, 64, 82, 83, 158, 219.
CAMBRIDGE PRINTED BY JOHN CLAY, M.A. AT THE UNIVERSITY
PRESS.
English Men of Letters.
Edited by JOHN MORLEY.
Popular Edition. Crown 8vo. Paper Covers, 1s.; Cloth, 1s. 6d.
each.
Pocket Edition. Fcap. 8vo. Cloth. 1s. net each.
Library Edition. Crown 8vo. Gilt tops. Flat backs. 2s. net each.
ADDISON.
By W. J. COURTHOPE.
BACON.
By DEAN CHURCH.
BENTLEY.
By Sir RICHARD JEBB.
BUNYAN.
By J. A. FROUDE.
BURKE.
By JOHN MORLEY.
BURNS.
By Principal SHAIRP.
BYRON.
By Professor NICHOL.
CARLYLE.
By Professor NICHOL.
CHAUCER.
By Dr. A. W. WARD.
COLERIDGE.
By H. D. TRAILL.
COWPER.
By GOLDWIN SMITH.
DEFOE.
By W. MINTO.
HUME.
By Professor HUXLEY,
F.R.S.
JOHNSON.
By Sir LESLIE STEPHEN,
K.C.B.
KEATS.
By Sir SIDNEY COLVIN.
LAMB, CHARLES.
By Canon AINGER.
LANDOR.
By Sir SIDNEY COLVIN.
LOCKE.
By THOMAS FOWLER.
MACAULAY.
By J. C. MORISON.
MILTON.
By MARK PATTISON.
POPE.
By Sir LESLIE STEPHEN,
K.C.B.
SCOTT.
By R. H. HUTTON.
SHELLEY.
By J. A. SYMONDS.
DE QUINCEY.
By Professor MASSON.
DICKENS.
By Dr. A. W. WARD.
DRYDEN.
By Professor
SAINTSBURY.
FIELDING.
By AUSTIN DOBSON.
GIBBON.
By J. C. MORISON.
GOLDSMITH.
By W. BLACK.
GRAY.
By EDMUND GOSSE.
HAWTHORNE.
By HENRY JAMES.
SHERIDAN.
By Mrs. OLIPHANT.
SIDNEY.
By J. A. SYMONDS.
SOUTHEY.
By Professor DOWDEN.
SPENSER.
By Dean CHURCH.
STERNE.
By H. D. TRAILL.
SWIFT.
By Sir LESLIE STEPHEN,
K.C.B.
THACKERAY.
By ANTHONY TROLLOPE.
WORDSWORTH.
By F. W. H. MYERS.
English Men of Letters.
NEW SERIES.
Crown 8vo. Gilt tops. Flat backs. 2s. net each.
MATTHEW ARNOLD. By Herbert W. Paul.
JANE AUSTEN. By F. Warre Cornish.
SIR THOMAS BROWNE. By Edmund Gosse.
BROWNING. By G. K. Chesterton.
FANNY BURNEY. By Austin Dobson.
CRABBE. By Alfred Ainger.
MARIA EDGEWORTH. By the Hon. Emily Lawless.
GEORGE ELIOT. By Sir Leslie Stephen, K.C.B.
EDWARD FITZGERALD. By A. C. Benson.
HAZLITT. By Augustine Birrell, K.C.
HOBBES. By Sir Leslie Stephen, K.C.B.
ANDREW MARVELL. By Augustine Birrell, K.C.
THOMAS MOORE. By Stephen Gwynn.
WILLIAM MORRIS. By Alfred Noyes.
WALTER PATER. By A. C. Benson.
RICHARDSON. By Austin Dobson.
ROSSETTI. By A. C. Benson.
RUSKIN. By Frederic Harrison.
SHAKESPEARE. By Walter Raleigh.
ADAM SMITH. By Francis W. Hirst.
SYDNEY SMITH. By George W. E. Russell.
JEREMY TAYLOR. By Edmund Gosse.
TENNYSON. By Sir Alfred Lyall.
JAMES THOMSON. By G. C. Macaulay.
In preparation.
MRS. GASKELL. By Clement Shorter.
BEN JONSON. By Prof. Gregory Smith.
MACMILLAN AND CO., LTD., LONDON.
Footnotes:
[1] See Appendix, p. 221.
[2] Ibid.
[3] John Jennings died March 8, 1805.
[4] Rawlings v. Jennings. See below, p. 138, and Appendix, p. 221.
[5] Captain Jennings died October 8, 1808.
[6] Houghton MSS.
[7] Rawlings v. Jennings. See Appendix, p. 221.
[8] Mrs Alice Jennings was buried at St Stephen’s, Coleman Street,
December 19, 1814, aged 78. (Communication from the Rev. J. W.
Pratt, M.A.)
[9] I owe this anecdote to Mr Gosse, who had it direct from Horne.
[10] Houghton MSS.
[11] A specimen of such scribble, in the shape of a fragment of
romance narrative, composed in the sham Old-English of Rowley,
and in prose, not verse, will be found in The Philosophy of Mystery,
by W. C. Dendy (London, 1841), p. 99, and another, preserved by Mr
H. Stephens, in the Poetical Works, ed. Forman (1 vol. 1884), p.
558.
[12] See Appendix.
[13] See C. L. Feltoe, Memorials of J. F. South (London, 1884), p. 81.
[14] Houghton MSS. See also Dr B. W. Richardson in the Asclepiad,
vol. i. p. 134.
[15] Houghton MSS.
[16] What, for instance, can be less Spenserian and at the same
time less Byronic than—
“For sure so fair a place was never seen
Of all that ever charm’d romantic eye”?
[17] See Appendix, p. 222.
[18] See Appendix, p. 223.
[19] See particularly the Invocation to Sleep in the little volume of
Webb’s poems published by the Olliers in 1821.
[20] See Appendix, p. 223.
[21] See Praeterita, vol. ii. chap. 2.
[22] See Appendix, p. 224.
[23] Compare Chapman, Hymn to Pan:—
“the bright-hair’d god of pastoral,
Who yet is lean and loveless, and doth owe,
By lot, all loftiest mountains crown’d with snow,
All tops of hills, and cliffy highnesses,
All sylvan copses, and the fortresses
Of thorniest queaches here and there doth rove,
And sometimes, by allurement of his love,
Will wade the wat’ry softnesses.”
[24] Compare Wordsworth:—
“Bees that soar for bloom,
High as the highest peak of Furness Fells,
Will murmur by the hour in foxglove bells.”
Is the line of Keats an echo or merely a coincidence?
[25] Mr W. T. Arnold in his Introduction (p. xxvii) quotes a parallel
passage from Leigh Hunt’s Gentle Armour as an example of the
degree to which Keats was at this time indebted to Hunt: forgetting
that the Gentle Armour was not written till 1831, and that the debt
in this instance is therefore the other way.
[26] See Appendix, p. 220.
[27] The facts and dates relating to Brown in the above paragraph
were furnished by his son, still living in New Zealand, to Mr Leslie
Stephen, from whom I have them. The point about the Adventures
of a Younger Son is confirmed by the fact that the mottoes in that
work are mostly taken from the Keats MSS. then in Brown’s hands,
especially Otho.
[28] Houghton MSS.
[29] See Appendix, p. 224.
[30] See Appendix, p. 225.
[31] See Appendix, p. 225.
[32] In the extract I have modernized Drayton’s spelling and
endeavoured to mend his punctuation: his grammatical constructions
are past mending.
[33] Mrs Owen was, I think, certainly right in her main conception of
an allegoric purpose vaguely underlying Keats’s narrative.
[34] Lempriere (after Pausanias) mentions Pæon as one of the fifty
sons of Endymion (in the Elean version of the myth): and in
Spenser’s Faerie Queene there is a Pæana—the daughter of the
giant Corflambo in the fourth book. Keats probably had both of these
in mind when he gave Endymion a sister and called her Peona.
[35] Book 1, Song 4. The point about Browne has been made by Mr
W. T. Arnold.
[36] The following is a fair and characteristic enough specimen of
Chamberlayne:—
“Upon the throne, in such a glorious state
As earth’s adored favorites, there sat
The image of a monarch, vested in
The spoils of nature’s robes, whose price had been
A diadem’s redemption; his large size,
Beyond this pigmy age, did equalize
The admired proportions of those mighty men
Whose cast-up bones, grown modern wonders, when
Found out, are carefully preserved to tell
Posterity how much these times are fell
From nature’s youthful strength.”
[37] See Appendix, p. 226.
[38] Houghton MSS.
[39] See Appendix, p. 227.
[40] Severn in Houghton MSS.
[41] Houghton MSS.
[42] Dilke (in a MS. note to his copy of Lord Houghton’s Life and
Letters, ed. 1848) states positively that Lockhart afterwards owned
as much; and there are tricks of style, e.g. the use of the Spanish
Sangrado for doctor, which seem distinctly to betray his hand.
[43] Leigh Hunt at first believed that Scott himself was the writer,
and Haydon to the last fancied it was Scott’s faithful satellite, the
actor Terry.
[44] Severn in the Atlantic Monthly, Vol. xi., p. 401.
[45] See Preface, p. viii.
[46] See Appendix, p. 227.
[47] Houghton MSS.
[48] The house is now known as Lawn Bank, the two blocks having
been thrown into one, with certain alterations and additions which in
the summer of 1885 were pointed out to me in detail by Mr William
Dilke, the then surviving brother of Keats’s friend.
[49] See Appendix, p. 227.
[50] See Appendix, p. 228.
[51] Decamerone, Giorn., iv. nov. 5. A very different metrical
treatment of the same subject was attempted and published, almost
simultaneously with that of Keats, by Barry Cornwall in his Sicilian
Story (1820). Of the metrical tales from Boccaccio which Reynolds
had agreed to write concurrently with Keats (see above, p. 86), two
were finished and published by him after Keats’s death in the volume
called A Garden of Florence (1821).
[52] As to the date when Hyperion was written, see Appendix, p.
228: and as to the error by which Keats’s later recast of his work has
been taken for an earlier draft, ibid., p. 230.
[53] If we want to see Greek themes treated in a Greek manner by
predecessors or contemporaries of Keats, we can do so—though
only on a cameo scale—in the best idyls of Chénier in France, as
L’Aveugle or Le Jeune Malade, or of Landor in England, as the
Hamadryad or Enallos and Cymodamia; poems which would hardly
have been written otherwise at Alexandria in the days of Theocritus.
[54] We are not surprised to hear of Keats, with his instinct for the
best, that what he most liked in Chatterton’s work was the minstrel’s
song in Ælla, that fantasia, so to speak, executed really with genius
on the theme of one of Ophelia’s songs in Hamlet.
[55] A critic, not often so in error, has contended that the deaths of
the beadsman and Angela in the concluding stanza are due to the
exigencies of rhyme. On the contrary, they are foreseen from the
first: that of the beadsman in the lines,
“But no—already had his death-bell rung;
The joys of all his life were said and sung;”
that of Angela where she calls herself
“A poor, weak, palsy-stricken, churchyard thing,
Whose passing bell may ere the midnight toll.”
[56] See Appendix, p. 229.
[57] Chartier was born at Bayeux. His Belle Dame sans Merci is a
poem of over eighty stanzas, the introduction in narrative and the
rest in dialogue, setting forth the obduracy shown by a lady to her
wooer, and his consequent despair and death.—For the date of
composition of Keats’s poem, see Appendix, p. 230.
[58] This has been pointed out by my colleague Mr A. S. Murray: see
Forman, Works, vol. iii. p. 115, note; and W. T. Arnold, Poetical
Works, &c., p. xxii, note.
[59] Houghton MSS.
[60] “He never spoke of any one,” says Severn, (Houghton MSS.,)
“but by saying something in their favour, and this always so
agreeably and cleverly, imitating the manner to increase your
favourable impression of the person he was speaking of.”
[61] See Appendix, p. 230.
[62] Auctores Mythographi Latini, ed. Van Staveren, Leyden, 1742.
Keats’s copy of the book was bought by him in 1819, and passed
after his death into the hands first of Brown, and afterwards of
Archdeacon Bailey (Houghton MSS.). The passage about Moneta
which had wrought in Keats’s mind occurs at p. 4, in the notes to
Hyginus.
[63] Mrs Owen was the first of Keats’s critics to call attention to this
passage, without, however, understanding the special significance it
derives from the date of its composition.
[64] Houghton MSS.
[65] See below, p. 193, note 2.
[66] “Interrupted,” says Brown oracularly in Houghton MSS., “by a
circumstance which it is needless to mention.”
[67] This passing phrase of Brown, who lived with Keats in the
closest daily companionship, by itself sufficiently refutes certain
statements of Haydon. But see Appendix, p. 232.
[68] A week or two later Leigh Hunt printed in the Indicator a few
stanzas from the Cap and Bells, and about the same time dedicated
to Keats his translation of Tasso’s Amyntas, speaking of the original
as “an early work of a celebrated poet whose fate it was to be
equally pestered by the critical and admired by the poetical.”
[69] See Crabb Robinson. Diaries, Vol. II. p. 197, etc.
[70] See Appendix, p. 233.
[71] Houghton MSS. In both the Autobiography and the
Correspondence the passage is amplified with painful and probably
not trustworthy additions.
[72] I have the date of sailing from Lloyd’s, through the kindness of
the secretary, Col. Hozier. For the particulars of the voyage and the
time following it, I have drawn in almost equal degrees from the
materials published by Lord Houghton, by Mr Forman, by Severn
himself in Atlantic Monthly, Vol. xi. p. 401, and from the unpublished
Houghton and Severn MSS.
[73] Severn, as most readers will remember, died at Rome in 1879,
and his remains were in 1882 removed from their original burying-
place to a grave beside those of Keats in the Protestant cemetery
near the pyramid of Gaius Cestius.
[74] Haslam, in Severn MSS.
[75] Severn MSS.
[76] Houghton MSS.
[77] Ibid.
[78] Houghton MSS.
[79] Ibid.
Introduction To Protein Structure Prediction Methods And Algorithms Wiley Series In Bioinformatics Huzefa Rangwala
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Basics of Protein structure in bioinformatics
PPT
BrentWathen.ppt
PPTX
Protein computational analysis
PPTX
Protein computational analysis
PPTX
protein design, principles and examples.pptx
PDF
Computational Structural Biology Methods and Applications 1st Edition Torsten...
PPT
Protein struc pred-Ab initio and other methods as a short introduction.ppt
PDF
SecondaryStructurePredictionReport
Basics of Protein structure in bioinformatics
BrentWathen.ppt
Protein computational analysis
Protein computational analysis
protein design, principles and examples.pptx
Computational Structural Biology Methods and Applications 1st Edition Torsten...
Protein struc pred-Ab initio and other methods as a short introduction.ppt
SecondaryStructurePredictionReport

Similar to Introduction To Protein Structure Prediction Methods And Algorithms Wiley Series In Bioinformatics Huzefa Rangwala (20)

PPT
2005_lecture_01.ppt
PPT
Cs273 structure prediction
PPTX
protein Modeling Abi.pptx
PPTX
Modelling Proteins By Computational Structural Biology
PPTX
Lessons in Modeling from 3-D Structural & Data Science Perspectives
PDF
Introduction to Proteins Structure Function and Motion 1st Edition Amit Kessel
PDF
Bioinformatics Practical-Course- ATIT Academy
PDF
Atit academy bioinformatics practical course
PPTX
Computational Prediction of Protein Structure.pptx
PPT
Protein Structural predection
PPT
protein structure prediction in bioinformatics.ppt
PPT
Recent trends in bioinformatics
PPT
methods for protein structure prediction
PPTX
Homology Modeling.pptx
PPT
Powerpoint
PPTX
Ultrasound to Enhance a Liquid–Liquid Reaction Presentation1.pptx
PDF
Protein Bioinformatics From Sequence to Function 1st Edition M. Michael Gromiha
DOC
Download Senior Thesis.doc
PDF
Generic approach for predicting unannotated protein pair function using protein
PPTX
Protien Structure Prediction
2005_lecture_01.ppt
Cs273 structure prediction
protein Modeling Abi.pptx
Modelling Proteins By Computational Structural Biology
Lessons in Modeling from 3-D Structural & Data Science Perspectives
Introduction to Proteins Structure Function and Motion 1st Edition Amit Kessel
Bioinformatics Practical-Course- ATIT Academy
Atit academy bioinformatics practical course
Computational Prediction of Protein Structure.pptx
Protein Structural predection
protein structure prediction in bioinformatics.ppt
Recent trends in bioinformatics
methods for protein structure prediction
Homology Modeling.pptx
Powerpoint
Ultrasound to Enhance a Liquid–Liquid Reaction Presentation1.pptx
Protein Bioinformatics From Sequence to Function 1st Edition M. Michael Gromiha
Download Senior Thesis.doc
Generic approach for predicting unannotated protein pair function using protein
Protien Structure Prediction
Ad

Recently uploaded (20)

PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
master seminar digital applications in india
PDF
Computing-Curriculum for Schools in Ghana
PPTX
GDM (1) (1).pptx small presentation for students
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
102 student loan defaulters named and shamed – Is someone you know on the list?
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
master seminar digital applications in india
Computing-Curriculum for Schools in Ghana
GDM (1) (1).pptx small presentation for students
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Structure & Organelles in detailed.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pharma ospi slides which help in ospi learning
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
O5-L3 Freight Transport Ops (International) V1.pdf
Sports Quiz easy sports quiz sports quiz
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
RMMM.pdf make it easy to upload and study
Abdominal Access Techniques with Prof. Dr. R K Mishra
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Ad

Introduction To Protein Structure Prediction Methods And Algorithms Wiley Series In Bioinformatics Huzefa Rangwala

  • 1. Introduction To Protein Structure Prediction Methods And Algorithms Wiley Series In Bioinformatics Huzefa Rangwala download https://ebookbell.com/product/introduction-to-protein-structure- prediction-methods-and-algorithms-wiley-series-in-bioinformatics- huzefa-rangwala-2160732 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Introduction To Proteins Structure Function And Motion Amit Kessel Nir Bental https://ebookbell.com/product/introduction-to-proteins-structure- function-and-motion-amit-kessel-nir-bental-4393606 Introduction To Proteins Structure Function And Motion Second Edition 2nd Edition Bental https://ebookbell.com/product/introduction-to-proteins-structure- function-and-motion-second-edition-2nd-edition-bental-7033134 Introduction To Protein Architecture The Structural Biology Of Proteins Arthur M Lesk https://ebookbell.com/product/introduction-to-protein-architecture- the-structural-biology-of-proteins-arthur-m-lesk-7368798 Introduction To Protein Science Architecture Function And Genomics 1st Edition Arthur M Lesk https://ebookbell.com/product/introduction-to-protein-science- architecture-function-and-genomics-1st-edition-arthur-m-lesk-24720564
  • 3. Introduction To Protein Mass Spectrometry 1st Edition Ghosh Pradip K https://ebookbell.com/product/introduction-to-protein-mass- spectrometry-1st-edition-ghosh-pradip-k-5432600 An Introduction To Protein Informatics 1st Edition Karlheinz Zimmermann Auth https://ebookbell.com/product/an-introduction-to-protein- informatics-1st-edition-karlheinz-zimmermann-auth-4200130 Introduction To Proteins Kessel A Bental N https://ebookbell.com/product/introduction-to-proteins-kessel-a- bental-n-4584468 Introduction To Peptides And Proteins Ulo Langel Benjamin F Cravatt https://ebookbell.com/product/introduction-to-peptides-and-proteins- ulo-langel-benjamin-f-cravatt-4767322 Illuminating Disease An Introduction To Green Fluorescent Proteins Marc Zimmer https://ebookbell.com/product/illuminating-disease-an-introduction-to- green-fluorescent-proteins-marc-zimmer-4914542
  • 6. INTRODUCTION TO PROTEIN STRUCTURE PREDICTION ffirs.indd i ffirs.indd i 8/20/2010 3:37:39 PM 8/20/2010 3:37:39 PM
  • 7. WILEY SERIES ON BIOINFORMATICS: COMPUTATIONAL TECHNIQUES AND ENGINEERING Series Editors, Yi Pan & Albert Zomaya Knowledge Discovery in Bioinformatics: Techniques, Methods and Applications / Xiaohua Hu & Yi Pan Grid Computing for Bioinformatics and Computational Biology / Albert Zomaya & El-Ghazali Talbi Analysis of Biological Networks / Björn H. Junker & Falk Schreiber Bioinformatics Algorithms: Techniques and Applications / Ion Mandoiu & Alexander Zelikovsky Machine Learning in Bioinformatics / Yanqing Zhang & Jagath C. Rajapakse Biomolecular Networks / Luonan Chen, Rui-Sheng Wang, & Xiang-Sun Zhang Computational Systems Biology / Huma Lodhi Computational Intelligence and Pattern Analysis in Biology Informatics / Ujjwal Maulik, Sanghamitra, & Jason T. Wang Mathematics of Bioinformatics: Theory, Practice, and Applications / Matthew He Introduction to Protein Structure Prediction: Methods and Algorithms / Huzefa Rangwala & George Karypis ffirs.indd ii ffirs.indd ii 8/20/2010 3:37:39 PM 8/20/2010 3:37:39 PM
  • 8. INTRODUCTION TO PROTEIN STRUCTURE PREDICTION Methods and Algorithms Edited by HUZEFA RANGWALA GEORGE KARYPIS A JOHN WILEY & SONS, INC., PUBLICATION ffirs.indd iii ffirs.indd iii 8/20/2010 3:37:39 PM 8/20/2010 3:37:39 PM
  • 9. Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Rangwala, Huzefa. Introduction to protein structure prediction : methods and algorithms / Huzefa Rangwala, George Karypis. p. cm.—(Wiley series in bioinformatics; 14) Includes bibliographical references and index. ISBN 978-0-470-47059-6 (hardback) 1. Proteins—Structure—Mathematical models. 2. Proteins—Structure—Computer simulation. I. Karypis, G. (George) II. Title. QP551.R225 2010 572′.633—dc22 2010028352 Printed in Singapore 10 9 8 7 6 5 4 3 2 1 ffirs.indd iv ffirs.indd iv 8/20/2010 3:37:39 PM 8/20/2010 3:37:39 PM
  • 10. CONTENTS v PREFACE vii CONTRIBUTORS xi 1 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION 1 Huzefa Rangwala and George Karypis 2 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING 15 Andriy Kryshtafovych, Krzysztof Fidelis, and John Moult 3 THE PROTEIN STRUCTURE INITIATIVE 33 Andras Fiser, Adam Godzik, Christine Orengo, and Burkhard Rost 4 PREDICTION OF ONE-DIMENSIONAL STRUCTURAL PROPERTIES OF PROTEINS BY INTEGRATED NEURAL NETWORKS 45 Yaoqi Zhou and Eshel Faraggi 5 LOCAL STRUCTURE ALPHABETS 75 Agnel Praveen Joseph, Aurélie Bornot, and Alexandre G. de Brevern 6 SHEDDING LIGHT ON TRANSMEMBRANE TOPOLOGY 107 Gábor E. Tusnády and István Simon 7 CONTACT MAP PREDICTION BY MACHINE LEARNING 137 Alberto J.M. Martin, Catherine Mooney, Ian Walsh, and Gianluca Pollastri 8 A SURVEY OF REMOTE HOMOLOGY DETECTION AND FOLD RECOGNITION METHODS 165 Huzefa Rangwala 9 INTEGRATIVE PROTEIN FOLD RECOGNITION BY ALIGNMENTS AND MACHINE LEARNING 195 Allison N. Tegge, Zheng Wang, and Jianlin Cheng ftoc.indd v ftoc.indd v 8/20/2010 3:37:41 PM 8/20/2010 3:37:41 PM
  • 11. vi CONTENTS 10 TASSER-BASED PROTEIN STRUCTURE PREDICTION 219 Shashi Bhushan Pandit, Hongyi Zhou, and Jeffrey Skolnick 11 COMPOSITE APPROACHES TO PROTEIN TERTIARY STRUCTURE PREDICTION: A CASE-STUDY BY I-TASSER 243 Ambrish Roy, Sitao Wu, and Yang Zhang 12 HYBRID METHODS FOR PROTEIN STRUCTURE PREDICTION 265 Dmitri Mourado, Bostjan Kobe, Nicholas E. Dixon, and Thomas Huber 13 MODELING LOOPS IN PROTEIN STRUCTURES 279 Narcis Fernandez-Fuentes, Andras Fiser 14 MODEL QUALITY ASSESSMENT USING A STATISTICAL PROGRAM THAT ADOPTS A SIDE CHAIN ENVIRONMENT VIEWPOINT 299 Genki Terashi, Mayuko Takeda-Shitaka, Kazuhiko Kanou and Hideaki Umeyama 15 MODEL QUALITY PREDICTION 323 Liam J. McGuffin 16 LIGAND-BINDING RESIDUE PREDICTION 343 Chris Kauffman and George Karypis 17 MODELING AND VALIDATION OF TRANSMEMBRANE PROTEIN STRUCTURES 369 Maya Schushan and Nir Ben-Tal 18 STRUCTURE-BASED MACHINE LEARNING MODELS FOR COMPUTATIONAL MUTAGENESIS 403 Majid Masso and Iosif I. Vaisman 19 CONFORMATIONAL SEARCH FOR THE PROTEIN NATIVE STATE 431 Amarda Shehu 20 MODELING MUTATIONS IN PROTEINS USING MEDUSA AND DISCRETE MOLECULE DYNAMICS 453 Shuangye Yin, Feng Ding, and Nikolay V. Dokholyan INDEX 477 ftoc.indd vi ftoc.indd vi 8/20/2010 3:37:41 PM 8/20/2010 3:37:41 PM
  • 12. vii PREFACE PROTEIN STRUCTURE PREDICTION Proteins play a crucial role in governing several life processes. Stunningly complex networks of proteins perform innumerable functions in every living cell. Knowing the function and structure of proteins is crucial for the develop- ment of better drugs, higher yield crops, and even synthetic biofuels. As such, knowledge of protein structure and function leads to crucial advances in life sciences and biology. The motivation behind the structural determination of proteins is based on the belief that structural information provides insights as to their function, which will ultimately result in a better understanding of intricate biological processes. Breakthroughs in large-scale sequencing have led to a surge in the available protein sequence information that has far outstripped our ability to character- ize the structural and functional characteristic of these proteins. Several research groups have been working on determining the three-dimensional structure of the protein using a wide variety of computational methods. The problem of unraveling the relationship between the amino acid sequence of a protein and its three-dimensional structure has been one of the grand chal- lenges in molecular biology.The importance and the far reaching implications of being able to predict the structure of a protein from its amino acid sequence is manifested by the ongoing biennial competition on “Critical Assessment of Protein Structure Prediction” (CASP) that started more than 16 years ago. CASP is designed to assess the performance of current structure prediction methods and over the years the number of groups that have been participating in it continues to increase. This book presents a series of chapters by authors who are involved in the task of structure determination and using modeled structures for applications involving drug discovery and protein design. The book is divided into the fol- lowing themes. fpref.indd vii fpref.indd vii 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 13. viii PREFACE BACKGROUND ON STRUCTURE PREDICTION Chapter 1 provides an introduction to the protein structure prediction problem along with information about databases and resources that are widely used. Chapters 2 and 3 provide information regarding two very important initiatives in the field: (i) the structure prediction flagship competition (CASP), and (ii) the protein structure initiative (PSI),respectively.Since many of the approaches developed have been tested in the CASP competition, Chapter 2 lays the foundation for the need for such an evaluation, the problem definitions, sig- nificant innovations, competition format, as well as future outlook. Chapter 3 describes the protein structure initiative, which is designed to determine rep- resentative three-dimensional structures within the human genome. PREDICTION OF STRUCTURAL ELEMENTS Within each structural entity called a protein there lies a set of recurring sub- structures, and within these substructures are smaller substructures. Beyond the goal of predicting the three-dimensional structure of a protein from sequence several other problems have been defined and methods have been developed for solving the same. Chapters 4–6 provide the definitions of these recurring substructures called local alphabets or secondary structures and the computational approaches used for solving these problems. Chapter 6 specifi- cally focuses on a class of transmembrane proteins known to be harder to crystallize. Knowing the pairs of residues within a protein that are within contact or at a closer distance provides useful distance constraints that can be used while modeling the three-dimensional structure of the protein. Chapter 7 focuses on the problem of contact map prediction and also shows the use of sophisticated machine learning methods to solve the problem. A successful solution for each of these subproblems assists in solving the overarching protein structure prediction problem. TERTIARY STRUCTURE PREDICTION Chapters 8–11 discuss the widely used structure prediction methods that rely on homology modeling, threading, and fragment assembly. Chapters 8–9 discuss the problems of fold recognition and remote homology detection that attempt to model the three-dimensional structure of a protein using known structures. Chapters 10 and 11 discuss a combination of threading-based approaches along with modeling the protein in parts or fragments and usually helps in modeling the structure of proteins known not to have a close homolog within the structure databases. Chapter 12 is a survey of the hybrid methods that use a combination of the computational and experimental methods to achieve high-resolution protein structures in a high-throughput manner. fpref.indd viii fpref.indd viii 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 14. PREFACE ix Chapter 17 provides information about the challenges in modeling transmem- brane proteins along with a discussion of some of the widely used methods for these sets of proteins. Chapter 13 describes the loop prediction problem and how the technique can be used for refinement of the modeled structures. Chapters 14 and 15 assess the modeled structures and provide a notion of the quality of structures. This is extremely important from a biologist’s perspective who would like to have a metric that describes the goodness of the structure before use. Chapter 19 provides insights into the different conformations that a protein may take and the approaches used to sample the different conformations. FUNCTIONAL INSIGHTS Certain parts of the protein structure may be conserved and interact with other biomolecules (e.g., proteins, DNA, RNA, and small molecules) and perform a particular function due to such interactions. Chapter 16 discusses the problem of ligand-binding site prediction and its role in determining the function of the proteins. The approach uses some of the homology modeling principles used for modeling the entire structure. Chapter 18 introduces a computational model that detects the differences between protein structure (modeled or experimentally-determined) and its modeled mutant. Chapter 20 describes the use of molecular dynamic-based approaches for modeling mutants. ACKNOWLEDGEMENTS We wish to acknowledge the many people who have helped us with this project. We firstly thank all the coauthors who spent time and energy to edit their chapters and also served as reviewers by providing critical feedback for improving other chapters. Kevin Deronne, Christopher Kauffman, and Rezwan Ahmed also assisted in reviewing several of the chapters and helped the book take a form that is complete on the topic of protein structure predic- tion and exciting to read. Finally, we wish to thank our families and friends. We hope that you as a reader benefit from this book and feel as excited about this field as we are. Huzefa Rangwala George Karypis fpref.indd ix fpref.indd ix 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 15. xi CONTRIBUTORS Nir Ben-Tal, Department of Biochemistry and Molecular Biology, Tel Aviv University, Tel Aviv, Israel Aurélie Bornot, Institut National de la Santé et de la Recherche Médicale, UMR-S 665,Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris, France Alexandre G. de Brevern, Institut National de la Santé et de la Recherche Médicale, Université Paris Diderot, Institut National de la Transfusion Sanguine, 75015, Paris, France Jianlin Cheng, Computer Science Department and Informatics Institute University of Missouri, Columbia, MO 65211 Feng Ding, Department of Biochemistry and Biophysics University of North Carolina—Chapel Hill, NC 27599 Nicholas E. Dixon, School of Chemistry, University of Wollongong, NSW 2522, Australia Nikolay V. Dokholyan, Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, NC 27599 Eshel Faraggi, Indiana University School of Informatics, Indiana University- Purdue University Indianapolis, and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202 Krzysztof Fidelis, Protein Structure Prediction Center, Genome Center, University of California, Davis, Davis, CA Andras Fiser, Department of Systems and Computational Biology and Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY 10461 Narcis Fernandez-Fuentes, Leeds Institute of Molecular Medicine, University of Leeds, Leeds, UK flast.indd xi flast.indd xi 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 16. xii CONTRIBUTORS Adam Godzik, Program in Bioinformatics and Systems Biology, Sanford- Burnham Medical Research Institute, La Jolla, CA 92037 Thomas Huber, The University of Queensland, School of Chemistry and Molecular Biosciences, QLD, Australia Agnel Praveen Joseph, Institut National de la Santé et de la Recherche Médicale, UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris, France Kazuhiko Kanou, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan George Karypis, Department of Computer Science, University of Minnesota Minneapolis, MN 55455 Chris Kauffman, Department of Computer Science,University of Minnesota, Minneapolis, MN 55455 Bostjan Kobe, The University of Queensland, School of Chemistry and Molecular Biosciences, Brisbane, Australia Andriy Kryshtafovych, Protein Structure Prediction Center, Genome Center, University of California, Davis, Davis, CA Alberto J.M. Martin, Complex and Adaptive Systems Lab, School of Computer Science and Informatics, UCD Dublin, Ireland Majid Massa, Department of Bioinformatics and Computational Biology, George Mason University, Manassas, VA 20110 Liam J. McGuffin, School of Biological Sciences, The University of Reading, Reading, UK Catherine Mooney, Shields Lab, School of Medicine and Medical Science, University College Dublin, Ireland John Moult, Institute for Bioscience and Biotechnology Research,University of Maryland, Rockville, MD 20850 Dmitri Mouradov, The University of Queensland, School of Chemistry and Molecular Biosciences, QLD, Australia Christine Orengo, Department of Structural and Molecular Biology, University College London, London UK Shashi Bhushan Pandit, Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA 30318 Gianluca Pollastri, Complex and Adaptive Systems Lab, School of Computer Science and Informatics, UCD Dublin, Ireland Huzefa Rangwala, Department of Computer Science, George Mason University, Fairfax, VA 22030 flast.indd xii flast.indd xii 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 17. CONTRIBUTORS xiii Burkhard Rost, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032 Ambrish Roy, Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109 Maya Schushan, Department of Biochemistry and Molecular Biology, Tel Aviv University, Tel Aviv, Israel Amarda Shehu, Department of Computer Science,George Mason University, Fairfax, VA 22030 Mayuko Takeda-Shitaka, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan István Simon, lntsitute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary Jeffrey Skolnick, Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology Atlanta, GA 30318 Allison N. Tegge, Computer Science Department and Informatics Institute, University of Missouri, Columbia, MO 65211 Genki Terashi, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan Gábor E. Tusnady, Intsitute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary Hideaki Umeyama, School of Pharmacy, Kitasato University, Tokyo 108- 8641, Japan Iosif I. Vaisman, Department of Bioinformatics and Computational Biology, George Mason University, Manassas, VA 20110 IanWalsh, Complex andAdaptive Systems Lab,School of Computer Science and Informatics, UCD Dublin, Ireland Zheng Wang, Computer Science Department, University of Missouri, Columbia, MO 65211 SitaoWu, Center for Computational Medicine and Bioinformatics,University of Michigan, Ann Arbor, MI 48109 Shuangye Yin, Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, NC 27599 Yang Zhang, Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109 Hongyi Zhou, Center for the Study of Systems Biology, School of Biology Georgia Institute of Technology, Atlanta, GA 30318 flast.indd xiii flast.indd xiii 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 18. xiv CONTRIBUTORS Yaoqi Zhou, Indiana University School of Informatics, Indiana University- Purdue University Indianapolis, and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202 flast.indd xiv flast.indd xiv 8/20/2010 3:37:40 PM 8/20/2010 3:37:40 PM
  • 19. 1 CHAPTER 1 Introduction to Protein Structure Prediction: Methods and Algorithms, Edited by Huzefa Rangwala and George Karypis Copyright © 2010 John Wiley & Sons, Inc. Proteins have a vast influence on the molecular machinery of life. Stunningly complex networks of proteins perform innumerable functions in every living cell. Knowing the function and structure of proteins is crucial for the develop- ment of improved drugs, better crops, and even synthetic biofuels. As such, knowledge of protein structure and function leads to crucial advances in life sciences and biology. With recent advances in large-scale sequencing technologies, we have seen an exponential growth in protein sequence information. Protein structures are primarily determined using X-ray crystallography or nuclear magnetic reso- nance (NMR) spectroscopy, but these methods are time consuming, expen- sive, and not feasible for all proteins. The experimental approaches to determine protein function (e.g., gene knockout, targeted mutation, and inhibitions of gene expression studies) are low-throughput in nature [1,2]. As such, our ability to produce sequence information far outpaces the rate at which we can produce structural and functional information. Consequently, researchers are increasingly reliant on computational approaches to extract useful information from experimentally determined three-dimensional (3D) structures and functions of proteins. Unraveling the INTRODUCTION TO PROTEIN STRUCTURE PREDICTION HUZEFA RANGWALA Department of Computer Science George Mason University Fairfax, VA GEORGE KARYPIS Department of Computer Science University of Minnesota Minneapolis, MN c01.indd 1 c01.indd 1 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 20. 2 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION relationship between pure sequence information and 3D structure and/or function remains one of the fundamental challenges in molecular biology. Function prediction is generally approached by using inheritance through homology [2], that is, proteins with similar sequences (common evolutionary ancestry) frequently carry out similar functions. However, several studies [2–4] have shown that a stronger correlation exists between structure conservation and function, that is, structure implies function, and a higher correlation exists between sequence conservation and structure, that is, sequence implies struc- ture (sequence → structure → function). 1.1. INTRODUCTION TO PROTEIN STRUCTURES In this section we introduce the basic definitions and facts about protein struc- ture, the four different levels of protein structure, as well as provide details about protein structure databases. 1.1.1. Protein Structure Levels Within each structural entity called a protein lies a set of recurring substruc- tures, and within these substructures are smaller substructures still. As an example, consider hemoglobin, the oxygen-carrying molecule in human blood. Hemoglobin has four domains that come together to form its quaternary structure. Each domain assembles (i.e., folds) itself independently to form a tertiary structure. These tertiary structures are comprised of multiple second- ary structure elements—in hemoglobin’s case α-helices. α-Helices (and their counterpart β-sheets) have elegant repeating patterns dependent upon sequences of amino acids. 1.1.1.1. Primary Structure. Amino acids form the basic building blocks of proteins. Amino acids consists of a central carbon atom (Cα) attached by an amino (NH2), a carboxyl (COOH) group, and a side chain (R) group.The side chain group differentiates the various amino acids. In case of proteins, there are primarily 20 different amino acids that form the building blocks.A protein is a chain of amino acids linked with peptide bonds. Pairs of amino acid form a peptide bond between the amino group of one and the carboxyl group of the other. This polypeptide chain of amino acids is known as the primary structure or the protein sequence. 1.1.1.2. Secondary Structure. A sequence of characters representing the secondary structure of a protein describes the general 3D form of local regions. These regions organize themselves independently from the rest of the protein into patterns of repeatedly occurring structural fragments.The most dominant local conformations of polypeptide chains are α-helices and β-sheets. These local structures have a certain regularity in their form, attributed to the hydro- gen bond interactions between various residues. An α-helix has a coil-like c01.indd 2 c01.indd 2 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 21. INTRODUCTION TO PROTEIN STRUCTURES 3 structure, whereas a β-sheet consists of parallel strands of residues. In addition to regular secondary structure elements, irregular shapes form an important part of the structure and function of proteins. These elements are typically termed coil regions. Secondary structure can be divided into several types, although usually at least three classes (α-helix, coils, and β-sheet) are used. No unique method of assigning residues to a particular secondary structure state from atomic coor- dinates exists, although the most widely accepted protocol is based on the Dictionary of Protein Secondary Structure (DSSP) algorithm [5]. DSSP uses the following structural classes: H (α-helix), G (310-helix), I (π-helix), E (β- strand), B (isolated β-bridge), T (turn), S (bend), and – (other). Several other secondary structure assignment algorithms use a reduction scheme that con- verts this eight-state assignment down to three states by assigning H and G to the helix state (H), E and B to a the strand state (E), and the rest (I, T, S, and –) to a coil state (C). This is the format generally used in structure databases. 1.1.1.3. Tertiary Structure. The tertiary structure of the protein is defined as the global 3D structure, represented by 3D coordinates for each atoms. These tertiary structures are comprised of multiple secondary structure ele- ments, and the 3D structure is a function of the interacting side chains between the different amino acids. Hence, the linear ordering of amino acids forms secondary structure; arranging secondary structures yields tertiary structure. 1.1.1.4. Quaternary Structure. Quaternary structures represent the interac- tion between multiple polypeptide chains.The interaction between the various chains is due to the non-covalent interactions between the atoms of the dif- ferent chains. Examples of these interactions include hydrogen bonding, van Der Walls interactions, ionic bonding, and disulfide bonding. Research in computational structure prediction concerns itself mainly with predicting secondary and tertiary structures from known experimentally determined primary structure or sequence. This is due to the relative ease of determining primary structure and the complexity involved in quaternary structure. 1.1.2. Protein Sequence and Structure Databases The large amount of protein sequence information, experimentally deter- mined structure information, and structural classification information is stored in publicly available databases. In this section we review some of the databases that are used in this field, and provide their availability information in Table 1.1. 1.1.2.1. Sequence Databases. The Universal Protein Resource (UniProt) [6] is the most comprehensive warehouse containing information about protein c01.indd 3 c01.indd 3 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 22. 4 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION sequences and their annotation. It is a database of protein sequences and their function that is formed by aggregating the information present in the Swiss- Prot, TrEMBL, and Protein Information Resources (PIR) databases. The UniProtKB 13.2 version of database (released on April 8, 2008) consists of 5,939,836 protein sequence entries (Swiss-Prot providing 362,782 entries and TrEMBL providing 5,577,054 entries). However, several proteins have high pairwise sequence identity, and as such lead to redundant information.The UniProt database [6] creates a subset of sequences such that the sequence identity between all pairs of sequences within the subset is less than a predetermined threshold. In essence, UniProt contains the UniRef100, UniRef90, and UniRef50 subsets where within each group the sequence identity between a pair of sequences is less than 100%, 90%, and 50%, respectively. The National Center for Biotechnology Information (NCBI) also provides a nonredundant (NCBI nr) database of protein sequences using sequences from a wide variety of sources. This database will have pairs of proteins with high sequence identity, but removes all the duplicates. The NCBI nr version 2.2.18 (released on March 2, 2008) contains 6,441,864 protein sequences. 1.1.2.2. Protein Data Bank (PDB). The Research Collaboratory for Structural Bioinformatics (RSCB) PDB [7] stores experimentally determined 3D structure of biological macromolecules including nucleotides and proteins. As of April 20, 2008 this database consists of 46,287 protein structures that are determined using X-ray crystallography (90%), NMR (9%), and other methods like Cryo-electron microscopy (Cryo-EM). These experimental methods are time-consuming, expensive, and need protein to crystallize. 1.1.2.3. Structure Classification Databases. Various methods have been proposed to categorize protein structures. These methods are based on the pairwise structural similarity between the protein structures, as well as the topological and geometric arrangement of atoms and predominant secondary TABLE 1.1 Protein Sequence and Structure Databases Database Information Availability Link UniProt Sequence http://www.pir.uniprot.org/ UniRef Cluster sequences http://www.pir.uniprot.org/ NCBI nr Nonredundant sequences ftp://ftp.ncbi.nlm.nih.gov/blast/db/ PDB Structure http://www.rcsb.org/ SCOP Structure classification http://scop.mrc-lmb.cam.ac.uk/scop/ CATH Structure classification http://www.cathdb.info/ FSSP Structure classification http://www.ebi.ac.uk/dali/fssp/ ASTRAL Compendium http://astral.berkeley.edu/ The databases referred to in this table are most popular for protein structure-related information. c01.indd 4 c01.indd 4 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 23. INTRODUCTION TO PROTEIN STRUCTURES 5 structure like subunits. Structural Classification of Proteins (SCOP) [8], Class, Architecture, Topology, and Homologous superfamily (CATH) [9], and Families of Structurally Similar Proteins (FSSP) [10] are three widely used structure classification databases. The classification methodology involves breaking a protein chain or complex into independent folding units called domains, and then classifying these domains into a set of hierarchical classes sharing similar structural characteristics. SCOP Database. SCOP [8] is a manually curated database that provides a detailed and comprehensive description of the evolutionary and structural relationships between proteins whose structure is known (present in the PDB). SCOP classifies proteins structures using visual inspection as well as structural comparison using a suite of automated tools. The basic unit of classification is generally a domain. SCOP classification is based on four hierarchical levels that encompass evolutionary and structural relationships [8]. In particular, proteins with clear evolutionary relationship are classified to be within the same family. Generally, protein pairs within the same family have pairwise residue identities greater than 30%. Protein pairs with low sequence identity, but whose structural and functional features imply probably common evolu- tionary information, are classified to be within the same superfamily. Protein pairs with similar major secondary structure elements and topological arrange- ment of substructures (as well as favoring certain packing geometries) are classified to be within the same fold. Finally, protein pairs having a predomi- nant set of secondary structures (e.g., all α-helices proteins) lie within the same class. The four hierarchical levels, that is, family, superfamily, fold, and class define the structure of the SCOP database. The SCOP 1.73 version database (released on September 26, 2007) classifies 34,494 PDB entries (97,178 domains) into 1086 unique folds, 1777 unique superfamilies, and 3464 unique families. CATH Database. CATH [9] database is a semi-automated protein structure classification database like the SCOP database. CATH uses a consensus of three automated classification techniques to break a chain into domains and classify them in the various structural categories [11]. Domains for proteins that are not resolved by the consensus approach are determined manually. These domains are then classified into the following hierarchical categories using both manual and automated methods in conjunction. The first level membership, class, is determined based on the secondary structure composition and packing within the structure. The second level, architecture, clusters proteins sharing the same orientation of the secondary structure element but ignoring the connectivity between these substructural units. The third level, topology, groups protein pairs with a high structure alignment score as determined by the SSAP [12] algorithm, and in essence share both overall shape and connectivity of secondary structures. The fourth level, homologous pairs, shares a common ancestor and is identified by c01.indd 5 c01.indd 5 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 24. 6 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION sequence alignment as well as the SSAP structure alignment method.Structures are further classified to be within the same sequence families if they share a high sequence identity. The CATH 3.1.0 version database (released on January 19, 2007) classifies 30,028 (93,885 domains) proteins from the PDB into 40 architecture-level classes, 1084 topology-level classes, and 2091 homologous-level classes. FSSP Database. The FSSP [10] is a structure classification database. FSSP uses an automatic classification scheme that employs exhaustive structure- to-structure alignment of proteins using the DALI [13] alignment. FSSP does not provide a hierarchical classification like the SCOP and CATH databases, but instead employs a hierarchical clustering algorithm using the pairwise structure similarity scores that can be used for the definition of fold classes— however, not very accurate. There have been several studies [14,15] analyzing the relationship between the SCOP, CATH, and FSSP databases for representing the fold space for proteins. The major disagreement between the three databases lies in the domain identification step, rather than the domain classification step. A high percentage of agreement exists between the SCOP, CATH, and FSSP data- bases especially at the fold level with sequence identity greater than 25%. ASTRAL Compendium. The A Structural Alignment Library (ASTRAL) [16–18] compendium is a set of database and tools used for analysis of protein structures and sequences. This database is partially derived from, and aug- ments, the SCOP [8] database. ASTRAL provides accurate linkage between the biological sequence and the reported structure in PDB, and identifies the domains within the sequence using SCOP. Since the majority of domain sequences in PDB are very similar to others,ASTRAL tools reduce the redun- dancy by selecting high-quality representatives. Using the reduced nonredun- dant set of representation proteins allows for sampling of all the different structures in the PDB. This also removes bias due to overrepresented struc- tures. Subsets provided by ASTRAL are based on SCOP domains and use high-quality structure files only. Independent subsets of representative pro- teins are identified using a greedy algorithm with filtering criterion based on pairwise sequence identity determined using the Basic LocalAlignment Search Tool (BLAST) [19], an e-value-based threshold, or a SCOP level-based filter. 1.2. PROTEIN STRUCTURE PREDICTION METHODS One of the biggest goals in structural bioinformatics is the prediction of the 3D structure of a protein from its one-dimensional (1D) protein sequence. The goal is to be able to determine the shape (known as a fold) that a given amino acid sequence will adopt. The problem is further divided based on c01.indd 6 c01.indd 6 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 25. PROTEIN STRUCTURE PREDICTION METHODS 7 whether the sequence will adopt a new fold or bear resemblance to an existing fold (template) in some protein structure database. Fold recognition is easy when the sequence in question has a high degree of sequence similarity to a sequence with known structure [20]. If the two sequences share evolutionary ancestry they are said to be homologous. For such sequence pairs we can build the structure for the query protein by choosing the structure of the known homologous sequence as template. This is known as comparative modeling. In the case where no good template structure exists for the query, one must attempt to build the protein tertiary structure from scratch. These methods are usually called ab initio methods. In a third-fold prediction scenario, there may not necessarily be a good sequence similarity with a known structure, but a structural template may still exist for the given sequence.To clarify this case, if one were aware of the target structure then they could extract the template using structure–structure alignments of the target against the entire structural database. It is important to note that the target and template need not be homologous. These two cases define the fold prediction (homologous) and fold prediction (analogous) problems during the Critical Assessment of Protein Structure Prediction (CASP) competition. 1.2.1. Comparative Modeling Comparative Modeling or homology modeling is used when there exists a clear relationship between the sequence of a query protein (unknown struc- ture) and a sequence of a known structure. The most basic approach to struc- ture prediction for such (query) proteins is to perform a pairwise sequence alignment against each sequence in protein sequence databases. This can be accomplished using sequence alignment algorithms such as Smith-Waterman [21] or sequence search algorithms (e.g., BLAST [19]). With a good sequence alignment in hand, the challenge in comparative modeling becomes how to best build a 3D protein structure for a query protein using the template structure. The heart of the above process is the selection of a suitable structural tem- plate based on sequence pair similarity. This is followed by the alignment of query sequence to the template structure selected to build the backbone of the query protein. Finally the entire modeled structure is refined by loop construction and side chain modeling. Several comparative modeling methods, more commonly known as modeler programs, have been developed over the past several years [22,23] focusing on various parts of the problem. As seen in the various years of CASP [24,25], the span of comparative modeling approaches [22,23] follows five basic steps: (i) selecting one or suit- able templates, (ii) utilizing sensitive sequence template alignment algorithms, (iii) building a protein model using the sequence structure alignment as refer- ence, (iv) evaluating the quality of the model, and (v) refining the model.These typical steps for the comparative modeling process are shown in Figure 1.1. c01.indd 7 c01.indd 7 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 26. 8 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION FIGURE 1.1 Flowchart for the comparative modeling process. Raw Model Start Template Identification (Structure Databases) Choose Template Align Target Sequence to Template Structure Build Model for Target Using Template Structure Evaluate the Model Model Good? Stop Side Chain Placement Loop Modeling Refinement c01.indd 8 c01.indd 8 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 27. PROTEIN STRUCTURE PREDICTION METHODS 9 1.2.2. Fold Prediction (Homologous) While satisfactory methods exist to detect homologs (proteins that share similar evolutionary ancestry) with high levels of similarity, accurately detect- ing homologs at low levels of sequence similarity (remote homology detec- tion) remains a challenging problem. Some of the most popular approaches for remote homology prediction compare a protein with a collection of related proteins using methods such as Position-Specific Iterative-BLAST (PSI- BLAST) [26], protein family profiles [27], hidden Markov models (HMMs) [28,29], and Sequence Alignment and Modeling System (SAM) [30]. These schemes produce models that are generative in the sense that they build a model for a set of related proteins and then check to see how well this model explains a candidate protein. In recent years, the performance of remote homology detection has been further improved through the use of methods that explicitly model the differ- ences between the various protein families (classes) by building discriminative models. In particular, a number of different methods that use Support Vector Machines (SVM) [31] have been developed to produce results that are gener- ally superior to those produced by either pairwise sequence comparisons or approaches based on generative models—provided there are sufficient train- ing data [32–39]. 1.2.3. Fold Prediction (Analogous) Occasionally a query sequence will have a native fold similar to another known fold in a database, but the two sequences will have no detectable simi- larity. In many cases the two proteins will lack an evolutionary relationship as well.As the definition of this problem relies on the inability of current methods to detect sequential similarity, the set of proteins falling into this category remains in flux. As new methods continue to improve at finding sequential similarities as a result of increasing database size and better techniques, the number of proteins in question decreases. Techniques to find structures for such query sequences revolve around mounting the query sequence on a series of template structures in a process known as threading [40–42]. An objective energy function provides a score for each alignment, and the highest scoring template is chosen. Obviously, if the correct template does not exist in the series then the method will not produce an accurate prediction. As a result of this limitation, predicting the structure of proteins in this category is as challenging as predict- ing protein targets that are part of the new or rare folds. 1.2.4. Ab Initio Techniques to predict novel protein structure have come a long way in recent years, although a definitive solution to the problem remains elusive. Research c01.indd 9 c01.indd 9 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 28. 10 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION in this area can be roughly divided into fragment assembly [43–45] and first principle-based approaches, although occasionally the two are combined [46]. The former attempt to assign a fragment with known structure to a section of the unknown query sequence. The latter start with an unfolded conformation, usually surrounded by solvent, and allow simulated physical forces to fold the protein as would normally happen in vivo. Usually, algorithms from either class will use reduced representations of query proteins during initial stages to reduce the overall complexity of the problem. Even in case of these ab initio prediction methods, the state-of-the-art methods [46–48] determine several template structures (using the template selection methods used in comparative modeling methods). The final protein is modeled using an assembly of fragments or substructures fitted together using a highly optimized approximate energy and statistics-based potential function. This book presents methods developed for protein structure prediction. In particular methods and problems that are prevalent in a biennial structure prediction competition (CASP) are discussed in the first half of the book.The second half of the book discusses approaches that combine experimental and computational approaches for structure prediction and also new techniques for predicting structures of transmembrane proteins. Finally, the book dis- cusses the applications of protein structure within the context of function prediction and drug discovery. REFERENCES 1. G. Pandey, V. Kumar, and M. Steinbach. Computational approaches for protein function prediction: A survey. Technical Report 06-23, Department of Computer Science and Engineering, University of Minnesota, 2006. 2. D. Lee, O. Redfern, and C. Orengo. Predicting protein function from sequence and structure. Nature Reviews. Molecular Cell Biology, 8(12):995–1005, 2007. 3. J.C. Whisstock and A.M. Lesk. Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics, 36(3):307–340, 2003. 4. D. Devos and A. Valencia. Practical limits of function prediction. Proteins, 41(1):98–107, 2000. 5. W. Kabsch and C. Sander. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577– 2637, 1983. 6. UniProt Consortium. The universal protein resource (uniprot). Nucleic Acids Research, 36(Database issue):D190–D195, 2008. 7. H.M. Berman, T.N. Bhat, P.E. Bourne, Z. Feng, G.G.H. Weissig, and J. Westbrook. The Protein Data Bank and the challenge of structural genomics. Nature Structural Biology, 7:957–959, 2000. 8. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural clas- sification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540, 1995. c01.indd 10 c01.indd 10 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 29. REFERENCES 11 9. C.A. Orengo, A.D. Mitchie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thorton. Cath- a hierarchic classification of protein domain structures. Structure, 5(8):1093– 1108, 1997. 10. L. Holm and C. Sander. The fssp database: Fold classification based on structur- estructure alignment of proteins. Nucleic Acids Research, 24(1):206–209, 1996. 11. S. Jones, M. Stewart, A. Michie, M.B. Swindells, C. Orengo, and J.M. Thornton. Domain assignment for protein structures using a consensus approach: Characterization and analysis. Protein Science, 7(2):233–242, 1998. 12. W.R. Taylor and A.C. Orengo. Protein structure alignment. Journal of Molecular Biology, 208(1):1–22, 1989. 13. L. Holm and C. Sander. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology, 233(1):123–138, 1993. 14. C. Hadley and D. Jones. A systematic comparison of protein structure classifica- tions: Scop, cath and fssp. Structure, 7(9):1099–1112, 1999. 15. R. Day, D.A.C. Beck, R.S. Armen, and V. Daggett. A consensus view of fold space: Combining SCOP, CATH, and the Dali Dom ain Dictionary. Protein Science, 12(10):2150–2160, 2003. 16. S.E. Brenner, P. Koehl, and M. Levitt. The astral compendium for sequence and structure analysis. Nucleic Acids Research, 28:254–256, 2000. 17. J.-M. Chandonia, N.S. Walker, L.L. Conte, P. Koehl, M. Levitt, and S.E. Brenner. ASTRAL compendium enhancements. Nucleic Acids Research, 30(1):260–263, 2002. 18. J.M. Chandonia, G. Hon, N.S. Walker, L.L. Conte, P. Koehl, M. Levitt, and S.E. Brenner. The astral compendium in 2004. Nucleic Acids Research, 32:D189–D192, 2004. 19. S.F. Altschul, W. Gish, E.W. Miller, and D.J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990. 20. P. Bourne and H. Weissig. Structural Bioinformatics. Hoboken, NJ: John Wiley & Sons, 2003. 21. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981. 22. P.A. Bates and M.J.E Sternberg. Model building by comparison at casp3: Using expert knowledge and computer automation. Proteins: Structure, Functions, and Genetics, 3:47–54, 1999. 23. A. Fiser, R.K. Do, and A. Sali. Modeling of loops in protein structures. Protein Science, 9:1753–1773, 2000. 24. C. Venclovas. Comparative modeling in casp5: Progress is evident, but alignment errors remain a significant hindrance. Proteins: Structure, Function, and Genetics, 53:380–388, 2003. 25. C. Venclovas and M. Margelevicius. Comparative modeling in casp6 using consen- sus approach to template selection, sequence-structure alignment, and structure assessment. Proteins: Structure, Function, and Bioinformatics, 7:99–105, 2005. 26. S.F. Altschul, L.T. Madden, A.A. SchÃd’ffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. c01.indd 11 c01.indd 11 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 30. 12 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION 27. M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: Detection of distantly related proteins. PNAS, 84:4355–4358, 1987. 28. A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, 1994. 29. P. Baldi, Y. Chauvin, T. Hunkapiller, and M. McClure. Hidden Markov models of biological primary sequence information. PNAS, 91:1053–1063, 1994. 30. K. Karplus, C. Barrett, and R. Hughey. Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14(10):846–856, 1998. 31. V. Vapnik. Statistical Learning Theory. New York: John Wiley, 1998. 32. T. Jaakkola, M. Diekhans, and D. Hassler. A dscriminative framework for detect- ing remote protein homologies. Journal of Computational Biology, 7(1/2):95–114, 2000. 33. L. Liao and W.S. Noble. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relation- ships. Proceedings of the International Conference on Research in Computational Molecular Biology, 225–232, 2002. 34. C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: A string kernel for svm protein classification. Proceedings of the Pacific Symposium on Biocomputing, 564–575, 2002. 35. C. Leslie, E. Eskin, W.S. Noble, and J. Weston. Mismatch string kernels for svm protein classification. Advances in Neural Information Processing Systems, 20(4):467–476, 2003. 36. Y. Hou, W. Hsu, M.L. Lee, and C. Bystroff. Efficient remote homology detection using local structure. Bioinformatics, 19(17):2294–2301, 2003. 37. Y. Hou, W. Hsu, M.L. Lee, and C. Bystroff. Remote homology detection using local sequence-structure correlations. Proteins: Structure, Function, and Bioinformatics, 57:518–530, 2004. 38. H. Saigo, J.P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004. 39. R. Kuang, E. Ie, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile- based string kernels for remote homology detection and motif extraction. Journal of Bioinformatics and Computational Biology, 3:152–160, 2004. 40. D.T. Jones, W.R. Taylor, and J.M. Thorton. A new approach to protein fold rec- ognition. Nature, 358:86–89, 1992. 41. D.T. Jones. Genthreader: An efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology, 287(4):797–815, 1999. 42. J.U. Bowie, R. Luethy, and D. Eisenberg. A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253:797–815, 1991. 43. K.T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein ter- tiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. Journal of Molecular Biology, 268:209– 225, 1997. 44. K. Karplus, R. Karchin, J. Draper, J. Casper, Y. Mandel-Gutfreund, M. Diekhans, and R. Hughey. Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins: Structure, Function, and Genetics, 53:491–496, 2003. c01.indd 12 c01.indd 12 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 31. REFERENCES 13 45. J. Lee, S.-Y. Kim, K. Joo, I. Kim, and J. Lee. Prediction of protein tertiary structure using profesy, a novel method based on fragment assembly and conformational space annealing. Proteins: Structure, Function, and Bioinformatics, 56:704–714, 2004. 46. C.A. Rohl, C.E.M. Strauss, K.M.S. Misura, and D. Baker. Protein structure predic- tion using rosetta. Methods in Enzymology, 383:66–93, 2004. 47. Y. Zhang. I-tasser server for protein 3d structure prediction. BMC Bioinformatics, 9:40, 2008. 48. Y. Zhang, A.J. Arakaki, and J. Skolnick. Tasser: An automated method for the prediction of protein tertiary structures in casp6. Proteins: Structure, Function, and Bioinformatics, 7:91–98, 2005. c01.indd 13 c01.indd 13 8/20/2010 3:36:15 PM 8/20/2010 3:36:15 PM
  • 32. 15 CHAPTER 2 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING ANDRIY KRYSHTAFOVYCH and KRZYSZTOF FIDELIS Protein Structure Prediction Center Genome Center University of California, Davis Davis, CA Introduction to Protein Structure Prediction: Methods and Algorithms, Edited by Huzefa Rangwala and George Karypis Copyright © 2010 John Wiley & Sons, Inc. 2.1. WHY CRITICAL ASSESSMENT OF PROTEIN STRUCTURE PREDICTION (CASP) WAS NEEDED? More than half a century has elapsed since it was shown that amino acid sequence determines the three-dimensional structure of a protein [1], but a general procedure to translate sequence into structure is still to be established. Several dozen methods for generating protein structure from sequence have been developed, providing different levels of model accuracy in different modeling circumstances. With such a variety of modeling approaches and success levels, it was important to establish an objective procedure to compare the performances of the methods and learn their advantages and weaknesses. Also, with only sparse reports on the performance of most methods it was difficult to arrive at a clear understanding of current capabilities and bottle- necks in the field. Specifically, it was not possible to address many key ques- tions about modeling methods, in particular: JOHN MOULT Center for Advanced Research in Biotechnology University of Maryland, College Park College Park, MD c02.indd 15 c02.indd 15 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 33. 16 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING 1. What are the most effective strategies for protein structure modeling? 2. What are the main factors influencing the outcome of a protein structure modeling experiment and how close can a model get to the correspond- ing experimental structure? 3. How can related structures on which a model can be based be identified reliably (the template identification problem)? How accurately can coor- dinates from the template structure be mapped to the correct positions on the target sequence (the alignment problem)? Are models produced by altering/refining templates more accurate than the models built by simply copying coordinates of the template (the refinement problem)? 4. How well can the reliability of the model in general and specific regions in particular be estimated (the quality assessment problem)? 5. How well can fully automatic modeling servers perform, compared with a combination of computing methods and human knowledge? 6. Has there been progress in the field? 7. What are the bottlenecks to further progress? 8. Where can future efforts be most productively focused? In order to rigorously address these issues John Moult and colleagues pio- neered the CASP experiment in 1994 [2]. The initiative was well accepted by the community of computational biologists, and the experiment, after eight completed rounds, continues to attract considerable attention to protein struc- ture modelers from around the world. Two hundred thirty four predictor groups from 25 countries participated in the last completed CASP8, submit- ting over 80,000 predictions (see Fig. 2.1 for historical CASP participation statistics), and approximately the same number of predictor groups are par- ticipating in CASP9, which is currently (July 2010) under way. Even though we, CASP co-organizers, continue to emphasize that CASP is primarily a scientific endeavor aimed at establishing the current state of the art in the protein structure prediction, many view it more as a “world cham- pionship” in this field of science. Thus, to a large extent, CASP owes its popu- larity to the twin human drives of competitiveness and curiosity.Whatever the case, a large community of structure modelers devote very considerable effort to the process, and it has now been emulated in other areas of computational biology [3–6]. 2.2. CASP PRINCIPLES AND ORGANIZATION In the pre-CASP times, protein structure modeling methods were tested using the procedure schematically shown in Figure 2.2a. Method developers selected sequences to test their own methods (usually with different research groups selecting different sets of proteins), and assessed the results by comparing models to the experimental structures already known to them at the time of c02.indd 16 c02.indd 16 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 34. CASP PRINCIPLES AND ORGANIZATION 17 “prediction.” Many apparently successful modeling results were reported in the literature but the inability of others to reproduce the results and the lack of resulting useful applications strongly suggested that this testing approach was not strict enough to ensure objective assessment of the results. In particu- lar, many felt that the reported results were too easily influenced by the known answers. CASP was established to address the deficiencies in these FIGURE 2.1 Statistics on (a) the number of participating groups and (b) number of submitted predictions in CASP experiments held so far. In panel (b), bars representing the number of tertiary structure predictions are shown in dark gray, while bars repre- senting the cumulative number of predictions in other categories (secondary structure, residue-residue contacts, disorder regions, domain boundaries, function, quality assess- ment) are shown in light gray. 129 0 891 56 25691238 9698 1438 25105 3623 34831 6452 52235 11482 55130 25430 0 10,000 20,000 30,000 40,000 50,000 60,000 CASP1 1994 CASP2 1996 CASP3 1998 CASP4 2000 CASP5 2002 CASP6 2004 CASP7 2006 CASP8 2008 CASP Predictions 3D Other 35 70 98 163 215 208 253 234 0 50 100 150 200 250 300 CASP1 1994 CASP2 1996 CASP3 1998 CASP4 2000 CASP5 2002 CASP6 2004 CASP7 2006 CASP8 2008 Participating Groups (a) (b) c02.indd 17 c02.indd 17 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 35. 18 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING traditional testing procedures. The main principles of CASP summarized in Figure 2.2b are: • “Blind” prediction regime. Predictors are required to submit their models before the answers (experimental structures) are publicly available. This is the primary CASP principle for ensuring rigorous conclusions. • Independent assessment of the results. Experts in the field are invited to perform an independent assessment of all submitted models. The asses- sors may not participate in the experiment in the role of predictors. • Same targets for everyone. Proteins for modeling (“targets” in CASP jargon) are selected not by the predictors but by the organizers who are not permitted to participate in the experiment and so have no interest in introducing any selection bias. The same set of targets is used to test all the methods, thus facilitating direct comparison of performance. Organizers strive to provide a reasonably large set of targets with a bal- anced range of difficulty, so that the assessment is statistically sound and shows the range of success and failure across the spectrum of structure modeling problems. • Anonymity of assessment. All information that could be directly or indi- rectly used to identify submitting research groups are stripped off the predictions. This information is not made available to the assessors until after their analysis of the results is completed. • Same evaluation criteria for everyone. All predictions are evaluated using the same set of numerical criteria. FIGURE 2.2 Schematics of (a) pre-CASP and (b) CASP testing procedures for protein structure prediction methods. c02.indd 18 c02.indd 18 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 36. CASP PROCESS 19 • Data availability for post-experiment comparisons. All predictions and automatic evaluation results are released to the public upon completion of each CASP experiment, so as to allow others to reproduce the results, and to facilitate methods development. • Control of the experiment by the participants. Those participating in CASP are involved in shaping the rules and scope of the experiment through a variety of mechanisms, particularly a discussion forum (FORCASP) and a predictors’ meeting at each conference,where motions for change are considered and voted upon. Together, these principles ensure a more objective determination of capa- bilities in the field of protein structure modeling than the conventional peer- review publication system. They make unjustified claims more difficult to publish, and provide a powerful mechanism for predictors to establish the strength of their methods. The principles remain untouched from one experiment to another, but a number of changes and additions to the details have been introduced, and these are summarized in Table 2.1. 2.3. CASP PROCESS CASP is a complicated process, requiring careful planning, data management, and security. The Protein Structure Prediction Center, established to support the experiment at the Lawrence Livermore Laboratory in 1996 and in 2005 at the University of California, Davis, provides the infrastructure for methods testing, develops method evaluation and visualization tools, and handles all data management issues [7]. Experiments are held every 2 years.The timetable of a typical CASP round is schematically shown in Figure 2.3. The experiment is open to all. The Prediction Center releases targets for prediction and collects models from registered participants for approximately 3 months. Targets for structure pre- diction are either structures soon-to-be solved by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, or structures already solved but not yet publicly accessible. Prediction methods are divided into two categories—those using a combination of computational methods and human experience, and those relying solely on computational methods. The integrity of the latter category is ensured by requiring that servers process target infor- mation and return models automatically. A window of 3 weeks is usually provided for prediction of a target by human-expert groups and 3 days by servers. Following closing of the server prediction window, the server models are posted at the Prediction Center web site. These models can then be used by human-expert predictors as starting points for further, more detailed mod- eling. They are also used for testing model quality assessment methods in CASP. Once all models of a target have been collected and the experimental c02.indd 19 c02.indd 19 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 37. TABLE 2.1 Changes in the Consecutive CASP s CASP Prediction Categories Main Evaluation Measures/ Packages General CASP1 (1994) TS, AL, SS. Protein tertiary structure (TS — coordinates format, AL — alignment format). Secondary structure (SS). RMSD. Main CASP principles were established. CASP2 (1996) TS, AL, SS. Prediction of protein - ligand complexes introduced. Prediction Center established to support CASP. CASP3 (1998) TS, AL, SS. Prediction of complexes dropped. ProSup and DALI packages were used for structural superpositions. New evaluation software tested at the Prediction Center to replace RMSD with a measure more suitable for model - target comparison. CAFASP experiment to evaluate fold recognition servers run as a satellite to CASP. CASP4 (2000) TS, AL, SS, RR. Residue – residue (RR) contact prediction introduced. New evaluation software further developed, resulting in the LGA package [9] . The GDT_TS measure of structural similarity, and AL0 score for correctness of the model - target alignment used as basic CASP measures. CASP5 (2002) TS, AL, RR, DR. SS dropped. Disordered regions (DR) prediction introduced. 20 c02.indd 20 c02.indd 20 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 38. CASP Prediction Categories Main Evaluation Measures/ Packages General CASP6 (2004) TS, AL, RR, DR, DP, FN Domain boundary (DP) prediction introduced. Function prediction (FN) introduced. DAL, nonrigid body structure superposition software, used for scoring models in addition to LGA. CASP moved to the independent of CAFASP server testing procedure. Time for server response was set to 48 hours plus 24 hours for potential format corrections. Release of server predictions to human - expert groups 72 hours after target release. CASP7 (2006) TS, AL, RR, DR, DP, FN, QA, TR Model quality assessment (QA) category introduced. Model refi nement (TR) category introduced. Prediction of multimers introduced. MAMMOTH structure superposition program additionally used for analysis of the results. Structural assessment categories changed from classic division on comparative modeling/fold recognition/ ab initio to template - based/template - free. High - accuracy modeling category separately assessed. CASP8 (2008) TS, AL, RR, DR, DP, FN, QA, TR Prediction of multimers dropped. FN category was narrowed to binding site prediction. DALI structure superposition program was additionally used for analysis of the results. Prediction Center automatically calculated group rankings for comparative modeling targets according to different measures. Limit on number of targets for human - expert groups. Division of targets into human - server and server only categories. Time for server response was set to 72 hours. Separate assessor for contacts, domains, and function predictions. CASP9 (2010, under way) TS, AL, RR, DR, FN, QA, TR DP prediction is dropped. Prediction of multimers is reinstated. 21 c02.indd 21 c02.indd 21 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 39. 22 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING structure is available, the Prediction Center performs a standard numerical evaluation of the models, taking the experimental structure as the gold stan- dard. A battery of tools is used for the numerical evaluation of predictions— LGA [8], ACE [9], DAL [10], MAMMOTH [11], DALI [12]. If the target consists of more than one well-defined structural domain, the evaluation is performed on each of these as well as on the complete target (the official domain boundaries are defined by the assessors). The results of automatic evaluation are made available to the independent assessors, who typically add their own analysis methods and make more subjective assessments of the merits and faults of the models. The identity of the predictors is concealed from the assessors while they conduct their analysis.Assessment outcomes are presented to the community at the predictors’ meeting usually held in December of a CASP year. At that time, results of the evaluations are also made publicly available through the Prediction Center web site (http:// predictioncenter.org) allowing predictors to compare their own models with those submitted by other groups. Details of all the experiments completed so far and their results are available through this web site.The web site also hosts a discussion forum, FORCASP, allowing exchange of thoughts by the predic- tors. The articles by the assessors, the organizers, and the most successful prediction groups are published in special issues of the journal Proteins: Structure, Function, and Bioinformatics. There are currently eight such issues available, one for each of the eight CASP experiments [2,13–19]. The articles in the special issues discuss in detail the methods tested in CASP, the evalu- ation results, and the analysis of the progress made. Below we briefly sum- marize the state of the art in different CASP modeling categories. 2.4. METHOD CLASSES AND PREDICTION DIFFICULTY CATEGORIES In evaluating the ability of prediction methods, it is important to realize that difficulty of a modeling problem is determined by many factors. In theory, it FIGURE 2.3 Timetable of the CASP experiment. c02.indd 22 c02.indd 22 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 40. TBM 23 is possible to calculate the structure of any protein from knowledge of its amino acid composition and environmental conditions alone, since it has long been established that these factors determine the functional conformation [1]. In practice, it is not yet possible to follow the detailed folding behavior of a system with as many atoms and degrees of freedom as a protein, nor to thoroughly search for the global free energy minimum of such a system [20– 22]. Two types of methods for combating these limitations have been devel- oped. One, by far the most effective at present, utilizes experimental structures of evolutionarily related proteins, providing templates on which to base a model. For cases where no such relationship exists, or none can be discovered, partially effective structure prediction techniques have been developed using simplified energy functions and employing approximate energy landscape search strategies. These two approaches define the main two classes of prediction methods—template-based modeling (TBM), some- times referred to as comparative or homology modeling, and template-free modeling. Historically, template-free methods were often termed ab initio (or first principles), but members of the CASP community objected on the grounds that these methods often make use of knowledge-based potentials to evaluate interactions and assemblies of observed peptide fragment confor- mations to generate trial structures. Template-free methods are currently effective only for modeling small proteins (100 residues or less). Template- based methods can be applied wherever it is possible to identify a structurally similar protein that can be used as a template for building the model, irre- spective of size. When the two approaches have been applied to the same modeling problem, template-based methods have usually proven more accu- rate than template-free methods. Thus, the most significant division in mod- eling difficulty is between cases where a model can be built based on templates derived from known experimental structures, and those where it cannot. At one extreme, high-resolution models competitive with experiments can be produced for proteins with sequences very similar to that of a known struc- ture. At the other extreme, low resolution, very approximate models can be generated by template-free methods for proteins with no detectable sequence or structure relationship to known structures. To properly assess method suc- cesses and failures, CASP subdivides modeling into these two separate cate- gories, each with its own challenges, and hence requiring its own evaluation procedures. 2.5. TBM Whenever there is a detectable sequence relationship between two proteins, the corresponding structures have been found to be similar. Thus, if at least a single structure within a family of homologous proteins is determined experi- mentally, then template-based methods can be used to model practically all proteins in that family. The potential of this modeling is huge—by some esti- mates, structures are already known for a quarter of the protein single-domain c02.indd 23 c02.indd 23 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 41. 24 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING families of significant size and half of all known sequences can be partially modeled due to their membership in these families (M. Levitt in [23]). A typical template-based method consists of several consecutive steps: identifying probable templates; selecting/combining suitable templates; align- ing target-template(s) sequence; copying structurally conserved regions from the selected template(s); modeling structurally variable regions; packing side chains; refining the model; and evaluating its quality. Each modeling step is prone to errors, but, as a rule, the earlier in the process the error is introduced, the costlier it is. As the template-based category covers a wide range of struc- ture similarity, different kinds of errors are typical for different modeling difficulty subcategories. 2.5.1. High-Resolution TBM The most reliable models can be built in cases where there is a strong sequence relationship between the target protein and a template (i.e., higher than ∼40% sequence identity between target and template). In these situations target and template are expected to have very similar structures. Template selection and alignment errors are rare here, and simply copying the backbone of a suitable template may be sufficient in producing a model that may rival NMR or low- resolution X-ray structures in accuracy (∼1Å C-alpha atom root-mean-square deviations [RMSD] from the experimental structure). The main effort in this class of prediction shifts to modeling of regions of structure not present in a template (loops), proper placement of side chains, and fine adjustment of the structure (refinement). Such high-resolution models often present a level of detail that is sufficient for detecting sites of protein–protein interactions, understanding enzyme reac- tion mechanisms, interpreting disease-causing mutations, molecular replace- ment in solving crystal structures, and occasionally even drug design. 2.5.2. Medium Difficulty Range TBM New, more sensitive methods of detecting remote sequence relationships, especially Position-Specific Iterative-Basic Local Alignment Search Tool (PSI- BLAST) and profile–profile methods, have greatly extended our ability to utilize structure templates based on more remote sequence relationships. The quality of models in this category has steadily improved over the course of the CASP experiments. Models with quite accurate core (typically 2–3Å C-alpha atom RMSD from the native structure) can now often be generated. Factors still limiting progress include difficulty in recognizing best templates, com- bining information from several templates, aligning target sequences with template structures, adjusting for considerable shifts in conserved regions of structure, and modeling regions not represented in any of the available tem- plates. As in high-resolution homology modeling, refinement methods play a role in improving the accuracy of final models. c02.indd 24 c02.indd 24 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 42. TBM 25 Even though less accurate than high-resolution models, these models can also be used in many biological applications such as detecting of probable sites of protein–protein interactions, identifying the approximate role of disease- associated substitutions, or assessing the likely role of alternative splicing in protein function. 2.5.3. Difficult TBM In cases where no evolutionary relationship can be detected based on sequence, it is still likely that the fold of a target protein is nevertheless similar to that of a known structure (implying a very remote evolutionary relationship or convergence of folds). Methods that check the compatibility of a target protein with the experimental structures use more sophisticated analyses (e.g., second- ary structure comparison, knowledge-based structural potentials of various types) and can sometimes assist in identifying templates for modeling. As in such cases the templates have no explicit sequence relationship with the target, alignment is often not reliable and not surprisingly, the accuracy of the result- ing model is often low. Nevertheless impressive models are sometimes obtained, and there has been substantial progress over the course of CASP experiments.We attribute this progress to both methodological improvements and the increased size of sequence and structure databases. Although models for hard TBM targets may not provide accurate structural detail, they are useful for providing an overall idea of what a structure is like, recognizing approximate domain boundaries, helping choose residues for mutagenesis experiments, and providing approximate information about molecular function. 2.5.4. Progress and Challenges in the TBM Assessment of template-based predictions over the several rounds of CASP clearly showed an indisputable progress in the area, and the accuracy of the models has grown substantially [24–28]. One measure of this is that for the majority of targets the best models for each target are now closer to the native structure than any of the available template structures. Despite this very evident progress, there are many challenges still remaining. After years of development, finding a good template and the alignment still remain the two issues with a major impact on the quality of models.The coverage of the target by the template imposes the upper limit on the fraction of residues that can be aligned between the template and the target.Figure 2.4 shows the maximum alignability together with the alignment accuracy for the best models in the latest four CASPs (see our article [28], pp. 196, 198, for the definitions). It can be observed that the trend in all CASPs is the same—both maximum align- ability and alignment accuracy fall steadily and approximately linearly with increasing target difficulty. The slope of the fall off for these two measures, however, is different. For the easiest targets, predictors can routinely achieve c02.indd 25 c02.indd 25 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 43. 26 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING alignment accuracy close to the maximum possible from a single template or even better; in the mid range of difficulty best alignments are typically within 20% of the optimum, but up to 40% of the structure cannot be aligned at all; for the difficult targets the gap between the maximum alignability and align- ment accuracy grows to 30% with the percentage of nonaligned residues increasing to 70% [28]. Predictors often manage to achieve alignment accuracy higher than a single template maximum by using additional templates or by employing free modeling methods for the structurally nonconserved regions such as loops, insertions or deletions. It is encouraging to see an increase in the number of such cases: there are 22 targets in all CASPs where predictors superseded maximum alignability by at least 2%; out of these nine cases were from CASP8 (squares above 0% level in Fig. 2.4), eight from CASP7, four from CASP6, and one from CASP5. Improvement in alignment over the best template shows only one side of the effectiveness of TBM methods. Analysis of the overall quality of the models (measured in terms of Global Distance Test_Total Score [GDT_TS]) shows that typically the best models are superior to the corresponding naïve models built by simply copying coordinates of aligned residues from the best possible template. This additional gain in quality can be associated with the modeling regions not present in the best template, and also with improving the quality of the model by refinement. Figure 2.5 provides comparison of FIGURE 2.4 Maximum template-imposed alignability (SWALI, solid lines) and alignment accuracy of the best template-based models (AL0, dashed lines) from CASP5–8 as a function of target difficulty. Maximum alignability is defined as the frac- tion of equivalent residues in superposition of the target and best template structure; target difficulty combines coverage of the target structure by the best template and target-template sequence identity. CASP8—black lines; CASP7—blue; CASP6— brown; CASP5—red. Squares represent the difference between alignment quality and maximum alignability for CASP8 targets. Points over the 0% level represent targets where alignment accuracy was better than maximum alignability. (See color insert.) –40 –20 0 20 40 60 80 100 Target difficulty Models—AL0, templates—SWALI (%) c02.indd 26 c02.indd 26 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 44. FREE MODELING OF NEW FOLD PROTEINS 27 quality of the best submitted models versus naïve models built on the best single template. Data trend lines indicate that in general the best submitted models are better than the corresponding naïve models, except for the targets representing the hardest one-fourth of the difficulty scale. In CASP6–8, over 70% of the best models in the template-based category have registered added value over the naïve model. The inset histogram shows that the majority of best predictions (153 out of 242) are up to eight GDT_TS units above the corresponding best naïve model. The median difference between the best model and naïve model equals 2.74 GDT_TS units (mean—2.07 GDT_TS). 2.6. FREE MODELING OF NEW FOLD PROTEINS A quarter of the protein sequences in the contemporary databases do not appear to match any sequence pattern corresponding to an already known FIGURE 2.5 GDT_TS score of the best submitted model and the best naïve model built on a single template for each TBM target in CASP5–8. The darker trendline cor- responds to the predicted models; the lighter one, to the naive ones. Naïve models are built on the top 20 templates according to the target coverage for each target, and the score for the naïve model with the highest GDT_TS is shown. For the easier three- fourths of the difficulty scale, best models in general outperform naïve models. The inset histogram shows number of models registering differences in GDT_TS scores between the best model and naïve model (bins stretch 4 GDT_TS units). The most representative bin is 0–4 GDT_TS difference (86 targets), followed by the 4–8 GDT_ TS bin (67 targets). Best predictors' models versus template models (CASP 5–8 TBM targets sorted by difficulty) GDT_TS—best predicted model Highest GDT_TS among 20 best template models c02.indd 27 c02.indd 27 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 45. 28 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING structure [23]. In such cases, template-free modeling methods must be used. Free modeling methods can be divided into two categories: structure-based de-novo modeling methods and ab initio (modeling from the first principles) methods. Currently, the more successful approaches are the de novo methods, which rely on the fact that although not all naturally occurring protein folds have yet been observed, on some length scale, all possible structures of fragments are known. Fragment assignment, fragment assembly, and finally selection of correct models from among many candidate structures all remain formidable challenges. The quality of free modeling predictions has increased dramatically over the course of the CASP experiments, with most small proteins (100 residues or less) usually assigned at least the correct overall fold by a few groups. For these shorter proteins models are typically 4–10Å C-alpha atom RMSD from the native structure; for larger proteins, models are usually over the 10Å away from the native structure. This level of detail is insufficient for many biomedi- cal applications. But it is encouraging that in the last three CASPs there were examples of high-resolution accuracy (<2Å) models for a few small proteins [29,30]. Recently, several notable cases of high-resolution structure prediction in the absence of a suitable structural template were reported in the literature [31,32]. 2.7. OTHER MODELING CATEGORIES Besides the three-dimensional protein structure prediction, CASP evaluates several other structure-related modeling categories. The Secondary structure prediction category was included in early CASP experiments. Initial substantial progress gave way to incremental improve- ments too small to evaluate with the amount of data collected, and so the category was dropped in 2002. The Disorder region prediction category was introduced in CASP in 2002 to address growing recognition that some regions of proteins do not adopt a single three-dimensional structure but nevertheless are involved in the signal- ing, regulating, or controlling functions of the protein [33]. The three most recent CASPs have shown that the field has converged and that new ideas to improve the predictions are needed [34]. Prediction of intramolecular residue–residue contacts could in principle be helpful for predicting protein structure per se as well as for inferring mutations in the proteins or distinguishing between correct and incorrect protein docking models [35]. This type of prediction is still an area of active research, and continues to be assessed in CASP (starting in CASP4). However, there has been no detectable progress in that period, and current methods do not appear sufficiently accurate to be of any significant use. Many proteins contain multiple domains, and identifying domain boundar- ies is important not only in modeling but also in selecting constructs for protein c02.indd 28 c02.indd 28 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 46. SERVERS IN CASP 29 expression.Assessment in this area started in CASP6. In general, approximate identification of domain boundaries is straightforward when these lie between TBM regions. There are too few challenging multi-domain template-free targets in CASP to evaluate those cases. As a result, this category will be dropped from future CASPs. One of the primary uses of a three-dimensional model is to deduce more about the protein’s function. Testing of methods for function prediction began in CASP6 [36]. Assessment in this category faced difficulties connected with unavailability of experimental data to verify the predictions. To make the analysis more stringent,in CASP8 the category was narrowed to ligand binding site prediction. As structure modeling has assumed a more prominent role in biology, the need to have reliable estimates of overall and detailed structure accuracy has become apparent. The necessity for an unbiased evaluation of model quality assessment methods led to the introduction of a separate category in CASP, starting in 2006. CASP quality assessment evaluation has demon- strated that at the moment the most accurate methods rely on the availability of multiple models for the same protein (called consensus-based or clustering methods) [37]. These methods are based on the observation that the more different modeling methods agree on structure, either overall or in particular regions, the more likely that structure is correct. The best quality assessment methods can provide ranking of overall models significantly correlated with accuracy, but are not able to consistently select the best model from the entire collection of models. It has been encouraging to see some quality assessment methods showing promising results in assessing accuracy of specific regions in a model, at times reproducing almost the exact C-alpha- C-alpha deviation along the sequence. The main challenges in the quality assessment category is developing methods that can be competitive with consensus-based methods but rely on the structural and sequential features of a single model and improving the performance of methods for determining local model accuracy. Starting in CASP7, special attention has been given to model tertiary struc- ture refinement. All-atom structure refinement is one of the challenges in protein structure prediction and the development of reliable refinement pro- cedures would help in bringing the models up to the high-resolution standards. The assessment of predictions in this category suggests that while there are no methods that can consistently improve over the initial model, refinement can sometimes result in structures that are much closer to the target than the template (not a trivial task, especially for high accuracy targets). 2.8. SERVERS IN CASP There are now many millions of proteins for which reasonable models could be produced, and meeting such a large-scale demand requires automatic c02.indd 29 c02.indd 29 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 47. 30 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING generation of models. Even though it is apparent that not all human expertise can be encoded in automatic servers, CASP shows that the best servers are not much worse than the best human predictors. Moreover, sometimes the difference in human-server performance is just due to the fact that human experts have more time for the modeling (and quite often, base their models on initial structures obtained from servers). Under these circumstances, the importance of automatic servers for the biomedical community cannot be over-estimated. Server performance has been continuously checked by CASP starting in CASP3 (originally with the help of the CriticalAssessment of FullyAutomated Structure Prediction Methods [CAFASP] [38] and since CASP6, indepen- dently). Analysis of server performance in successive CASPs [28,39] shows that the best human-expert groups in CASP still outperform the best server groups, but the gap between the best servers and the best human-expert groups is narrowing. Especially in the case of easy TBM, progress of auto- mated servers is impressive, with the fraction of targets where at least one server model is among the best six submitted models increasing from 35% in CASP5 to 65% in CASP6, to over 90% in CASP7, and slightly decreasing to 83% in CASP8. This statistics confirms the notion that the impact of human expertise on modeling of easy comparative targets is now marginal. In general, in both CASP7 and CASP8 servers were at least at par with humans (three or more models in the best six) for about 20% of targets, and significantly worse than the best human model for only very few targets. 2.9. MODELING CHALLENGES AND CASP INITIATIVES Despite evident progress in protein structure modeling, many challenges still remain to be addressed. In TBM, refinement of high-accuracy models and improvement of template-to-target alignments in nontrivial cases are the major limiting factors. In free modeling, the challenge is to predict larger proteins (over 100 residues) more reliably and to routinely generate models within 2–3Å RMSD from the native structure for smaller proteins. The methods tested in CASP in this category have run out of steam and new mod- eling techniques seem necessary. Besides the traditional CASP categories, we have already conducted two in-between-CASP experiments to test the prediction of mutation sites and refinement of models. Assessment of model quality is currently an area of active research, and more than a dozen of papers on the subject have been published since this category was introduced in CASP7 (2006). It is also planned to conduct additional in-between-CASP experiments in modeling of membrane proteins and in selecting models from decoy sets.Within the main CASP track, in CASP9 we are reviving the predic- tion of quaternary structure. An initiative to continuously test free modeling methods is currently underway. c02.indd 30 c02.indd 30 8/20/2010 3:36:16 PM 8/20/2010 3:36:16 PM
  • 48. REFERENCES 31 REFERENCES 1. M. Sela, F.H. Jr. White, and C.B. Anfinsen. Reductive cleavage of disulfide bridges in ribonuclease. Science, 125(3250):691–692, 1957. 2. J. Moult et al. A large-scale experiment to assess protein structure prediction methods. Proteins, 23(3):ii–v, 1995. 3. J.M Bujnicki et al. LiveBench-2: Large-scale automated evaluation of protein structure prediction servers. Proteins, S(5):184–191, 2001. 4. J. Janin et al. CAPRI: A critical assessment of predicted interactions. Proteins, 52(1):2–9, 2003. 5. V.A. Eyrich et al. EVA: Continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17(12):1242–1243, 2001. 6. M.G. Reese et al. Genome annotation assessment in Drosophila melanogaster. Genome Research, 10(4):483–501, 2000. 7. A. Kryshtafovych et al. New tools and expanded data analysis capabilities at the Protein Structure Prediction Center. Proteins, 69(8):19–26, 2007. 8. A. Zemla. LGA:A method for finding 3D similarities in protein structures. Nucleic Acids Research, 31(13):3370–3374, 2003. 9. A. Zemla et al. Processing and evaluation of predictions in CASP4. Proteins, (S5):13–21, 2001. 10. A. Kryshtafovych et al. CASP6 data processing and automatic evaluation at the Protein Structure Prediction Center. Proteins, 61(S7):19–23, 2005. 11. A.R. Ortiz, C.E. Strauss, and O. Olmea. MAMMOTH (matching molecular models obtained from theory): An automated method for model comparison. Protein Science, 11(11):2606–2621, 2002. 12. L. Holm et al. Searching protein structure databases with DaliLite v.3. Bioinformatics, 24(23):2780–2781, 2008. 13. J. Moult et al. Critical assessment of methods of protein structure prediction- Round VIII. Proteins, 77(S9):1–4, 2009. 14. J. Moult et al. Critical assessment of methods of protein structure prediction- Round VII. Proteins, 69(S8):3–9, 2007. 15. J. Moult et al. Critical assessment of methods of protein structure prediction- Round VI. Proteins, 61(S7):3–7, 2005. 16. J. Moult et al. Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins, 53(6):334–339, 2003. 17. J. Moult et al. Critical assessment of methods of protein structure prediction (CASP): round IV. Proteins, (S5):2–7, 2001. 18. J. Moult et al. Critical assessment of methods of protein structure prediction (CASP): Round III. Proteins, (S3):2–6, 1999. 19. J. Moult et al. Critical assessment of methods of protein structure prediction (CASP): Round II. Proteins, (S1):2–6, 1997. 20. K.M. Misura and D. Baker. Progress and challenges in high-resolution refinement of protein structure models. Proteins, 59(1):15–29, 2005. 21. H. Lei and Y. Duan. Protein folding and unfolding by all-atom molecular dynamics simulations. Methods in Molecular Biology, 443:277–295, 2008. c02.indd 31 c02.indd 31 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 49. 32 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING 22. Y. He et al. Exploring the parameter space of the coarse-grained UNRES force field by random search: Selecting a transferable medium-resolution force field. Journal of Computational Chemistry, 30(13):2127–2135, 2009. 23. T. Schwede et al. Outcome of a workshop on applications of protein models in biomedical research. Structure, 17(2):151–159, 2009. 24. J. Kopp et al., Assessment of CASP7 predictions for template-based modeling targets. Proteins, 69(S8):38–56, 2007. 25. R.J. Read and G. Chavali. Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins, 69(S8):27–37, 2007. 26. D. Cozzetto et al. Evaluation of template-based models in CASP8 with standard measures. Proteins, 77(S9):18–28, 2009. 27. D. Keedy et al. The other 90% of the protein: Assessment beyond the C-alphas for CASP8 template-based and high-accuracy models. Proteins, 77(S9):29–49, 2009. 28. A. Kryshtafovych, K. Fidelis, and J. Moult. Progress from CASP6 to CASP7. Proteins, 69(S8):194–207, 2007. 29. P. Bradley et al. Free modeling with Rosetta in CASP6. Proteins, 61(S7):128–134, 2005. 30. R. Das et al. Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins, 69(S8):118–128, 2007. 31. P. Bradley, K.M. Misura, and D. Baker, Toward high-resolution de novo structure prediction for small proteins. Science, 309(5742):1868–1871, 2005. 32. B. Qian et al. High-resolution structure prediction and the crystallographic phase problem. Nature, 450(7167):259–264, 2007. 33. P. Radivojac et al. Intrinsic disorder and functional proteomics. Biophysical Journal, 92(5):1439–1456, 2007. 34. L. Bordoli, F. Kiefer, and T. Schwede. Assessment of disorder predictions in CASP7. Proteins, 69(S8):129–136, 2007. 35. J.M. Izarzugaza et al., Assessment of intramolecular contact predictions for CASP7. Proteins, 69(S8):152–158, 2007. 36. S. Soro and A. Tramontano. The prediction of protein function at CASP6. Proteins, 61(S7):201–213, 2005. 37. D. Cozzetto et al. Assessment of predictions in the model quality assessment cat- egory. Proteins, 69(S8):175–183, 2007. 38. D. Fischer et al. CAFASP-1: Critical assessment of fully automated structure pre- diction methods. Proteins, (S3):209–217, 1999. 39. J.N. Battey et al. Automated server predictions in CASP7. Proteins, 69(S8):68–82, 2007. c02.indd 32 c02.indd 32 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 50. 33 CHAPTER 3 Introduction to Protein Structure Prediction: Methods and Algorithms, Edited by Huzefa Rangwala and George Karypis Copyright © 2010 John Wiley & Sons, Inc. 3.1. BACKGROUND, RATIONALE, AND HISTORY High-throughput sequencing projects started to pour in an unprecedented amount of genomic information in the mid 1990s. Subsequently a strong inter- est emerged for even more ambitious high-throughput experiments that would assign 3D shapes to all known proteins in all genomes. Three-dimensional structures of proteins are often more informative than their sequences alone because interactions take place in the 3D space and because patterns formed by residues within the same protein that are far in sequence often form a recognizable motif in space. The large-scale efforts to target and solve protein THE PROTEIN STRUCTURE INITIATIVE ANDRAS FISER Department of Systems and Computational Biology Department of Biochemistry Albert Einstein College of Medicine Bronx, NY ADAM GODZIK CHRISTINE ORENGO BURKHARD ROST Program in Bioinformatics and Systems Biology Sanford-Burnham Medical Research Institute La Jolla, CA Department of Structural and Molecular Biology University College London London, UK Department of Biochemistry and Molecular Biophysics Center for Computational Biology Columbia University New York, NY c03.indd 33 c03.indd 33 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 51. 34 THE PROTEIN STRUCTURE INITIATIVE structures were dubbed as structural genomics projects and were launched worldwide with a variety of different focuses in Europe, Japan, and the United States. The U.S. efforts were spearheaded by National Institutes of Health- National Institute of General Medical Sciences (NIH-NIGMS) and named as the Protein Structure Initiative (PSI). The scientific rationale for structural genomics is the recognition of the fact that the many million known protein sequences seem to cluster into much fewer structural families or folds. A variety of estimates exists about the anticipated number of protein folds ranging between 1000 and 20,000 [1–4]. In addition the size distribution of protein folds is very uneven. Domain super- families from the 12 most highly populated folds (superfolds; [5]) cover approximately half of a typical genome, with such representatives as the Rossman, TIM, OB, or Ig fold [6]. On the other hand, many thousands of much less populated domain superfamilies compose the rest of the genomes. The rationale behind structural genomics is that a few thousand carefully selected and solved protein structures will provide a means to structurally characterize, at least in part, up to 80% of all existing sequences by using the solved structures as templates and employing comparative protein structure modeling techniques to model the rest of the proteins in each superfamily [4,7]. This rationale sheds light on the two most critical computational aspects of structural genomics efforts: target selection and structure modeling. While target selection is chiefly responsible for efficiently mapping the fold universe and properly directing efforts, structure modeling is the actual tool to provide the 3D characterization for more than 99% of all proteins. The underlying hypothesis of PSI is that high-throughput pipelines could be developed to produce high-quality protein structure representatives of large protein families with little or no prior structural representation.This goal sets the PSI apart from traditional structural biology approaches that normally take a much more highly focused and pragmatic approach through study of one or a limited number of macromolecules via “hypothesis-driven” research. After 1 year of preparation, a pilot phase of PSI started in year 2000, estab- lishing nine PSI centers around the United States.These centers were charged with the initial goals to set up automated pipelines for structural genomics projects and to start producing 3D structures with an increasing efficiency that includes both an increased number of solved experimental structures and the reduction of the cost per structure. While the overall goal of PSI was to explore the fold universe and to target structurally uncharacterized families, various centers focused this global aim within more specific biologically defined frameworks, for example, targeting new folds in specific genomes of biomedical interest, such as Thermotoga maritima or Mycobacterium tubercu- losis, or targeting human and other eukaryotic proteins, or targeting proteins involved in metabolic pathways and cancer. The pilot phase of PSI was a success, with an unprecedented number of 1300 structures solved, already making an impact on the composition of Protein Data Bank (PDB), solving far more structures than all conventional structural biology labs solved with a c03.indd 34 c03.indd 34 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 52. OVERVIEW, PIPELINE, AND RESOURCES 35 comparable amount of funding. Thereby, PSI-1 achieved its goals, namely to demonstrate the feasibility of efficient pipelines and to significantly reduce the cost of solving protein structures. The second, the so-called production phase of PSI started in 2005, funding four production centers and six specialized research centers. The four large-scale centers (Joint Center for Structural Genomics [JCSG; http://www.jcsg.org]; Midwest Center for Structural Genomics [MCSG; http://www.mcsg.anl.gov]; New York SGX Research Center for Structural Genomics [NYSGXRC; http://www.nysgrc.org]; and Northeast Structural Genomics Consortium [NESG; http://www.nesg.org]) were all selected for their proven high-throughput capability and were charged to synchronize target selection efforts providing a concerted effort to uncover fold space. Meanwhile specialized centers focused on addressing technological problems and known bottlenecks that include solving membrane proteins, proteins in higher eukaryotes (especially in human), and small protein com- plexes, and/or developing technology for portability, applicability, and scal- ability in general. In this chapter we will review the current state of PSI efforts with focus on target selection and the impact on coverage of the fold universe. 3.2. OVERVIEW, PIPELINE, AND RESOURCES Structural genomics centers established pipelines where all steps of the pro- duction are highly automated. The pipelines include both experimental and computational modules and typically contain the following steps: target selec- tion (and target tracking); protein production (including cloning, expression and purification); protein characterization (e.g. solubility tests); Heteronuclear Single Quantum Coherence (HSQC) or nuclear magnetic resonance (NMR) assignment or crystallization (if the structure determination technology used is NMR or X-ray crystallography, respectively); structure determination (by X-ray crystallography or NMR spectroscopy); and modeling of related protein sequences. While the pipelines among various centers differ in their details and by the specific experimental technologies employed, the previously described major steps are common.These common steps (e.g., target selection, cloning, expression, solubility experiments, purification, biophysical character- ization, crystallization, diffraction or NMR assignments, solved structure) are also reflected in the databases (Structural Genomics Target Search [TARGETDB; http://targetdb.pdb.org/] and Protein Expression Purification and Crystallization DataBase [PEPCDB; http://pepcdb.pdb.org/]) that track experimental progress at all PSI centers. Additional resources of PSI efforts include the PSI Materials Repository (http://psimr.asu.edu) that collects all of the PSI center clones and processes them for general distribution. The PSI Knowledgebase (http://kb.psi-structuralgenomics.org/) serves as the public face of PSI efforts, providing all available technologies, experimental struc- tures and homology models to the community. A partnership with the Nature c03.indd 35 c03.indd 35 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 53. 36 THE PROTEIN STRUCTURE INITIATIVE Gateway provides exposure through highlight publication for the PSI and its products. 3.3. TARGET SELECTION AND TARGET CATEGORIES 3.3.1. Centralized Target Selection PSI applies the structural genomics paradigm at several levels of biological investigation: (i) structural coverage of the protein universe, (ii) structural coverage of proteins targeted in collections of organisms (i.e., metagenomics), (iii) structural coverage of proteins targeted in specific organisms or organelles (e.g., T. maritima), (iv) structural coverage of systems of cofunctioning pro- teins and protein networks (e.g., protein phosphatases), (v) analysis of structure/function diversity across a large domain family, and (vi) structural analysis of membrane proteins. Target selection for the large centers in PSI is overseen by the BioInformatics Groups (BIG4). Each center is represented by one representative in this group—the authors of this chapter. Targets in PSI can be assigned in different categories. About 70% of all targets are centrally compiled and selected among the four centers by BIG. Another 15% of targets are solved through collaborations with the research community, while the remaining 15% of targets are picked by each center individually because of their biomedical relevance. Target selection is centralized among the four production centers to increase efficiency, which is achieved by avoiding overlaps among the centers and by balancing efforts on various target lists available. 3.3.2. Modeling Families PSI targets protein “families” because of their predicted structural “novelty.” These subjective qualifiers required a more precise, operational definition. Definition of a protein family is subjective because it can range from a low- resolution definition, such as the fold of a protein, to a high-resolution defini- tion as any structural novelty within the same fold family, such as a novel sub-domain or a longer loop insertion. In PSI the operational concept is to map the protein universe into sequence clusters, so-called Modeling Families, at a resolution, where the experimentally solved proteins can serve as suitable template for comparative protein structure modeling of similar proteins. A general guideline relies on the large-scale study that concluded that if a template-target pair shares more than 30% sequence identity a comparative model can be built, for which accuracy is expected to be within 2 Ang root- mean-square difference (RMSD) of the native structure [8]. Consequently within the PSI a practical definition of Modeling Families refers to groups of sequences, where any two sequences in a Modeling Family share more than 30% with one other (Fig. 3.1). Therefore if the structure of at least one member in a Modeling Family is solved experimentally all other family c03.indd 36 c03.indd 36 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 54. TARGET SELECTION AND TARGET CATEGORIES 37 members can be accurately modeled. With the continuously improving tech- nologies in comparative modeling, this threshold, or in general the definition of the concept of Modeling Family, can be revised, which will impact on the number of Modeling Families. In the PSI the category of biomedical targets does not necessarily follow this rule, as these proteins have proven a strong potential for immediate biomedical application that necessitates a higher reso- lution approach. For these targets very detailed structural information may be FIGURE 3.1 Flowchart of a typical PSI pipeline. physical properties functional information expression information target selection coordination among centres disseminate target list status reports abandon N N N N Y Y Y Y – – – – – – – – – – – – + + + + + + + + + + + + + other expression systems? disseminate clones receive soluble proteins disseminate proteins = decision point = process receive crystals for Fed-Ex crystallography obtain SeMet crystals MAD data collection phasing, model building, refinement MIR search NMR identify and correct problem further purification, subclone, add metal or cofactor, other? abandon MIR data collection choose targets choose another family member? solubilize refolding, detergents, metals, cofactors, etc. clone coding sequences expression abandon soluble purify quality assurance/ biophysical analyses likely to crystalize? crystalization trials microcrystals diffraction-quality crystals contains methionines? deposit structure in PDB deposit homology models in MoDBase annotate structure in SPD human gene information sequence families databases other centers interactions, cofactors biochemical pathways c03.indd 37 c03.indd 37 8/20/2010 3:36:17 PM 8/20/2010 3:36:17 PM
  • 55. Other documents randomly have different content
  • 56. the legend, 164; its pictorial brilliance, 165; its influence on later English poetry, 165. Examiner, The (Leigh Hunt’s), 25. Faerie Queene (Spenser’s), 12, 13, 35. Faithful Shepherdess (Fletcher’s), 95. Fanny, Lines to, 134. Feast of the Poets (Leigh Hunt’s), 32. Fletcher, 95. Foliage (Leigh Hunt’s), 73. Genius, births of, 1. Gisborne, Letter to Maria (Shelley’s), 30. Goethe, 154. Grasshopper and Cricket, 35. Gray, 113. Greece, Keats’ love of, 58, 77, 154. Guy Mannering (Scott’s), 115. Hammond, Mr, 11, 14.
  • 57. Hampstead, 72, 77. Haslam, William, 45, 212 (note). Haydon, 3, 40, 65, 68, 78, 137, 138, 191, 214. Hazlitt, William, 83, 84. History of his own Time (Burnet’s), 10. Holmes, Edward, 8. Holy Living and Dying (Jeremy Taylor’s), 206. Homer, On first looking into Chapman’s (Sonnet), 23-24. Hood, 219. Hope, address to, 21. Horne, R. H., 11. Houghton, Lord, 75, 211-213. Hunt, John, 25. Hunt, Leigh, 22, 24, 25, 32, 35, 39, 49, 51, 68, 72, 78, 196. Hyperion, 129, 133, 144; its purpose, 152; one of the grandest poems of our language, 157; the influences of Paradise Lost on it, 158; its blank verse compared with Milton’s, 158; its elemental grandeur, 160; remodelling of it, 185 seq.;
  • 58. description of the changes, 186-187; special interest of the poem, 187. Imitation of Spenser (Keats’ first lines), 14, 20. Indolence, Ode on, 174-175. Isabella, or the Pot of Basil, 86; source of its inspiration, 148; minor blemishes, 149; its Italian metre, 149; its conspicuous power and charm, 149; description of its beauties, 151. Isle of Wight, 67. Jennings, Mrs, 5, 11. Jennings, Capt. M. J., 7. Joseph and his Brethren (Wells’), 45. Kean, 81. Keats, John, various descriptions of, 7, 8, 9, 46, 47, 76, 136, 224; birth, 2; education at Enfield, 4; death of his father, 5; school-life, 5-9; his studious inclinations, 10; death of his mother, 10; leaves school at the age of fifteen, 11; is apprenticed to a surgeon, 11;
  • 59. finishes his school-translation of the Æneid, 12; reads Spenser’s Epithalamium and Faerie Queene, 12; his first attempts at composition, 13; goes to London and walks the hospitals, 14; his growing passion for poetry, 15; appointed dresser at Guy’s Hospital, 16; his last operation, 16; his early life in London, 18; his early poems, 20 seq.; his introduction to Leigh Hunt, 24; Hunt’s great influence over him, 26 seq.; his acquaintance with Shelley, 38; his other friends, 40-45; personal characteristics, 47-48; goes to live with his brothers in the Poultry, 48; publication of his first volume of poems, 65; retires to the Isle of Wight, 66; lives at Carisbrooke, 67; changes to Margate, 68; money troubles, 70; spends some time at Canterbury, 71; receives first payment in advance for Endymion, 71; lives with his two brothers at Hampstead, 71; works steadily at Endymion, 71-72; makes more friends, 73; writes part of Endymion at Oxford, 76; his love for his sister Fanny, 77; stays at Burford Bridge, 80; goes to the ‘immortal dinner,’ 82; he visits Devonshire, 87; goes on a walking tour in Scotland with Charles Brown, 113; crosses over to Ireland, 116; returns to Scotland and visits Burns’ country, 118; sows there the seeds of consumption, 120; returns to London, 120; is attacked in Blackwood’s Magazine and the Quarterly Review,
  • 60. 121; Lockhart’s conduct towards him, 122; death of his young brother Tom, 128; goes to live with Charles Brown, 128; falls in love, 130-131; visits friends in Chichester, 133; suffers with his throat, 133; his correspondence with his brother George, 139; goes to Shanklin, 143; collaborates with Brown in writing Otho, 143; goes to Winchester, 144; returns again to London, 146; more money troubles, 146; determines to make a living by journalism, 146; lives by himself, 146; goes back to Mr Brown, 181; Otho is returned unopened after having been accepted, 182; want of means prevents his marriage, 190; his increasing illness, 191 seq.; temporary improvement in his health, 194; publishes another volume of poems, 196; stays with Leigh Hunt’s family, 197; favourable notice in the Edinburgh Review, 197; lives with the family of Miss Brawne, 198; goes with Severn to spend the winter in Italy, 199; the journey improves his health, 200; writes his last lines, 201; stays for a time at Naples, 203; goes on to Rome, 203-204; further improvement in his health, 205; sudden and last relapse, 205; he is tenderly nursed by his friend Severn, 206; speaks of himself as already living a ‘posthumous life,’ 207; grows worse and dies, 208; various tributes to his memory, 214.
  • 61. His genius awakened by the Faerie Queene, 13; influence of other poets on him, 21; experiments in language, 21, 64, 147, 169; employment of the ‘Heroic’ couplet, 27, 30; element and spirit of his own poetry, 50; experiments in metre, 52; studied musical effect of his verse, 55; his Grecian spirit, 58, 77, 95, 114, 154; view of the aims and principles of poetry, 61; imaginary dependence on Shakspere, 69; thoughts on the mystery of Evil, 88; puns, 72, 202; his poems Greek in idea, English in manner, 96; his poetry a true spontaneous expression of his mind, 110; power of vivifying, 161; verbal licenses, 169; influence on subsequent poets, 218; felicity of phrase, 219. Personal characteristics: Celtic temperament, 3, 58, 70; affectionate nature, 6, 7, 9, 10, 77; morbid temperament, 6, 70, 211; lovable disposition, 6, 8, 19, 212, 213; temper, 7, 9, 233; personal beauty, 8; penchant for fighting, 8, 9, 72; studious nature, 9, 112; humanity, 39, 89, 114-115; sympathy and tenderness, 47, 213; eyes, description of, 46, 207, 224; love of nature, 47, 55-56;; voice, 47; desire of fame, 60, 125, 141, 207; natural sensibility to physical and spiritual spell of moonlight, 95;
  • 62. highmindedness, 125-126; love romances, 127, 130-134, 180-181, 197, 200, 203, 212; pride and sensitiveness, 211; unselfishness, 213, 214; instability, 215. Various descriptions of, 7, 8, 9, 46, 47, 76, 136, 224. Keats, Admiral Sir Richard, 7. Keats, Fanny (Mrs Llanos), 77. Keats, Mrs (Keats’ mother), 5, 10. Keats, George, 90, 113, 192, 193, 210. Keats, Thomas (Keats’ father), 2, 5. Keats, Tom, 6, 127. King Stephen, 179. ‘Kirk-men,’ 116-117. La Belle Dame sans Merci, 165, 166, 218; origin of the title, 165; a story of the wasting power of love, 166; description of its beauties, 166. Lamb, Charles, 26, 82, 83. Lamia, 143; its source, 167; versification, 167; the picture of the serpent woman, 168;
  • 63. Keats’ opinion of the Poem, 168. Landor, 75. Laon and Cythna, 76. Letters, extracts, etc., from Keats’, 66, 67, 68, 69, 77, 78, 79, 81, 85, 87, 88, 89, 90, 91, 114, 116-117, 118, 126, 127, 129, 130, 134, 137, 139, 141, 145, 146, 157, 181, 182, 190, 194-195, 200, 203, 226. ‘Little Keats,’ 19. Lockhart, 33, 122, 123. London Magazine, 71. Mackereth, George Wilson, 18. Madeline, 162 seq. ‘Maiden-Thought,’ 88, 114. Man about Town (Webb’s), 38. Man in the Moon (Drayton’s), 93. Margate, 68. Mathew, George Felton, 19. Meg Merrilies, 115-116. Melancholy, Ode on, 175. Milton, 51, 52, 54, 88.
  • 64. Monckton, Milnes, 211. Moore, 65. Morning Chronicle, The, 124. Mother Hubbard’s Tale (Spenser’s), 31. Mythology, Greek, 10, 58, 152, 153. Naples, 203. Narensky (Brown’s), 74. Newmarch, 19. Nightingale, Ode to a, 136, 175, 218. Nymphs, 73. Odes, 21, 137, 145, 170-171, 172, 174, 175, 177, 218. Orion, 11. Otho, 143, 144, 180, 181. Oxford, 75, 77. Oxford Herald, The, 122. Pan, Hymn to, 83.
  • 65. Pantheon (Tooke’s), 10. Paradise Lost, 88, 152, 154, 158. Patriotism, 115. Peter Corcoran (Reynolds’), 36. Plays, 178, 179, 181, 182. Poems (Keats’ first volume), faint echoes of other poets in them, 51; their form, 52; their experiments in metre, 52; merely poetic preludes, 53; their rambling tendency, 53; immaturity, 60; attractiveness, 61; characteristic extracts, 63; their moderate success, 65-66. Poetic Art, Theory and Practice, 61, 64. Poetry, joys of, 55; principle and aims of, 61; genius of, 110. Polymetis (Spence’s), 10. Pope, 19, 29, 30. ‘Posthumous Life,’ 207. Prince Regent, 25. Proctor, Mrs, 47.
  • 66. Psyche, Ode to, 136, 171, 172. Psyche (Mrs Tighe’s), 21. Quarterly Review, 121, 124. Rainbow (Campbell’s), 170. Rawlings, William, 5. Reynolds, John Hamilton, 36, 211, 214. Rice, James, 37, 142. Rimini, Story of, 27, 30, 31, 35. Ritchie, 82. Rome, 204. Rossetti, 220. Safie (Reynolds’), 36. Scott, Sir Walter, 1, 33, 65, 115, 123, 124. Scott, John, 124. Sculpture, ancient, 136. Sea-Sonnet, 67. Severn, Joseph, 45, 72, 135, 191, 199 seq.
  • 67. Shakspere, 67, 69. Shanklin, 67, 143. Shelley, 16, 32, 38, 56, 85, 110, 199, 203, 209. Shenstone, 21. Sleep and Poetry, 52, 60, 61, 109. Smith, Horace, 33, 81. Sonnets, 22, 23, 43, 48, 49, 57, 201. Specimen of an Induction to a Poem, 52. Spenser, 19, 20, 21, 31, 35, 54, 55. Stephens, Henry, 18-20. Surrey Institution, 84. Taylor, Mr, 71, 81, 126, 144, 146, 206, 211. Teignmouth, 87. Tennyson, 218. Thomson, 21. Urn, Ode on a Grecian, 136, 172-174.
  • 68. Vision, The, 187, 193 (see Hyperion). Webb, Cornelius, 38. Wells, Charles, 45. Wilson, 33. Winchester, 143-145. Windermere, 113, 114. Wordsworth, 1, 44, 46, 56, 64, 82, 83, 158, 219. CAMBRIDGE PRINTED BY JOHN CLAY, M.A. AT THE UNIVERSITY PRESS. English Men of Letters. Edited by JOHN MORLEY. Popular Edition. Crown 8vo. Paper Covers, 1s.; Cloth, 1s. 6d. each. Pocket Edition. Fcap. 8vo. Cloth. 1s. net each. Library Edition. Crown 8vo. Gilt tops. Flat backs. 2s. net each.
  • 69. ADDISON. By W. J. COURTHOPE. BACON. By DEAN CHURCH. BENTLEY. By Sir RICHARD JEBB. BUNYAN. By J. A. FROUDE. BURKE. By JOHN MORLEY. BURNS. By Principal SHAIRP. BYRON. By Professor NICHOL. CARLYLE. By Professor NICHOL. CHAUCER. By Dr. A. W. WARD. COLERIDGE. By H. D. TRAILL. COWPER. By GOLDWIN SMITH. DEFOE. By W. MINTO. HUME. By Professor HUXLEY, F.R.S. JOHNSON. By Sir LESLIE STEPHEN, K.C.B. KEATS. By Sir SIDNEY COLVIN. LAMB, CHARLES. By Canon AINGER. LANDOR. By Sir SIDNEY COLVIN. LOCKE. By THOMAS FOWLER. MACAULAY. By J. C. MORISON. MILTON. By MARK PATTISON. POPE. By Sir LESLIE STEPHEN, K.C.B. SCOTT. By R. H. HUTTON. SHELLEY. By J. A. SYMONDS.
  • 70. DE QUINCEY. By Professor MASSON. DICKENS. By Dr. A. W. WARD. DRYDEN. By Professor SAINTSBURY. FIELDING. By AUSTIN DOBSON. GIBBON. By J. C. MORISON. GOLDSMITH. By W. BLACK. GRAY. By EDMUND GOSSE. HAWTHORNE. By HENRY JAMES. SHERIDAN. By Mrs. OLIPHANT. SIDNEY. By J. A. SYMONDS. SOUTHEY. By Professor DOWDEN. SPENSER. By Dean CHURCH. STERNE. By H. D. TRAILL. SWIFT. By Sir LESLIE STEPHEN, K.C.B. THACKERAY. By ANTHONY TROLLOPE. WORDSWORTH. By F. W. H. MYERS. English Men of Letters. NEW SERIES. Crown 8vo. Gilt tops. Flat backs. 2s. net each. MATTHEW ARNOLD. By Herbert W. Paul.
  • 71. JANE AUSTEN. By F. Warre Cornish. SIR THOMAS BROWNE. By Edmund Gosse. BROWNING. By G. K. Chesterton. FANNY BURNEY. By Austin Dobson. CRABBE. By Alfred Ainger. MARIA EDGEWORTH. By the Hon. Emily Lawless. GEORGE ELIOT. By Sir Leslie Stephen, K.C.B. EDWARD FITZGERALD. By A. C. Benson. HAZLITT. By Augustine Birrell, K.C. HOBBES. By Sir Leslie Stephen, K.C.B. ANDREW MARVELL. By Augustine Birrell, K.C. THOMAS MOORE. By Stephen Gwynn. WILLIAM MORRIS. By Alfred Noyes. WALTER PATER. By A. C. Benson. RICHARDSON. By Austin Dobson. ROSSETTI. By A. C. Benson. RUSKIN. By Frederic Harrison. SHAKESPEARE. By Walter Raleigh. ADAM SMITH. By Francis W. Hirst. SYDNEY SMITH. By George W. E. Russell. JEREMY TAYLOR. By Edmund Gosse. TENNYSON. By Sir Alfred Lyall.
  • 72. JAMES THOMSON. By G. C. Macaulay. In preparation. MRS. GASKELL. By Clement Shorter. BEN JONSON. By Prof. Gregory Smith. MACMILLAN AND CO., LTD., LONDON. Footnotes: [1] See Appendix, p. 221. [2] Ibid. [3] John Jennings died March 8, 1805. [4] Rawlings v. Jennings. See below, p. 138, and Appendix, p. 221. [5] Captain Jennings died October 8, 1808. [6] Houghton MSS. [7] Rawlings v. Jennings. See Appendix, p. 221. [8] Mrs Alice Jennings was buried at St Stephen’s, Coleman Street, December 19, 1814, aged 78. (Communication from the Rev. J. W. Pratt, M.A.) [9] I owe this anecdote to Mr Gosse, who had it direct from Horne.
  • 73. [10] Houghton MSS. [11] A specimen of such scribble, in the shape of a fragment of romance narrative, composed in the sham Old-English of Rowley, and in prose, not verse, will be found in The Philosophy of Mystery, by W. C. Dendy (London, 1841), p. 99, and another, preserved by Mr H. Stephens, in the Poetical Works, ed. Forman (1 vol. 1884), p. 558. [12] See Appendix. [13] See C. L. Feltoe, Memorials of J. F. South (London, 1884), p. 81. [14] Houghton MSS. See also Dr B. W. Richardson in the Asclepiad, vol. i. p. 134. [15] Houghton MSS. [16] What, for instance, can be less Spenserian and at the same time less Byronic than— “For sure so fair a place was never seen Of all that ever charm’d romantic eye”? [17] See Appendix, p. 222. [18] See Appendix, p. 223. [19] See particularly the Invocation to Sleep in the little volume of Webb’s poems published by the Olliers in 1821. [20] See Appendix, p. 223. [21] See Praeterita, vol. ii. chap. 2. [22] See Appendix, p. 224. [23] Compare Chapman, Hymn to Pan:— “the bright-hair’d god of pastoral, Who yet is lean and loveless, and doth owe,
  • 74. By lot, all loftiest mountains crown’d with snow, All tops of hills, and cliffy highnesses, All sylvan copses, and the fortresses Of thorniest queaches here and there doth rove, And sometimes, by allurement of his love, Will wade the wat’ry softnesses.” [24] Compare Wordsworth:— “Bees that soar for bloom, High as the highest peak of Furness Fells, Will murmur by the hour in foxglove bells.” Is the line of Keats an echo or merely a coincidence? [25] Mr W. T. Arnold in his Introduction (p. xxvii) quotes a parallel passage from Leigh Hunt’s Gentle Armour as an example of the degree to which Keats was at this time indebted to Hunt: forgetting that the Gentle Armour was not written till 1831, and that the debt in this instance is therefore the other way. [26] See Appendix, p. 220. [27] The facts and dates relating to Brown in the above paragraph were furnished by his son, still living in New Zealand, to Mr Leslie Stephen, from whom I have them. The point about the Adventures of a Younger Son is confirmed by the fact that the mottoes in that work are mostly taken from the Keats MSS. then in Brown’s hands, especially Otho. [28] Houghton MSS. [29] See Appendix, p. 224. [30] See Appendix, p. 225. [31] See Appendix, p. 225. [32] In the extract I have modernized Drayton’s spelling and endeavoured to mend his punctuation: his grammatical constructions
  • 75. are past mending. [33] Mrs Owen was, I think, certainly right in her main conception of an allegoric purpose vaguely underlying Keats’s narrative. [34] Lempriere (after Pausanias) mentions Pæon as one of the fifty sons of Endymion (in the Elean version of the myth): and in Spenser’s Faerie Queene there is a Pæana—the daughter of the giant Corflambo in the fourth book. Keats probably had both of these in mind when he gave Endymion a sister and called her Peona. [35] Book 1, Song 4. The point about Browne has been made by Mr W. T. Arnold. [36] The following is a fair and characteristic enough specimen of Chamberlayne:— “Upon the throne, in such a glorious state As earth’s adored favorites, there sat The image of a monarch, vested in The spoils of nature’s robes, whose price had been A diadem’s redemption; his large size, Beyond this pigmy age, did equalize The admired proportions of those mighty men Whose cast-up bones, grown modern wonders, when Found out, are carefully preserved to tell Posterity how much these times are fell From nature’s youthful strength.” [37] See Appendix, p. 226. [38] Houghton MSS. [39] See Appendix, p. 227. [40] Severn in Houghton MSS. [41] Houghton MSS.
  • 76. [42] Dilke (in a MS. note to his copy of Lord Houghton’s Life and Letters, ed. 1848) states positively that Lockhart afterwards owned as much; and there are tricks of style, e.g. the use of the Spanish Sangrado for doctor, which seem distinctly to betray his hand. [43] Leigh Hunt at first believed that Scott himself was the writer, and Haydon to the last fancied it was Scott’s faithful satellite, the actor Terry. [44] Severn in the Atlantic Monthly, Vol. xi., p. 401. [45] See Preface, p. viii. [46] See Appendix, p. 227. [47] Houghton MSS. [48] The house is now known as Lawn Bank, the two blocks having been thrown into one, with certain alterations and additions which in the summer of 1885 were pointed out to me in detail by Mr William Dilke, the then surviving brother of Keats’s friend. [49] See Appendix, p. 227. [50] See Appendix, p. 228. [51] Decamerone, Giorn., iv. nov. 5. A very different metrical treatment of the same subject was attempted and published, almost simultaneously with that of Keats, by Barry Cornwall in his Sicilian Story (1820). Of the metrical tales from Boccaccio which Reynolds had agreed to write concurrently with Keats (see above, p. 86), two were finished and published by him after Keats’s death in the volume called A Garden of Florence (1821). [52] As to the date when Hyperion was written, see Appendix, p. 228: and as to the error by which Keats’s later recast of his work has been taken for an earlier draft, ibid., p. 230. [53] If we want to see Greek themes treated in a Greek manner by predecessors or contemporaries of Keats, we can do so—though
  • 77. only on a cameo scale—in the best idyls of Chénier in France, as L’Aveugle or Le Jeune Malade, or of Landor in England, as the Hamadryad or Enallos and Cymodamia; poems which would hardly have been written otherwise at Alexandria in the days of Theocritus. [54] We are not surprised to hear of Keats, with his instinct for the best, that what he most liked in Chatterton’s work was the minstrel’s song in Ælla, that fantasia, so to speak, executed really with genius on the theme of one of Ophelia’s songs in Hamlet. [55] A critic, not often so in error, has contended that the deaths of the beadsman and Angela in the concluding stanza are due to the exigencies of rhyme. On the contrary, they are foreseen from the first: that of the beadsman in the lines, “But no—already had his death-bell rung; The joys of all his life were said and sung;” that of Angela where she calls herself “A poor, weak, palsy-stricken, churchyard thing, Whose passing bell may ere the midnight toll.” [56] See Appendix, p. 229. [57] Chartier was born at Bayeux. His Belle Dame sans Merci is a poem of over eighty stanzas, the introduction in narrative and the rest in dialogue, setting forth the obduracy shown by a lady to her wooer, and his consequent despair and death.—For the date of composition of Keats’s poem, see Appendix, p. 230. [58] This has been pointed out by my colleague Mr A. S. Murray: see Forman, Works, vol. iii. p. 115, note; and W. T. Arnold, Poetical Works, &c., p. xxii, note. [59] Houghton MSS. [60] “He never spoke of any one,” says Severn, (Houghton MSS.,) “but by saying something in their favour, and this always so
  • 78. agreeably and cleverly, imitating the manner to increase your favourable impression of the person he was speaking of.” [61] See Appendix, p. 230. [62] Auctores Mythographi Latini, ed. Van Staveren, Leyden, 1742. Keats’s copy of the book was bought by him in 1819, and passed after his death into the hands first of Brown, and afterwards of Archdeacon Bailey (Houghton MSS.). The passage about Moneta which had wrought in Keats’s mind occurs at p. 4, in the notes to Hyginus. [63] Mrs Owen was the first of Keats’s critics to call attention to this passage, without, however, understanding the special significance it derives from the date of its composition. [64] Houghton MSS. [65] See below, p. 193, note 2. [66] “Interrupted,” says Brown oracularly in Houghton MSS., “by a circumstance which it is needless to mention.” [67] This passing phrase of Brown, who lived with Keats in the closest daily companionship, by itself sufficiently refutes certain statements of Haydon. But see Appendix, p. 232. [68] A week or two later Leigh Hunt printed in the Indicator a few stanzas from the Cap and Bells, and about the same time dedicated to Keats his translation of Tasso’s Amyntas, speaking of the original as “an early work of a celebrated poet whose fate it was to be equally pestered by the critical and admired by the poetical.” [69] See Crabb Robinson. Diaries, Vol. II. p. 197, etc. [70] See Appendix, p. 233. [71] Houghton MSS. In both the Autobiography and the Correspondence the passage is amplified with painful and probably not trustworthy additions.
  • 79. [72] I have the date of sailing from Lloyd’s, through the kindness of the secretary, Col. Hozier. For the particulars of the voyage and the time following it, I have drawn in almost equal degrees from the materials published by Lord Houghton, by Mr Forman, by Severn himself in Atlantic Monthly, Vol. xi. p. 401, and from the unpublished Houghton and Severn MSS. [73] Severn, as most readers will remember, died at Rome in 1879, and his remains were in 1882 removed from their original burying- place to a grave beside those of Keats in the Protestant cemetery near the pyramid of Gaius Cestius. [74] Haslam, in Severn MSS. [75] Severn MSS. [76] Houghton MSS. [77] Ibid. [78] Houghton MSS. [79] Ibid.
  • 81. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com