SlideShare a Scribd company logo
Instant Ebook Access, One Click Away – Begin at ebookgate.com
Bioinformatics for Comparative Proteomics 1st
Edition Chuming Chen
https://ebookgate.com/product/bioinformatics-for-
comparative-proteomics-1st-edition-chuming-chen/
OR CLICK BUTTON
DOWLOAD EBOOK
Get Instant Ebook Downloads – Browse at https://ebookgate.com
Click here to visit ebookgate.com and download ebook now
Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...
Between Heschel and Buber A Comparative Study 1st Edition
Alexander Even-Chen
https://ebookgate.com/product/between-heschel-and-buber-a-comparative-
study-1st-edition-alexander-even-chen/
ebookgate.com
Bioinformatics for Geneticists A Bioinformatics Primer for
the Analysis of Genetic Data 2nd Edition Michael R. Barnes
https://ebookgate.com/product/bioinformatics-for-geneticists-a-
bioinformatics-primer-for-the-analysis-of-genetic-data-2nd-edition-
michael-r-barnes/
ebookgate.com
Python for Bioinformatics 2nd Edition Sebastian Bassi
https://ebookgate.com/product/python-for-bioinformatics-2nd-edition-
sebastian-bassi/
ebookgate.com
Proteomics for Biomarker Discovery 1st Edition Julian A.
J. Jaros
https://ebookgate.com/product/proteomics-for-biomarker-discovery-1st-
edition-julian-a-j-jaros/
ebookgate.com
Mass Spectrometry for Microbial Proteomics 1st Edition
Haroun N. Shah
https://ebookgate.com/product/mass-spectrometry-for-microbial-
proteomics-1st-edition-haroun-n-shah/
ebookgate.com
Proteomics for Biomarker Discovery 1st Edition Julian A.
J. Jaros
https://ebookgate.com/product/proteomics-for-biomarker-discovery-1st-
edition-julian-a-j-jaros-2/
ebookgate.com
Informatics In Proteomics 1st Edition Sudhir Srivastava
https://ebookgate.com/product/informatics-in-proteomics-1st-edition-
sudhir-srivastava/
ebookgate.com
Bioinformatics for Glycobiology and Glycomics An
Introduction 1st Edition Claus-Wilhelm Von Der Lieth
https://ebookgate.com/product/bioinformatics-for-glycobiology-and-
glycomics-an-introduction-1st-edition-claus-wilhelm-von-der-lieth/
ebookgate.com
Technology Application Competencies for K 12 Teachers 1st
Edition Irene Chen
https://ebookgate.com/product/technology-application-competencies-
for-k-12-teachers-1st-edition-irene-chen/
ebookgate.com
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Me t h o d s i n Mo l e c u l a r Bi o l o g y ™
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
For other titles published in this series, go to
www.springer.com/series/7651
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Bioinformatics for Comparative
Proteomics
Edited by
Cathy H.Wu
DepartmentofComputerandInformationSciences,
CenterforBioinformaticsandComputationalBiology,
UniversityofDelaware,Newark,DE,USA
Chuming Chen
DepartmentofComputerandInformationSciences,
CenterforBioinformaticsandComputationalBiology,
UniversityofDelaware,Newark,DE,USA
Editors
Cathy H. Wu, Ph.D.
Center for Bioinformatics
and Computational Biology
University of Delaware
15 Innovation Way, Suite 205
Newark, DE 19711
USA
wuc@dbi.udel.edu
Chuming Chen, Ph.D.
Center for Bioinformatics
and Computational Biology
University of Delaware
15 Innovation Way, Suite 205
Newark, DE 19711
USA
chenc@dbi.udel.edu
ISSN 1064-3745 e-ISSN 1940-6029
ISBN 978-1-60761-976-5 e-ISBN 978-1-60761-977-2
DOI 10.1007/978-1-60761-977-2
Springer New York Dordrecht Heidelberg London
© Springer Science+Business Media, LLC 2011
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of
the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of
information storage and retrieval, electronic adaptation, computer software, or by similar or ­
dissimilar methodology
now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified
as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of going to press, ­
neither
the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may
be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Printed on acid-free paper
Humana Press is part of Springer Science+Business Media (www.springer.com)
v
Preface
With the rapid development of proteomic technologies in life sciences and in clinical appli-
cations, many bioinformatics methodologies, databases, and software tools have been
developed to support comparative proteomics study. This volume aims to highlight the
current status, challenges, open problems, and future trends in developing bioinformatics
tools and resources for comparative proteomics research and to serve as a definitive source
of reference providing both the breadth and depth needed on the subject of Bioinformatics
for Comparative Proteomics.
The volume is structured to introduce three major areas of research methods: (1)
basic bioinformatics frameworks related to comparative proteomics, (2) bioinformatics
databases and tools for proteomics data analysis, and (3) integrated bioinformatics systems
and approaches for studying comparative proteomics in the systems biology context.
Part I (Bioinformatics Framework for Comparative Proteomics) consists of seven
chapters:
Chapter 1 presents a comprehensive review (with categorization and description) of
major protein bioinformatics databases and resources that are relevant to comparative
proteomics research.
Chapter 2 provides a practical guide to the comparative proteomics community for
exploiting the knowledge captured from and the services provided in UniProt databases.
Chapter 3 introduces the InterPro protein classification system for automatic protein
annotation and reviews the signature methods used in the InterPro database.
Chapter 4 introduces the Reactome Knowledgebase that provides an integrated view
of the molecular details of human biological processes.
Chapter 5 introduces eFIP (extraction of Functional Impact of Phosphorylation), a
Web-based text mining system that can aid scientists in quickly finding abstracts from lit-
erature related to the phosphorylation (including site and kinase), interactions, and func-
tional aspects of a given protein.
Chapter 6 presents a tutorial for the Protein Ontology (PRO) Web resources to help
researchers in their proteomic studies by providing key information about protein diver-
sity in terms of evolutionary-related protein classes based on full-length sequence conser-
vation and the various protein forms that arise from a gene along with the specific functional
annotation.
Chapter 7 describes a method for the annotation of functional residues within experi-
mentally uncharacterized proteins using position-specific site annotation rules derived
from structural and experimental information.
Part II (Proteomic Bioinformatics) consists of ten chapters:
Chapter 8 describes how the detailed understanding of information value of mass
spectrometry-based proteomics data can be elucidated by performing simulations using
synthetic data.
Chapter 9 describes the concepts, prerequisites, and methods required to analyze a
shotgun proteomics data set using a tandem mass spectrometry search engine.
vi Preface
Chapter 10 presents computational methods for quantification and comparison of
peptides by label-free LC–MS analysis, including data preprocessing, multivariate statisti-
cal methods, and detection of differential protein expression.
Chapter 11 proposes an alternative to MS/MS spectrum identification by combining
the uninterpreted MS/MS spectra from overlapping peptides and then determining the
consensus identifications for sets of aligned MS/MS spectra.
Chapter 12 describes the Trans-Proteomic Pipeline, a freely available open-source
software suite that provides uniform analysis of LC–MS/MS data from raw data to quanti-
fied sample proteins.
Chapter 13 provides an overview of a set of open-source software tools and steps
involved in ELISA microarray data analysis.
Chapter 14 presents the state of the art on the Proteomics Databases and Repositories.
Chapter 15 is a brief guide to preparing both large- and small-scale protein interaction
data for publication.
Chapter 16 demonstrates a new graphical user interface tool called PRIDE Converter,
which greatly simplifies the submission of MS data to PRIDE database for submitted pro-
teomics manuscripts.
Chapter 17 presents a method for describing a protein’s posttranslational modifications
by integrating the top–down and bottom–up MS data using the Protein Inference Engine.
Chapter 18 describes an integrated top–down and bottom–up approach facilitated by
concurrent liquid chromatography–mass spectrometry analysis and fraction collection for
comprehensive high-throughput intact protein profiling.
Part III (Comparative Proteomics in Systems Biology) consists of four chapters:
Chapter 19 gives an overview of the content and usage of the PhosphoPep database,
which supports systems biology signaling research by providing interactive interrogation
of MS-derived phosphorylation data from four different organisms.
Chapter 20 describes “omics” data integration to map a list of identified proteins to a
common representation of the protein and uses the related structural, functional, genetic,
and disease information for functional categorization and pathway mapping.
Chapter 21 describes a knowledge-based approach relying on existing metabolic path-
way information and a direct data-driven approach for a metabolic pathway-centric inte-
gration of proteomics and metabolomics data.
Chapter 22 provides a detailed description of a method used to study temporal changes
in the endoplasmic reticulum (ER) proteome of fibroblast cells exposed to ER stress agents
(tunicamycin and thapsigargin).
This volume targets the readers who wish to learn about state-of-the-art bioinformat-
ics databases and tools, novel computational methods and future trends in proteomics
data analysis, and comparative proteomics in systems biology. The audience may range
from graduate students embarking upon a research project, to practicing biologists work-
ing on proteomics and systems biology research, and to bioinformaticians developing
advanced databases, analysis tools, and integrative systems. With its interdisciplinary
nature, this volume is expected to find a broad audience in biotechnology and pharmaceu-
tical companies and in various academic departments in biological and medical sciences
(such as biochemistry, molecular biology, protein chemistry, and genomics) and compu-
tational sciences and engineering (such as bioinformatics and computational biology,
computer science, and biomedical engineering).
vii
Preface
We thank all the authors and coauthors who had contributed to this volume. We
thank our series editor, Dr. John M. Walker, for reviewing all the chapter manuscripts and
providing constructive comments. We also thank Dr. Winona C. Barker from Georgetown
University for reviewing the manuscripts. We thank Dr. Qinghua Wu for proof reading the
book draft. Finally, we would like to extend our thanks to David C. Casey and Anne
Meagher of Springer US, Jeya Ruby and Ravi Amina of SPi for their help in the compila-
tion of this book.
Newark, DE, USA Cathy H. Wu and Chuming Chen
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
ix
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xi
Part I: Bioinformatics Framework for Comparative Proteomics
1 Protein Bioinformatics Databases and Resources . . . . . . . . . . . . . . . . . . . . . . . . .  3
Chuming Chen, Hongzhan Huang, and Cathy H. Wu
2 A Guide to UniProt for Protein Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  25
Claire O’Donovan and Rolf Apweiler
3 InterPro Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  37
Jennifer McDowall and Sarah Hunter
4 Reactome Knowledgebase of Human Biological
Pathways and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  49
Peter D’Eustachio
5 eFIP: A Tool for Mining Functional Impact
of Phosphorylation from Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  63
Cecilia N. Arighi, Amy Y. Siu, Catalina O. Tudor,
Jules A. Nchoutmboube, Cathy H. Wu, and Vijay K. Shanker
6 A Tutorial on Protein Ontology Resources for Proteomic Studies  . . . . . . . . . . . .  77
Cecilia N. Arighi
7 Structure-Guided Rule-Based Annotation of Protein
Functional Sites in UniProt Knowledgebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  91
Sona Vasudevan, C.R. Vinayaka, Darren A. Natale,
Hongzhan Huang, Robel Y. Kahsay, and Cathy H. Wu
Part II: Proteomic Bioinformatics
8 Modeling Mass Spectrometry-Based Protein Analysis . . . . . . . . . . . . . . . . . . . . . .  109
Jan Eriksson and David Fenyö
9 Protein Identification from Tandem Mass Spectra
by Database Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  119
Nathan J. Edwards
10 LC-MS Data Analysis for Differential
Protein Expression Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  139
Rency S. Varghese and Habtom W. Ressom
11 Protein Identification by Spectral Networks Analysis . . . . . . . . . . . . . . . . . . . . . .  151
Nuno Bandeira
12 Software Pipeline and Data Analysis for MS/MS Proteomics:
The Trans-Proteomic Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  169
Andrew Keller and David Shteynberg
x Contents
13 Analysis of High-Throughput ELISA Microarray Data  . . . . . . . . . . . . . . . . . . . .  191
Amanda M. White, Don S. Daly, and Richard C. Zangar
14 Proteomics Databases and Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  213
Lennart Martens
15 Preparing Molecular Interaction Data for Publication . . . . . . . . . . . . . . . . . . . . .  229
Sandra Orchard and Henning Hermjakob
16 Submitting Proteomics Data to PRIDE Using PRIDE Converter . . . . . . . . . . . .  237
Harald Barsnes, Juan Antonio Vizcaíno, Florian Reisinger,
Ingvar Eidhammer, and Lennart Martens
17 Automated Data Integration and Determination of
Posttranslational Modifications with the Protein Inference Engine . . . . . . . . . . . .  255
Stuart R. Jefferys and Morgan C. Giddings
18 An Integrated Top-Down and Bottom-Up Strategy
for Characterization of Protein Isoforms and Modifications . . . . . . . . . . . . . . . . .  293
Si Wu, Nikola Tolic¢, Zhixin Tian, Errol W. Robinson,
and Ljiljana Paša-Tolic¢
Part III: Comparative Proteomics in Systems Biology
19 Phosphoproteome Resource for Systems Biology Research  . . . . . . . . . . . . . . . . .  307
Bernd Bodenmiller and Ruedi Aebersold
20 Protein-Centric Data Integration for Functional Analysis of
Comparative Proteomics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  323
Peter B. McGarvey, Jian Zhang, Darren A. Natale,
Cathy H. Wu, and Hongzhan Huang
21 Integration of Proteomic and Metabolomic Profiling
as well as Metabolic Modeling for the Functional
Analysis of Metabolic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  341
Patrick May, Nils Christian, Oliver Ebenhöh,
Wolfram Weckwerth, and Dirk Walther
22 Time Series Proteome Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  365
Catherine A. Formolo, Michelle Mintz, Asako Takanohashi,
Kristy J. Brown, Adeline Vanderver, Brian Halligan,
and Yetrib Hathout
Index .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 379
xi
Contributors
Ruedi Aebersold • Institute of Molecular Systems Biology, ETH Zurich, Zurich,
Switzerland
Rolf Apweiler • The European Bioinformatics Institute, Cambridge, UK
Cecilia N. Arighi • Department of Computer and Information Sciences, University
of Delaware, Newark, DE, USA
Nuno Bandeira • Center for Computational Mass Spectrometry, University of
California, San Diego, La Jolla, CA, USA
Harald Barsnes • Department of Informatics, University of Bergen, Bergen, Norway
Bernd Bodenmiller • Institute of Molecular Systems Biology, ETH Zurich, Zurich,
Switzerland
Kristy J. Brown • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Chuming Chen • Department of Computer and Information Sciences, University of
Delaware, Newark, DE, USA
Nils Christian • Max-Planck-Institute for Molecular Plant Physiology,
Potsdam-Golm, Germany
Don S. Daly • Pacific Northwest National Laboratory, Richland, WA, USA
Peter D’Eustachio • Department of Biochemistry, New York University School of
Medicine, New York, NY, USA
Oliver Ebenhöh • Max-Planck-Institute for Molecular Plant Physiology,
Potsdam-Golm, Germany
Nathan J. Edwards • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Ingvar Eidhammer • Department of Informatics, University of Bergen, Bergen, Norway
Jan Eriksson • Swedish University of Agricultural Sciences, Uppsala, Sweden
David Fenyö • The Rockefeller University, New York, NY, USA
Catherine A. Formolo • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Morgan C. Giddings • Departments of Microbiology & Immunology and Biomedical
Engineering, The University of North Carolina at Chapel Hill, Chapel Hill,
NC, USA
Brian Halligan • Bioinformatics, Human and Molecular Genetics Center, Medical
College of Wisconsin, Milwaukee, WI, USA
Yetrib Hathout • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Henning Hermjakob • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Hongzhan Huang • Department of Computer and Information Sciences, University
of Delaware, Newark, DE, USA
xii Contributors
Sarah Hunter • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Stuart R. Jefferys • Department of Bioinformatics & Computational Biology,
The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Robel Y. Kahsay • DuPont Central Research & Development,
Wilmington, DE, USA
Andrew Keller • Rosetta Biosoftware, Seattle, WA, USA
Lennart Martens • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Patrick May • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm,
Germany
Jennifer McDowall • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Peter B. McGarvey • Department of Biochemistry and Molecular & Cellular Biol-
ogy, Georgetown University Medical Center, Washington, DC, USA
Michelle Mintz • Center for Genetic Medicine Research, Children’s National Medi-
cal Center, Washington, DC, USA
Darren A. Natale • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Jules A. Nchoutmboube • Department of Computer and Information Sciences,
University of Delaware, Newark, DE, USA
Claire O’Donovan • The European Bioinformatics Institute, Cambridge, UK
Sandra Orchard • EMBL Outstation, European Bioinformatics
Institute (EBI), Cambridge, UK
Ljiljana Paša-Tolic
¢ • Pacific Northwest National Laboratory, Richland, WA, USA
Florian Reisinger • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Habtom W. Ressom • Department of Oncology, Georgetown University Medical
Center, Washington, DC, USA
Errol W. Robinson • Pacific Northwest National Laboratory, Richland, WA, USA
Vijay K. Shanker • Department of Computer and Information Sciences, University of
Delaware, Newark, DE, USA
David Shteynberg • Institute for Systems Biology, Seattle, WA, USA
Amy Y. Siu • Department of Computer and Information Sciences, University of Dela-
ware, Newark, DE, USA
Asako Takanohashi • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Zhixin Tian • Pacific Northwest National Laboratory, Richland, WA, USA
Nikola Tolic
¢ • Pacific Northwest National Laboratory, Richland, WA, USA
Catalina O. Tudor • Department of Computer and Information Sciences,
University of Delaware, Newark, DE, USA
Adeline Vanderver • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Rency S. Varghese • Department of Oncology, Georgetown University Medical
Center, Washington, DC, USA
xiii
Contributors
Sona Vasudevan • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
C.R. Vinayaka • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Juan Antonio Vizcaíno • EMBL Outstation, European Bioinformatics Institute
(EBI), Cambridge, UK
Dirk Walther • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-
Golm, Germany
Wolfram Weckwerth • Molecular Systems Biology, University of Vienna, Vienna,
Austria
Amanda M. White • Pacific Northwest National Laboratory, Richland, WA, USA
Cathy H. Wu • Department of Computer and Information Sciences,
University of Delaware, Newark, DE, USA
Si Wu • Pacific Northwest National Laboratory, Richland, WA, USA
Richard C. Zangar • Pacific Northwest National Laboratory, Richland, WA, USA
Jian Zhang • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Part I
Bioinformatics Framework for Comparative Proteomics
3
Cathy H. Wu and Chuming Chen (eds.), Bioinformatics for Comparative Proteomics, Methods in Molecular Biology, vol. 694,
DOI 10.1007/978-1-60761-977-2_1, © Springer Science+Business Media, LLC 2011
Chapter 1
Protein Bioinformatics Databases and Resources
Chuming Chen, Hongzhan Huang, and Cathy H. Wu
Abstract
In the past decades, a variety of publicly available data repositories and resources have been developed to
support protein related information management, data-driven hypothesis generation and biological
knowledge discovery. However, there is also an increasing confusion for the researchers who are trying
to quickly find the appropriate resources to help them solve their problems. In this chapter, we present a
comprehensive review (with categorization and description) of major protein bioinformatics databases
and resources that are relevant to comparative proteomics research. We conclude the chapter by discuss-
ing the challenges and opportunities for developing new protein bioinformatics databases.
Key words: Bioinformatics, Database, Protein sequence, Protein family, Protein structure, Protein
function, Proteomics, Data integration, Comparative analysis
Advances of high-throughput technologies in the study of molec-
ular biology systems in the past decades have marked the begin-
ning of a new era of research, in which biological researchers
systematically study organisms on the levels of genomes (complete
genetic sequences) (1), transcriptomes (gene expressions) (2) and
proteomes (protein expressions) (3). Because proteins occupy a
middle ground molecularly between gene and transcript informa-
tion and higher levels of molecular and cellular structure and orga-
nization, and most physiological and pathological processes are
manifested at the protein level, biological scientists are growingly
interested in applying proteomics techniques to foster a better
understanding of basic molecular biology, disease processes and
discovery of new diagnostic, prognostic and therapeutic targets
for numerous diseases (4, 5).
1. 
Introduction
4 Chen, Huang, and Wu
Recently, proteomics data analysis has moved toward infor-
mation integration of multiple studies including cross-species
analyses (6–9). The richness of proteomics data allows research-
ers to ask complex biological questions and gain new scientific
insights. To support comparative proteomics, data-driven
hypothesis generation, and biological knowledge discovery, many
protein-related bioinformatics databases, query facilities, and
data analysis software tools have been developed. These organize
and provide biological annotations for individual proteins to
support sequence, structural, functional and evolutionary analy-
ses in the context of pathway, network and systems biology.
However, it is not always easy for researchers to quickly find the
pieces of related information. In this chapter, we present a com-
prehensive review (with categorization and description) of major
protein bioinformatics databases and resources that are relevant
to comparative proteomics research. We highlight some of these
databases, and focus on the types of data stored and related data
access and data analysis supports. We also discuss the challenges
and opportunities for developing new protein bioinformatics
databases in terms of supporting data integration and compara-
tive analysis, maintaining data provenance and managing
biological knowledge.
Our coverage of protein bioinformatics databases in this chapter
is by no means exhaustive. We refer the readers to ref. 10 for a
more complete list. Our intention is to cover those that are recent,
high quality, publicly available, and are expected to be of interest
to more users in the comparative proteomics community. Based
on the topics and data stored, protein bioinformatics databases
can be primarily classified as sequence databases, family databases,
structure databases, function databases and proteomics databases
as shown in Table 1. It is worth noting that certain databases can
be classified into more than one category. Please visit http://
www.proteininformationresource.org/staff/chenc/MiMB/
dbSummary.html to access the databases reviewed in this chapter
through their corresponding web addresses (URLs).
Protein sequence databases serve as the archival repositories for col-
lections of protein sequences as well as their associated annotations.
These databases are also the primary sources for developing other
2. 
Overview
3. Databases
and Resources
Highlights
3.1. Protein Sequence
Databases
5
Protein Bioinformatics Databases and Resources
Table
1
Overview
of
protein
bioinformatics
databases
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Sequence
NCBI
Reference
Sequence
(RefSeq)
Biologically
non-redundant
collection
of
DNA,
RNA,
and
protein
sequences
http://www.ncbi.nlm.nih.
gov/RefSeq/
(11)
Entrez
Protein
Database
Collection
of
protein
sequences
from
a
variety
of
sources,
and
translations
from
annotated
coding
regions
in
GenBank
and
RefSeq
http://www.ncbi.nlm.nih.
gov/sites/
entrez?db,protein
(20)
UniProt
UniProt
Knowledgebase
(UniProtKB)
Collection
of
functional
information
on
proteins
with
accurate,
consistent
and
rich
annotation
http://www.uniprot.org/
help/uniprotkb
(13)
UniProt
Archive
(UniParc)
Comprehensive
and
non-redundant
database
that
contains
most
of
the
publicly
available
protein
sequences
in
the
world
http://www.uniprot.org/
help/uniparc
(14)
UniProt
Reference
Clusters
(UniRef)
Clustered
sets
of
sequences
from
UniProt
Knowledgebase
(including
splice
variants
and
isoforms)
and
selected
UniParc
records
http://www.uniprot.org/
help/uniref
(15)
Family
Whole
protein
PIRSF
Comprehensive
and
non-overlapping
clustering
of
UniProtKB
sequences
into
a
hierarchical
order
to
reflect
their
evolutionary
relationships
based
on
whole
protein
sequences
http://www.pir.george-
town.edu/pirwww/
dbinfo/pirsf.shtml
(18)
Clusters
of
Orthologous
Groups
of
proteins
(COGs)
Phylogenetic
classification
of
proteins
encoded
in
complete
genomes
http://www.ncbi.nlm.nih.
gov/COG/
(64)
(continued)
6 Chen, Huang, and Wu
Table
1
(continued)
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Protein
ANalysis
THrough
Evolutionary
Relationships
Classification
System
(PANTHER)
Proteins
are
classified
by
expert
biologists
into
families
and
subfamilies
of
shared
function
and
further
categorized
by
GO
terms
http://www.pantherdb.
org/
(29)
ProtoNet
Automatic
hierarchical
classification
of
protein
sequences
http://www.protonet.cs.
huji.ac.il/index.php
(65)
Protein
domain
Pfam
Protein
families
of
domains
each
represented
by
multiple
sequence
alignments
and
Hidden
Markov
Models
(HMMs)
http://www.pfam.sanger.
ac.uk/
(19)
ProDom
Comprehensive
set
of
protein
domain
families
automatically
generated
from
the
UniProtKB
http://www.prodom.prabi.
fr/prodom/current/
html/home.php
(21)
Conserved
Domains
Database
(CDD)
Collections
of
multiple
sequence
alignments
representing
conserved
domains
http://www.ncbi.nlm.nih.
gov/sites/entrez?db=cdd
(66)
Simple
Modular
Architecture
Research
Tool
(SMART)
Resource
for
identification
and
annotation
of
protein
domains
and
the
analysis
of
domain
architectures
http://www.smart.embl.
de/
(31)
Protein
motif
PRINTS
Group
of
conserved
motifs
used
to
characterize
a
protein
family
http://www.bioinf.
manchester.ac.uk/
dbbrowser/PRINTS/
index.php
(30)
PROSITE
Protein
domains,
families
and
functional
sites
as
well
as
associated
patterns
and
profiles
to
identify
them
http://www.ca.expasy.org/
prosite/
(24)
7
Protein Bioinformatics Databases and Resources
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Integrative
InterPro
Integrated
resource
of
protein
families,
domains
and
functional
sites
from
Pfam,
PRINTS,
PROSITE,
ProDom,
SMART,
PIRSF
etc.
http://www.ebi.ac.uk/
interpro/
(27)
Structure
3D
structure
Worldwide
Protein
Data
Bank
(wwPDB)
Repository
for
the
3D
coordinates
and
related
information
on
more
than
38,000
macromolecular
structures,
including
proteins,
nucleic
acids
and
large
macromolecular
complexes
that
have
been
determined
using
X-ray
crystallography,
NMR
and
electron
microscopy
techniques
http://www.wwpdb.org/
(23)
Molecular
Modeling
Database
(MMDB)
3D
macromolecular
structures,
including
proteins
and
polynucleotides.
http://www.ncbi.nlm.nih.
gov/sites/
entrez?db=structure
(67)
ModBase
3D
protein
models
calculated
by
comparative
modeling
http://www.modbase.
compbio.ucsf.edu/
modbase-cgi/index.cgi
(68)
SWISS-MODEL
Repository
Annotated
protein
3D
models
http://www.swissmodel.
expasy.org/repository/
(69)
Structural
classification
CATH
Hierarchical
classification
of
protein
domain
structures
in
the
Protein
Data
Bank
http://www.cathdb.info/
(37)
Structural
Classification
Of
Proteins
(SCOP)
Description
of
the
evolutionary
and
structural
relation-
ships
of
the
proteins
of
known
structures
http://www.scop.mrc-lmb.
cam.ac.uk/scop/
(22)
SUPERFAMILY
Structural
and
functional
annotation
for
all
proteins
and
genomes
based
on
a
collection
of
Hidden
Markov
Models,
which
represents
structural
protein
domains
at
the
SCOP
superfamily
level
http://www.supfam.org/
SUPERFAMILY/
(32)
(continued)
8 Chen, Huang, and Wu
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Protein
folding
Protein
Folding
Database
(PFD)
Repository
of
available
experimental
protein
folding
data
http://www.pfd.med.
monash.edu.au/
public_html/index.php
(38)
KineticDB
Experimental
data
on
protein
folding
kinetics
http://www.kineticdb.
protres.ru/db/index.pl
(70)
Protein
modification
RESID
Collection
of
annotations
and
structures
for
protein
pre-,
co-
and
post-translational
modifications
http://www.ebi.ac.uk/
RESID/
(71)
Phospho3D
3D
structures
of
phosphorylation
sites
that
stores
information
retrieved
from
the
phospho.ELM
database
http://www.cbm.bio.
uniroma2.it/
phospho3d/
(40)
Function
Inter-
molecular
interactions
IntAct
Protein
interaction
data
from
literature
and
user
submission
http://www.ebi.ac.uk/
intact/main.xhtml
(42)
Database
of
Interacting
Proteins
(DIP)
Experimentally
determined
protein–protein
interactions
http://www.dip.doe-mbi.
ucla.edu/dip/Main.cgi
(72)
Reactome
A
curated
knowledgebase
of
biological
pathways
http://www.reactome.org/
(47)
Biological
General
Repository
for
Interaction
Datasets
(BioGRID)
Collections
of
protein
and
genetic
interactions
from
major
model
organism
species
http://www.thebiogrid.org
(73)
Metabolic
pathways
Kyoto
Encyclopedia
of
Genes
and
Genomes
(KEGG)
Pathway
maps
on
the
molecular
interaction
and
reaction
networks
for
metabolism
http://www.genome.jp/
kegg/pathway.html
(74)
Table
1
(continued)
9
Protein Bioinformatics Databases and Resources
Primary
category
Secondary
category
Database
name
Database
content
URL
References
BioCyc
Pathway/Genome
Databases
(PGDBs)
on
the
pathways
and
genomes
of
different
organisms
http://www.biocyc.org/
(51)
MetaCyc
Nonredundant,
experimentally
elucidated
metabolic
pathways
http://www.metacyc.org/
(51)
Integrative
Michigan
molecular
interactions
(MiMI)
Merged
view
of
several
popular
interaction
databases
including:
BIND,
HPRD,
IntAct,
GRID,
and
others
http://www.mimitest.ncibi.
org/MimiWeb/main-
page.jsp
(75)
Proteomics
Gel
electro­
phoresis
WORLD-2DPAGE
Constellation
List
of
World-2DPAGE
database
servers,
World-
2DPAGE
Portal
that
queries
simultaneously
world-
wide
proteomics
databases,
and
World-2DPAGE
Repository
http://www.world-2dpage.
expasy.org/
(52)
Mass
spectrometry
Global
Proteome
Machine
Database
(GPMDB)
Mass
spectral
library
for
data
from
a
variety
of
organisms,
the
identified
peptides
are
matched
to
the
Ensembl
genome
database
http://www.thegpm.org/
GPMDB/index.html
(76)
PRoteomics
IDEntifications
database
(PRIDE)
Protein
and
peptide
identifications
that
have
been
described
in
the
scientific
literature
together
with
the
evidence
supporting
these
identifications
http://www.ebi.ac.uk/
pride/
(54)
PeptideAtlas
Peptides
identified
in
a
large
set
of
LC–MS/MS
proteomics
experiments
http://www.peptideatlas.
org/
(77)
Peptidome
Tandem
mass
spectrometry
peptide
and
protein
identification
data
generated
by
the
scientific
community
http://www.ncbi.nlm.nih.
gov/peptidome/
(78)
10 Chen, Huang, and Wu
resources such as protein family databases, and the foundation for
medical and functional studies.
The National Center for Biotechnology Information Reference
Sequence (NCBI RefSeq) database provides curated non-redundant
sequences for genomic regions, transcripts and proteins (11).
RefSeq collection is derived from the sequence data available in
the redundant archival database GenBank (12). RefSeq sequences
include coding regions, conserved domains, variations, refer-
ences, names, and database cross-references. The sequences are
annotated using a combined approach of collaboration, auto-
mated prediction, and manual curation (11). The RefSeq release
37 of September 11, 2009 includes 8,835,796 proteins and 9,005
organisms. The RefSeq data can be accessed from NCBI web sites
by Entrez query, BLAST, FTP download etc.
The UniProt Consortium consists of groups from the European
BioinformaticsInstitute(EBI),theSwissInstituteofBioinformatics
(SIB) and the Protein Information Resource (PIR). The UniProt
Consortium provides a central resource for protein sequences and
functional annotations with four database components to support
protein bioinformatics research:
The UniProt Knowledgebase (UniProtKB) is the predomi-
●
●
nant data store for functional information on proteins (13).
The UniProtKB consists of two sections: UniProtKB/Swiss-
Prot, which contains manually annotated records with infor-
mation extracted from literature and curator-evaluated
computational analysis, and UniProtKB/TrEMBL, which
contains computationally analyzed records with rule-based
automatic annotation. Comparative analysis and query across
databases are supported by the UniProtKB extensive cross-
references, functional and feature annotations, classification,
and literature-based evidence attribution. The UniProtKB
release 15.9 of October 13, 2009 includes 510,076
UniProtKB/Swiss-Prot sequence entries, comprising
179,409,349 amino acids abstracted from 183,725 references,
and 9,501,907 UniProtKB/TrEMBL sequence entries com-
prising 3,068,281,486 amino acids.
The UniProt archive (UniParc) (
●
● 14) is an archival protein
sequence database from all major publicly accessible resources.
UniParc contains protein sequences and database cross-refer-
ences to the provenance of the sequences. Text- and sequence-
based searches are available from UniParc database web site.
The UniProt Reference Clusters (UniRef) (
●
● 15) merge
sequences and sub-sequences that are 100% (UniRef100), ³90%
3.1.1. RefSeq
3.1.2. UniProt
11
Protein Bioinformatics Databases and Resources
(UniRef90), or ³50% (UniRef50) identical, regardless of
source organism to speed up searches.
The UniProt Metagenomic and Environmental Sequences
●
●
(UniMES) database is a repository specifically developed for
Metagenomic and environmental data. UniMES currently
contains data from the Global Ocean Sampling Expedition
(GOS) (16), which predicts nearly six million proteins, pri-
marily from oceanic microbes (13).
The UniProt web site (http://www.uniprot.org) is the pri-
mary access point to the data and documentation. The site also
provides batch retrieval using UniProt identifiers, BLAST-based
sequence similarity search, ClustalW-based sequence alignment,
and Database identifier mapping. The UniProt FTP download
site provides batch download of protein sequence data in various
formats, including flat file text, XML, RDF, FASTA, and GFF.
Programmatic access to the data and search results is supported
via simple HTTP RESTful web services or UniProtJAPI (17) for
Java-based applications.
The primary protein sequence databases can be used to develop
new resources with value-added information by either classifying
protein sequences into families or assigning certain properties to
the sequences by detecting specific sequence features such as
domains, motifs, and functional sites.
The PIRSF classification system provides comprehensive and
non-overlapping clustering of UniProtKB (13) sequences into a
hierarchical order to reflect their evolutionary relationships based
on whole proteins rather than on the component domains. The
PIRSF system classifies the protein sequences into families, whose
members are both homologous (evolved from a common ances-
tor) and homeomorphic (sharing full-length sequence similarity
and a common domain architecture) (18). The PIRSF family clas-
sification results are expert-curated based on literature review and
integrative sequence and functional analysis. The classification
report shows the information on PIRSF members and general
statistics, family and function/structure relationships, database
cross-references and graphical display of domain and motif archi-
tecture of seed members or all members. The web-based PIRSF
system has been shown as a useful tool for studying the function
and evolution of protein families (18). It provides batch retrieval
of entries from the PIRSF database. The PIRSF scan allows
searching a query sequence against the set of fully curated PIRSF
families with benchmarked Hidden Markov Models. The PIRSF
membership hierarchy data is also available for FTP download.
3.2. Protein Family
Databases
3.2.1. 
PIRSF
12 Chen, Huang, and Wu
Pfam is a database of protein domains and families represented as
multiple sequence alignments and Hidden Markov Models
(HMMs) (19). Pfam is built based on the protein sequence data
from UniProtKB (13), NCBI GenPept (20) and selected
Metagenomics projects. The Pfam database contains two compo-
nents: Pfam-A and Pfam-B. Pfam-A entries are manually curated
high-quality representative seed alignments, profile HMMs built
from the seed alignments, and an automatically generated full
alignment for all detectable family member protein sequences.
Pfam-B entries are automatically generated from the ProDom
database (21). The Pfam release 24.0 of October 2009 contains
11,912 families. The Pfam database is further organized into
higher-level hierarchical groupings of related families called clan
(19), which are collections of related Pfam-A entries built manu-
ally based on the similarity of their sequences, known structures,
profile-HMMs, and other databases such as SCOP (22). The
Pfam database web site provides a set of query and browsing
interfaces for analyzing protein sequences for Pfam matches, for
viewing Pfam family annotations, alignments, groups of related
families, and the domains of a protein sequence, as well as for
finding the domains on a PDB (23) structure. The Pfam data can
be downloaded from its FTP site or programmatically accessed
through RESTful and SOAP based web services.
PROSITE (24) is a database of annotated motif descriptors (pat-
terns or profiles), which can be used for the identification of pro-
tein domains and families. The motif descriptors are derived from
multiple alignments of homologous sequences and have the advan-
tage of identifying distant relationships among sequences (25). A
set of ProRules providing additional information about the func-
tionally and/or structurally critical amino acids are used to increase
the discriminatory power of the motif descriptors (24). The
PROSITE web site provides keywords-based search and allows
browsing of motif entries, ProRule description, taxonomic scope,
and number of positive hits. The ScanProsite (26) tool allows one
either to scan protein sequences for the occurrence of PROSITE
motifs by entering UniProtKB AC and/or ID, PDB identifier(s)
or protein sequence(s), or to scan the UniProtKB or PDB data-
bases for the occurrence of a pattern by entering the PROSITE
AC and/or ID or user’s own pattern(s). The ScanProsite (26) tool
can also be accessed programmatically through a simple HTTP
web service. The PROSITE documentation entries and related
tools can be downloaded from its FTP site.
InterPro (27) is an integrated resource of predictive models or
“signatures” representing protein domains, families, regions,
repeats and sites from major protein signature databases includ-
ing Gene3D (28), PANTHER (29), Pfam (19), PIRSF (18),
3.2.2. 
Pfam
3.2.3. 
PROSITE
3.2.4. 
InterPro
13
Protein Bioinformatics Databases and Resources
PRINTS (30), ProDom (21), PROSITE (24), SMART (31),
SUPERFAMILY (32) and TIGRFAMs (33). Each entry in the
InterPro database is annotated with a descriptive abstract name
and cross-references to the original data sources, as well as to
specialized functional databases. The InterPro release 23.0 of
September 23, 2009 includes 19,150 entries containing 434 new
signatures. The database is available via a web interface and anon-
ymous FTP download. The software tool InterProScan (34) is
provided as a protein sequence classification and comparison
package that can be used via a web interface and SOAP-based
Web Services or can be installed locally for bulk operations. The
InterPro BioMart (35) allows users to retrieve InterPro data from
a query-optimized data warehouse that is synchronized with the
main InterPro database, and to build simple or complex queries
and control the query results through a unified interface.
Many bioinformatics studies are based on the premise that pro-
teins of similar sequences carry out similar functions whereas
those with different sequences carry out different functions. More
and more experimental data support the notion that structure of
a protein reflects the nature of the role it is playing, therefore,
determining its function in the biological process. The protein
structure databases organize and annotate various experimentally
determined protein structures, providing the biological commu-
nity access to the experimental data in a useful way.
The worldwide PDB (wwPDB) was established in 2003 as an inter-
national collaboration to maintain a single and publicly available
Protein Data Bank Archive (PDB Archive) of macro-molecular
structural data (23). The wwPDB member includes RCSB PDB
(USA), the Macromolecular Structure Database at the European
Bioinformatics Institute (MSD-EBI) (UK), the Protein Data Bank
Japan (PDBj) at Osaka University (Japan) and the BioMagRes-
Bank (BMRB) at the University of Wisconsin – Madison (USA).
The “PDB Archive” is a collection of flat files in three different
formats: the legacy PDB file format; the PDB exchange format that
follows the mmCIF syntax (http://www.deposit.pdb.org/
mmcif/); and the PDBML/XML format (36). Each member site
serves as a deposition, data processing and distribution site for the
PDB Archive and each provides its own view of the primary data
and a variety of tools and resources. As of October 27, 2009, there
are 61,086 structures in the wwPDB database.
CATH (Class, Architecture, Topology, Homology) is a database
of protein domain structures in the Protein Data Bank, where
domains are hierarchically classified by the curators guided by
prediction algorithms (such as structure comparison). CATH
clusters proteins at four major levels (37):
3.3. Protein Structure
Databases
3.3.1. 
worldwide PDB
3.3.2. 
CATH
14 Chen, Huang, and Wu
●
● Class (C): secondary structure composition and packing
within the structure.
●
● Architecture (A): orientations of the secondary structures
ignoring the connectivity among the secondary structures.
●
● Topology (T): whether they share the same topology in the
core of the domain.
●
● Homologous superfamily (H): sequence and structural
similarities.
The CATH release 3.2.0 of July 14, 2008 contains 114,215
assigned domains. CATH provides the SSAP server, which allows
users to compare the structures of two proteins and view the sub-
sequent structural alignment.
The SCOP (Structural Classification of Proteins) database provides
a comprehensive and detailed description of the evolutionary and
structural relationships of the proteins of known structures. The
SCOP classification hierarchy is constructed based on a domain in
the experimentally determined protein structure and includes the
following levels (22):
●
● Species: distinct protein sequence and its naturally occurring
or artificially created variants.
●
● Protein: similar sequences of essentially the same functions.
●
● Family: proteins with related sequences but typically distinct
functions.
●
● Superfamily: protein families with common evolutionary
ancestor.
●
● Fold: superfamilies with structural similarity (same major sec-
ondary structures in the same arrangement and with the same
topological connections, not necessarily with common evolu-
tionary origin).
●
● Class: based on the secondary structure content and organi-
zation of folds.
The SCOP release 1.75 of June 2009 includes 38,221 PDB
entries, 1,195 folds, 1,962 superfamilies and 3,902 families.
The Protein Folding Database (PFD) is a publicly searchable
repository that collects experimental thermodynamic and kinetic
data for the folding of proteins. Experimenters deposit data
including Constructor, Mutations, Equilibrium Method, Kinetic
Method, Equilibrium Data, Kinetic Data, and Publications (38).
The PFD database uses the International Foldeomics Consortium
standards (39) for data deposition, analysis and reporting to facil-
itate the comparison of folding rates, energies and structure across
diverse sets of proteins (38). The PFD release 2.2 of June 8, 2009
contains 296 entries, 70 proteins, 53 families, 30 species and 230
3.3.3. SCOP
3.3.4. PFD
15
Protein Bioinformatics Databases and Resources
(five proteins) j values. The web site provides advanced text
searches of protein names, literature references, and experimental
details with search results displayed in a tabular view. The graphi-
cal visualization tools have been built for raw equilibrium data,
chevron data, contact order and folding rates with the hyperlinks
on the graph directly link to the data in the text format.
Phospho3D (40) is a database of 3D structures of phosphoryla-
tion sites. Phospho3D is constructed by using the data collected
from the phospho.ELM (41) database of experimentally verified
phosphorylation sites in eukaryotic proteins, and is enriched with
structural information and annotations at the residue level. The
basic information unit in the Phospho3D database consists of the
instance, its flanking sequence (ten residues) and its “zone,” a 3D
neighborhood including any residue whose distance does not
exceed 12Å (40). For each zone, structural similarity and bio-
chemical similarity are used to collect the results of a large-scale
local structural comparison versus a representative dataset of PDB
(23) protein chains, which provide the clues for the identification
of new putative phosphorylation sites. Users can browse the data
in Phospho3D database or search the database using kinase name,
PDB identification code or keywords.
The unique feature of proteins that allows their diverse functions
is the ability to bind to other molecules specifically. For example,
proteins can be enzymes to catalyze the chemical reactions in the
cell or to manipulate the replication and transcription of DNA.
Many proteins are also involved in the process of cell signaling
and signal transduction. Protein function databases maintain
information about metabolic pathways, enzymes, compounds,
and the inter-molecular interactions and regulatory pathways
mechanisms underlying many biological processes.
IntAct is an open source database and toolkit for the storage, pre-
sentation and analysis of protein interaction data (42). IntAct pro-
vides all relevant experimental details of protein interactions
described in the originating publication. All the entries in the data-
base are fully compliant with the IMEx (43) guidelines and MIMIx
(44) standard. The technical details of the experiment, binding
sites, protein tags and mutations are annotated with the Molecular
Interaction ontology of the Proteomics Standard Initiative
(PSI-MI) (45). The latest database contains 202,419 binary inter-
actions, 60,310 proteins, 11,119 experiments and 1,509 con-
trolled vocabulary terms. The IntAct web site provides both textual
and graphical views of protein interactions, and allows exploring
interaction networks in the context of the Gene Ontology (46)
controlled vocabulary and InterPro (27) domains of the inter-
acting proteins. IntAct data and source code are available for
3.3.5. 
Phospho3D
3.4. Protein Function
Databases
3.4.1. 
IntAct
16 Chen, Huang, and Wu
downloading from its web site. In addition, a set of tools have
been developed by the IntAct project:
●
● ProViz: visualization of protein–protein interaction graphs.
●
● MiNe: compute the minimal connecting networks for a given
set of proteins.
●
● PSI-MI Semantic Validator: validate files in PSI-MI XML 2.5
and PSI-PAR format.
Reactome is an open source, expert-curated and peer-reviewed
database of biological reactions and pathways with cross-references
to major molecular databases (47). The basic information in the
Reactome database is provided by either publications or sequence
similarity-based inference. The Reactome release 30 of September
30, 2009 contains 3,916 proteins, 2,955 complexes, 3,541 reac-
tions, and 1,045 pathways for Homo sapiens. Reactome data can be
exported in SBML (48), Protégé (49), Cytoscape (50) and BioPax
(http://www.biopax.org) formats. Software tools like PathFinder,
SkyPainter and Reactome BioMart (35) have been developed to
support data mining and analysis of large-scale data sets.
MetaCyc is a database of non-redundant, experimentally elucidated
metabolic pathways and enzymes curated from the scientific litera-
ture (51). MetaCyc stores pathways involved in Primary and
Secondary metabolism. It also stores compounds, proteins, protein
complexes and genes associated with these pathways with extensive
links to other biological databases of protein sequences, nucleic
acid sequences, protein structures and literature. BioCyc is a collec-
tion of Pathway/Genome Databases (PGDBs) (51). Each BioCyc
PGDB contains the metabolic network of one organism predicted
by the Pathway tool software using MetaCyc as a reference data-
base. Web-based query, browsing, visualization and comparative
analysis tools are also provided on the MetaCyc and BioCyc web
sites. A collection of data files is also available for downloading.
The advent of high-throughput 2D-gel and mass spectrometry
based analytical techniques and the available protein sequence
databases have created massive amount of proteomics data. To
facilitate the sharing and further computational analysis of pub-
lished proteomics data, several repositories have been created.
The World-2DPAGE Constellation (52) is an effort of the Swiss
Institute of Bioinformatics (SIB) to promote and publish two-
dimensional gel electrophoresis proteomics data online through
the ExPASy proteomics server. The World-2DPAGE Constellation
consists of three components:
●
● WORLD-2DPAGE List (http://www.world-2dpage.expasy.
org/list/) contains references to known federated 2D PAGE
3.4.2. 
Reactome
3.4.3. MetaCyc and BioCyc
3.5. Proteomics
Databases
3.5.1. 
World-2DPAGE
17
Protein Bioinformatics Databases and Resources
databases, as well as to 2D PAGE-related servers and
services.
●
● World-2DPAGE Portal (http://www.world-2dpage.expasy.
org/portal/) is a dynamic portal that serves as a single inter-
face to query simultaneously world-wide gel-based proteomics
databases that are built using the Make2D-DB package (53).
●
● World-2DPAGE Repository (http://www.world-2dpage.
expasy.org/repository/) is a public repository for gel-based
proteomics data with protein identifications published in the
literature. Mass-spectrometry based proteomics data from
related studies can also be submitted to the PRIDE database
(54) so that interested readers can explore the data in the
views of 2D-gel and/or MS.
The PRoteomics IDEntifications database (PRIDE) is a reposi-
tory for mass-spectrometry based proteomics data including
identifications of proteins, peptides and post-translational modifi-
cations that have been described in the scientific literature,
together with supporting mass spectra (54). The PRIDE team
has built an infrastructure and a set of software tools to facilitate
the data submissions in PRIDE XML or mzData XML format
from labs using different MS-based proteomics technologies. The
PRIDE database can be queried by experiment accession number,
protein accession number, literature reference, and sample param-
eters including species, tissue, sub-cellular location and disease
state. The query results can be retrieved as PRIDE XML, mzData
XML, or HTML. The PRIDE database includes a BioMart (35)
interface that provides access to public PRIDE data from a query-
optimized data warehouse as well as programmatic web service
access. The PRIDE project also provides the Protein Identifier
Cross-Reference Service (PICR) (55), which maps protein
sequence identifiers from over 60 different databases via the
UniParc (14) database. The Database on Demand (DoD, http://
www.ebi.ac.uk/pride/dod) service provides custom FASTA for-
matted sequence databases according to a set of user-selectable
criteria to optimize the search engine results. By November 19,
2009, the PRIDE database contains 10,329 experiments,
2,827,384 identified proteins, 12,542,472 identified peptides,
1,891,670 unique peptides and 56,703,344 Spectra.
Although a variety of protein bioinformatics databases and
resources have been developed to catalog and store different
information about proteins, there are still opportunities to develop
3.5.2. PRIDE
4. 
Discussion
18 Chen, Huang, and Wu
new solutions to facilitate comparative analysis, data-driven
hypothesis generation, and biological knowledge discovery.
As the volume and diversity of data and the desire to share those
data increase, we inevitably encounter the problem of combining
heterogeneous data generated from many different but related
sources and providing the users with a unified view of this com-
bined data set. This problem emerges in the biological and bio-
medical research community, where research data from different
bioinformatics data repositories and laboratories need to be com-
bined and analyzed. There are urgent needs for developing com-
putational methods to integrate data from multiple studies and to
answer more complex biological questions than traditional meth-
ods can tackle. Comparing experimental results across multiple
laboratories and data types can also help forming new hypotheses
for further experimentation (56–58). Different laboratories use
different experimental protocols, instruments and analysis tech-
niques, which make direct comparisons of their experimental
results difficult. However, having related data in one place can
make queries and comparisons of combined protein and gene
data sets and further analysis possible.
In general, there are two types of data integration approaches.
The data warehouse approach puts data sources into a centralized
location with a global data schema and an indexing system for fast
data retrieval. An example of this approach is the NIAID (National
Institute for Allergy and Infectious Diseases) Biodefense Resource
Center (http://www.proteomicsresource.org), which uses a pro-
tein-centric data warehouse (Master Protein Directory) to integrate
and support mining and comparative analysis of large and hetero-
geneous “omics” data across different experiments and organisms
(59). Another approach to data integration involves the federation
of data across multiple sources. An example of this approach is the
BioMart (35), an open source database management system that
uses integrated query interfaces to query different BioMarts and
allows users to group and refine their query results. The BioMart
can also be accessed programmatically through web services or
software libraries written in programming languages Java or Perl.
In many cases, the most difficult tasks in protein bioinformatics
data management and analysis are not mapping biological entities
from different sources or managing and processing large set of
experimental data, such as gel images and mass spectra. Rather, it
is in recording the detailed provenance of data, i.e., what was
done, why it was done, where it was done, which instrument was
used, what settings were used, how it was done. The provenance
of experimental data is an important aspect of scientific best prac-
tice and is central to scientific discovery (60).
4.1. Data Integration
and Comparative
Analysis
4.2. Data Provenance
and Biological
Knowledge
19
Protein Bioinformatics Databases and Resources
In proteomics studies, although great efforts have been made
to develop and maintain data format standards, such as mzXML
(61) and HUPO PSI (HUPO Proteomics Standards Initiative)
(62), and minimal information standards for describing such data,
for example, MIAPE (Minimum Information About a Proteomics
Experiment) (63), the ontologies and related tools that provide
formal representation of a set of concepts and their relationships
within the domain of “omics” experiments still lag behind the
current development of experimental protocols and methods.
The standardization of data provenance remains a somewhat
manual process, which depends on the efforts of database main-
tainers and data submitters.
The general biological and biomedical scientists are more inter-
ested in finding and viewing the “knowledge” contained in an
already analyzed data set. However, much of the protein data gener-
ated in high-throughput research is insignificant in the conclusions
of an analysis. Unfortunately, this information seldom comes with
the standard data files and formats and is usually not easily found in
omics repositories unless a reanalysis is performed or the data is
annotated by a curator. For example, tables of proteins present in a
given proteomics experiment are routinely found as supplemental
data in scientific publications, but are not available in a searchable or
easily computable format. This is unfortunate as this supplemental
information is the result of considerable analysis by the original
authors of a study to minimize false positive and false negative
results, thus often representing the “knowledge” that underlies
additional analysis and conclusions reached in a publication.
The NIAID Biodefense Resource Center developed a simple
set of defined fields called “structured assertion” that could be used
across proteomics, microarray and possibly other data types (59).
A “structured assertion” can represent the results in a simple form
like “protein V (presented) in experimental condition W,” where V
represents any valid identifier and W represents a value in a simple
experimental ontology. A simple two-field assertion for the analyzed
results of proteomics and microarray data and an “experimental
condition” field containing simple keywords was implemented to
describe the key experimental variables (growth conditions, sample
fractionation, time, temperature, infection status and others) and
“Expression Status,” which has three values: increase, decrease or
present. Although seemingly simple, the approach provides unique
analytical power in the form of enabling simple queries across results
from different data types and laboratories.
Acknowledgment
We would like to thank Dr. Winona C. Barker for reviewing the
manuscript and providing constructive comments.
20 Chen, Huang, and Wu
References
1. Ridley, M. (2006) Genome. Harper Perennial,
New York.
2. Velculescu, V. E., Zhang, L., Zhou, W.,
Vogelstein, J., Basrai, M. A., Bassett, D. E. Jr,
Hieter, P., Vogelstein, B., Kinzler, K. W.
(1997) Characterization of the yeast tran-
scriptome. Cell 2, 243–251.
3. Anderson, N. L., Anderson, N. G. (1998)
Proteome and proteomics: new technologies,
new concepts, and new words. Electrophoresis
11, 1853–1861.
4. Hye, A., Lynham, S., Thambisetty, M.,
Causevic, M., Campbell, J., Byers, H. L.,
Hooper, C., Rijsdijk, F., Tabrizi, S. J., Banner,
S., Shaw, C. E., Foy, C., Poppe, M., Archer,
N., Hamilton, G., Powell, J., Brown, R. G.,
Sham, P., Ward, M., Lovestone, S. (2006)
Proteome-based plasma biomarkers for
Alzheimer’s disease. Brain 11, 3042–3050.
5. Decramer, S., Wittke, S., Mischak, H., Zürbig,
P., Walden, M., Bouissou, F., Bascands, J. L.,
Schanstra, J. P. (2006) Predicting the clinical
outcome of congenital unilateral ureteropel-
vic junction obstruction in newborn by uri-
nary proteome analysis. Nat. Med. 4,
398–400.
6. Savidor, A., Donahoo, R. S., Hurtado-
Gonzales, O., Land, M. L., Shah, M. B.,
Lamour, K. H., McDonald, W. H. (2008)
Cross-species global proteomics reveals con-
served and unique processes in Phytophthora
sojae and Phytophthora ramorum. Mol. Cell
Proteomics 8, 1501–1516.
7. Huang, M., Chen, T., Chan, Z. (2006) An
evaluation for cross-species proteomics research
by publicly available expressed sequence tag
database search using tandem mass spectral
data. Rapid Commun. Mass Spectrom. 18,
2635–2640.
8. Ishii, A., Dutta, R., Wark, G. M., Hwang, S. I.,
Han, D. K., Trapp, B. D., Pfeiffer, S. E., Bansal,
R. (2009) Human myelin proteome and com-
parative analysis with mouse myelin. Proc. Natl.
Acad. Sci. U. S. A. 34, 14605–14610.
9. Irmler, M., Hartl, D., Schmidt, T.,
Schuchhardt, J., Lach, C., Meyer, H. E.,
Hrabé, de Angelis M., Klose, J., Beckers, J.
(2008) An approach to handling and inter-
pretation of ambiguous data in transcriptome
and proteome comparisons. Proteomics 6,
1165–1169.
10. Galperin, M. Y., Cochrane, G. R. (2009)
Nucleic acids research annual database issue
and the NAR online molecular biology data-
base collection in 2009. Nucleic Acids Res.
37, D1–D4.
11. Pruitt, K. D., Tatusova, T., Maglott, D. R.
(2007) NCBI reference sequences (RefSeq): a
curated non-redundant sequence database of
genomes, transcripts and proteins. Nucleic
Acids Res. 35, D61–D65.
12. Benson, D. A., Karsch-Mizrachi, I., Lipman,
D. J., Ostell, J., Wheeler, D. L. (2008)
GeneBank. Nucleic Acids Res. 36, D25–D30.
13. The UniProt Consortium. (2010) The uni-
versal protein resource (UniProt) in 2010.
Nucleic Acids Res. 38, D142–D148.
14. Leinonen, R., Diez, F. G., Binns, D.,
Fleischmann, W., Lopez, R., Apweiler, R.
(2004) UniProt archive. Bioinformatics 20,
3236–3237.
15. Suzek, B. E., Huang, H., McGarvey, P.,
Mazumder, R., Wu, C. H. (2007) UniRef:
comprehensive and non-redundant UniProt
reference clusters. Bioinformatics 23,
1282–1288.
16. Yooseph, S., Sutton, G., Rusch, D. B., Halpern,
A. L., Williamson, S. J., Remington, K., Eisen,
J. A., Heidelberg, K. B., Manning, G., Li, W.,
Jaroszewski, L., Cieplak, P., Miller, C. S., Li,
H., Mashiyama, S. T., Joachimiak, M. P., van
Belle, C., Chandonia, J. M., Soergel, D. A.,
Zhai, Y., Natarajan, K., Lee, S., Raphael, B. J.,
Bafna, V., Friedman, R., Brenner, S. E., Godzik,
A., Eisenberg, D., Dixon, J. E., Taylor, S. S.,
Strausberg, R. L., Frazier, M., Venter, J. C.
(2007) The Sorcerer II global ocean sampling
expedition: expanding the universe of protein
families. PLoS Biol. 5, e16.
17. Patient, S., Wieser, D., Kleen, M.,
Kretschmann, E., Martin, M. J., Apweiler, R.
(2008) UniProtJAPI: a remote API for access-
ing UniProt data. Bioinformatics 24,
1321–1322.
18. Nikolskaya, A. N., Arighi, C. N., Huang, H.,
Barker, W. C., Wu, C. H. (2006) PIRSF fam-
ily classification system for protein functional
and evolutionary analysis. Evol. Bioinform.
Online 2, 197–209.
19. Finn, R. D., Tate, J., Mistry, J., Coggill, P. C.,
Sammut, S. J., Hotz, H. R., Ceric, G.,
Forslund, K., Eddy, S. R., Sonnhammer, E.
L., Bateman, A. (2008) The Pfam protein
families database. Nucleic Acids Res. 36,
D281–D288.
20. Wheeler, D. L., Barrett, T., Benson, D. A.,
Bryant, S. H., Canese, K., Chetvernin, V.,
Church, D. M., DiCuccio, M., Edgar, R.,
Federhen, S., Geer, L. Y., Kapustin, Y.,
Khovayko, O., Landsman, D., Lipman, D. J.,
Madden, T. L., Maglott, D. R., Ostell, J.,
Miller, V., Pruitt, K. D., Schuler, G. D.,
21
Protein Bioinformatics Databases and Resources
Sequeira, E., Sherry, S. T., Sirotkin, K.,
Souvorov, A., Starchenko, G., Tatusov, R. L.,
Tatusova, T. A., Wagner, L., Yaschenko, E.
(2007) Database resources of the National
Center for Biotechnology Information.
Nucleic Acids Res. 35, D5–D12.
21. Bru, C., Courcelle, E., Carrère, S., Beausse,
Y., Dalmar, S., Kahn, D. (2005) The ProDom
database of protein domain families: more
emphasis on 3D. Nucleic Acids Res. 33,
D212–D215.
22. Andreeva, A., Howorth, D., Chandonia, J.
M., Brenner, S. E., Hubbard, T. J., Chothia,
C., Murzin, A. G. (2008) Data growth and its
impact on the SCOP database: new develop-
ments. Nucleic Acids Res. 36, D419–D425.
23. Berman, H., Henrick, K., Nakamura, H.,
Markley, J. L. (2007) The worldwide Protein
Data Bank (wwPDB): ensuring a single, uni-
form archive of PDB data. Nucleic Acids Res.
35, D301–D303.
24. Hulo, N., Bairoch, A., Bulliard, V., Cerutti,
L., Cuche, B. A., de Castro, E, Lachaize, C.,
Langendijk-Genevaux, P. S., Sigrist, C. J.
(2008) The 20 years of PROSITE. Nucleic
Acids Res. 36, D245–D249.
25. Sigrist, C. J. A., Cerutti, L., Hulo, N., Gattiker,
A., Falquet, L., Pagni, M., Bairoch, A., Bucher,
P. (2002) PROSITE: a documented database
using patterns and profiles as motif descrip-
tors. Brief. Bioinform. 3, 265–274.
26. De Castro, E., Sigrist, C. J. A., Gattiker, A.,
Bulliard, V., Langendijk-Genevaux, P. S.,
Gasteiger, E., Bairoch, A., Hulo, N. (2006)
ScanProsite: detection of PROSITE signature
matches and ProRule-associated functional
and structural residues in proteins. Nucleic
Acids Res. 34, W362–W365.
27. Hunter, S., Apweiler, R., Attwood, T. K.,
Bairoch, A., Bateman, A., Binns, D., Bork, P.,
Das, U., Daugherty, L., Duquenne, L., Finn,
R. D., Gough, J., Haft, D., Hulo, N., Kahn,
D., Kelly, E., Laugraud, A., Letunic, I.,
Lonsdale, D., Lopez, R., Madera, M., Maslen,
J., McAnulla, C., McDowall, J., Mistry, J.,
Mitchell, A., Mulder, N., Natale, D., Orengo,
C., Quinn, A. F., Selengut, J. D., Sigrist, C. J.,
Thimma, M., Thomas, P. D., Valentin, F.,
Wilson, D., Wu, C. H., Yeats, C. (2009)
InterPro: the integrative protein signature data-
base. Nucleic Acids Res. 37, D211–D215.
28. Yeats, C., Lees, J., Reid, A., Kellam, P., Martin,
N., Liu, X., Orengo, C. (2008) Gene3D:
comprehensive structural and functional
annotation of genomes. Nucleic Acids Res.
36, D414–D418.
29. Mi, H., Guo, N., Kejariwal, A., Thomas, P. D.
(2007)PANTHERversion 6: proteinsequence
and function evolution data with expanded
representation of biological pathways. Nucleic
Acids Res. 35, D247–D252.
30. Attwood, T. K. (2002) The PRINTS data-
base: a resource for identification of protein
families. Brief. Bioinform. 3, 252–263.
31. Letunic, I., Doerks, T., Bork, P. (2009)
SMART 6: recent updates and new develop-
ments. Nucleic Acids Res. 37, D229–D232.
32. Wilson, D., Pethica, R., Zhou, Y., Talbot, C.,
Vogel, C., Madera, M., Chothia, C., Gough,
J. (2009) SUPERFAMILY – sophisticated
comparative genomics, data mining, visualiza-
tion and phylogeny. Nucleic Acids Res. 37,
D380–D386.
33. Haft, D. H., Selengut, J. D., White, O. (2003)
The TIGRFAMs database of protein families.
Nucleic Acids Res. 31, D371–D373.
34. Mulder, N., Apweiler, R. (2007) InterPro and
InterProScan: tools for protein sequence clas-
sification and comparison. Methods Mol Biol.
396, 59–70.
35. Smedley, D., Haider, S., Ballester, B., Holland,
R., London, D., Thorisson, G., Kasprzyk, A.
(2009) BioMart – biological queries made
easy. BMC Genomics 10, 22.
36. Westbrook, J., Ito, N., Nakamura, H.,
Henrick, K., Berman, H. M. (2005) PDBML:
the representation of archival macromolecular
structure data in XML. Bioinformatics 21,
988–992.
37. Cuff, A. L., Sillitoe, I., Lewis, T., Redfern, O.
C., Garratt, R., Thornton, J., Orengo, C. A.
(2009) The CATH classification revisited –
architectures reviewed and new ways to char-
acterize structural divergence in superfamilies.
Nucleic Acids Res. 37, D310–D314.
38. Fulton, K. F., Bate, M. A., Faux, N. G.,
Mahmood, K., Betts, C., Buckle, A. M.
(2007) Protein Folding Database (PFD 2.0):
an online environment for the International
Foldeomics Consortium. Nucleic Acids Res.
35, D304–D307.
39. Maxwell, K. L., Wildes, D., Zarrine-Afsar, A.,
De Los Rios, M. A., Brown, A. G., Friel, C.
T., Hedberg, L., Horng, J. C., Bona, D.,
Miller, E. J., Vallée-Bélisle, A., Main, E. R.,
Bemporad, F., Qiu, L., Teilum, K., Vu, N. D.,
Edwards, A. M., Ruczinski, I., Poulsen, F. M.,
Kragelund, B. B., Michnick, S. W., Chiti, F.,
Bai, Y., Hagen, S. J., Serrano, L., Oliveberg,
M., Raleigh, D. P., Wittung-Stafshede, P.,
Radford, S. E., Jackson, S. E., Sosnick, T. R.,
Marqusee, S., Davidson, A. R., Plaxco, K. W.
(2005) Protein folding: defining a “standard”
set of experimental conditions and a prelimi-
nary kinetic data set of two-state proteins.
Protein Sci. 14, 602–616.
22 Chen, Huang, and Wu
40. Zanzoni, A., Ausiello, G., Via, A., Gherardini,
P.F.,Helmer-Citterich,M.(2007)Phospho3D:
a database of three-dimensional structures of
protein phosphorylation sites. Nucleic Acids
Res. 35, D229–D231.
41. Diella, F., Cameron, S., Gemünd, C., Linding,
R., Via, A., Kuster, B., Sicheritz-Pontén, T.,
Blom, N., Gibson, T. J. (2004) Phospho.
ELM: a database of experimentally verified
phosphorylation sites in eukaryotic proteins.
BMC Bioinformatics 5, 79.
42. Aranda, B., Achuthan, P., Alam-Faruque, Y.,
Armean,I.,Bridge,A.,Derow,C.,Feuermann,
M., Ghanbarian, A. T., Kerrien, S., Khadake,
J., Kerssemakers, J., Leroy, C., Menden, M.,
Michaut, M., Montecchi-Palazzi, L.,
Neuhauser, S. N., Orchard, S., Perreau, V.,
Roechert, B., van Eijk, K., Hermjakob, H.
(2010) The IntAct molecular interaction
database in 2010. Nucleic Acids Res. 38,
D525–D531.
43. Orchard, S., Kerrien, S., Jones, P., Ceol, A.,
Chatr-Aryamontri, A., Salwinski, L., Nerothin,
J., Hermjakob, H. (2007) Submit your inter-
action data the IMEx way: a step by step guide
to trouble-free deposition. Proteomics 7 Suppl
1, 28–34.
44. Orchard, S., Salwinski, L., Kerrien, S.,
Montecchi-Palazzi, L., Oesterheld, M.,
Stümpflen, V., Ceol, A., Chatr-aryamontri,
A., Armstrong, J., Woollard, P., Salama, J. J.,
Moore, S., Wojcik, J., Bader, G. D., Vidal, M.,
Cusick, M. E., Gerstein, M., Gavin, A. C.,
Superti-Furga, G., Greenblatt, J., Bader, J.,
Uetz, P., Tyers, M., Legrain, P., Fields, S,,
Mulder, N., Gilson, M., Niepmann, M.,
Burgoon, L., De Las Rivas, J., Prieto, C.,
Perreau, V. M., Hogue, C., Mewes, H. W.,
Apweiler, R., Xenarios, I., Eisenberg, D.,
Cesareni, G., Hermjakob, H. (2007) The
minimum information required for reporting
a molecular interaction experiment (MIMIx).
Nat. Biotechnol. 25, 894–898.
45. Kerrien, S., Orchard, S., Montecchi-Palazzi,
L., Aranda, B., Quinn, A. F., Vinod, N.,
Bader, G. D., Xenarios, I., Wojcik, J., Sherman,
D., Tyers, M., Salama, J. J., Moore, S., Ceol,
A., Chatr-Aryamontri, A., Oesterheld, M.,
Stümpflen, V., Salwinski, L., Nerothin, J.,
Cerami, E., Cusick, M. E., Vidal, M., Gilson,
M., Armstrong, J., Woollard, P., Hogue, C.,
Eisenberg, D., Cesareni, G., Apweiler, R.,
Hermjakob, H. (2007) Broadening the hori-
zon – level 2.5 of the HUPO-PSI format for
molecular interactions. BMC Biol. 5, 44.
46. Ashburner M, Ball CA, Blake JA, Botstein D,
Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT, Harris MA, Hill DP,
Issel-Tarver L, Kasarskis A, Lewis S, Matese
JC, Richardson JE, Ringwald M, Rubin GM,
Sherlock G. (2000) Gene ontology: tool for
the unification of biology. The Gene Ontology
Consortium. Nat. Genet. 25, 25–29.
47. Matthews, L., Gopinath, G., Gillespie, M.,
Caudy, M., Croft, D., de Bono, B., Garapati, P.,
Hemish, J., Hermjakob, H., Jassal, B., Kanapin,
A., Lewis, S., Mahajan, S., May, B., Schmidt,
E., Vastrik, I., Wu, G., Birney, E., Stein, L.,
D’Eustachio, P. (2009) Reactome knowledge-
base of human biological pathways and pro-
cesses. Nucleic Acids Res. 37, D619–D622.
48. Hucka, M., Finney, A., Sauro, H. M., Bolouri,
H., Doyle, J. C., Kitano, H., Arkin, A. P.,
Bornstein, B. J., Bray, D., Cornish-Bowden,
A., Cuellar, A. A., Dronov, S., Gilles, E. D.,
Ginkel, M., Gor, V., Goryanin, II., Hedley, W.
J., Hodgman, T. C., Hofmeyr, J. H., Hunter,
P. J., Juty, N. S., Kasberger, J. L., Kremling,
A., Kummer, U., Le Novère, N., Loew, L. M.,
Lucio, D., Mendes, P., Minch, E., Mjolsness,
E. D., Nakayama, Y., Nelson, M. R., Nielsen,
P. F., Sakurada, T., Schaff, J. C., Shapiro, B.
E., Shimizu, T. S., Spence, H. D., Stelling, J.,
Takahashi, K., Tomita, M., Wagner, J., Wang,
J., SBML Forum. (2003) The systems biology
markup language (SBML): a medium for rep-
resentation and exchange of biochemical net-
work models. Bioinformatics 19, 524–531.
49. Noy, N. F., Crubezy, M., Fergerson, R. W.,
Knublauch, H., Tu, S. W., Vendetti, J.,
Musen, M. A. (2003) Protégé-2000: an
open-source ontology-development and
knowledge-acquisition environment. AMIA.
Annu Symp Proc. 953.
50. Cline, M. S., Smoot, M., Cerami, E.,
Kuchinsky, A., Landys, N., Workman, C.,
Christmas, R., Avila-Campilo, I., Creech,
M., Gross, B., Hanspers, K., Isserlin, R.,
Kelley, R., Killcoyne, S., Lotia, S., Maere, S.,
Morris, J., Ono, K., Pavlovic, V., Pico, A. R.,
Vailaya, A., Wang, P. L., Adler, A., Conklin,
B. R., Hood, L., Kuiper, M., Sander, C.,
Schmulevich, I., Schwikowski, B., Warner,
G. J., Ideker, T., Bader, G. D. (2007)
Integration of biological networks and gene
expression data using Cytoscape. Nat. Protoc.
2, 2366–2382.
51. Caspi, R., Foerster, H., Fulcher, C. A., Kaipa,
P., Krummenacker, M., Latendresse, M.,
Paley, S., Rhee, S. Y., Shearer, A., Tissier, C.,
Walk, T. C., Zhang, P. and Karp, P. D. (2008)
The MetaCyc Database of metabolic pathways
and enzymes and the BioCyc collection of
Pathway/Genome Databases. Nucleic Acids
Res. 36, D623–D631.
52. Hoogland, C., Mostaguir, K., Appel, R. D.,
Lisacek, F. (2008) The World-2DPAGE
Constellation to promote and publish gel-based
23
Protein Bioinformatics Databases and Resources
proteomics data through the ExPASy server.
J. Proteomics 71, 245–248.
53. Mostaguir, K., Hoogland, C., Binz, P. A.,
Appel, R. D. (2003) The Make 2D-DB II
package: conversion of federated two-dimen-
sional gel electrophoresis databases into a rela-
tionalformatandinterconnectionofdistributed
databases. Proteomics 3, 1441–1444.
54. Vizcaíno, J. A., Côté, R., Reisinger, F.,
Barsnes, H., Foster, J. M., Rameseder, J.,
Hermjakob, H., Martens, L. (2009) The pro-
teomics identifications database: 2010 update.
Nucleic Acids Res. 38, D736–D742.
55. Côté, R. G., Jones, P., Martens, L., Kerrien,
S., Reisinger, F., Lin, Q., Leinonen, R.,
Apweiler, R., Hermjakob, H. (2007) The
protein identifier cross-referencing (PICR)
service: reconciling protein identifiers across
multiplesourcedatabases.BMCBioinformatics
8, 401–414.
56. Burgun,A.,Bodenreider,O.(2008)Accessing
and integrating data and knowledge for bio-
medical research. Yearb Med Inform.
91–101.
57. Hwang, D., Rust, A. G., Ramsey, S., Smith, J.
J., Leslie, D. M., Weston, A. D., de Atauri, P.,
Aitchison, J. D., Hood, L., Siegel, A. F.,
Bolouri, H. (2005) A data integration meth-
odology for systems biology. Proc. Natl Acad.
Sci. U. S. A. 102, 17296–17301.
58. Mathew, J. P., Taylor, B. S., Bader, G. D.,
Pyarajan, S., Antoniotti, M., Chinnaiyan, A.
M., Sander, C., Burakoff, S. J., Mishra, B.
(2007) From bytes to bedside: data integra-
tion and computational biology for transla-
tional cancer research. PLoS Comput. Biol.
3, e12.
59. McGarvey, P. B., Huang, H., Mazumder, R.,
Zhang, J., Chen, Y., Zhang, C., Cammer, S.,
Will, R., Odle, M., Sobral, B., Moore, M.,
Wu, C. H. (2009) Systems integration of bio-
defense omics data for analysis of pathogen–
host interactions and identification of potential
targets. PLoS One 4, e7162.
60. Stevens, R., Zhao, J., Goble, C. (2007) Using
provenance to manage knowledge of in silico
experiments. Brief. Bioinform. 8, 183–194.
61. Pedrioli, P. G., Eng, J. K., Hubley, R.,
Vogelzang, M., Deutsch, E. W., Raught, B.,
Pratt, B., Nilsson, E., Angeletti, R. H.,
Apweiler, R., Cheung, K., Costello, C. E.,
Hermjakob, H., Huang, S., Julian, R. K.,
Kapp, E., McComb, M. E., Oliver, S. G.,
Omenn, G., Paton, N. W., Simpson, R.,
Smith, R., Taylor, C. F., Zhu, W., Aebersold,
R. (2004) A common open representation
of mass spectrometry data and its applica-
tion to proteomics research. Nat. Biotechnol.
22, 1459–1466.
62. Orchard, S., Montechi-Palazzi, L., Deutsch,
E. W., Binz, P. A., Jones, A. R., Paton, N.,
Pizarro, A., Creasy, D. M., Wojcik, J.,
Hermjakob, H. (2007) Five years of prog-
ress in the standardization of proteomics
data 4(th) annual spring workshop of the
HUPO-proteomics standards initiative April
23–25, 2007 Ecole Nationale Supérieure
(ENS), Lyon, France. Proteomics 7,
3436–3440.
63. Taylor, C. F., Paton, N. W., Lilley, K. S.,
Binz, P. A., Julian, R. K. Jr, Jones, A. R.,
Zhu, W., Apweiler, R., Aebersold, R.,
Deutsch, E. W., Dunn, M. J., Heck, A. J.,
Leitner, A., Macht, M., Mann, M., Martens,
L., Neubert, T. A., Patterson, S. D., Ping,
P., Seymour, S. L., Souda, P., Tsugita, A.,
Vandekerckhove, J., Vondriska, T. M.,
Whitelegge, J. P., Wilkins, M. R., Xenarios,
I., Yates, J. R. 3rd, Hermjakob, H. (2007)
The minimum information about a proteom-
ics experiment (MIAPE). Nat. Biotechnol. 25,
887–893.
64. Tatusov, R. L., Fedorova, N. D., Jackson, J.
D., Jacobs, A. R., Kiryutin, B., Koonin, E. V.,
Krylov, D. M., Mazumder, R., Mekhedov, S.
L., Nikolskaya, A. N., Rao, B. S., Smirnov, S.,
Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin,
J. J., Natale, D. A. (2003) The COG data-
base: an updated version includes eukaryotes.
BMC Bioinformatics 4, 41–54.
65. Kaplan, N., Sasson, O., Inbar, U., Friedlich, M.,
Fromer, M., Fleischer, H., Portugaly, E., Linial,
N., Linial, M. (2005) ProtoNet 4.0: a hierarchi-
calclassificationofonemillionproteinsequences.
Nucleic Acids Res. 33, D216–D218.
66. Marchler-Bauer, A., Anderson, J. B., Chitsaz,
F., Derbyshire, M. K., DeWeese-Scott, C.,
Fong, J. H., Geer, L. Y., Geer, R. C., Gonzales,
N. R., Gwadz, M., He, S., Hurwitz, D. I.,
Jackson, J. D., Ke, Z., Lanczycki, C. J.,
Liebert, C. A., Liu, C., Lu, F., Lu, S.,
Marchler, G. H., Mullokandov, M., Song, J.
S., Tasneem, A., Thanki, N., Yamashita, R. A.,
Zhang, D., Zhang, N., Bryant, S. H. (2009)
CDD: specific functional annotation with the
conserved domain database. Nucleic Acids
Res. 37, D205–D210.
67. Wang, Y., Addess, K. J., Chen, J., Geer, L.
Y., He, J., He, S., Lu, S., Madej, T.,
Marchler-Bauer, A., Thiessen, P. A., Zhang,
N., Bryant, S. H. (2007) MMDB: annotat-
ing protein sequences with Entrez’s
3D-structure database. Nucleic Acids Res.
35, D298–D300.
68. Pieper, U., Eswar, N., Webb, B. M., Eramian,
D., Kelly, L., Barkan, D. T., Carter, H.,
Mankoo, P., Karchin, R., Marti-Renom, M.
A., Davis, F. P., Sali, A. (2009) MODBASE, a
24 Chen, Huang, and Wu
database of annotated comparative protein
structure models and associated resources.
Nucleic Acids Res. 37, D347–D354.
69. Kiefer, F., Arnold, K., Künzli, M., Bordoli, L.,
Schwede, T. (2009) The SWISS-MODEL
repository and associated resources. Nucleic
Acids Res. 37, D387–D392.
70. Bogatyreva, N. S., Osypov, A. A., Ivankov, D.
N. (2009) KineticDB: a database of protein
folding kinetics. Nucleic Acids Res. 37,
D342–D346.
71. Garavelli, J. S. (2004) The RESID database of
protein modifications as a resource and anno-
tation tool. Proteomics 4, 1527–1533.
72. Salwinski, L., Miller, C. S., Smith, A. J., Pettit,
F. K., Bowie, J. U., Eisenberg, D. (2004) The
database of interacting proteins: 2004 update.
Nucleic Acids Res. 32, D449–D451.
73. Breitkreutz, B. J., Stark, C., Reguly, T.,
Boucher, L., Breitkreutz, A., Livstone, M.,
Oughtred, R., Lackner, D. H., Bähler, J.,
Wood, V., Dolinski, K., Tyers, M. (2008) The
BioGRID interaction database: 2008 update.
Nucleic Acids Res. 36, D637–D640.
74. Kanehisa, M., Goto, S. (2000) KEGG: Kyoto
encyclopedia of genes and genomes. Nucleic
Acids Res. 28, 27–30.
75. Tarcea, V. G., Weymouth, T., Ade, A.,
Bookvich, A., Gao, J., Mahavisno, V.,
Wright, Z., Chapman, A., Jayapandian, M.,
Ozgür, A., Tian, Y., Cavalcoli, J., Mirel, B.,
Patel, J., Radev, D., Athey, B., States, D.,
Jagadish, H. V. (2009) Michigan molecular
interactions r2: from interacting proteins
to pathways. Nucleic Acids Res. 37,
D642–D646.
76. Craig, R., Cortens, J. C., Fenyo, D., Beavis,
R. C. (2006) Using annotated peptide mass
spectrum libraries for protein identification.
J. Proteome Res. 5, 1843–1849.
77. Deutsch, E. W., Lam, H., Aebersold, R.
(2008) PeptideAtlas: a resource for target
selection for emerging targeted proteomics
workflows. EMBO Rep. 9, 429–434.
78. Slotta, D. J., Barrett, T., Edgar, R. (2009)
NCBI peptidome: a new public repository for
mass spectrometry peptide identifications.
Nat. Biotechnol. 27, 600–601.
25
Chapter 2
A Guide to UniProt for Protein Scientists
Claire O’Donovan and Rolf Apweiler
Abstract
One of the essential requirements of the proteomics community is a high quality annotated nonredundant
protein sequence database with stable identifiers and an archival service to enable protein identification
and characterization. The scope of this chapter is to illustrate how Universal Protein Resource (UniProt)
(The UniProt Consortium, Nucleic Acids Res. 38:D142–D148, 2010) can be best utilized for proteomics
purposes with a particular focus on exploiting the knowledge captured in the UniProt databases, the
services provided and the availability of complete proteomes.
Key words: Protein sequence database, Annotation, Stable identifiers, Complete proteome, Archive,
Nonredundant
The Proteomics community has evolved intensively over the last
decade but one constant is the need to identify the resulting pro-
teins and their potential functions. This requires the availability of
a nonredundant protein sequence database, with maximal cover-
age including splice isoforms, disease variant(s) and posttransla-
tional modifications. Sequence archiving is an essential feature in
order to be able to interpret and maintain the proteomic set
results. Stable identifiers, consistent nomenclature and controlled
vocabularies are highly beneficial for protein identification. The
last but by no means least requirement is the provision of detailed
information on protein function, biological processes, and molec-
ular interactions and pathways cross-referenced to appropriate
external sources. In this chapter, we will show how the Universal
Protein Resource fulfils these criteria.
1. Introduction
Cathy H. Wu and Chuming Chen (eds.), Bioinformatics for Comparative Proteomics, Methods in Molecular Biology, vol. 694,
DOI 10.1007/978-1-60761-977-2_2, © Springer Science+Business Media, LLC 2011
26 O’Donovan and Apweiler
The mission of the Universal Protein Resource (UniProt) is to
provide the scientific community with a comprehensive, high-quality
and freely accessible resource of protein sequence and functional
information, which is essential for modern biological research.
UniProt is produced by the UniProt Consortium, which consists
of groups from the European Bioinformatics Institute (EBI), the
Protein Information Resource (PIR), and the Swiss Institute of
Bioinformatics (SIB). Its activities are mainly supported by the
National Institutes of Health (NIH) with additional funding from
the European Commission and the Swiss Federal Government.
It has five components optimized for different uses. The
UniProt Knowledgebase (UniProtKB) (1) is an expertly curated
database, a central access point for integrated protein information
with cross-references to multiple sources. The UniProt Archive
(UniParc) (2) is a comprehensive sequence repository, reflecting
the history of all protein sequences. UniProt Reference Clusters
(UniRef) (3) merge closely related sequences based on sequence
identity to speed up searches whereas the UniProt Metagenomic and
Environmental Sequences database (UniMES) was created to
respond to the expanding area of metagenomic data. UniProtKB
Sequence/AnnotationVersionArchive(UniSave)istheUniProtKB
protein entry archive, which contains all versions of each protein
entry (Fig. 1).
2. Materials
Fig. 1. UniProt databases.
27
A Guide to UniProt for Protein Scientists
UniParc is the main sequence storehouse and is a comprehensive
repository that reflects the history of all protein sequences.
UniParc contains all new and revised protein sequences from all
publicly available sources (http:/
/www.uniprot.org/help/uniparc)
to ensure that complete coverage is available at a single site. To
avoid redundancy, all sequences 100% identical over the entire
length are merged, regardless of source organism. New and
updated sequences are loaded on a daily basis, cross-referenced to
the source database accession number, and provided with a
sequence version that increments on changes to the underlying
sequence. The basic information stored within each UniParc
entry is the identifier, the sequence, cyclic redundancy check
number, source database(s) with accession and version numbers, and
atimestamp.IfaUniParcentrylacksacross-referencetoaUniProtKB
entry, the reason for its exclusion from UniProtKB is provided (e.g.,
pseudogene). In addition, each source database accession number is
tagged with its status in that database, indicating if the sequence still
exists or has been deleted in the source database and cross-references
to NCBI GI and TaxId if appropriate.
UniProtKB consists of two sections, UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL. The former contains manually annotated
recordswithinformationextractedfromliteratureandcurator-evaluated
computational analysis. Annotation is done by biologists with specific
expertise to achieve accuracy. In UniProtKB/Swiss-Prot, annotation
consists of the description of the following: function(s), enzyme-
specific information, biologically relevant domains and sites, post-
translational modifications, subcellular location(s), tissue specificity,
developmental specific expression, structure, interactions, splice
isoform(s), associated diseases or deficiencies, or abnormalities etc.
The UniProt Knowledgebase aims to describe, in a single record, all
protein products derived from a certain gene from a certain species.
After an inspection of the sequences, the curator selects the refer-
ence sequence, does the corresponding merging, and lists the splice
and genetic variants along with disease information when available.
This results in not only the whole record having an accession num-
ber but also unique identifiers for each protein form derived by
alternative splicing, proteolytic cleavage, and posttranslational mod-
ification. The freely available tool VARSPLIC (4) enables the recre-
ation of all annotated splice variants from the feature table of a
UniProt Knowledgebase entry, or for the complete database.
A FASTA-formatted file containing all splice variants annotated in
the UniProt Knowledgebase can be downloaded for use with
similarity search programs.
UniProtKB/TrEMBL contains high quality computationally
analyzed records enriched with automatic annotation and classifi-
cation. The computer-assisted annotation is created using both
automatically generated rules as well as manually curated rules
2.1. The UniProt
Archive
2.2. The UniProt
Knowledgebase
28 O’Donovan and Apweiler
(UniRule) based on protein families (5–8). UniProtKB/TrEMBL
contains the translations of all coding sequences (CDS) present in
the EMBL/GenBank/DDBJ Nucleotide Sequence Databases
and, with some defined exclusions, Arabidopsis thaliana sequences
from The Arabidopsis Information Resource (TAIR) (9), yeast
sequences from the Saccharomyces Genome Database (SGD)
(10) and Homo sapiens sequences from the Ensembl database
(11). It will soon be extended to include other Ensembl organism
sets and RefSeq records. Records are selected for full manual
annotation and integration into UniProtKB/Swiss-Prot accord-
ing to defined annotation priorities.
Integration between the three types of sequence-related data-
bases (nucleic acid sequences, protein sequences, and protein
tertiary structures) as well as with specialized data collections is
important for the UniProt users. UniProtKB is currently cross-
referenced with more than ten million links to 114 different data-
bases with regular update cycles. This extensive network of
cross-references allows UniProt to act as a focal point of biomo-
lecular database interconnectivity. All cross-referenced databases
are documented at http:/
/www.uniprot.org/docs/dbxref and if
appropriate are included in the UniProt ID mapping tool at
http:/
/www.uniprot.org/help/mapping with the file for down-
load at ftp:/
/ftp.uniprot.org/pub/databases/uniprot/current_
release/knowledgebase/idmapping.
UniRef provides clustered sets of all sequences from the UniProt
Knowledgebase (including splice forms as separate entries) and
selected records from the UniProt Archive to achieve complete
coverage of sequence space at identity levels of 100, 90, and 50%
while hiding redundant sequences (3). The UniRef clusters are
generated in a hierarchical manner; the UniRef100 database com-
bines identical sequences and sub-fragments into a single UniRef
entry, UniRef90 is built from UniRef100 clusters and UniRef50
is built from UniRef90 clusters. Each individual member sequence
can exist in only one UniRef cluster at each identity level and have
only one parent or child cluster at another identity level.
UniRef100, UniRef90, and UniRef50 yield a database size reduc-
tion of ~10, 40, and 70%, respectively. Each cluster record con-
tains source database, protein name, and taxonomy information
on each member sequence but is represented by a single selected
representative protein sequence and name; the number of mem-
bers and lowest common taxonomy node for the membership is
also included. The representative protein sequence or cluster rep-
resentative is automatically selected using an algorithm that
accounts for (1) Quality of entry annotation: order of preference
is a member from UniProtKB/Swiss-Prot, UniProtKB/TrEMBL,
then UniParc; (2) Meaningful name: members with protein names
that do not contain words such as “hypothetical” or “probable”
2.3. The UniProt
Reference Clusters
29
A Guide to UniProt for Protein Scientists
are preferred; (3) Organism: members from model organisms are
preferred; (4) Sequence length: longest sequence is preferred.
UniRef100 is one of the most comprehensive and nonredundant
protein sequence dataset available. The reduced size of the
UniRef90 and UniRef50 datasets provide faster sequence similar-
ity searches and reduce the research bias in similarity searches by
providing a more even sampling of sequence space.
The UniProt Knowledgebase contains entries with a known taxo-
nomic source. However, the expanding area of metagenomic data
has necessitated the creation of a separate database, the UniProt
Metagenomic and Environmental Sequences database (UniMES).
UniMES currently contains data from the Global Ocean Sampling
Expedition (GOS), which predicts nearly six million proteins, pri-
marily from oceanic microbes. By combining the predicted pro-
tein sequences with automatic classification by InterPro, the
integrated resource for protein families, domains and functional
sites, UniMES uniquely provides free access to the array of
genomic information gathered.
UniSaveisarepositoryofUniProtKB/Swiss-ProtandUniProtKB/
TrEMBL entry versions and provides the backend to the
UniProtKB entry history service (Fig. 2) and is also provided as a
standalone service at http:/
/www.ebi.ac.uk/uniprot/unisave.
These descriptions of our databases should illustrate that
UniProt does provide a high quality annotated nonredundant
database with maximal coverage and sequence archiving.
This section will describe particular features of the UniProt activities,
which fulfill the proteomics community requirements of detailed
information on protein function, biological processes, molecular
2.4. The UniProt
Metagenomic
and Environmental
Sequences
2.5. The UniProtKB
Sequence/Annotation
Version Archive
3. Methods
Fig. 2. UniSave link.
30 O’Donovan and Apweiler
interactions and pathways cross-referenced to appropriate external
sources and stable identifiers, consistent nomenclature and con-
trolled vocabularies.
UniProtKB consists of two sections, Swiss-Prot and TrEMBL.
UniProtKB/Swiss-Prot contains manually annotated records
with information extracted from literature and curator-evaluated
computational analysis. Manual annotation consists of a critical
review of experimentally proven or computer-predicted data
about each protein. An essential aspect of the annotation protocol
is the use of official nomenclatures and controlled vocabularies
that facilitate consistent and accurate identification (Fig. 3).
Annotation consists of the description of the following:
functions(s), enzyme-specific information, biologically relevant
domains and sites, posttranslation modifications, subcellular
location(s), tissue specificity, developmental specific expression,
structure, interactions, splice isoforms(s), associated diseases or
deficiencies, or abnormalities etc (Fig. 4).
Another important part of the annotation process involves
merging of different reports for a single protein. After an inspec-
tion of the sequences the curator selects the reference sequence,
does the corresponding merging and lists the splice and genetic
variants along with disease information when available (Fig. 5).
Data are continuously updated by an expert team of biologists.
3.1. Protein Annotation
Fig. 3. UniProt nomenclature.
31
A Guide to UniProt for Protein Scientists
To promote database interoperability and provide consistent
annotation, the UniProt Consortium is a key member of the
Gene Ontology Consortium (12) and benefits from the presence
of the GO editorial office at the EBI. UniProt curators will con-
tinue to assign Gene Ontology (GO terms) to the gene products
in UniProtKB during the UniProt manual curation process.
UniProtKB also profits from GO annotation carried out by other
GO Consortium members. Currently we include manual GO
annotations from 19 GO Consortium annotation groups, and we
further supplement this with high-quality annotations from other
manual annotation sources (including the Human Protein Atlas
and LIFEdb). In addition to this manually curated GO annota-
tion, automatic GO annotation pipelines exist and will be further
developed to ensure that the functional knowledge supplied by
various UniProtKB ontologies, Ensembl orthology data, and
InterPro matches are fully exploited to provide high-quality, com-
prehensive set of GO annotation predictions for all UniProtKB
entries.
One challenge in life sciences research is the ability to integrate
and exchange data coming from multiple research groups. The
UniProt Consortium is committed to fostering interaction and
exchange with the scientific community, ensuring wide access to
UniProt resources, and promoting interoperability between
resources. An essential component of this interoperability is the
provision of cross-references to these resources in UniProt entries
(Fig. 6).
3.2. The Gene Ontology
Consortium
and UniProt
3.3. Cross-references
to External Sources
Fig. 4. Protein annotation.
32 O’Donovan and Apweiler
UniProt constructs complete nonredundant proteome sets. Each
set and its analysis is made available shortly after the appearance of
a new complete genome sequence in the nucleotide sequence
databases. A standard procedure is used to create, from the
UniProtKB, proteome sets for bacterial, archaeal and some eukary-
otic genomes. Proteome sets for certain metazoan genomes are
3.4. Nonredundant
Complete UniProt
Proteome Sets
Fig. 5. Sequence annotation.
Exploring the Variety of Random
Documents with Different Content
*** END OF THE PROJECT GUTENBERG EBOOK HARPER'S ROUND
TABLE, MARCH 23, 1897 ***
Updated editions will replace the previous one—the old editions
will be renamed.
Creating the works from print editions not protected by U.S.
copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.
START: FULL LICENSE
THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the
free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.
Section 1. General Terms of Use and
Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only
be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E. Unless you have removed all references to Project
Gutenberg:
1.E.1. The following sentence, with active links to, or other
immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is
derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is
posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.
1.E.4. Do not unlink or detach or remove the full Project
Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,
performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or
providing access to or distributing Project Gutenberg™
electronic works provided that:
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who
notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of
any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project
Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except
for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you
discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.
1.F.4. Except for the limited right of replacement or refund set
forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied
warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the
Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.
Section 2. Information about the Mission
of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.
The Foundation’s business office is located at 809 North 1500
West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws
regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states
where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot
make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.
Please check the Project Gutenberg web pages for current
donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.
Project Gutenberg™ eBooks are often created from several
printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com

More Related Content

PDF
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
PDF
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
PDF
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
PDF
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
PDF
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
PDF
Bioinformatics For Comparative Proteomics 1st Edition Chuming Chen
PDF
Informatics In Proteomics 1st Edition Sudhir Srivastava
PDF
Informatics In Proteomics 1st Edition Sudhir Srivastava
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
Bioinformatics For Comparative Proteomics 1st Edition Chuming Chen
Informatics In Proteomics 1st Edition Sudhir Srivastava
Informatics In Proteomics 1st Edition Sudhir Srivastava

Similar to Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen (20)

PDF
Proteomics For Biomarker Discovery 2013th Ming Zhou Timothy Veenstra
PPT
proteomics.ppt
PDF
Data Mining In Proteomics From Standards To Applications 1st Edition Michael ...
PDF
Proteomic and metabolomic
PPTX
Introduction to proteomics
PPTX
The Role Of Proteomics In Drug Discovery And Development (1).pptx
PPT
Salisha ppt (1) (1)
PDF
proteomics and bioinformatics lecturesss
PDF
Introduction To Proteomics Principles And Applications Nawin Mishraauth
PPTX
genomics proteomics metbolomics.pptx
PPT
proteomics, types, concept and Applications
PDF
Proteomics For Biomarker Discovery Methods And Protocols 1st Ed Virginie Brun
PDF
Proteomics Methods and Protocols 1st Edition Friedrich Lottspeich (Auth.)
PDF
Protein Microarray For Disease Analysis Methods And Protocols 1st Edition Tan...
PPTX
Proteomics: types, protein profiling steps etc.
PDF
Proteomics Methods and Protocols 1st Edition Friedrich Lottspeich (Auth.)
PPT
Techniques in proteomics
PPTX
Geomics proteomics
PPTX
Functional proteomics, and tools
PPTX
Genomics and proteomics by shreeman
Proteomics For Biomarker Discovery 2013th Ming Zhou Timothy Veenstra
proteomics.ppt
Data Mining In Proteomics From Standards To Applications 1st Edition Michael ...
Proteomic and metabolomic
Introduction to proteomics
The Role Of Proteomics In Drug Discovery And Development (1).pptx
Salisha ppt (1) (1)
proteomics and bioinformatics lecturesss
Introduction To Proteomics Principles And Applications Nawin Mishraauth
genomics proteomics metbolomics.pptx
proteomics, types, concept and Applications
Proteomics For Biomarker Discovery Methods And Protocols 1st Ed Virginie Brun
Proteomics Methods and Protocols 1st Edition Friedrich Lottspeich (Auth.)
Protein Microarray For Disease Analysis Methods And Protocols 1st Edition Tan...
Proteomics: types, protein profiling steps etc.
Proteomics Methods and Protocols 1st Edition Friedrich Lottspeich (Auth.)
Techniques in proteomics
Geomics proteomics
Functional proteomics, and tools
Genomics and proteomics by shreeman
Ad

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
RMMM.pdf make it easy to upload and study
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Cell Types and Its function , kingdom of life
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Institutional Correction lecture only . . .
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
Pharma ospi slides which help in ospi learning
PPH.pptx obstetrics and gynecology in nursing
RMMM.pdf make it easy to upload and study
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Cell Structure & Organelles in detailed.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Cell Types and Its function , kingdom of life
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Institutional Correction lecture only . . .
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
01-Introduction-to-Information-Management.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Ad

Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen

  • 1. Instant Ebook Access, One Click Away – Begin at ebookgate.com Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen https://ebookgate.com/product/bioinformatics-for- comparative-proteomics-1st-edition-chuming-chen/ OR CLICK BUTTON DOWLOAD EBOOK Get Instant Ebook Downloads – Browse at https://ebookgate.com Click here to visit ebookgate.com and download ebook now
  • 2. Instant digital products (PDF, ePub, MOBI) available Download now and explore formats that suit you... Between Heschel and Buber A Comparative Study 1st Edition Alexander Even-Chen https://ebookgate.com/product/between-heschel-and-buber-a-comparative- study-1st-edition-alexander-even-chen/ ebookgate.com Bioinformatics for Geneticists A Bioinformatics Primer for the Analysis of Genetic Data 2nd Edition Michael R. Barnes https://ebookgate.com/product/bioinformatics-for-geneticists-a- bioinformatics-primer-for-the-analysis-of-genetic-data-2nd-edition- michael-r-barnes/ ebookgate.com Python for Bioinformatics 2nd Edition Sebastian Bassi https://ebookgate.com/product/python-for-bioinformatics-2nd-edition- sebastian-bassi/ ebookgate.com Proteomics for Biomarker Discovery 1st Edition Julian A. J. Jaros https://ebookgate.com/product/proteomics-for-biomarker-discovery-1st- edition-julian-a-j-jaros/ ebookgate.com
  • 3. Mass Spectrometry for Microbial Proteomics 1st Edition Haroun N. Shah https://ebookgate.com/product/mass-spectrometry-for-microbial- proteomics-1st-edition-haroun-n-shah/ ebookgate.com Proteomics for Biomarker Discovery 1st Edition Julian A. J. Jaros https://ebookgate.com/product/proteomics-for-biomarker-discovery-1st- edition-julian-a-j-jaros-2/ ebookgate.com Informatics In Proteomics 1st Edition Sudhir Srivastava https://ebookgate.com/product/informatics-in-proteomics-1st-edition- sudhir-srivastava/ ebookgate.com Bioinformatics for Glycobiology and Glycomics An Introduction 1st Edition Claus-Wilhelm Von Der Lieth https://ebookgate.com/product/bioinformatics-for-glycobiology-and- glycomics-an-introduction-1st-edition-claus-wilhelm-von-der-lieth/ ebookgate.com Technology Application Competencies for K 12 Teachers 1st Edition Irene Chen https://ebookgate.com/product/technology-application-competencies- for-k-12-teachers-1st-edition-irene-chen/ ebookgate.com
  • 5. Me t h o d s i n Mo l e c u l a r Bi o l o g y ™ Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK For other titles published in this series, go to www.springer.com/series/7651
  • 7. Bioinformatics for Comparative Proteomics Edited by Cathy H.Wu DepartmentofComputerandInformationSciences, CenterforBioinformaticsandComputationalBiology, UniversityofDelaware,Newark,DE,USA Chuming Chen DepartmentofComputerandInformationSciences, CenterforBioinformaticsandComputationalBiology, UniversityofDelaware,Newark,DE,USA
  • 8. Editors Cathy H. Wu, Ph.D. Center for Bioinformatics and Computational Biology University of Delaware 15 Innovation Way, Suite 205 Newark, DE 19711 USA wuc@dbi.udel.edu Chuming Chen, Ph.D. Center for Bioinformatics and Computational Biology University of Delaware 15 Innovation Way, Suite 205 Newark, DE 19711 USA chenc@dbi.udel.edu ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-60761-976-5 e-ISBN 978-1-60761-977-2 DOI 10.1007/978-1-60761-977-2 Springer New York Dordrecht Heidelberg London © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or ­ dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, ­ neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)
  • 9. v Preface With the rapid development of proteomic technologies in life sciences and in clinical appli- cations, many bioinformatics methodologies, databases, and software tools have been developed to support comparative proteomics study. This volume aims to highlight the current status, challenges, open problems, and future trends in developing bioinformatics tools and resources for comparative proteomics research and to serve as a definitive source of reference providing both the breadth and depth needed on the subject of Bioinformatics for Comparative Proteomics. The volume is structured to introduce three major areas of research methods: (1) basic bioinformatics frameworks related to comparative proteomics, (2) bioinformatics databases and tools for proteomics data analysis, and (3) integrated bioinformatics systems and approaches for studying comparative proteomics in the systems biology context. Part I (Bioinformatics Framework for Comparative Proteomics) consists of seven chapters: Chapter 1 presents a comprehensive review (with categorization and description) of major protein bioinformatics databases and resources that are relevant to comparative proteomics research. Chapter 2 provides a practical guide to the comparative proteomics community for exploiting the knowledge captured from and the services provided in UniProt databases. Chapter 3 introduces the InterPro protein classification system for automatic protein annotation and reviews the signature methods used in the InterPro database. Chapter 4 introduces the Reactome Knowledgebase that provides an integrated view of the molecular details of human biological processes. Chapter 5 introduces eFIP (extraction of Functional Impact of Phosphorylation), a Web-based text mining system that can aid scientists in quickly finding abstracts from lit- erature related to the phosphorylation (including site and kinase), interactions, and func- tional aspects of a given protein. Chapter 6 presents a tutorial for the Protein Ontology (PRO) Web resources to help researchers in their proteomic studies by providing key information about protein diver- sity in terms of evolutionary-related protein classes based on full-length sequence conser- vation and the various protein forms that arise from a gene along with the specific functional annotation. Chapter 7 describes a method for the annotation of functional residues within experi- mentally uncharacterized proteins using position-specific site annotation rules derived from structural and experimental information. Part II (Proteomic Bioinformatics) consists of ten chapters: Chapter 8 describes how the detailed understanding of information value of mass spectrometry-based proteomics data can be elucidated by performing simulations using synthetic data. Chapter 9 describes the concepts, prerequisites, and methods required to analyze a shotgun proteomics data set using a tandem mass spectrometry search engine.
  • 10. vi Preface Chapter 10 presents computational methods for quantification and comparison of peptides by label-free LC–MS analysis, including data preprocessing, multivariate statisti- cal methods, and detection of differential protein expression. Chapter 11 proposes an alternative to MS/MS spectrum identification by combining the uninterpreted MS/MS spectra from overlapping peptides and then determining the consensus identifications for sets of aligned MS/MS spectra. Chapter 12 describes the Trans-Proteomic Pipeline, a freely available open-source software suite that provides uniform analysis of LC–MS/MS data from raw data to quanti- fied sample proteins. Chapter 13 provides an overview of a set of open-source software tools and steps involved in ELISA microarray data analysis. Chapter 14 presents the state of the art on the Proteomics Databases and Repositories. Chapter 15 is a brief guide to preparing both large- and small-scale protein interaction data for publication. Chapter 16 demonstrates a new graphical user interface tool called PRIDE Converter, which greatly simplifies the submission of MS data to PRIDE database for submitted pro- teomics manuscripts. Chapter 17 presents a method for describing a protein’s posttranslational modifications by integrating the top–down and bottom–up MS data using the Protein Inference Engine. Chapter 18 describes an integrated top–down and bottom–up approach facilitated by concurrent liquid chromatography–mass spectrometry analysis and fraction collection for comprehensive high-throughput intact protein profiling. Part III (Comparative Proteomics in Systems Biology) consists of four chapters: Chapter 19 gives an overview of the content and usage of the PhosphoPep database, which supports systems biology signaling research by providing interactive interrogation of MS-derived phosphorylation data from four different organisms. Chapter 20 describes “omics” data integration to map a list of identified proteins to a common representation of the protein and uses the related structural, functional, genetic, and disease information for functional categorization and pathway mapping. Chapter 21 describes a knowledge-based approach relying on existing metabolic path- way information and a direct data-driven approach for a metabolic pathway-centric inte- gration of proteomics and metabolomics data. Chapter 22 provides a detailed description of a method used to study temporal changes in the endoplasmic reticulum (ER) proteome of fibroblast cells exposed to ER stress agents (tunicamycin and thapsigargin). This volume targets the readers who wish to learn about state-of-the-art bioinformat- ics databases and tools, novel computational methods and future trends in proteomics data analysis, and comparative proteomics in systems biology. The audience may range from graduate students embarking upon a research project, to practicing biologists work- ing on proteomics and systems biology research, and to bioinformaticians developing advanced databases, analysis tools, and integrative systems. With its interdisciplinary nature, this volume is expected to find a broad audience in biotechnology and pharmaceu- tical companies and in various academic departments in biological and medical sciences (such as biochemistry, molecular biology, protein chemistry, and genomics) and compu- tational sciences and engineering (such as bioinformatics and computational biology, computer science, and biomedical engineering).
  • 11. vii Preface We thank all the authors and coauthors who had contributed to this volume. We thank our series editor, Dr. John M. Walker, for reviewing all the chapter manuscripts and providing constructive comments. We also thank Dr. Winona C. Barker from Georgetown University for reviewing the manuscripts. We thank Dr. Qinghua Wu for proof reading the book draft. Finally, we would like to extend our thanks to David C. Casey and Anne Meagher of Springer US, Jeya Ruby and Ravi Amina of SPi for their help in the compila- tion of this book. Newark, DE, USA Cathy H. Wu and Chuming Chen
  • 13. ix Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Part I: Bioinformatics Framework for Comparative Proteomics 1 Protein Bioinformatics Databases and Resources . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chuming Chen, Hongzhan Huang, and Cathy H. Wu 2 A Guide to UniProt for Protein Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Claire O’Donovan and Rolf Apweiler 3 InterPro Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Jennifer McDowall and Sarah Hunter 4 Reactome Knowledgebase of Human Biological Pathways and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Peter D’Eustachio 5 eFIP: A Tool for Mining Functional Impact of Phosphorylation from Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Cecilia N. Arighi, Amy Y. Siu, Catalina O. Tudor, Jules A. Nchoutmboube, Cathy H. Wu, and Vijay K. Shanker 6 A Tutorial on Protein Ontology Resources for Proteomic Studies . . . . . . . . . . . . 77 Cecilia N. Arighi 7 Structure-Guided Rule-Based Annotation of Protein Functional Sites in UniProt Knowledgebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Sona Vasudevan, C.R. Vinayaka, Darren A. Natale, Hongzhan Huang, Robel Y. Kahsay, and Cathy H. Wu Part II: Proteomic Bioinformatics 8 Modeling Mass Spectrometry-Based Protein Analysis . . . . . . . . . . . . . . . . . . . . . . 109 Jan Eriksson and David Fenyö 9 Protein Identification from Tandem Mass Spectra by Database Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Nathan J. Edwards 10 LC-MS Data Analysis for Differential Protein Expression Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Rency S. Varghese and Habtom W. Ressom 11 Protein Identification by Spectral Networks Analysis . . . . . . . . . . . . . . . . . . . . . . 151 Nuno Bandeira 12 Software Pipeline and Data Analysis for MS/MS Proteomics: The Trans-Proteomic Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Andrew Keller and David Shteynberg
  • 14. x Contents 13 Analysis of High-Throughput ELISA Microarray Data . . . . . . . . . . . . . . . . . . . . 191 Amanda M. White, Don S. Daly, and Richard C. Zangar 14 Proteomics Databases and Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Lennart Martens 15 Preparing Molecular Interaction Data for Publication . . . . . . . . . . . . . . . . . . . . . 229 Sandra Orchard and Henning Hermjakob 16 Submitting Proteomics Data to PRIDE Using PRIDE Converter . . . . . . . . . . . . 237 Harald Barsnes, Juan Antonio Vizcaíno, Florian Reisinger, Ingvar Eidhammer, and Lennart Martens 17 Automated Data Integration and Determination of Posttranslational Modifications with the Protein Inference Engine . . . . . . . . . . . . 255 Stuart R. Jefferys and Morgan C. Giddings 18 An Integrated Top-Down and Bottom-Up Strategy for Characterization of Protein Isoforms and Modifications . . . . . . . . . . . . . . . . . 293 Si Wu, Nikola Tolic¢, Zhixin Tian, Errol W. Robinson, and Ljiljana Paša-Tolic¢ Part III: Comparative Proteomics in Systems Biology 19 Phosphoproteome Resource for Systems Biology Research . . . . . . . . . . . . . . . . . 307 Bernd Bodenmiller and Ruedi Aebersold 20 Protein-Centric Data Integration for Functional Analysis of Comparative Proteomics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Peter B. McGarvey, Jian Zhang, Darren A. Natale, Cathy H. Wu, and Hongzhan Huang 21 Integration of Proteomic and Metabolomic Profiling as well as Metabolic Modeling for the Functional Analysis of Metabolic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Patrick May, Nils Christian, Oliver Ebenhöh, Wolfram Weckwerth, and Dirk Walther 22 Time Series Proteome Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Catherine A. Formolo, Michelle Mintz, Asako Takanohashi, Kristy J. Brown, Adeline Vanderver, Brian Halligan, and Yetrib Hathout Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
  • 15. xi Contributors Ruedi Aebersold • Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Rolf Apweiler • The European Bioinformatics Institute, Cambridge, UK Cecilia N. Arighi • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA Nuno Bandeira • Center for Computational Mass Spectrometry, University of California, San Diego, La Jolla, CA, USA Harald Barsnes • Department of Informatics, University of Bergen, Bergen, Norway Bernd Bodenmiller • Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Kristy J. Brown • Center for Genetic Medicine Research, Children’s National Medical Center, Washington, DC, USA Chuming Chen • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA Nils Christian • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm, Germany Don S. Daly • Pacific Northwest National Laboratory, Richland, WA, USA Peter D’Eustachio • Department of Biochemistry, New York University School of Medicine, New York, NY, USA Oliver Ebenhöh • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm, Germany Nathan J. Edwards • Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, USA Ingvar Eidhammer • Department of Informatics, University of Bergen, Bergen, Norway Jan Eriksson • Swedish University of Agricultural Sciences, Uppsala, Sweden David Fenyö • The Rockefeller University, New York, NY, USA Catherine A. Formolo • Center for Genetic Medicine Research, Children’s National Medical Center, Washington, DC, USA Morgan C. Giddings • Departments of Microbiology & Immunology and Biomedical Engineering, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA Brian Halligan • Bioinformatics, Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, WI, USA Yetrib Hathout • Center for Genetic Medicine Research, Children’s National Medical Center, Washington, DC, USA Henning Hermjakob • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Hongzhan Huang • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
  • 16. xii Contributors Sarah Hunter • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Stuart R. Jefferys • Department of Bioinformatics & Computational Biology, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA Robel Y. Kahsay • DuPont Central Research & Development, Wilmington, DE, USA Andrew Keller • Rosetta Biosoftware, Seattle, WA, USA Lennart Martens • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Patrick May • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm, Germany Jennifer McDowall • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Peter B. McGarvey • Department of Biochemistry and Molecular & Cellular Biol- ogy, Georgetown University Medical Center, Washington, DC, USA Michelle Mintz • Center for Genetic Medicine Research, Children’s National Medi- cal Center, Washington, DC, USA Darren A. Natale • Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, USA Jules A. Nchoutmboube • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA Claire O’Donovan • The European Bioinformatics Institute, Cambridge, UK Sandra Orchard • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Ljiljana Paša-Tolic ¢ • Pacific Northwest National Laboratory, Richland, WA, USA Florian Reisinger • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Habtom W. Ressom • Department of Oncology, Georgetown University Medical Center, Washington, DC, USA Errol W. Robinson • Pacific Northwest National Laboratory, Richland, WA, USA Vijay K. Shanker • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA David Shteynberg • Institute for Systems Biology, Seattle, WA, USA Amy Y. Siu • Department of Computer and Information Sciences, University of Dela- ware, Newark, DE, USA Asako Takanohashi • Center for Genetic Medicine Research, Children’s National Medical Center, Washington, DC, USA Zhixin Tian • Pacific Northwest National Laboratory, Richland, WA, USA Nikola Tolic ¢ • Pacific Northwest National Laboratory, Richland, WA, USA Catalina O. Tudor • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA Adeline Vanderver • Center for Genetic Medicine Research, Children’s National Medical Center, Washington, DC, USA Rency S. Varghese • Department of Oncology, Georgetown University Medical Center, Washington, DC, USA
  • 17. xiii Contributors Sona Vasudevan • Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, USA C.R. Vinayaka • Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, USA Juan Antonio Vizcaíno • EMBL Outstation, European Bioinformatics Institute (EBI), Cambridge, UK Dirk Walther • Max-Planck-Institute for Molecular Plant Physiology, Potsdam- Golm, Germany Wolfram Weckwerth • Molecular Systems Biology, University of Vienna, Vienna, Austria Amanda M. White • Pacific Northwest National Laboratory, Richland, WA, USA Cathy H. Wu • Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA Si Wu • Pacific Northwest National Laboratory, Richland, WA, USA Richard C. Zangar • Pacific Northwest National Laboratory, Richland, WA, USA Jian Zhang • Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, USA
  • 19. Part I Bioinformatics Framework for Comparative Proteomics
  • 20. 3 Cathy H. Wu and Chuming Chen (eds.), Bioinformatics for Comparative Proteomics, Methods in Molecular Biology, vol. 694, DOI 10.1007/978-1-60761-977-2_1, © Springer Science+Business Media, LLC 2011 Chapter 1 Protein Bioinformatics Databases and Resources Chuming Chen, Hongzhan Huang, and Cathy H. Wu Abstract In the past decades, a variety of publicly available data repositories and resources have been developed to support protein related information management, data-driven hypothesis generation and biological knowledge discovery. However, there is also an increasing confusion for the researchers who are trying to quickly find the appropriate resources to help them solve their problems. In this chapter, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases and resources that are relevant to comparative proteomics research. We conclude the chapter by discuss- ing the challenges and opportunities for developing new protein bioinformatics databases. Key words: Bioinformatics, Database, Protein sequence, Protein family, Protein structure, Protein function, Proteomics, Data integration, Comparative analysis Advances of high-throughput technologies in the study of molec- ular biology systems in the past decades have marked the begin- ning of a new era of research, in which biological researchers systematically study organisms on the levels of genomes (complete genetic sequences) (1), transcriptomes (gene expressions) (2) and proteomes (protein expressions) (3). Because proteins occupy a middle ground molecularly between gene and transcript informa- tion and higher levels of molecular and cellular structure and orga- nization, and most physiological and pathological processes are manifested at the protein level, biological scientists are growingly interested in applying proteomics techniques to foster a better understanding of basic molecular biology, disease processes and discovery of new diagnostic, prognostic and therapeutic targets for numerous diseases (4, 5). 1. Introduction
  • 21. 4 Chen, Huang, and Wu Recently, proteomics data analysis has moved toward infor- mation integration of multiple studies including cross-species analyses (6–9). The richness of proteomics data allows research- ers to ask complex biological questions and gain new scientific insights. To support comparative proteomics, data-driven hypothesis generation, and biological knowledge discovery, many protein-related bioinformatics databases, query facilities, and data analysis software tools have been developed. These organize and provide biological annotations for individual proteins to support sequence, structural, functional and evolutionary analy- ses in the context of pathway, network and systems biology. However, it is not always easy for researchers to quickly find the pieces of related information. In this chapter, we present a com- prehensive review (with categorization and description) of major protein bioinformatics databases and resources that are relevant to comparative proteomics research. We highlight some of these databases, and focus on the types of data stored and related data access and data analysis supports. We also discuss the challenges and opportunities for developing new protein bioinformatics databases in terms of supporting data integration and compara- tive analysis, maintaining data provenance and managing biological knowledge. Our coverage of protein bioinformatics databases in this chapter is by no means exhaustive. We refer the readers to ref. 10 for a more complete list. Our intention is to cover those that are recent, high quality, publicly available, and are expected to be of interest to more users in the comparative proteomics community. Based on the topics and data stored, protein bioinformatics databases can be primarily classified as sequence databases, family databases, structure databases, function databases and proteomics databases as shown in Table 1. It is worth noting that certain databases can be classified into more than one category. Please visit http:// www.proteininformationresource.org/staff/chenc/MiMB/ dbSummary.html to access the databases reviewed in this chapter through their corresponding web addresses (URLs). Protein sequence databases serve as the archival repositories for col- lections of protein sequences as well as their associated annotations. These databases are also the primary sources for developing other 2. Overview 3. Databases and Resources Highlights 3.1. Protein Sequence Databases
  • 22. 5 Protein Bioinformatics Databases and Resources Table 1 Overview of protein bioinformatics databases Primary category Secondary category Database name Database content URL References Sequence NCBI Reference Sequence (RefSeq) Biologically non-redundant collection of DNA, RNA, and protein sequences http://www.ncbi.nlm.nih. gov/RefSeq/ (11) Entrez Protein Database Collection of protein sequences from a variety of sources, and translations from annotated coding regions in GenBank and RefSeq http://www.ncbi.nlm.nih. gov/sites/ entrez?db,protein (20) UniProt UniProt Knowledgebase (UniProtKB) Collection of functional information on proteins with accurate, consistent and rich annotation http://www.uniprot.org/ help/uniprotkb (13) UniProt Archive (UniParc) Comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world http://www.uniprot.org/ help/uniparc (14) UniProt Reference Clusters (UniRef) Clustered sets of sequences from UniProt Knowledgebase (including splice variants and isoforms) and selected UniParc records http://www.uniprot.org/ help/uniref (15) Family Whole protein PIRSF Comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships based on whole protein sequences http://www.pir.george- town.edu/pirwww/ dbinfo/pirsf.shtml (18) Clusters of Orthologous Groups of proteins (COGs) Phylogenetic classification of proteins encoded in complete genomes http://www.ncbi.nlm.nih. gov/COG/ (64) (continued)
  • 23. 6 Chen, Huang, and Wu Table 1 (continued) Primary category Secondary category Database name Database content URL References Protein ANalysis THrough Evolutionary Relationships Classification System (PANTHER) Proteins are classified by expert biologists into families and subfamilies of shared function and further categorized by GO terms http://www.pantherdb. org/ (29) ProtoNet Automatic hierarchical classification of protein sequences http://www.protonet.cs. huji.ac.il/index.php (65) Protein domain Pfam Protein families of domains each represented by multiple sequence alignments and Hidden Markov Models (HMMs) http://www.pfam.sanger. ac.uk/ (19) ProDom Comprehensive set of protein domain families automatically generated from the UniProtKB http://www.prodom.prabi. fr/prodom/current/ html/home.php (21) Conserved Domains Database (CDD) Collections of multiple sequence alignments representing conserved domains http://www.ncbi.nlm.nih. gov/sites/entrez?db=cdd (66) Simple Modular Architecture Research Tool (SMART) Resource for identification and annotation of protein domains and the analysis of domain architectures http://www.smart.embl. de/ (31) Protein motif PRINTS Group of conserved motifs used to characterize a protein family http://www.bioinf. manchester.ac.uk/ dbbrowser/PRINTS/ index.php (30) PROSITE Protein domains, families and functional sites as well as associated patterns and profiles to identify them http://www.ca.expasy.org/ prosite/ (24)
  • 24. 7 Protein Bioinformatics Databases and Resources Primary category Secondary category Database name Database content URL References Integrative InterPro Integrated resource of protein families, domains and functional sites from Pfam, PRINTS, PROSITE, ProDom, SMART, PIRSF etc. http://www.ebi.ac.uk/ interpro/ (27) Structure 3D structure Worldwide Protein Data Bank (wwPDB) Repository for the 3D coordinates and related information on more than 38,000 macromolecular structures, including proteins, nucleic acids and large macromolecular complexes that have been determined using X-ray crystallography, NMR and electron microscopy techniques http://www.wwpdb.org/ (23) Molecular Modeling Database (MMDB) 3D macromolecular structures, including proteins and polynucleotides. http://www.ncbi.nlm.nih. gov/sites/ entrez?db=structure (67) ModBase 3D protein models calculated by comparative modeling http://www.modbase. compbio.ucsf.edu/ modbase-cgi/index.cgi (68) SWISS-MODEL Repository Annotated protein 3D models http://www.swissmodel. expasy.org/repository/ (69) Structural classification CATH Hierarchical classification of protein domain structures in the Protein Data Bank http://www.cathdb.info/ (37) Structural Classification Of Proteins (SCOP) Description of the evolutionary and structural relation- ships of the proteins of known structures http://www.scop.mrc-lmb. cam.ac.uk/scop/ (22) SUPERFAMILY Structural and functional annotation for all proteins and genomes based on a collection of Hidden Markov Models, which represents structural protein domains at the SCOP superfamily level http://www.supfam.org/ SUPERFAMILY/ (32) (continued)
  • 25. 8 Chen, Huang, and Wu Primary category Secondary category Database name Database content URL References Protein folding Protein Folding Database (PFD) Repository of available experimental protein folding data http://www.pfd.med. monash.edu.au/ public_html/index.php (38) KineticDB Experimental data on protein folding kinetics http://www.kineticdb. protres.ru/db/index.pl (70) Protein modification RESID Collection of annotations and structures for protein pre-, co- and post-translational modifications http://www.ebi.ac.uk/ RESID/ (71) Phospho3D 3D structures of phosphorylation sites that stores information retrieved from the phospho.ELM database http://www.cbm.bio. uniroma2.it/ phospho3d/ (40) Function Inter- molecular interactions IntAct Protein interaction data from literature and user submission http://www.ebi.ac.uk/ intact/main.xhtml (42) Database of Interacting Proteins (DIP) Experimentally determined protein–protein interactions http://www.dip.doe-mbi. ucla.edu/dip/Main.cgi (72) Reactome A curated knowledgebase of biological pathways http://www.reactome.org/ (47) Biological General Repository for Interaction Datasets (BioGRID) Collections of protein and genetic interactions from major model organism species http://www.thebiogrid.org (73) Metabolic pathways Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway maps on the molecular interaction and reaction networks for metabolism http://www.genome.jp/ kegg/pathway.html (74) Table 1 (continued)
  • 26. 9 Protein Bioinformatics Databases and Resources Primary category Secondary category Database name Database content URL References BioCyc Pathway/Genome Databases (PGDBs) on the pathways and genomes of different organisms http://www.biocyc.org/ (51) MetaCyc Nonredundant, experimentally elucidated metabolic pathways http://www.metacyc.org/ (51) Integrative Michigan molecular interactions (MiMI) Merged view of several popular interaction databases including: BIND, HPRD, IntAct, GRID, and others http://www.mimitest.ncibi. org/MimiWeb/main- page.jsp (75) Proteomics Gel electro­ phoresis WORLD-2DPAGE Constellation List of World-2DPAGE database servers, World- 2DPAGE Portal that queries simultaneously world- wide proteomics databases, and World-2DPAGE Repository http://www.world-2dpage. expasy.org/ (52) Mass spectrometry Global Proteome Machine Database (GPMDB) Mass spectral library for data from a variety of organisms, the identified peptides are matched to the Ensembl genome database http://www.thegpm.org/ GPMDB/index.html (76) PRoteomics IDEntifications database (PRIDE) Protein and peptide identifications that have been described in the scientific literature together with the evidence supporting these identifications http://www.ebi.ac.uk/ pride/ (54) PeptideAtlas Peptides identified in a large set of LC–MS/MS proteomics experiments http://www.peptideatlas. org/ (77) Peptidome Tandem mass spectrometry peptide and protein identification data generated by the scientific community http://www.ncbi.nlm.nih. gov/peptidome/ (78)
  • 27. 10 Chen, Huang, and Wu resources such as protein family databases, and the foundation for medical and functional studies. The National Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database provides curated non-redundant sequences for genomic regions, transcripts and proteins (11). RefSeq collection is derived from the sequence data available in the redundant archival database GenBank (12). RefSeq sequences include coding regions, conserved domains, variations, refer- ences, names, and database cross-references. The sequences are annotated using a combined approach of collaboration, auto- mated prediction, and manual curation (11). The RefSeq release 37 of September 11, 2009 includes 8,835,796 proteins and 9,005 organisms. The RefSeq data can be accessed from NCBI web sites by Entrez query, BLAST, FTP download etc. The UniProt Consortium consists of groups from the European BioinformaticsInstitute(EBI),theSwissInstituteofBioinformatics (SIB) and the Protein Information Resource (PIR). The UniProt Consortium provides a central resource for protein sequences and functional annotations with four database components to support protein bioinformatics research: The UniProt Knowledgebase (UniProtKB) is the predomi- ● ● nant data store for functional information on proteins (13). The UniProtKB consists of two sections: UniProtKB/Swiss- Prot, which contains manually annotated records with infor- mation extracted from literature and curator-evaluated computational analysis, and UniProtKB/TrEMBL, which contains computationally analyzed records with rule-based automatic annotation. Comparative analysis and query across databases are supported by the UniProtKB extensive cross- references, functional and feature annotations, classification, and literature-based evidence attribution. The UniProtKB release 15.9 of October 13, 2009 includes 510,076 UniProtKB/Swiss-Prot sequence entries, comprising 179,409,349 amino acids abstracted from 183,725 references, and 9,501,907 UniProtKB/TrEMBL sequence entries com- prising 3,068,281,486 amino acids. The UniProt archive (UniParc) ( ● ● 14) is an archival protein sequence database from all major publicly accessible resources. UniParc contains protein sequences and database cross-refer- ences to the provenance of the sequences. Text- and sequence- based searches are available from UniParc database web site. The UniProt Reference Clusters (UniRef) ( ● ● 15) merge sequences and sub-sequences that are 100% (UniRef100), ³90% 3.1.1. RefSeq 3.1.2. UniProt
  • 28. 11 Protein Bioinformatics Databases and Resources (UniRef90), or ³50% (UniRef50) identical, regardless of source organism to speed up searches. The UniProt Metagenomic and Environmental Sequences ● ● (UniMES) database is a repository specifically developed for Metagenomic and environmental data. UniMES currently contains data from the Global Ocean Sampling Expedition (GOS) (16), which predicts nearly six million proteins, pri- marily from oceanic microbes (13). The UniProt web site (http://www.uniprot.org) is the pri- mary access point to the data and documentation. The site also provides batch retrieval using UniProt identifiers, BLAST-based sequence similarity search, ClustalW-based sequence alignment, and Database identifier mapping. The UniProt FTP download site provides batch download of protein sequence data in various formats, including flat file text, XML, RDF, FASTA, and GFF. Programmatic access to the data and search results is supported via simple HTTP RESTful web services or UniProtJAPI (17) for Java-based applications. The primary protein sequence databases can be used to develop new resources with value-added information by either classifying protein sequences into families or assigning certain properties to the sequences by detecting specific sequence features such as domains, motifs, and functional sites. The PIRSF classification system provides comprehensive and non-overlapping clustering of UniProtKB (13) sequences into a hierarchical order to reflect their evolutionary relationships based on whole proteins rather than on the component domains. The PIRSF system classifies the protein sequences into families, whose members are both homologous (evolved from a common ances- tor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture) (18). The PIRSF family clas- sification results are expert-curated based on literature review and integrative sequence and functional analysis. The classification report shows the information on PIRSF members and general statistics, family and function/structure relationships, database cross-references and graphical display of domain and motif archi- tecture of seed members or all members. The web-based PIRSF system has been shown as a useful tool for studying the function and evolution of protein families (18). It provides batch retrieval of entries from the PIRSF database. The PIRSF scan allows searching a query sequence against the set of fully curated PIRSF families with benchmarked Hidden Markov Models. The PIRSF membership hierarchy data is also available for FTP download. 3.2. Protein Family Databases 3.2.1. PIRSF
  • 29. 12 Chen, Huang, and Wu Pfam is a database of protein domains and families represented as multiple sequence alignments and Hidden Markov Models (HMMs) (19). Pfam is built based on the protein sequence data from UniProtKB (13), NCBI GenPept (20) and selected Metagenomics projects. The Pfam database contains two compo- nents: Pfam-A and Pfam-B. Pfam-A entries are manually curated high-quality representative seed alignments, profile HMMs built from the seed alignments, and an automatically generated full alignment for all detectable family member protein sequences. Pfam-B entries are automatically generated from the ProDom database (21). The Pfam release 24.0 of October 2009 contains 11,912 families. The Pfam database is further organized into higher-level hierarchical groupings of related families called clan (19), which are collections of related Pfam-A entries built manu- ally based on the similarity of their sequences, known structures, profile-HMMs, and other databases such as SCOP (22). The Pfam database web site provides a set of query and browsing interfaces for analyzing protein sequences for Pfam matches, for viewing Pfam family annotations, alignments, groups of related families, and the domains of a protein sequence, as well as for finding the domains on a PDB (23) structure. The Pfam data can be downloaded from its FTP site or programmatically accessed through RESTful and SOAP based web services. PROSITE (24) is a database of annotated motif descriptors (pat- terns or profiles), which can be used for the identification of pro- tein domains and families. The motif descriptors are derived from multiple alignments of homologous sequences and have the advan- tage of identifying distant relationships among sequences (25). A set of ProRules providing additional information about the func- tionally and/or structurally critical amino acids are used to increase the discriminatory power of the motif descriptors (24). The PROSITE web site provides keywords-based search and allows browsing of motif entries, ProRule description, taxonomic scope, and number of positive hits. The ScanProsite (26) tool allows one either to scan protein sequences for the occurrence of PROSITE motifs by entering UniProtKB AC and/or ID, PDB identifier(s) or protein sequence(s), or to scan the UniProtKB or PDB data- bases for the occurrence of a pattern by entering the PROSITE AC and/or ID or user’s own pattern(s). The ScanProsite (26) tool can also be accessed programmatically through a simple HTTP web service. The PROSITE documentation entries and related tools can be downloaded from its FTP site. InterPro (27) is an integrated resource of predictive models or “signatures” representing protein domains, families, regions, repeats and sites from major protein signature databases includ- ing Gene3D (28), PANTHER (29), Pfam (19), PIRSF (18), 3.2.2. Pfam 3.2.3. PROSITE 3.2.4. InterPro
  • 30. 13 Protein Bioinformatics Databases and Resources PRINTS (30), ProDom (21), PROSITE (24), SMART (31), SUPERFAMILY (32) and TIGRFAMs (33). Each entry in the InterPro database is annotated with a descriptive abstract name and cross-references to the original data sources, as well as to specialized functional databases. The InterPro release 23.0 of September 23, 2009 includes 19,150 entries containing 434 new signatures. The database is available via a web interface and anon- ymous FTP download. The software tool InterProScan (34) is provided as a protein sequence classification and comparison package that can be used via a web interface and SOAP-based Web Services or can be installed locally for bulk operations. The InterPro BioMart (35) allows users to retrieve InterPro data from a query-optimized data warehouse that is synchronized with the main InterPro database, and to build simple or complex queries and control the query results through a unified interface. Many bioinformatics studies are based on the premise that pro- teins of similar sequences carry out similar functions whereas those with different sequences carry out different functions. More and more experimental data support the notion that structure of a protein reflects the nature of the role it is playing, therefore, determining its function in the biological process. The protein structure databases organize and annotate various experimentally determined protein structures, providing the biological commu- nity access to the experimental data in a useful way. The worldwide PDB (wwPDB) was established in 2003 as an inter- national collaboration to maintain a single and publicly available Protein Data Bank Archive (PDB Archive) of macro-molecular structural data (23). The wwPDB member includes RCSB PDB (USA), the Macromolecular Structure Database at the European Bioinformatics Institute (MSD-EBI) (UK), the Protein Data Bank Japan (PDBj) at Osaka University (Japan) and the BioMagRes- Bank (BMRB) at the University of Wisconsin – Madison (USA). The “PDB Archive” is a collection of flat files in three different formats: the legacy PDB file format; the PDB exchange format that follows the mmCIF syntax (http://www.deposit.pdb.org/ mmcif/); and the PDBML/XML format (36). Each member site serves as a deposition, data processing and distribution site for the PDB Archive and each provides its own view of the primary data and a variety of tools and resources. As of October 27, 2009, there are 61,086 structures in the wwPDB database. CATH (Class, Architecture, Topology, Homology) is a database of protein domain structures in the Protein Data Bank, where domains are hierarchically classified by the curators guided by prediction algorithms (such as structure comparison). CATH clusters proteins at four major levels (37): 3.3. Protein Structure Databases 3.3.1. worldwide PDB 3.3.2. CATH
  • 31. 14 Chen, Huang, and Wu ● ● Class (C): secondary structure composition and packing within the structure. ● ● Architecture (A): orientations of the secondary structures ignoring the connectivity among the secondary structures. ● ● Topology (T): whether they share the same topology in the core of the domain. ● ● Homologous superfamily (H): sequence and structural similarities. The CATH release 3.2.0 of July 14, 2008 contains 114,215 assigned domains. CATH provides the SSAP server, which allows users to compare the structures of two proteins and view the sub- sequent structural alignment. The SCOP (Structural Classification of Proteins) database provides a comprehensive and detailed description of the evolutionary and structural relationships of the proteins of known structures. The SCOP classification hierarchy is constructed based on a domain in the experimentally determined protein structure and includes the following levels (22): ● ● Species: distinct protein sequence and its naturally occurring or artificially created variants. ● ● Protein: similar sequences of essentially the same functions. ● ● Family: proteins with related sequences but typically distinct functions. ● ● Superfamily: protein families with common evolutionary ancestor. ● ● Fold: superfamilies with structural similarity (same major sec- ondary structures in the same arrangement and with the same topological connections, not necessarily with common evolu- tionary origin). ● ● Class: based on the secondary structure content and organi- zation of folds. The SCOP release 1.75 of June 2009 includes 38,221 PDB entries, 1,195 folds, 1,962 superfamilies and 3,902 families. The Protein Folding Database (PFD) is a publicly searchable repository that collects experimental thermodynamic and kinetic data for the folding of proteins. Experimenters deposit data including Constructor, Mutations, Equilibrium Method, Kinetic Method, Equilibrium Data, Kinetic Data, and Publications (38). The PFD database uses the International Foldeomics Consortium standards (39) for data deposition, analysis and reporting to facil- itate the comparison of folding rates, energies and structure across diverse sets of proteins (38). The PFD release 2.2 of June 8, 2009 contains 296 entries, 70 proteins, 53 families, 30 species and 230 3.3.3. SCOP 3.3.4. PFD
  • 32. 15 Protein Bioinformatics Databases and Resources (five proteins) j values. The web site provides advanced text searches of protein names, literature references, and experimental details with search results displayed in a tabular view. The graphi- cal visualization tools have been built for raw equilibrium data, chevron data, contact order and folding rates with the hyperlinks on the graph directly link to the data in the text format. Phospho3D (40) is a database of 3D structures of phosphoryla- tion sites. Phospho3D is constructed by using the data collected from the phospho.ELM (41) database of experimentally verified phosphorylation sites in eukaryotic proteins, and is enriched with structural information and annotations at the residue level. The basic information unit in the Phospho3D database consists of the instance, its flanking sequence (ten residues) and its “zone,” a 3D neighborhood including any residue whose distance does not exceed 12Å (40). For each zone, structural similarity and bio- chemical similarity are used to collect the results of a large-scale local structural comparison versus a representative dataset of PDB (23) protein chains, which provide the clues for the identification of new putative phosphorylation sites. Users can browse the data in Phospho3D database or search the database using kinase name, PDB identification code or keywords. The unique feature of proteins that allows their diverse functions is the ability to bind to other molecules specifically. For example, proteins can be enzymes to catalyze the chemical reactions in the cell or to manipulate the replication and transcription of DNA. Many proteins are also involved in the process of cell signaling and signal transduction. Protein function databases maintain information about metabolic pathways, enzymes, compounds, and the inter-molecular interactions and regulatory pathways mechanisms underlying many biological processes. IntAct is an open source database and toolkit for the storage, pre- sentation and analysis of protein interaction data (42). IntAct pro- vides all relevant experimental details of protein interactions described in the originating publication. All the entries in the data- base are fully compliant with the IMEx (43) guidelines and MIMIx (44) standard. The technical details of the experiment, binding sites, protein tags and mutations are annotated with the Molecular Interaction ontology of the Proteomics Standard Initiative (PSI-MI) (45). The latest database contains 202,419 binary inter- actions, 60,310 proteins, 11,119 experiments and 1,509 con- trolled vocabulary terms. The IntAct web site provides both textual and graphical views of protein interactions, and allows exploring interaction networks in the context of the Gene Ontology (46) controlled vocabulary and InterPro (27) domains of the inter- acting proteins. IntAct data and source code are available for 3.3.5. Phospho3D 3.4. Protein Function Databases 3.4.1. IntAct
  • 33. 16 Chen, Huang, and Wu downloading from its web site. In addition, a set of tools have been developed by the IntAct project: ● ● ProViz: visualization of protein–protein interaction graphs. ● ● MiNe: compute the minimal connecting networks for a given set of proteins. ● ● PSI-MI Semantic Validator: validate files in PSI-MI XML 2.5 and PSI-PAR format. Reactome is an open source, expert-curated and peer-reviewed database of biological reactions and pathways with cross-references to major molecular databases (47). The basic information in the Reactome database is provided by either publications or sequence similarity-based inference. The Reactome release 30 of September 30, 2009 contains 3,916 proteins, 2,955 complexes, 3,541 reac- tions, and 1,045 pathways for Homo sapiens. Reactome data can be exported in SBML (48), Protégé (49), Cytoscape (50) and BioPax (http://www.biopax.org) formats. Software tools like PathFinder, SkyPainter and Reactome BioMart (35) have been developed to support data mining and analysis of large-scale data sets. MetaCyc is a database of non-redundant, experimentally elucidated metabolic pathways and enzymes curated from the scientific litera- ture (51). MetaCyc stores pathways involved in Primary and Secondary metabolism. It also stores compounds, proteins, protein complexes and genes associated with these pathways with extensive links to other biological databases of protein sequences, nucleic acid sequences, protein structures and literature. BioCyc is a collec- tion of Pathway/Genome Databases (PGDBs) (51). Each BioCyc PGDB contains the metabolic network of one organism predicted by the Pathway tool software using MetaCyc as a reference data- base. Web-based query, browsing, visualization and comparative analysis tools are also provided on the MetaCyc and BioCyc web sites. A collection of data files is also available for downloading. The advent of high-throughput 2D-gel and mass spectrometry based analytical techniques and the available protein sequence databases have created massive amount of proteomics data. To facilitate the sharing and further computational analysis of pub- lished proteomics data, several repositories have been created. The World-2DPAGE Constellation (52) is an effort of the Swiss Institute of Bioinformatics (SIB) to promote and publish two- dimensional gel electrophoresis proteomics data online through the ExPASy proteomics server. The World-2DPAGE Constellation consists of three components: ● ● WORLD-2DPAGE List (http://www.world-2dpage.expasy. org/list/) contains references to known federated 2D PAGE 3.4.2. Reactome 3.4.3. MetaCyc and BioCyc 3.5. Proteomics Databases 3.5.1. World-2DPAGE
  • 34. 17 Protein Bioinformatics Databases and Resources databases, as well as to 2D PAGE-related servers and services. ● ● World-2DPAGE Portal (http://www.world-2dpage.expasy. org/portal/) is a dynamic portal that serves as a single inter- face to query simultaneously world-wide gel-based proteomics databases that are built using the Make2D-DB package (53). ● ● World-2DPAGE Repository (http://www.world-2dpage. expasy.org/repository/) is a public repository for gel-based proteomics data with protein identifications published in the literature. Mass-spectrometry based proteomics data from related studies can also be submitted to the PRIDE database (54) so that interested readers can explore the data in the views of 2D-gel and/or MS. The PRoteomics IDEntifications database (PRIDE) is a reposi- tory for mass-spectrometry based proteomics data including identifications of proteins, peptides and post-translational modifi- cations that have been described in the scientific literature, together with supporting mass spectra (54). The PRIDE team has built an infrastructure and a set of software tools to facilitate the data submissions in PRIDE XML or mzData XML format from labs using different MS-based proteomics technologies. The PRIDE database can be queried by experiment accession number, protein accession number, literature reference, and sample param- eters including species, tissue, sub-cellular location and disease state. The query results can be retrieved as PRIDE XML, mzData XML, or HTML. The PRIDE database includes a BioMart (35) interface that provides access to public PRIDE data from a query- optimized data warehouse as well as programmatic web service access. The PRIDE project also provides the Protein Identifier Cross-Reference Service (PICR) (55), which maps protein sequence identifiers from over 60 different databases via the UniParc (14) database. The Database on Demand (DoD, http:// www.ebi.ac.uk/pride/dod) service provides custom FASTA for- matted sequence databases according to a set of user-selectable criteria to optimize the search engine results. By November 19, 2009, the PRIDE database contains 10,329 experiments, 2,827,384 identified proteins, 12,542,472 identified peptides, 1,891,670 unique peptides and 56,703,344 Spectra. Although a variety of protein bioinformatics databases and resources have been developed to catalog and store different information about proteins, there are still opportunities to develop 3.5.2. PRIDE 4. Discussion
  • 35. 18 Chen, Huang, and Wu new solutions to facilitate comparative analysis, data-driven hypothesis generation, and biological knowledge discovery. As the volume and diversity of data and the desire to share those data increase, we inevitably encounter the problem of combining heterogeneous data generated from many different but related sources and providing the users with a unified view of this com- bined data set. This problem emerges in the biological and bio- medical research community, where research data from different bioinformatics data repositories and laboratories need to be com- bined and analyzed. There are urgent needs for developing com- putational methods to integrate data from multiple studies and to answer more complex biological questions than traditional meth- ods can tackle. Comparing experimental results across multiple laboratories and data types can also help forming new hypotheses for further experimentation (56–58). Different laboratories use different experimental protocols, instruments and analysis tech- niques, which make direct comparisons of their experimental results difficult. However, having related data in one place can make queries and comparisons of combined protein and gene data sets and further analysis possible. In general, there are two types of data integration approaches. The data warehouse approach puts data sources into a centralized location with a global data schema and an indexing system for fast data retrieval. An example of this approach is the NIAID (National Institute for Allergy and Infectious Diseases) Biodefense Resource Center (http://www.proteomicsresource.org), which uses a pro- tein-centric data warehouse (Master Protein Directory) to integrate and support mining and comparative analysis of large and hetero- geneous “omics” data across different experiments and organisms (59). Another approach to data integration involves the federation of data across multiple sources. An example of this approach is the BioMart (35), an open source database management system that uses integrated query interfaces to query different BioMarts and allows users to group and refine their query results. The BioMart can also be accessed programmatically through web services or software libraries written in programming languages Java or Perl. In many cases, the most difficult tasks in protein bioinformatics data management and analysis are not mapping biological entities from different sources or managing and processing large set of experimental data, such as gel images and mass spectra. Rather, it is in recording the detailed provenance of data, i.e., what was done, why it was done, where it was done, which instrument was used, what settings were used, how it was done. The provenance of experimental data is an important aspect of scientific best prac- tice and is central to scientific discovery (60). 4.1. Data Integration and Comparative Analysis 4.2. Data Provenance and Biological Knowledge
  • 36. 19 Protein Bioinformatics Databases and Resources In proteomics studies, although great efforts have been made to develop and maintain data format standards, such as mzXML (61) and HUPO PSI (HUPO Proteomics Standards Initiative) (62), and minimal information standards for describing such data, for example, MIAPE (Minimum Information About a Proteomics Experiment) (63), the ontologies and related tools that provide formal representation of a set of concepts and their relationships within the domain of “omics” experiments still lag behind the current development of experimental protocols and methods. The standardization of data provenance remains a somewhat manual process, which depends on the efforts of database main- tainers and data submitters. The general biological and biomedical scientists are more inter- ested in finding and viewing the “knowledge” contained in an already analyzed data set. However, much of the protein data gener- ated in high-throughput research is insignificant in the conclusions of an analysis. Unfortunately, this information seldom comes with the standard data files and formats and is usually not easily found in omics repositories unless a reanalysis is performed or the data is annotated by a curator. For example, tables of proteins present in a given proteomics experiment are routinely found as supplemental data in scientific publications, but are not available in a searchable or easily computable format. This is unfortunate as this supplemental information is the result of considerable analysis by the original authors of a study to minimize false positive and false negative results, thus often representing the “knowledge” that underlies additional analysis and conclusions reached in a publication. The NIAID Biodefense Resource Center developed a simple set of defined fields called “structured assertion” that could be used across proteomics, microarray and possibly other data types (59). A “structured assertion” can represent the results in a simple form like “protein V (presented) in experimental condition W,” where V represents any valid identifier and W represents a value in a simple experimental ontology. A simple two-field assertion for the analyzed results of proteomics and microarray data and an “experimental condition” field containing simple keywords was implemented to describe the key experimental variables (growth conditions, sample fractionation, time, temperature, infection status and others) and “Expression Status,” which has three values: increase, decrease or present. Although seemingly simple, the approach provides unique analytical power in the form of enabling simple queries across results from different data types and laboratories. Acknowledgment We would like to thank Dr. Winona C. Barker for reviewing the manuscript and providing constructive comments.
  • 37. 20 Chen, Huang, and Wu References 1. Ridley, M. (2006) Genome. Harper Perennial, New York. 2. Velculescu, V. E., Zhang, L., Zhou, W., Vogelstein, J., Basrai, M. A., Bassett, D. E. Jr, Hieter, P., Vogelstein, B., Kinzler, K. W. (1997) Characterization of the yeast tran- scriptome. Cell 2, 243–251. 3. Anderson, N. L., Anderson, N. G. (1998) Proteome and proteomics: new technologies, new concepts, and new words. Electrophoresis 11, 1853–1861. 4. Hye, A., Lynham, S., Thambisetty, M., Causevic, M., Campbell, J., Byers, H. L., Hooper, C., Rijsdijk, F., Tabrizi, S. J., Banner, S., Shaw, C. E., Foy, C., Poppe, M., Archer, N., Hamilton, G., Powell, J., Brown, R. G., Sham, P., Ward, M., Lovestone, S. (2006) Proteome-based plasma biomarkers for Alzheimer’s disease. Brain 11, 3042–3050. 5. Decramer, S., Wittke, S., Mischak, H., Zürbig, P., Walden, M., Bouissou, F., Bascands, J. L., Schanstra, J. P. (2006) Predicting the clinical outcome of congenital unilateral ureteropel- vic junction obstruction in newborn by uri- nary proteome analysis. Nat. Med. 4, 398–400. 6. Savidor, A., Donahoo, R. S., Hurtado- Gonzales, O., Land, M. L., Shah, M. B., Lamour, K. H., McDonald, W. H. (2008) Cross-species global proteomics reveals con- served and unique processes in Phytophthora sojae and Phytophthora ramorum. Mol. Cell Proteomics 8, 1501–1516. 7. Huang, M., Chen, T., Chan, Z. (2006) An evaluation for cross-species proteomics research by publicly available expressed sequence tag database search using tandem mass spectral data. Rapid Commun. Mass Spectrom. 18, 2635–2640. 8. Ishii, A., Dutta, R., Wark, G. M., Hwang, S. I., Han, D. K., Trapp, B. D., Pfeiffer, S. E., Bansal, R. (2009) Human myelin proteome and com- parative analysis with mouse myelin. Proc. Natl. Acad. Sci. U. S. A. 34, 14605–14610. 9. Irmler, M., Hartl, D., Schmidt, T., Schuchhardt, J., Lach, C., Meyer, H. E., Hrabé, de Angelis M., Klose, J., Beckers, J. (2008) An approach to handling and inter- pretation of ambiguous data in transcriptome and proteome comparisons. Proteomics 6, 1165–1169. 10. Galperin, M. Y., Cochrane, G. R. (2009) Nucleic acids research annual database issue and the NAR online molecular biology data- base collection in 2009. Nucleic Acids Res. 37, D1–D4. 11. Pruitt, K. D., Tatusova, T., Maglott, D. R. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65. 12. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Wheeler, D. L. (2008) GeneBank. Nucleic Acids Res. 36, D25–D30. 13. The UniProt Consortium. (2010) The uni- versal protein resource (UniProt) in 2010. Nucleic Acids Res. 38, D142–D148. 14. Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R., Apweiler, R. (2004) UniProt archive. Bioinformatics 20, 3236–3237. 15. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R., Wu, C. H. (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288. 16. Yooseph, S., Sutton, G., Rusch, D. B., Halpern, A. L., Williamson, S. J., Remington, K., Eisen, J. A., Heidelberg, K. B., Manning, G., Li, W., Jaroszewski, L., Cieplak, P., Miller, C. S., Li, H., Mashiyama, S. T., Joachimiak, M. P., van Belle, C., Chandonia, J. M., Soergel, D. A., Zhai, Y., Natarajan, K., Lee, S., Raphael, B. J., Bafna, V., Friedman, R., Brenner, S. E., Godzik, A., Eisenberg, D., Dixon, J. E., Taylor, S. S., Strausberg, R. L., Frazier, M., Venter, J. C. (2007) The Sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biol. 5, e16. 17. Patient, S., Wieser, D., Kleen, M., Kretschmann, E., Martin, M. J., Apweiler, R. (2008) UniProtJAPI: a remote API for access- ing UniProt data. Bioinformatics 24, 1321–1322. 18. Nikolskaya, A. N., Arighi, C. N., Huang, H., Barker, W. C., Wu, C. H. (2006) PIRSF fam- ily classification system for protein functional and evolutionary analysis. Evol. Bioinform. Online 2, 197–209. 19. Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H. R., Ceric, G., Forslund, K., Eddy, S. R., Sonnhammer, E. L., Bateman, A. (2008) The Pfam protein families database. Nucleic Acids Res. 36, D281–D288. 20. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D.,
  • 38. 21 Protein Bioinformatics Databases and Resources Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L., Yaschenko, E. (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35, D5–D12. 21. Bru, C., Courcelle, E., Carrère, S., Beausse, Y., Dalmar, S., Kahn, D. (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 33, D212–D215. 22. Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S. E., Hubbard, T. J., Chothia, C., Murzin, A. G. (2008) Data growth and its impact on the SCOP database: new develop- ments. Nucleic Acids Res. 36, D419–D425. 23. Berman, H., Henrick, K., Nakamura, H., Markley, J. L. (2007) The worldwide Protein Data Bank (wwPDB): ensuring a single, uni- form archive of PDB data. Nucleic Acids Res. 35, D301–D303. 24. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B. A., de Castro, E, Lachaize, C., Langendijk-Genevaux, P. S., Sigrist, C. J. (2008) The 20 years of PROSITE. Nucleic Acids Res. 36, D245–D249. 25. Sigrist, C. J. A., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P. (2002) PROSITE: a documented database using patterns and profiles as motif descrip- tors. Brief. Bioinform. 3, 265–274. 26. De Castro, E., Sigrist, C. J. A., Gattiker, A., Bulliard, V., Langendijk-Genevaux, P. S., Gasteiger, E., Bairoch, A., Hulo, N. (2006) ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res. 34, W362–W365. 27. Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R. D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A. F., Selengut, J. D., Sigrist, C. J., Thimma, M., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., Yeats, C. (2009) InterPro: the integrative protein signature data- base. Nucleic Acids Res. 37, D211–D215. 28. Yeats, C., Lees, J., Reid, A., Kellam, P., Martin, N., Liu, X., Orengo, C. (2008) Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36, D414–D418. 29. Mi, H., Guo, N., Kejariwal, A., Thomas, P. D. (2007)PANTHERversion 6: proteinsequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res. 35, D247–D252. 30. Attwood, T. K. (2002) The PRINTS data- base: a resource for identification of protein families. Brief. Bioinform. 3, 252–263. 31. Letunic, I., Doerks, T., Bork, P. (2009) SMART 6: recent updates and new develop- ments. Nucleic Acids Res. 37, D229–D232. 32. Wilson, D., Pethica, R., Zhou, Y., Talbot, C., Vogel, C., Madera, M., Chothia, C., Gough, J. (2009) SUPERFAMILY – sophisticated comparative genomics, data mining, visualiza- tion and phylogeny. Nucleic Acids Res. 37, D380–D386. 33. Haft, D. H., Selengut, J. D., White, O. (2003) The TIGRFAMs database of protein families. Nucleic Acids Res. 31, D371–D373. 34. Mulder, N., Apweiler, R. (2007) InterPro and InterProScan: tools for protein sequence clas- sification and comparison. Methods Mol Biol. 396, 59–70. 35. Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A. (2009) BioMart – biological queries made easy. BMC Genomics 10, 22. 36. Westbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H. M. (2005) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21, 988–992. 37. Cuff, A. L., Sillitoe, I., Lewis, T., Redfern, O. C., Garratt, R., Thornton, J., Orengo, C. A. (2009) The CATH classification revisited – architectures reviewed and new ways to char- acterize structural divergence in superfamilies. Nucleic Acids Res. 37, D310–D314. 38. Fulton, K. F., Bate, M. A., Faux, N. G., Mahmood, K., Betts, C., Buckle, A. M. (2007) Protein Folding Database (PFD 2.0): an online environment for the International Foldeomics Consortium. Nucleic Acids Res. 35, D304–D307. 39. Maxwell, K. L., Wildes, D., Zarrine-Afsar, A., De Los Rios, M. A., Brown, A. G., Friel, C. T., Hedberg, L., Horng, J. C., Bona, D., Miller, E. J., Vallée-Bélisle, A., Main, E. R., Bemporad, F., Qiu, L., Teilum, K., Vu, N. D., Edwards, A. M., Ruczinski, I., Poulsen, F. M., Kragelund, B. B., Michnick, S. W., Chiti, F., Bai, Y., Hagen, S. J., Serrano, L., Oliveberg, M., Raleigh, D. P., Wittung-Stafshede, P., Radford, S. E., Jackson, S. E., Sosnick, T. R., Marqusee, S., Davidson, A. R., Plaxco, K. W. (2005) Protein folding: defining a “standard” set of experimental conditions and a prelimi- nary kinetic data set of two-state proteins. Protein Sci. 14, 602–616.
  • 39. 22 Chen, Huang, and Wu 40. Zanzoni, A., Ausiello, G., Via, A., Gherardini, P.F.,Helmer-Citterich,M.(2007)Phospho3D: a database of three-dimensional structures of protein phosphorylation sites. Nucleic Acids Res. 35, D229–D231. 41. Diella, F., Cameron, S., Gemünd, C., Linding, R., Via, A., Kuster, B., Sicheritz-Pontén, T., Blom, N., Gibson, T. J. (2004) Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics 5, 79. 42. Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean,I.,Bridge,A.,Derow,C.,Feuermann, M., Ghanbarian, A. T., Kerrien, S., Khadake, J., Kerssemakers, J., Leroy, C., Menden, M., Michaut, M., Montecchi-Palazzi, L., Neuhauser, S. N., Orchard, S., Perreau, V., Roechert, B., van Eijk, K., Hermjakob, H. (2010) The IntAct molecular interaction database in 2010. Nucleic Acids Res. 38, D525–D531. 43. Orchard, S., Kerrien, S., Jones, P., Ceol, A., Chatr-Aryamontri, A., Salwinski, L., Nerothin, J., Hermjakob, H. (2007) Submit your inter- action data the IMEx way: a step by step guide to trouble-free deposition. Proteomics 7 Suppl 1, 28–34. 44. Orchard, S., Salwinski, L., Kerrien, S., Montecchi-Palazzi, L., Oesterheld, M., Stümpflen, V., Ceol, A., Chatr-aryamontri, A., Armstrong, J., Woollard, P., Salama, J. J., Moore, S., Wojcik, J., Bader, G. D., Vidal, M., Cusick, M. E., Gerstein, M., Gavin, A. C., Superti-Furga, G., Greenblatt, J., Bader, J., Uetz, P., Tyers, M., Legrain, P., Fields, S,, Mulder, N., Gilson, M., Niepmann, M., Burgoon, L., De Las Rivas, J., Prieto, C., Perreau, V. M., Hogue, C., Mewes, H. W., Apweiler, R., Xenarios, I., Eisenberg, D., Cesareni, G., Hermjakob, H. (2007) The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat. Biotechnol. 25, 894–898. 45. Kerrien, S., Orchard, S., Montecchi-Palazzi, L., Aranda, B., Quinn, A. F., Vinod, N., Bader, G. D., Xenarios, I., Wojcik, J., Sherman, D., Tyers, M., Salama, J. J., Moore, S., Ceol, A., Chatr-Aryamontri, A., Oesterheld, M., Stümpflen, V., Salwinski, L., Nerothin, J., Cerami, E., Cusick, M. E., Vidal, M., Gilson, M., Armstrong, J., Woollard, P., Hogue, C., Eisenberg, D., Cesareni, G., Apweiler, R., Hermjakob, H. (2007) Broadening the hori- zon – level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol. 5, 44. 46. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29. 47. Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., Garapati, P., Hemish, J., Hermjakob, H., Jassal, B., Kanapin, A., Lewis, S., Mahajan, S., May, B., Schmidt, E., Vastrik, I., Wu, G., Birney, E., Stein, L., D’Eustachio, P. (2009) Reactome knowledge- base of human biological pathways and pro- cesses. Nucleic Acids Res. 37, D619–D622. 48. Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., Cuellar, A. A., Dronov, S., Gilles, E. D., Ginkel, M., Gor, V., Goryanin, II., Hedley, W. J., Hodgman, T. C., Hofmeyr, J. H., Hunter, P. J., Juty, N. S., Kasberger, J. L., Kremling, A., Kummer, U., Le Novère, N., Loew, L. M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E. D., Nakayama, Y., Nelson, M. R., Nielsen, P. F., Sakurada, T., Schaff, J. C., Shapiro, B. E., Shimizu, T. S., Spence, H. D., Stelling, J., Takahashi, K., Tomita, M., Wagner, J., Wang, J., SBML Forum. (2003) The systems biology markup language (SBML): a medium for rep- resentation and exchange of biochemical net- work models. Bioinformatics 19, 524–531. 49. Noy, N. F., Crubezy, M., Fergerson, R. W., Knublauch, H., Tu, S. W., Vendetti, J., Musen, M. A. (2003) Protégé-2000: an open-source ontology-development and knowledge-acquisition environment. AMIA. Annu Symp Proc. 953. 50. Cline, M. S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., Christmas, R., Avila-Campilo, I., Creech, M., Gross, B., Hanspers, K., Isserlin, R., Kelley, R., Killcoyne, S., Lotia, S., Maere, S., Morris, J., Ono, K., Pavlovic, V., Pico, A. R., Vailaya, A., Wang, P. L., Adler, A., Conklin, B. R., Hood, L., Kuiper, M., Sander, C., Schmulevich, I., Schwikowski, B., Warner, G. J., Ideker, T., Bader, G. D. (2007) Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc. 2, 2366–2382. 51. Caspi, R., Foerster, H., Fulcher, C. A., Kaipa, P., Krummenacker, M., Latendresse, M., Paley, S., Rhee, S. Y., Shearer, A., Tissier, C., Walk, T. C., Zhang, P. and Karp, P. D. (2008) The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 36, D623–D631. 52. Hoogland, C., Mostaguir, K., Appel, R. D., Lisacek, F. (2008) The World-2DPAGE Constellation to promote and publish gel-based
  • 40. 23 Protein Bioinformatics Databases and Resources proteomics data through the ExPASy server. J. Proteomics 71, 245–248. 53. Mostaguir, K., Hoogland, C., Binz, P. A., Appel, R. D. (2003) The Make 2D-DB II package: conversion of federated two-dimen- sional gel electrophoresis databases into a rela- tionalformatandinterconnectionofdistributed databases. Proteomics 3, 1441–1444. 54. Vizcaíno, J. A., Côté, R., Reisinger, F., Barsnes, H., Foster, J. M., Rameseder, J., Hermjakob, H., Martens, L. (2009) The pro- teomics identifications database: 2010 update. Nucleic Acids Res. 38, D736–D742. 55. Côté, R. G., Jones, P., Martens, L., Kerrien, S., Reisinger, F., Lin, Q., Leinonen, R., Apweiler, R., Hermjakob, H. (2007) The protein identifier cross-referencing (PICR) service: reconciling protein identifiers across multiplesourcedatabases.BMCBioinformatics 8, 401–414. 56. Burgun,A.,Bodenreider,O.(2008)Accessing and integrating data and knowledge for bio- medical research. Yearb Med Inform. 91–101. 57. Hwang, D., Rust, A. G., Ramsey, S., Smith, J. J., Leslie, D. M., Weston, A. D., de Atauri, P., Aitchison, J. D., Hood, L., Siegel, A. F., Bolouri, H. (2005) A data integration meth- odology for systems biology. Proc. Natl Acad. Sci. U. S. A. 102, 17296–17301. 58. Mathew, J. P., Taylor, B. S., Bader, G. D., Pyarajan, S., Antoniotti, M., Chinnaiyan, A. M., Sander, C., Burakoff, S. J., Mishra, B. (2007) From bytes to bedside: data integra- tion and computational biology for transla- tional cancer research. PLoS Comput. Biol. 3, e12. 59. McGarvey, P. B., Huang, H., Mazumder, R., Zhang, J., Chen, Y., Zhang, C., Cammer, S., Will, R., Odle, M., Sobral, B., Moore, M., Wu, C. H. (2009) Systems integration of bio- defense omics data for analysis of pathogen– host interactions and identification of potential targets. PLoS One 4, e7162. 60. Stevens, R., Zhao, J., Goble, C. (2007) Using provenance to manage knowledge of in silico experiments. Brief. Bioinform. 8, 183–194. 61. Pedrioli, P. G., Eng, J. K., Hubley, R., Vogelzang, M., Deutsch, E. W., Raught, B., Pratt, B., Nilsson, E., Angeletti, R. H., Apweiler, R., Cheung, K., Costello, C. E., Hermjakob, H., Huang, S., Julian, R. K., Kapp, E., McComb, M. E., Oliver, S. G., Omenn, G., Paton, N. W., Simpson, R., Smith, R., Taylor, C. F., Zhu, W., Aebersold, R. (2004) A common open representation of mass spectrometry data and its applica- tion to proteomics research. Nat. Biotechnol. 22, 1459–1466. 62. Orchard, S., Montechi-Palazzi, L., Deutsch, E. W., Binz, P. A., Jones, A. R., Paton, N., Pizarro, A., Creasy, D. M., Wojcik, J., Hermjakob, H. (2007) Five years of prog- ress in the standardization of proteomics data 4(th) annual spring workshop of the HUPO-proteomics standards initiative April 23–25, 2007 Ecole Nationale Supérieure (ENS), Lyon, France. Proteomics 7, 3436–3440. 63. Taylor, C. F., Paton, N. W., Lilley, K. S., Binz, P. A., Julian, R. K. Jr, Jones, A. R., Zhu, W., Apweiler, R., Aebersold, R., Deutsch, E. W., Dunn, M. J., Heck, A. J., Leitner, A., Macht, M., Mann, M., Martens, L., Neubert, T. A., Patterson, S. D., Ping, P., Seymour, S. L., Souda, P., Tsugita, A., Vandekerckhove, J., Vondriska, T. M., Whitelegge, J. P., Wilkins, M. R., Xenarios, I., Yates, J. R. 3rd, Hermjakob, H. (2007) The minimum information about a proteom- ics experiment (MIAPE). Nat. Biotechnol. 25, 887–893. 64. Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N., Rao, B. S., Smirnov, S., Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin, J. J., Natale, D. A. (2003) The COG data- base: an updated version includes eukaryotes. BMC Bioinformatics 4, 41–54. 65. Kaplan, N., Sasson, O., Inbar, U., Friedlich, M., Fromer, M., Fleischer, H., Portugaly, E., Linial, N., Linial, M. (2005) ProtoNet 4.0: a hierarchi- calclassificationofonemillionproteinsequences. Nucleic Acids Res. 33, D216–D218. 66. Marchler-Bauer, A., Anderson, J. B., Chitsaz, F., Derbyshire, M. K., DeWeese-Scott, C., Fong, J. H., Geer, L. Y., Geer, R. C., Gonzales, N. R., Gwadz, M., He, S., Hurwitz, D. I., Jackson, J. D., Ke, Z., Lanczycki, C. J., Liebert, C. A., Liu, C., Lu, F., Lu, S., Marchler, G. H., Mullokandov, M., Song, J. S., Tasneem, A., Thanki, N., Yamashita, R. A., Zhang, D., Zhang, N., Bryant, S. H. (2009) CDD: specific functional annotation with the conserved domain database. Nucleic Acids Res. 37, D205–D210. 67. Wang, Y., Addess, K. J., Chen, J., Geer, L. Y., He, J., He, S., Lu, S., Madej, T., Marchler-Bauer, A., Thiessen, P. A., Zhang, N., Bryant, S. H. (2007) MMDB: annotat- ing protein sequences with Entrez’s 3D-structure database. Nucleic Acids Res. 35, D298–D300. 68. Pieper, U., Eswar, N., Webb, B. M., Eramian, D., Kelly, L., Barkan, D. T., Carter, H., Mankoo, P., Karchin, R., Marti-Renom, M. A., Davis, F. P., Sali, A. (2009) MODBASE, a
  • 41. 24 Chen, Huang, and Wu database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 37, D347–D354. 69. Kiefer, F., Arnold, K., Künzli, M., Bordoli, L., Schwede, T. (2009) The SWISS-MODEL repository and associated resources. Nucleic Acids Res. 37, D387–D392. 70. Bogatyreva, N. S., Osypov, A. A., Ivankov, D. N. (2009) KineticDB: a database of protein folding kinetics. Nucleic Acids Res. 37, D342–D346. 71. Garavelli, J. S. (2004) The RESID database of protein modifications as a resource and anno- tation tool. Proteomics 4, 1527–1533. 72. Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U., Eisenberg, D. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res. 32, D449–D451. 73. Breitkreutz, B. J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M., Oughtred, R., Lackner, D. H., Bähler, J., Wood, V., Dolinski, K., Tyers, M. (2008) The BioGRID interaction database: 2008 update. Nucleic Acids Res. 36, D637–D640. 74. Kanehisa, M., Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. 75. Tarcea, V. G., Weymouth, T., Ade, A., Bookvich, A., Gao, J., Mahavisno, V., Wright, Z., Chapman, A., Jayapandian, M., Ozgür, A., Tian, Y., Cavalcoli, J., Mirel, B., Patel, J., Radev, D., Athey, B., States, D., Jagadish, H. V. (2009) Michigan molecular interactions r2: from interacting proteins to pathways. Nucleic Acids Res. 37, D642–D646. 76. Craig, R., Cortens, J. C., Fenyo, D., Beavis, R. C. (2006) Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5, 1843–1849. 77. Deutsch, E. W., Lam, H., Aebersold, R. (2008) PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 9, 429–434. 78. Slotta, D. J., Barrett, T., Edgar, R. (2009) NCBI peptidome: a new public repository for mass spectrometry peptide identifications. Nat. Biotechnol. 27, 600–601.
  • 42. 25 Chapter 2 A Guide to UniProt for Protein Scientists Claire O’Donovan and Rolf Apweiler Abstract One of the essential requirements of the proteomics community is a high quality annotated nonredundant protein sequence database with stable identifiers and an archival service to enable protein identification and characterization. The scope of this chapter is to illustrate how Universal Protein Resource (UniProt) (The UniProt Consortium, Nucleic Acids Res. 38:D142–D148, 2010) can be best utilized for proteomics purposes with a particular focus on exploiting the knowledge captured in the UniProt databases, the services provided and the availability of complete proteomes. Key words: Protein sequence database, Annotation, Stable identifiers, Complete proteome, Archive, Nonredundant The Proteomics community has evolved intensively over the last decade but one constant is the need to identify the resulting pro- teins and their potential functions. This requires the availability of a nonredundant protein sequence database, with maximal cover- age including splice isoforms, disease variant(s) and posttransla- tional modifications. Sequence archiving is an essential feature in order to be able to interpret and maintain the proteomic set results. Stable identifiers, consistent nomenclature and controlled vocabularies are highly beneficial for protein identification. The last but by no means least requirement is the provision of detailed information on protein function, biological processes, and molec- ular interactions and pathways cross-referenced to appropriate external sources. In this chapter, we will show how the Universal Protein Resource fulfils these criteria. 1. Introduction Cathy H. Wu and Chuming Chen (eds.), Bioinformatics for Comparative Proteomics, Methods in Molecular Biology, vol. 694, DOI 10.1007/978-1-60761-977-2_2, © Springer Science+Business Media, LLC 2011
  • 43. 26 O’Donovan and Apweiler The mission of the Universal Protein Resource (UniProt) is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information, which is essential for modern biological research. UniProt is produced by the UniProt Consortium, which consists of groups from the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR), and the Swiss Institute of Bioinformatics (SIB). Its activities are mainly supported by the National Institutes of Health (NIH) with additional funding from the European Commission and the Swiss Federal Government. It has five components optimized for different uses. The UniProt Knowledgebase (UniProtKB) (1) is an expertly curated database, a central access point for integrated protein information with cross-references to multiple sources. The UniProt Archive (UniParc) (2) is a comprehensive sequence repository, reflecting the history of all protein sequences. UniProt Reference Clusters (UniRef) (3) merge closely related sequences based on sequence identity to speed up searches whereas the UniProt Metagenomic and Environmental Sequences database (UniMES) was created to respond to the expanding area of metagenomic data. UniProtKB Sequence/AnnotationVersionArchive(UniSave)istheUniProtKB protein entry archive, which contains all versions of each protein entry (Fig. 1). 2. Materials Fig. 1. UniProt databases.
  • 44. 27 A Guide to UniProt for Protein Scientists UniParc is the main sequence storehouse and is a comprehensive repository that reflects the history of all protein sequences. UniParc contains all new and revised protein sequences from all publicly available sources (http:/ /www.uniprot.org/help/uniparc) to ensure that complete coverage is available at a single site. To avoid redundancy, all sequences 100% identical over the entire length are merged, regardless of source organism. New and updated sequences are loaded on a daily basis, cross-referenced to the source database accession number, and provided with a sequence version that increments on changes to the underlying sequence. The basic information stored within each UniParc entry is the identifier, the sequence, cyclic redundancy check number, source database(s) with accession and version numbers, and atimestamp.IfaUniParcentrylacksacross-referencetoaUniProtKB entry, the reason for its exclusion from UniProtKB is provided (e.g., pseudogene). In addition, each source database accession number is tagged with its status in that database, indicating if the sequence still exists or has been deleted in the source database and cross-references to NCBI GI and TaxId if appropriate. UniProtKB consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. The former contains manually annotated recordswithinformationextractedfromliteratureandcurator-evaluated computational analysis. Annotation is done by biologists with specific expertise to achieve accuracy. In UniProtKB/Swiss-Prot, annotation consists of the description of the following: function(s), enzyme- specific information, biologically relevant domains and sites, post- translational modifications, subcellular location(s), tissue specificity, developmental specific expression, structure, interactions, splice isoform(s), associated diseases or deficiencies, or abnormalities etc. The UniProt Knowledgebase aims to describe, in a single record, all protein products derived from a certain gene from a certain species. After an inspection of the sequences, the curator selects the refer- ence sequence, does the corresponding merging, and lists the splice and genetic variants along with disease information when available. This results in not only the whole record having an accession num- ber but also unique identifiers for each protein form derived by alternative splicing, proteolytic cleavage, and posttranslational mod- ification. The freely available tool VARSPLIC (4) enables the recre- ation of all annotated splice variants from the feature table of a UniProt Knowledgebase entry, or for the complete database. A FASTA-formatted file containing all splice variants annotated in the UniProt Knowledgebase can be downloaded for use with similarity search programs. UniProtKB/TrEMBL contains high quality computationally analyzed records enriched with automatic annotation and classifi- cation. The computer-assisted annotation is created using both automatically generated rules as well as manually curated rules 2.1. The UniProt Archive 2.2. The UniProt Knowledgebase
  • 45. 28 O’Donovan and Apweiler (UniRule) based on protein families (5–8). UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and, with some defined exclusions, Arabidopsis thaliana sequences from The Arabidopsis Information Resource (TAIR) (9), yeast sequences from the Saccharomyces Genome Database (SGD) (10) and Homo sapiens sequences from the Ensembl database (11). It will soon be extended to include other Ensembl organism sets and RefSeq records. Records are selected for full manual annotation and integration into UniProtKB/Swiss-Prot accord- ing to defined annotation priorities. Integration between the three types of sequence-related data- bases (nucleic acid sequences, protein sequences, and protein tertiary structures) as well as with specialized data collections is important for the UniProt users. UniProtKB is currently cross- referenced with more than ten million links to 114 different data- bases with regular update cycles. This extensive network of cross-references allows UniProt to act as a focal point of biomo- lecular database interconnectivity. All cross-referenced databases are documented at http:/ /www.uniprot.org/docs/dbxref and if appropriate are included in the UniProt ID mapping tool at http:/ /www.uniprot.org/help/mapping with the file for down- load at ftp:/ /ftp.uniprot.org/pub/databases/uniprot/current_ release/knowledgebase/idmapping. UniRef provides clustered sets of all sequences from the UniProt Knowledgebase (including splice forms as separate entries) and selected records from the UniProt Archive to achieve complete coverage of sequence space at identity levels of 100, 90, and 50% while hiding redundant sequences (3). The UniRef clusters are generated in a hierarchical manner; the UniRef100 database com- bines identical sequences and sub-fragments into a single UniRef entry, UniRef90 is built from UniRef100 clusters and UniRef50 is built from UniRef90 clusters. Each individual member sequence can exist in only one UniRef cluster at each identity level and have only one parent or child cluster at another identity level. UniRef100, UniRef90, and UniRef50 yield a database size reduc- tion of ~10, 40, and 70%, respectively. Each cluster record con- tains source database, protein name, and taxonomy information on each member sequence but is represented by a single selected representative protein sequence and name; the number of mem- bers and lowest common taxonomy node for the membership is also included. The representative protein sequence or cluster rep- resentative is automatically selected using an algorithm that accounts for (1) Quality of entry annotation: order of preference is a member from UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, then UniParc; (2) Meaningful name: members with protein names that do not contain words such as “hypothetical” or “probable” 2.3. The UniProt Reference Clusters
  • 46. 29 A Guide to UniProt for Protein Scientists are preferred; (3) Organism: members from model organisms are preferred; (4) Sequence length: longest sequence is preferred. UniRef100 is one of the most comprehensive and nonredundant protein sequence dataset available. The reduced size of the UniRef90 and UniRef50 datasets provide faster sequence similar- ity searches and reduce the research bias in similarity searches by providing a more even sampling of sequence space. The UniProt Knowledgebase contains entries with a known taxo- nomic source. However, the expanding area of metagenomic data has necessitated the creation of a separate database, the UniProt Metagenomic and Environmental Sequences database (UniMES). UniMES currently contains data from the Global Ocean Sampling Expedition (GOS), which predicts nearly six million proteins, pri- marily from oceanic microbes. By combining the predicted pro- tein sequences with automatic classification by InterPro, the integrated resource for protein families, domains and functional sites, UniMES uniquely provides free access to the array of genomic information gathered. UniSaveisarepositoryofUniProtKB/Swiss-ProtandUniProtKB/ TrEMBL entry versions and provides the backend to the UniProtKB entry history service (Fig. 2) and is also provided as a standalone service at http:/ /www.ebi.ac.uk/uniprot/unisave. These descriptions of our databases should illustrate that UniProt does provide a high quality annotated nonredundant database with maximal coverage and sequence archiving. This section will describe particular features of the UniProt activities, which fulfill the proteomics community requirements of detailed information on protein function, biological processes, molecular 2.4. The UniProt Metagenomic and Environmental Sequences 2.5. The UniProtKB Sequence/Annotation Version Archive 3. Methods Fig. 2. UniSave link.
  • 47. 30 O’Donovan and Apweiler interactions and pathways cross-referenced to appropriate external sources and stable identifiers, consistent nomenclature and con- trolled vocabularies. UniProtKB consists of two sections, Swiss-Prot and TrEMBL. UniProtKB/Swiss-Prot contains manually annotated records with information extracted from literature and curator-evaluated computational analysis. Manual annotation consists of a critical review of experimentally proven or computer-predicted data about each protein. An essential aspect of the annotation protocol is the use of official nomenclatures and controlled vocabularies that facilitate consistent and accurate identification (Fig. 3). Annotation consists of the description of the following: functions(s), enzyme-specific information, biologically relevant domains and sites, posttranslation modifications, subcellular location(s), tissue specificity, developmental specific expression, structure, interactions, splice isoforms(s), associated diseases or deficiencies, or abnormalities etc (Fig. 4). Another important part of the annotation process involves merging of different reports for a single protein. After an inspec- tion of the sequences the curator selects the reference sequence, does the corresponding merging and lists the splice and genetic variants along with disease information when available (Fig. 5). Data are continuously updated by an expert team of biologists. 3.1. Protein Annotation Fig. 3. UniProt nomenclature.
  • 48. 31 A Guide to UniProt for Protein Scientists To promote database interoperability and provide consistent annotation, the UniProt Consortium is a key member of the Gene Ontology Consortium (12) and benefits from the presence of the GO editorial office at the EBI. UniProt curators will con- tinue to assign Gene Ontology (GO terms) to the gene products in UniProtKB during the UniProt manual curation process. UniProtKB also profits from GO annotation carried out by other GO Consortium members. Currently we include manual GO annotations from 19 GO Consortium annotation groups, and we further supplement this with high-quality annotations from other manual annotation sources (including the Human Protein Atlas and LIFEdb). In addition to this manually curated GO annota- tion, automatic GO annotation pipelines exist and will be further developed to ensure that the functional knowledge supplied by various UniProtKB ontologies, Ensembl orthology data, and InterPro matches are fully exploited to provide high-quality, com- prehensive set of GO annotation predictions for all UniProtKB entries. One challenge in life sciences research is the ability to integrate and exchange data coming from multiple research groups. The UniProt Consortium is committed to fostering interaction and exchange with the scientific community, ensuring wide access to UniProt resources, and promoting interoperability between resources. An essential component of this interoperability is the provision of cross-references to these resources in UniProt entries (Fig. 6). 3.2. The Gene Ontology Consortium and UniProt 3.3. Cross-references to External Sources Fig. 4. Protein annotation.
  • 49. 32 O’Donovan and Apweiler UniProt constructs complete nonredundant proteome sets. Each set and its analysis is made available shortly after the appearance of a new complete genome sequence in the nucleotide sequence databases. A standard procedure is used to create, from the UniProtKB, proteome sets for bacterial, archaeal and some eukary- otic genomes. Proteome sets for certain metazoan genomes are 3.4. Nonredundant Complete UniProt Proteome Sets Fig. 5. Sequence annotation.
  • 50. Exploring the Variety of Random Documents with Different Content
  • 51. *** END OF THE PROJECT GUTENBERG EBOOK HARPER'S ROUND TABLE, MARCH 23, 1897 *** Updated editions will replace the previous one—the old editions will be renamed. Creating the works from print editions not protected by U.S. copyright law means that no one owns a United States copyright in these works, so the Foundation (and you!) can copy and distribute it in the United States without permission and without paying copyright royalties. Special rules, set forth in the General Terms of Use part of this license, apply to copying and distributing Project Gutenberg™ electronic works to protect the PROJECT GUTENBERG™ concept and trademark. Project Gutenberg is a registered trademark, and may not be used if you charge for an eBook, except by following the terms of the trademark license, including paying royalties for use of the Project Gutenberg trademark. If you do not charge anything for copies of this eBook, complying with the trademark license is very easy. You may use this eBook for nearly any purpose such as creation of derivative works, reports, performances and research. Project Gutenberg eBooks may be modified and printed and given away—you may do practically ANYTHING in the United States with eBooks not protected by U.S. copyright law. Redistribution is subject to the trademark license, especially commercial redistribution. START: FULL LICENSE
  • 52. THE FULL PROJECT GUTENBERG LICENSE
  • 53. PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK To protect the Project Gutenberg™ mission of promoting the free distribution of electronic works, by using or distributing this work (or any other work associated in any way with the phrase “Project Gutenberg”), you agree to comply with all the terms of the Full Project Gutenberg™ License available with this file or online at www.gutenberg.org/license. Section 1. General Terms of Use and Redistributing Project Gutenberg™ electronic works 1.A. By reading or using any part of this Project Gutenberg™ electronic work, you indicate that you have read, understand, agree to and accept all the terms of this license and intellectual property (trademark/copyright) agreement. If you do not agree to abide by all the terms of this agreement, you must cease using and return or destroy all copies of Project Gutenberg™ electronic works in your possession. If you paid a fee for obtaining a copy of or access to a Project Gutenberg™ electronic work and you do not agree to be bound by the terms of this agreement, you may obtain a refund from the person or entity to whom you paid the fee as set forth in paragraph 1.E.8. 1.B. “Project Gutenberg” is a registered trademark. It may only be used on or associated in any way with an electronic work by people who agree to be bound by the terms of this agreement. There are a few things that you can do with most Project Gutenberg™ electronic works even without complying with the full terms of this agreement. See paragraph 1.C below. There are a lot of things you can do with Project Gutenberg™ electronic works if you follow the terms of this agreement and help preserve free future access to Project Gutenberg™ electronic works. See paragraph 1.E below.
  • 54. 1.C. The Project Gutenberg Literary Archive Foundation (“the Foundation” or PGLAF), owns a compilation copyright in the collection of Project Gutenberg™ electronic works. Nearly all the individual works in the collection are in the public domain in the United States. If an individual work is unprotected by copyright law in the United States and you are located in the United States, we do not claim a right to prevent you from copying, distributing, performing, displaying or creating derivative works based on the work as long as all references to Project Gutenberg are removed. Of course, we hope that you will support the Project Gutenberg™ mission of promoting free access to electronic works by freely sharing Project Gutenberg™ works in compliance with the terms of this agreement for keeping the Project Gutenberg™ name associated with the work. You can easily comply with the terms of this agreement by keeping this work in the same format with its attached full Project Gutenberg™ License when you share it without charge with others. 1.D. The copyright laws of the place where you are located also govern what you can do with this work. Copyright laws in most countries are in a constant state of change. If you are outside the United States, check the laws of your country in addition to the terms of this agreement before downloading, copying, displaying, performing, distributing or creating derivative works based on this work or any other Project Gutenberg™ work. The Foundation makes no representations concerning the copyright status of any work in any country other than the United States. 1.E. Unless you have removed all references to Project Gutenberg: 1.E.1. The following sentence, with active links to, or other immediate access to, the full Project Gutenberg™ License must appear prominently whenever any copy of a Project Gutenberg™ work (any work on which the phrase “Project
  • 55. Gutenberg” appears, or with which the phrase “Project Gutenberg” is associated) is accessed, displayed, performed, viewed, copied or distributed: This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. 1.E.2. If an individual Project Gutenberg™ electronic work is derived from texts not protected by U.S. copyright law (does not contain a notice indicating that it is posted with permission of the copyright holder), the work can be copied and distributed to anyone in the United States without paying any fees or charges. If you are redistributing or providing access to a work with the phrase “Project Gutenberg” associated with or appearing on the work, you must comply either with the requirements of paragraphs 1.E.1 through 1.E.7 or obtain permission for the use of the work and the Project Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9. 1.E.3. If an individual Project Gutenberg™ electronic work is posted with the permission of the copyright holder, your use and distribution must comply with both paragraphs 1.E.1 through 1.E.7 and any additional terms imposed by the copyright holder. Additional terms will be linked to the Project Gutenberg™ License for all works posted with the permission of the copyright holder found at the beginning of this work. 1.E.4. Do not unlink or detach or remove the full Project Gutenberg™ License terms from this work, or any files
  • 56. containing a part of this work or any other work associated with Project Gutenberg™. 1.E.5. Do not copy, display, perform, distribute or redistribute this electronic work, or any part of this electronic work, without prominently displaying the sentence set forth in paragraph 1.E.1 with active links or immediate access to the full terms of the Project Gutenberg™ License. 1.E.6. You may convert to and distribute this work in any binary, compressed, marked up, nonproprietary or proprietary form, including any word processing or hypertext form. However, if you provide access to or distribute copies of a Project Gutenberg™ work in a format other than “Plain Vanilla ASCII” or other format used in the official version posted on the official Project Gutenberg™ website (www.gutenberg.org), you must, at no additional cost, fee or expense to the user, provide a copy, a means of exporting a copy, or a means of obtaining a copy upon request, of the work in its original “Plain Vanilla ASCII” or other form. Any alternate format must include the full Project Gutenberg™ License as specified in paragraph 1.E.1. 1.E.7. Do not charge a fee for access to, viewing, displaying, performing, copying or distributing any Project Gutenberg™ works unless you comply with paragraph 1.E.8 or 1.E.9. 1.E.8. You may charge a reasonable fee for copies of or providing access to or distributing Project Gutenberg™ electronic works provided that: • You pay a royalty fee of 20% of the gross profits you derive from the use of Project Gutenberg™ works calculated using the method you already use to calculate your applicable taxes. The fee is owed to the owner of the Project Gutenberg™ trademark, but he has agreed to donate royalties under this paragraph to the Project Gutenberg Literary Archive Foundation. Royalty
  • 57. payments must be paid within 60 days following each date on which you prepare (or are legally required to prepare) your periodic tax returns. Royalty payments should be clearly marked as such and sent to the Project Gutenberg Literary Archive Foundation at the address specified in Section 4, “Information about donations to the Project Gutenberg Literary Archive Foundation.” • You provide a full refund of any money paid by a user who notifies you in writing (or by e-mail) within 30 days of receipt that s/he does not agree to the terms of the full Project Gutenberg™ License. You must require such a user to return or destroy all copies of the works possessed in a physical medium and discontinue all use of and all access to other copies of Project Gutenberg™ works. • You provide, in accordance with paragraph 1.F.3, a full refund of any money paid for a work or a replacement copy, if a defect in the electronic work is discovered and reported to you within 90 days of receipt of the work. • You comply with all other terms of this agreement for free distribution of Project Gutenberg™ works. 1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™ electronic work or group of works on different terms than are set forth in this agreement, you must obtain permission in writing from the Project Gutenberg Literary Archive Foundation, the manager of the Project Gutenberg™ trademark. Contact the Foundation as set forth in Section 3 below. 1.F. 1.F.1. Project Gutenberg volunteers and employees expend considerable effort to identify, do copyright research on, transcribe and proofread works not protected by U.S. copyright
  • 58. law in creating the Project Gutenberg™ collection. Despite these efforts, Project Gutenberg™ electronic works, and the medium on which they may be stored, may contain “Defects,” such as, but not limited to, incomplete, inaccurate or corrupt data, transcription errors, a copyright or other intellectual property infringement, a defective or damaged disk or other medium, a computer virus, or computer codes that damage or cannot be read by your equipment. 1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the “Right of Replacement or Refund” described in paragraph 1.F.3, the Project Gutenberg Literary Archive Foundation, the owner of the Project Gutenberg™ trademark, and any other party distributing a Project Gutenberg™ electronic work under this agreement, disclaim all liability to you for damages, costs and expenses, including legal fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE. 1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a defect in this electronic work within 90 days of receiving it, you can receive a refund of the money (if any) you paid for it by sending a written explanation to the person you received the work from. If you received the work on a physical medium, you must return the medium with your written explanation. The person or entity that provided you with the defective work may elect to provide a replacement copy in lieu of a refund. If you received the work electronically, the person or entity providing it to you may choose to give you a second opportunity to receive the work electronically in lieu of a refund.
  • 59. If the second copy is also defective, you may demand a refund in writing without further opportunities to fix the problem. 1.F.4. Except for the limited right of replacement or refund set forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PURPOSE. 1.F.5. Some states do not allow disclaimers of certain implied warranties or the exclusion or limitation of certain types of damages. If any disclaimer or limitation set forth in this agreement violates the law of the state applicable to this agreement, the agreement shall be interpreted to make the maximum disclaimer or limitation permitted by the applicable state law. The invalidity or unenforceability of any provision of this agreement shall not void the remaining provisions. 1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation, the trademark owner, any agent or employee of the Foundation, anyone providing copies of Project Gutenberg™ electronic works in accordance with this agreement, and any volunteers associated with the production, promotion and distribution of Project Gutenberg™ electronic works, harmless from all liability, costs and expenses, including legal fees, that arise directly or indirectly from any of the following which you do or cause to occur: (a) distribution of this or any Project Gutenberg™ work, (b) alteration, modification, or additions or deletions to any Project Gutenberg™ work, and (c) any Defect you cause. Section 2. Information about the Mission of Project Gutenberg™
  • 60. Project Gutenberg™ is synonymous with the free distribution of electronic works in formats readable by the widest variety of computers including obsolete, old, middle-aged and new computers. It exists because of the efforts of hundreds of volunteers and donations from people in all walks of life. Volunteers and financial support to provide volunteers with the assistance they need are critical to reaching Project Gutenberg™’s goals and ensuring that the Project Gutenberg™ collection will remain freely available for generations to come. In 2001, the Project Gutenberg Literary Archive Foundation was created to provide a secure and permanent future for Project Gutenberg™ and future generations. To learn more about the Project Gutenberg Literary Archive Foundation and how your efforts and donations can help, see Sections 3 and 4 and the Foundation information page at www.gutenberg.org. Section 3. Information about the Project Gutenberg Literary Archive Foundation The Project Gutenberg Literary Archive Foundation is a non- profit 501(c)(3) educational corporation organized under the laws of the state of Mississippi and granted tax exempt status by the Internal Revenue Service. The Foundation’s EIN or federal tax identification number is 64-6221541. Contributions to the Project Gutenberg Literary Archive Foundation are tax deductible to the full extent permitted by U.S. federal laws and your state’s laws. The Foundation’s business office is located at 809 North 1500 West, Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up to date contact information can be found at the Foundation’s website and official page at www.gutenberg.org/contact
  • 61. Section 4. Information about Donations to the Project Gutenberg Literary Archive Foundation Project Gutenberg™ depends upon and cannot survive without widespread public support and donations to carry out its mission of increasing the number of public domain and licensed works that can be freely distributed in machine-readable form accessible by the widest array of equipment including outdated equipment. Many small donations ($1 to $5,000) are particularly important to maintaining tax exempt status with the IRS. The Foundation is committed to complying with the laws regulating charities and charitable donations in all 50 states of the United States. Compliance requirements are not uniform and it takes a considerable effort, much paperwork and many fees to meet and keep up with these requirements. We do not solicit donations in locations where we have not received written confirmation of compliance. To SEND DONATIONS or determine the status of compliance for any particular state visit www.gutenberg.org/donate. While we cannot and do not solicit contributions from states where we have not met the solicitation requirements, we know of no prohibition against accepting unsolicited donations from donors in such states who approach us with offers to donate. International donations are gratefully accepted, but we cannot make any statements concerning tax treatment of donations received from outside the United States. U.S. laws alone swamp our small staff. Please check the Project Gutenberg web pages for current donation methods and addresses. Donations are accepted in a number of other ways including checks, online payments and
  • 62. credit card donations. To donate, please visit: www.gutenberg.org/donate. Section 5. General Information About Project Gutenberg™ electronic works Professor Michael S. Hart was the originator of the Project Gutenberg™ concept of a library of electronic works that could be freely shared with anyone. For forty years, he produced and distributed Project Gutenberg™ eBooks with only a loose network of volunteer support. Project Gutenberg™ eBooks are often created from several printed editions, all of which are confirmed as not protected by copyright in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition. Most people start at our website which has the main PG search facility: www.gutenberg.org. This website includes information about Project Gutenberg™, including how to make donations to the Project Gutenberg Literary Archive Foundation, how to help produce our new eBooks, and how to subscribe to our email newsletter to hear about new eBooks.
  • 63. Welcome to Our Bookstore - The Ultimate Destination for Book Lovers Are you passionate about books and eager to explore new worlds of knowledge? At our website, we offer a vast collection of books that cater to every interest and age group. From classic literature to specialized publications, self-help books, and children’s stories, we have it all! Each book is a gateway to new adventures, helping you expand your knowledge and nourish your soul Experience Convenient and Enjoyable Book Shopping Our website is more than just an online bookstore—it’s a bridge connecting readers to the timeless values of culture and wisdom. With a sleek and user-friendly interface and a smart search system, you can find your favorite books quickly and easily. Enjoy special promotions, fast home delivery, and a seamless shopping experience that saves you time and enhances your love for reading. Let us accompany you on the journey of exploring knowledge and personal growth! ebookgate.com