Bioinformatics for Comparative Proteomics 1st Edition Chuming Chen
1. Instant Ebook Access, One Click Away – Begin at ebookgate.com
Bioinformatics for Comparative Proteomics 1st
Edition Chuming Chen
https://ebookgate.com/product/bioinformatics-for-
comparative-proteomics-1st-edition-chuming-chen/
OR CLICK BUTTON
DOWLOAD EBOOK
Get Instant Ebook Downloads – Browse at https://ebookgate.com
Click here to visit ebookgate.com and download ebook now
2. Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...
Between Heschel and Buber A Comparative Study 1st Edition
Alexander Even-Chen
https://ebookgate.com/product/between-heschel-and-buber-a-comparative-
study-1st-edition-alexander-even-chen/
ebookgate.com
Bioinformatics for Geneticists A Bioinformatics Primer for
the Analysis of Genetic Data 2nd Edition Michael R. Barnes
https://ebookgate.com/product/bioinformatics-for-geneticists-a-
bioinformatics-primer-for-the-analysis-of-genetic-data-2nd-edition-
michael-r-barnes/
ebookgate.com
Python for Bioinformatics 2nd Edition Sebastian Bassi
https://ebookgate.com/product/python-for-bioinformatics-2nd-edition-
sebastian-bassi/
ebookgate.com
Proteomics for Biomarker Discovery 1st Edition Julian A.
J. Jaros
https://ebookgate.com/product/proteomics-for-biomarker-discovery-1st-
edition-julian-a-j-jaros/
ebookgate.com
3. Mass Spectrometry for Microbial Proteomics 1st Edition
Haroun N. Shah
https://ebookgate.com/product/mass-spectrometry-for-microbial-
proteomics-1st-edition-haroun-n-shah/
ebookgate.com
Proteomics for Biomarker Discovery 1st Edition Julian A.
J. Jaros
https://ebookgate.com/product/proteomics-for-biomarker-discovery-1st-
edition-julian-a-j-jaros-2/
ebookgate.com
Informatics In Proteomics 1st Edition Sudhir Srivastava
https://ebookgate.com/product/informatics-in-proteomics-1st-edition-
sudhir-srivastava/
ebookgate.com
Bioinformatics for Glycobiology and Glycomics An
Introduction 1st Edition Claus-Wilhelm Von Der Lieth
https://ebookgate.com/product/bioinformatics-for-glycobiology-and-
glycomics-an-introduction-1st-edition-claus-wilhelm-von-der-lieth/
ebookgate.com
Technology Application Competencies for K 12 Teachers 1st
Edition Irene Chen
https://ebookgate.com/product/technology-application-competencies-
for-k-12-teachers-1st-edition-irene-chen/
ebookgate.com
5. Me t h o d s i n Mo l e c u l a r Bi o l o g y ™
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
For other titles published in this series, go to
www.springer.com/series/7651
7. Bioinformatics for Comparative
Proteomics
Edited by
Cathy H.Wu
DepartmentofComputerandInformationSciences,
CenterforBioinformaticsandComputationalBiology,
UniversityofDelaware,Newark,DE,USA
Chuming Chen
DepartmentofComputerandInformationSciences,
CenterforBioinformaticsandComputationalBiology,
UniversityofDelaware,Newark,DE,USA
9. v
Preface
With the rapid development of proteomic technologies in life sciences and in clinical appli-
cations, many bioinformatics methodologies, databases, and software tools have been
developed to support comparative proteomics study. This volume aims to highlight the
current status, challenges, open problems, and future trends in developing bioinformatics
tools and resources for comparative proteomics research and to serve as a definitive source
of reference providing both the breadth and depth needed on the subject of Bioinformatics
for Comparative Proteomics.
The volume is structured to introduce three major areas of research methods: (1)
basic bioinformatics frameworks related to comparative proteomics, (2) bioinformatics
databases and tools for proteomics data analysis, and (3) integrated bioinformatics systems
and approaches for studying comparative proteomics in the systems biology context.
Part I (Bioinformatics Framework for Comparative Proteomics) consists of seven
chapters:
Chapter 1 presents a comprehensive review (with categorization and description) of
major protein bioinformatics databases and resources that are relevant to comparative
proteomics research.
Chapter 2 provides a practical guide to the comparative proteomics community for
exploiting the knowledge captured from and the services provided in UniProt databases.
Chapter 3 introduces the InterPro protein classification system for automatic protein
annotation and reviews the signature methods used in the InterPro database.
Chapter 4 introduces the Reactome Knowledgebase that provides an integrated view
of the molecular details of human biological processes.
Chapter 5 introduces eFIP (extraction of Functional Impact of Phosphorylation), a
Web-based text mining system that can aid scientists in quickly finding abstracts from lit-
erature related to the phosphorylation (including site and kinase), interactions, and func-
tional aspects of a given protein.
Chapter 6 presents a tutorial for the Protein Ontology (PRO) Web resources to help
researchers in their proteomic studies by providing key information about protein diver-
sity in terms of evolutionary-related protein classes based on full-length sequence conser-
vation and the various protein forms that arise from a gene along with the specific functional
annotation.
Chapter 7 describes a method for the annotation of functional residues within experi-
mentally uncharacterized proteins using position-specific site annotation rules derived
from structural and experimental information.
Part II (Proteomic Bioinformatics) consists of ten chapters:
Chapter 8 describes how the detailed understanding of information value of mass
spectrometry-based proteomics data can be elucidated by performing simulations using
synthetic data.
Chapter 9 describes the concepts, prerequisites, and methods required to analyze a
shotgun proteomics data set using a tandem mass spectrometry search engine.
10. vi Preface
Chapter 10 presents computational methods for quantification and comparison of
peptides by label-free LC–MS analysis, including data preprocessing, multivariate statisti-
cal methods, and detection of differential protein expression.
Chapter 11 proposes an alternative to MS/MS spectrum identification by combining
the uninterpreted MS/MS spectra from overlapping peptides and then determining the
consensus identifications for sets of aligned MS/MS spectra.
Chapter 12 describes the Trans-Proteomic Pipeline, a freely available open-source
software suite that provides uniform analysis of LC–MS/MS data from raw data to quanti-
fied sample proteins.
Chapter 13 provides an overview of a set of open-source software tools and steps
involved in ELISA microarray data analysis.
Chapter 14 presents the state of the art on the Proteomics Databases and Repositories.
Chapter 15 is a brief guide to preparing both large- and small-scale protein interaction
data for publication.
Chapter 16 demonstrates a new graphical user interface tool called PRIDE Converter,
which greatly simplifies the submission of MS data to PRIDE database for submitted pro-
teomics manuscripts.
Chapter 17 presents a method for describing a protein’s posttranslational modifications
by integrating the top–down and bottom–up MS data using the Protein Inference Engine.
Chapter 18 describes an integrated top–down and bottom–up approach facilitated by
concurrent liquid chromatography–mass spectrometry analysis and fraction collection for
comprehensive high-throughput intact protein profiling.
Part III (Comparative Proteomics in Systems Biology) consists of four chapters:
Chapter 19 gives an overview of the content and usage of the PhosphoPep database,
which supports systems biology signaling research by providing interactive interrogation
of MS-derived phosphorylation data from four different organisms.
Chapter 20 describes “omics” data integration to map a list of identified proteins to a
common representation of the protein and uses the related structural, functional, genetic,
and disease information for functional categorization and pathway mapping.
Chapter 21 describes a knowledge-based approach relying on existing metabolic path-
way information and a direct data-driven approach for a metabolic pathway-centric inte-
gration of proteomics and metabolomics data.
Chapter 22 provides a detailed description of a method used to study temporal changes
in the endoplasmic reticulum (ER) proteome of fibroblast cells exposed to ER stress agents
(tunicamycin and thapsigargin).
This volume targets the readers who wish to learn about state-of-the-art bioinformat-
ics databases and tools, novel computational methods and future trends in proteomics
data analysis, and comparative proteomics in systems biology. The audience may range
from graduate students embarking upon a research project, to practicing biologists work-
ing on proteomics and systems biology research, and to bioinformaticians developing
advanced databases, analysis tools, and integrative systems. With its interdisciplinary
nature, this volume is expected to find a broad audience in biotechnology and pharmaceu-
tical companies and in various academic departments in biological and medical sciences
(such as biochemistry, molecular biology, protein chemistry, and genomics) and compu-
tational sciences and engineering (such as bioinformatics and computational biology,
computer science, and biomedical engineering).
11. vii
Preface
We thank all the authors and coauthors who had contributed to this volume. We
thank our series editor, Dr. John M. Walker, for reviewing all the chapter manuscripts and
providing constructive comments. We also thank Dr. Winona C. Barker from Georgetown
University for reviewing the manuscripts. We thank Dr. Qinghua Wu for proof reading the
book draft. Finally, we would like to extend our thanks to David C. Casey and Anne
Meagher of Springer US, Jeya Ruby and Ravi Amina of SPi for their help in the compila-
tion of this book.
Newark, DE, USA Cathy H. Wu and Chuming Chen
13. ix
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Part I: Bioinformatics Framework for Comparative Proteomics
1 Protein Bioinformatics Databases and Resources . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chuming Chen, Hongzhan Huang, and Cathy H. Wu
2 A Guide to UniProt for Protein Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Claire O’Donovan and Rolf Apweiler
3 InterPro Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Jennifer McDowall and Sarah Hunter
4 Reactome Knowledgebase of Human Biological
Pathways and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Peter D’Eustachio
5 eFIP: A Tool for Mining Functional Impact
of Phosphorylation from Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Cecilia N. Arighi, Amy Y. Siu, Catalina O. Tudor,
Jules A. Nchoutmboube, Cathy H. Wu, and Vijay K. Shanker
6 A Tutorial on Protein Ontology Resources for Proteomic Studies . . . . . . . . . . . . 77
Cecilia N. Arighi
7 Structure-Guided Rule-Based Annotation of Protein
Functional Sites in UniProt Knowledgebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Sona Vasudevan, C.R. Vinayaka, Darren A. Natale,
Hongzhan Huang, Robel Y. Kahsay, and Cathy H. Wu
Part II: Proteomic Bioinformatics
8 Modeling Mass Spectrometry-Based Protein Analysis . . . . . . . . . . . . . . . . . . . . . . 109
Jan Eriksson and David Fenyö
9 Protein Identification from Tandem Mass Spectra
by Database Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Nathan J. Edwards
10 LC-MS Data Analysis for Differential
Protein Expression Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Rency S. Varghese and Habtom W. Ressom
11 Protein Identification by Spectral Networks Analysis . . . . . . . . . . . . . . . . . . . . . . 151
Nuno Bandeira
12 Software Pipeline and Data Analysis for MS/MS Proteomics:
The Trans-Proteomic Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Andrew Keller and David Shteynberg
14. x Contents
13 Analysis of High-Throughput ELISA Microarray Data . . . . . . . . . . . . . . . . . . . . 191
Amanda M. White, Don S. Daly, and Richard C. Zangar
14 Proteomics Databases and Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Lennart Martens
15 Preparing Molecular Interaction Data for Publication . . . . . . . . . . . . . . . . . . . . . 229
Sandra Orchard and Henning Hermjakob
16 Submitting Proteomics Data to PRIDE Using PRIDE Converter . . . . . . . . . . . . 237
Harald Barsnes, Juan Antonio Vizcaíno, Florian Reisinger,
Ingvar Eidhammer, and Lennart Martens
17 Automated Data Integration and Determination of
Posttranslational Modifications with the Protein Inference Engine . . . . . . . . . . . . 255
Stuart R. Jefferys and Morgan C. Giddings
18 An Integrated Top-Down and Bottom-Up Strategy
for Characterization of Protein Isoforms and Modifications . . . . . . . . . . . . . . . . . 293
Si Wu, Nikola Tolic¢, Zhixin Tian, Errol W. Robinson,
and Ljiljana Paša-Tolic¢
Part III: Comparative Proteomics in Systems Biology
19 Phosphoproteome Resource for Systems Biology Research . . . . . . . . . . . . . . . . . 307
Bernd Bodenmiller and Ruedi Aebersold
20 Protein-Centric Data Integration for Functional Analysis of
Comparative Proteomics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Peter B. McGarvey, Jian Zhang, Darren A. Natale,
Cathy H. Wu, and Hongzhan Huang
21 Integration of Proteomic and Metabolomic Profiling
as well as Metabolic Modeling for the Functional
Analysis of Metabolic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Patrick May, Nils Christian, Oliver Ebenhöh,
Wolfram Weckwerth, and Dirk Walther
22 Time Series Proteome Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Catherine A. Formolo, Michelle Mintz, Asako Takanohashi,
Kristy J. Brown, Adeline Vanderver, Brian Halligan,
and Yetrib Hathout
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
15. xi
Contributors
Ruedi Aebersold • Institute of Molecular Systems Biology, ETH Zurich, Zurich,
Switzerland
Rolf Apweiler • The European Bioinformatics Institute, Cambridge, UK
Cecilia N. Arighi • Department of Computer and Information Sciences, University
of Delaware, Newark, DE, USA
Nuno Bandeira • Center for Computational Mass Spectrometry, University of
California, San Diego, La Jolla, CA, USA
Harald Barsnes • Department of Informatics, University of Bergen, Bergen, Norway
Bernd Bodenmiller • Institute of Molecular Systems Biology, ETH Zurich, Zurich,
Switzerland
Kristy J. Brown • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Chuming Chen • Department of Computer and Information Sciences, University of
Delaware, Newark, DE, USA
Nils Christian • Max-Planck-Institute for Molecular Plant Physiology,
Potsdam-Golm, Germany
Don S. Daly • Pacific Northwest National Laboratory, Richland, WA, USA
Peter D’Eustachio • Department of Biochemistry, New York University School of
Medicine, New York, NY, USA
Oliver Ebenhöh • Max-Planck-Institute for Molecular Plant Physiology,
Potsdam-Golm, Germany
Nathan J. Edwards • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Ingvar Eidhammer • Department of Informatics, University of Bergen, Bergen, Norway
Jan Eriksson • Swedish University of Agricultural Sciences, Uppsala, Sweden
David Fenyö • The Rockefeller University, New York, NY, USA
Catherine A. Formolo • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Morgan C. Giddings • Departments of Microbiology & Immunology and Biomedical
Engineering, The University of North Carolina at Chapel Hill, Chapel Hill,
NC, USA
Brian Halligan • Bioinformatics, Human and Molecular Genetics Center, Medical
College of Wisconsin, Milwaukee, WI, USA
Yetrib Hathout • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Henning Hermjakob • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Hongzhan Huang • Department of Computer and Information Sciences, University
of Delaware, Newark, DE, USA
16. xii Contributors
Sarah Hunter • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Stuart R. Jefferys • Department of Bioinformatics & Computational Biology,
The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Robel Y. Kahsay • DuPont Central Research & Development,
Wilmington, DE, USA
Andrew Keller • Rosetta Biosoftware, Seattle, WA, USA
Lennart Martens • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Patrick May • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm,
Germany
Jennifer McDowall • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Peter B. McGarvey • Department of Biochemistry and Molecular & Cellular Biol-
ogy, Georgetown University Medical Center, Washington, DC, USA
Michelle Mintz • Center for Genetic Medicine Research, Children’s National Medi-
cal Center, Washington, DC, USA
Darren A. Natale • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Jules A. Nchoutmboube • Department of Computer and Information Sciences,
University of Delaware, Newark, DE, USA
Claire O’Donovan • The European Bioinformatics Institute, Cambridge, UK
Sandra Orchard • EMBL Outstation, European Bioinformatics
Institute (EBI), Cambridge, UK
Ljiljana Paša-Tolic
¢ • Pacific Northwest National Laboratory, Richland, WA, USA
Florian Reisinger • EMBL Outstation, European Bioinformatics Institute (EBI),
Cambridge, UK
Habtom W. Ressom • Department of Oncology, Georgetown University Medical
Center, Washington, DC, USA
Errol W. Robinson • Pacific Northwest National Laboratory, Richland, WA, USA
Vijay K. Shanker • Department of Computer and Information Sciences, University of
Delaware, Newark, DE, USA
David Shteynberg • Institute for Systems Biology, Seattle, WA, USA
Amy Y. Siu • Department of Computer and Information Sciences, University of Dela-
ware, Newark, DE, USA
Asako Takanohashi • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Zhixin Tian • Pacific Northwest National Laboratory, Richland, WA, USA
Nikola Tolic
¢ • Pacific Northwest National Laboratory, Richland, WA, USA
Catalina O. Tudor • Department of Computer and Information Sciences,
University of Delaware, Newark, DE, USA
Adeline Vanderver • Center for Genetic Medicine Research, Children’s National
Medical Center, Washington, DC, USA
Rency S. Varghese • Department of Oncology, Georgetown University Medical
Center, Washington, DC, USA
17. xiii
Contributors
Sona Vasudevan • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
C.R. Vinayaka • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
Juan Antonio Vizcaíno • EMBL Outstation, European Bioinformatics Institute
(EBI), Cambridge, UK
Dirk Walther • Max-Planck-Institute for Molecular Plant Physiology, Potsdam-
Golm, Germany
Wolfram Weckwerth • Molecular Systems Biology, University of Vienna, Vienna,
Austria
Amanda M. White • Pacific Northwest National Laboratory, Richland, WA, USA
Cathy H. Wu • Department of Computer and Information Sciences,
University of Delaware, Newark, DE, USA
Si Wu • Pacific Northwest National Laboratory, Richland, WA, USA
Richard C. Zangar • Pacific Northwest National Laboratory, Richland, WA, USA
Jian Zhang • Department of Biochemistry and Molecular & Cellular Biology,
Georgetown University Medical Center, Washington, DC, USA
21. 4 Chen, Huang, and Wu
Recently, proteomics data analysis has moved toward infor-
mation integration of multiple studies including cross-species
analyses (6–9). The richness of proteomics data allows research-
ers to ask complex biological questions and gain new scientific
insights. To support comparative proteomics, data-driven
hypothesis generation, and biological knowledge discovery, many
protein-related bioinformatics databases, query facilities, and
data analysis software tools have been developed. These organize
and provide biological annotations for individual proteins to
support sequence, structural, functional and evolutionary analy-
ses in the context of pathway, network and systems biology.
However, it is not always easy for researchers to quickly find the
pieces of related information. In this chapter, we present a com-
prehensive review (with categorization and description) of major
protein bioinformatics databases and resources that are relevant
to comparative proteomics research. We highlight some of these
databases, and focus on the types of data stored and related data
access and data analysis supports. We also discuss the challenges
and opportunities for developing new protein bioinformatics
databases in terms of supporting data integration and compara-
tive analysis, maintaining data provenance and managing
biological knowledge.
Our coverage of protein bioinformatics databases in this chapter
is by no means exhaustive. We refer the readers to ref. 10 for a
more complete list. Our intention is to cover those that are recent,
high quality, publicly available, and are expected to be of interest
to more users in the comparative proteomics community. Based
on the topics and data stored, protein bioinformatics databases
can be primarily classified as sequence databases, family databases,
structure databases, function databases and proteomics databases
as shown in Table 1. It is worth noting that certain databases can
be classified into more than one category. Please visit http://
www.proteininformationresource.org/staff/chenc/MiMB/
dbSummary.html to access the databases reviewed in this chapter
through their corresponding web addresses (URLs).
Protein sequence databases serve as the archival repositories for col-
lections of protein sequences as well as their associated annotations.
These databases are also the primary sources for developing other
2.
Overview
3. Databases
and Resources
Highlights
3.1. Protein Sequence
Databases
22. 5
Protein Bioinformatics Databases and Resources
Table
1
Overview
of
protein
bioinformatics
databases
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Sequence
NCBI
Reference
Sequence
(RefSeq)
Biologically
non-redundant
collection
of
DNA,
RNA,
and
protein
sequences
http://www.ncbi.nlm.nih.
gov/RefSeq/
(11)
Entrez
Protein
Database
Collection
of
protein
sequences
from
a
variety
of
sources,
and
translations
from
annotated
coding
regions
in
GenBank
and
RefSeq
http://www.ncbi.nlm.nih.
gov/sites/
entrez?db,protein
(20)
UniProt
UniProt
Knowledgebase
(UniProtKB)
Collection
of
functional
information
on
proteins
with
accurate,
consistent
and
rich
annotation
http://www.uniprot.org/
help/uniprotkb
(13)
UniProt
Archive
(UniParc)
Comprehensive
and
non-redundant
database
that
contains
most
of
the
publicly
available
protein
sequences
in
the
world
http://www.uniprot.org/
help/uniparc
(14)
UniProt
Reference
Clusters
(UniRef)
Clustered
sets
of
sequences
from
UniProt
Knowledgebase
(including
splice
variants
and
isoforms)
and
selected
UniParc
records
http://www.uniprot.org/
help/uniref
(15)
Family
Whole
protein
PIRSF
Comprehensive
and
non-overlapping
clustering
of
UniProtKB
sequences
into
a
hierarchical
order
to
reflect
their
evolutionary
relationships
based
on
whole
protein
sequences
http://www.pir.george-
town.edu/pirwww/
dbinfo/pirsf.shtml
(18)
Clusters
of
Orthologous
Groups
of
proteins
(COGs)
Phylogenetic
classification
of
proteins
encoded
in
complete
genomes
http://www.ncbi.nlm.nih.
gov/COG/
(64)
(continued)
23. 6 Chen, Huang, and Wu
Table
1
(continued)
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Protein
ANalysis
THrough
Evolutionary
Relationships
Classification
System
(PANTHER)
Proteins
are
classified
by
expert
biologists
into
families
and
subfamilies
of
shared
function
and
further
categorized
by
GO
terms
http://www.pantherdb.
org/
(29)
ProtoNet
Automatic
hierarchical
classification
of
protein
sequences
http://www.protonet.cs.
huji.ac.il/index.php
(65)
Protein
domain
Pfam
Protein
families
of
domains
each
represented
by
multiple
sequence
alignments
and
Hidden
Markov
Models
(HMMs)
http://www.pfam.sanger.
ac.uk/
(19)
ProDom
Comprehensive
set
of
protein
domain
families
automatically
generated
from
the
UniProtKB
http://www.prodom.prabi.
fr/prodom/current/
html/home.php
(21)
Conserved
Domains
Database
(CDD)
Collections
of
multiple
sequence
alignments
representing
conserved
domains
http://www.ncbi.nlm.nih.
gov/sites/entrez?db=cdd
(66)
Simple
Modular
Architecture
Research
Tool
(SMART)
Resource
for
identification
and
annotation
of
protein
domains
and
the
analysis
of
domain
architectures
http://www.smart.embl.
de/
(31)
Protein
motif
PRINTS
Group
of
conserved
motifs
used
to
characterize
a
protein
family
http://www.bioinf.
manchester.ac.uk/
dbbrowser/PRINTS/
index.php
(30)
PROSITE
Protein
domains,
families
and
functional
sites
as
well
as
associated
patterns
and
profiles
to
identify
them
http://www.ca.expasy.org/
prosite/
(24)
24. 7
Protein Bioinformatics Databases and Resources
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Integrative
InterPro
Integrated
resource
of
protein
families,
domains
and
functional
sites
from
Pfam,
PRINTS,
PROSITE,
ProDom,
SMART,
PIRSF
etc.
http://www.ebi.ac.uk/
interpro/
(27)
Structure
3D
structure
Worldwide
Protein
Data
Bank
(wwPDB)
Repository
for
the
3D
coordinates
and
related
information
on
more
than
38,000
macromolecular
structures,
including
proteins,
nucleic
acids
and
large
macromolecular
complexes
that
have
been
determined
using
X-ray
crystallography,
NMR
and
electron
microscopy
techniques
http://www.wwpdb.org/
(23)
Molecular
Modeling
Database
(MMDB)
3D
macromolecular
structures,
including
proteins
and
polynucleotides.
http://www.ncbi.nlm.nih.
gov/sites/
entrez?db=structure
(67)
ModBase
3D
protein
models
calculated
by
comparative
modeling
http://www.modbase.
compbio.ucsf.edu/
modbase-cgi/index.cgi
(68)
SWISS-MODEL
Repository
Annotated
protein
3D
models
http://www.swissmodel.
expasy.org/repository/
(69)
Structural
classification
CATH
Hierarchical
classification
of
protein
domain
structures
in
the
Protein
Data
Bank
http://www.cathdb.info/
(37)
Structural
Classification
Of
Proteins
(SCOP)
Description
of
the
evolutionary
and
structural
relation-
ships
of
the
proteins
of
known
structures
http://www.scop.mrc-lmb.
cam.ac.uk/scop/
(22)
SUPERFAMILY
Structural
and
functional
annotation
for
all
proteins
and
genomes
based
on
a
collection
of
Hidden
Markov
Models,
which
represents
structural
protein
domains
at
the
SCOP
superfamily
level
http://www.supfam.org/
SUPERFAMILY/
(32)
(continued)
25. 8 Chen, Huang, and Wu
Primary
category
Secondary
category
Database
name
Database
content
URL
References
Protein
folding
Protein
Folding
Database
(PFD)
Repository
of
available
experimental
protein
folding
data
http://www.pfd.med.
monash.edu.au/
public_html/index.php
(38)
KineticDB
Experimental
data
on
protein
folding
kinetics
http://www.kineticdb.
protres.ru/db/index.pl
(70)
Protein
modification
RESID
Collection
of
annotations
and
structures
for
protein
pre-,
co-
and
post-translational
modifications
http://www.ebi.ac.uk/
RESID/
(71)
Phospho3D
3D
structures
of
phosphorylation
sites
that
stores
information
retrieved
from
the
phospho.ELM
database
http://www.cbm.bio.
uniroma2.it/
phospho3d/
(40)
Function
Inter-
molecular
interactions
IntAct
Protein
interaction
data
from
literature
and
user
submission
http://www.ebi.ac.uk/
intact/main.xhtml
(42)
Database
of
Interacting
Proteins
(DIP)
Experimentally
determined
protein–protein
interactions
http://www.dip.doe-mbi.
ucla.edu/dip/Main.cgi
(72)
Reactome
A
curated
knowledgebase
of
biological
pathways
http://www.reactome.org/
(47)
Biological
General
Repository
for
Interaction
Datasets
(BioGRID)
Collections
of
protein
and
genetic
interactions
from
major
model
organism
species
http://www.thebiogrid.org
(73)
Metabolic
pathways
Kyoto
Encyclopedia
of
Genes
and
Genomes
(KEGG)
Pathway
maps
on
the
molecular
interaction
and
reaction
networks
for
metabolism
http://www.genome.jp/
kegg/pathway.html
(74)
Table
1
(continued)
26. 9
Protein Bioinformatics Databases and Resources
Primary
category
Secondary
category
Database
name
Database
content
URL
References
BioCyc
Pathway/Genome
Databases
(PGDBs)
on
the
pathways
and
genomes
of
different
organisms
http://www.biocyc.org/
(51)
MetaCyc
Nonredundant,
experimentally
elucidated
metabolic
pathways
http://www.metacyc.org/
(51)
Integrative
Michigan
molecular
interactions
(MiMI)
Merged
view
of
several
popular
interaction
databases
including:
BIND,
HPRD,
IntAct,
GRID,
and
others
http://www.mimitest.ncibi.
org/MimiWeb/main-
page.jsp
(75)
Proteomics
Gel
electro
phoresis
WORLD-2DPAGE
Constellation
List
of
World-2DPAGE
database
servers,
World-
2DPAGE
Portal
that
queries
simultaneously
world-
wide
proteomics
databases,
and
World-2DPAGE
Repository
http://www.world-2dpage.
expasy.org/
(52)
Mass
spectrometry
Global
Proteome
Machine
Database
(GPMDB)
Mass
spectral
library
for
data
from
a
variety
of
organisms,
the
identified
peptides
are
matched
to
the
Ensembl
genome
database
http://www.thegpm.org/
GPMDB/index.html
(76)
PRoteomics
IDEntifications
database
(PRIDE)
Protein
and
peptide
identifications
that
have
been
described
in
the
scientific
literature
together
with
the
evidence
supporting
these
identifications
http://www.ebi.ac.uk/
pride/
(54)
PeptideAtlas
Peptides
identified
in
a
large
set
of
LC–MS/MS
proteomics
experiments
http://www.peptideatlas.
org/
(77)
Peptidome
Tandem
mass
spectrometry
peptide
and
protein
identification
data
generated
by
the
scientific
community
http://www.ncbi.nlm.nih.
gov/peptidome/
(78)
27. 10 Chen, Huang, and Wu
resources such as protein family databases, and the foundation for
medical and functional studies.
The National Center for Biotechnology Information Reference
Sequence (NCBI RefSeq) database provides curated non-redundant
sequences for genomic regions, transcripts and proteins (11).
RefSeq collection is derived from the sequence data available in
the redundant archival database GenBank (12). RefSeq sequences
include coding regions, conserved domains, variations, refer-
ences, names, and database cross-references. The sequences are
annotated using a combined approach of collaboration, auto-
mated prediction, and manual curation (11). The RefSeq release
37 of September 11, 2009 includes 8,835,796 proteins and 9,005
organisms. The RefSeq data can be accessed from NCBI web sites
by Entrez query, BLAST, FTP download etc.
The UniProt Consortium consists of groups from the European
BioinformaticsInstitute(EBI),theSwissInstituteofBioinformatics
(SIB) and the Protein Information Resource (PIR). The UniProt
Consortium provides a central resource for protein sequences and
functional annotations with four database components to support
protein bioinformatics research:
The UniProt Knowledgebase (UniProtKB) is the predomi-
●
●
nant data store for functional information on proteins (13).
The UniProtKB consists of two sections: UniProtKB/Swiss-
Prot, which contains manually annotated records with infor-
mation extracted from literature and curator-evaluated
computational analysis, and UniProtKB/TrEMBL, which
contains computationally analyzed records with rule-based
automatic annotation. Comparative analysis and query across
databases are supported by the UniProtKB extensive cross-
references, functional and feature annotations, classification,
and literature-based evidence attribution. The UniProtKB
release 15.9 of October 13, 2009 includes 510,076
UniProtKB/Swiss-Prot sequence entries, comprising
179,409,349 amino acids abstracted from 183,725 references,
and 9,501,907 UniProtKB/TrEMBL sequence entries com-
prising 3,068,281,486 amino acids.
The UniProt archive (UniParc) (
●
● 14) is an archival protein
sequence database from all major publicly accessible resources.
UniParc contains protein sequences and database cross-refer-
ences to the provenance of the sequences. Text- and sequence-
based searches are available from UniParc database web site.
The UniProt Reference Clusters (UniRef) (
●
● 15) merge
sequences and sub-sequences that are 100% (UniRef100), ³90%
3.1.1. RefSeq
3.1.2. UniProt
28. 11
Protein Bioinformatics Databases and Resources
(UniRef90), or ³50% (UniRef50) identical, regardless of
source organism to speed up searches.
The UniProt Metagenomic and Environmental Sequences
●
●
(UniMES) database is a repository specifically developed for
Metagenomic and environmental data. UniMES currently
contains data from the Global Ocean Sampling Expedition
(GOS) (16), which predicts nearly six million proteins, pri-
marily from oceanic microbes (13).
The UniProt web site (http://www.uniprot.org) is the pri-
mary access point to the data and documentation. The site also
provides batch retrieval using UniProt identifiers, BLAST-based
sequence similarity search, ClustalW-based sequence alignment,
and Database identifier mapping. The UniProt FTP download
site provides batch download of protein sequence data in various
formats, including flat file text, XML, RDF, FASTA, and GFF.
Programmatic access to the data and search results is supported
via simple HTTP RESTful web services or UniProtJAPI (17) for
Java-based applications.
The primary protein sequence databases can be used to develop
new resources with value-added information by either classifying
protein sequences into families or assigning certain properties to
the sequences by detecting specific sequence features such as
domains, motifs, and functional sites.
The PIRSF classification system provides comprehensive and
non-overlapping clustering of UniProtKB (13) sequences into a
hierarchical order to reflect their evolutionary relationships based
on whole proteins rather than on the component domains. The
PIRSF system classifies the protein sequences into families, whose
members are both homologous (evolved from a common ances-
tor) and homeomorphic (sharing full-length sequence similarity
and a common domain architecture) (18). The PIRSF family clas-
sification results are expert-curated based on literature review and
integrative sequence and functional analysis. The classification
report shows the information on PIRSF members and general
statistics, family and function/structure relationships, database
cross-references and graphical display of domain and motif archi-
tecture of seed members or all members. The web-based PIRSF
system has been shown as a useful tool for studying the function
and evolution of protein families (18). It provides batch retrieval
of entries from the PIRSF database. The PIRSF scan allows
searching a query sequence against the set of fully curated PIRSF
families with benchmarked Hidden Markov Models. The PIRSF
membership hierarchy data is also available for FTP download.
3.2. Protein Family
Databases
3.2.1.
PIRSF
29. 12 Chen, Huang, and Wu
Pfam is a database of protein domains and families represented as
multiple sequence alignments and Hidden Markov Models
(HMMs) (19). Pfam is built based on the protein sequence data
from UniProtKB (13), NCBI GenPept (20) and selected
Metagenomics projects. The Pfam database contains two compo-
nents: Pfam-A and Pfam-B. Pfam-A entries are manually curated
high-quality representative seed alignments, profile HMMs built
from the seed alignments, and an automatically generated full
alignment for all detectable family member protein sequences.
Pfam-B entries are automatically generated from the ProDom
database (21). The Pfam release 24.0 of October 2009 contains
11,912 families. The Pfam database is further organized into
higher-level hierarchical groupings of related families called clan
(19), which are collections of related Pfam-A entries built manu-
ally based on the similarity of their sequences, known structures,
profile-HMMs, and other databases such as SCOP (22). The
Pfam database web site provides a set of query and browsing
interfaces for analyzing protein sequences for Pfam matches, for
viewing Pfam family annotations, alignments, groups of related
families, and the domains of a protein sequence, as well as for
finding the domains on a PDB (23) structure. The Pfam data can
be downloaded from its FTP site or programmatically accessed
through RESTful and SOAP based web services.
PROSITE (24) is a database of annotated motif descriptors (pat-
terns or profiles), which can be used for the identification of pro-
tein domains and families. The motif descriptors are derived from
multiple alignments of homologous sequences and have the advan-
tage of identifying distant relationships among sequences (25). A
set of ProRules providing additional information about the func-
tionally and/or structurally critical amino acids are used to increase
the discriminatory power of the motif descriptors (24). The
PROSITE web site provides keywords-based search and allows
browsing of motif entries, ProRule description, taxonomic scope,
and number of positive hits. The ScanProsite (26) tool allows one
either to scan protein sequences for the occurrence of PROSITE
motifs by entering UniProtKB AC and/or ID, PDB identifier(s)
or protein sequence(s), or to scan the UniProtKB or PDB data-
bases for the occurrence of a pattern by entering the PROSITE
AC and/or ID or user’s own pattern(s). The ScanProsite (26) tool
can also be accessed programmatically through a simple HTTP
web service. The PROSITE documentation entries and related
tools can be downloaded from its FTP site.
InterPro (27) is an integrated resource of predictive models or
“signatures” representing protein domains, families, regions,
repeats and sites from major protein signature databases includ-
ing Gene3D (28), PANTHER (29), Pfam (19), PIRSF (18),
3.2.2.
Pfam
3.2.3.
PROSITE
3.2.4.
InterPro
30. 13
Protein Bioinformatics Databases and Resources
PRINTS (30), ProDom (21), PROSITE (24), SMART (31),
SUPERFAMILY (32) and TIGRFAMs (33). Each entry in the
InterPro database is annotated with a descriptive abstract name
and cross-references to the original data sources, as well as to
specialized functional databases. The InterPro release 23.0 of
September 23, 2009 includes 19,150 entries containing 434 new
signatures. The database is available via a web interface and anon-
ymous FTP download. The software tool InterProScan (34) is
provided as a protein sequence classification and comparison
package that can be used via a web interface and SOAP-based
Web Services or can be installed locally for bulk operations. The
InterPro BioMart (35) allows users to retrieve InterPro data from
a query-optimized data warehouse that is synchronized with the
main InterPro database, and to build simple or complex queries
and control the query results through a unified interface.
Many bioinformatics studies are based on the premise that pro-
teins of similar sequences carry out similar functions whereas
those with different sequences carry out different functions. More
and more experimental data support the notion that structure of
a protein reflects the nature of the role it is playing, therefore,
determining its function in the biological process. The protein
structure databases organize and annotate various experimentally
determined protein structures, providing the biological commu-
nity access to the experimental data in a useful way.
The worldwide PDB (wwPDB) was established in 2003 as an inter-
national collaboration to maintain a single and publicly available
Protein Data Bank Archive (PDB Archive) of macro-molecular
structural data (23). The wwPDB member includes RCSB PDB
(USA), the Macromolecular Structure Database at the European
Bioinformatics Institute (MSD-EBI) (UK), the Protein Data Bank
Japan (PDBj) at Osaka University (Japan) and the BioMagRes-
Bank (BMRB) at the University of Wisconsin – Madison (USA).
The “PDB Archive” is a collection of flat files in three different
formats: the legacy PDB file format; the PDB exchange format that
follows the mmCIF syntax (http://www.deposit.pdb.org/
mmcif/); and the PDBML/XML format (36). Each member site
serves as a deposition, data processing and distribution site for the
PDB Archive and each provides its own view of the primary data
and a variety of tools and resources. As of October 27, 2009, there
are 61,086 structures in the wwPDB database.
CATH (Class, Architecture, Topology, Homology) is a database
of protein domain structures in the Protein Data Bank, where
domains are hierarchically classified by the curators guided by
prediction algorithms (such as structure comparison). CATH
clusters proteins at four major levels (37):
3.3. Protein Structure
Databases
3.3.1.
worldwide PDB
3.3.2.
CATH
31. 14 Chen, Huang, and Wu
●
● Class (C): secondary structure composition and packing
within the structure.
●
● Architecture (A): orientations of the secondary structures
ignoring the connectivity among the secondary structures.
●
● Topology (T): whether they share the same topology in the
core of the domain.
●
● Homologous superfamily (H): sequence and structural
similarities.
The CATH release 3.2.0 of July 14, 2008 contains 114,215
assigned domains. CATH provides the SSAP server, which allows
users to compare the structures of two proteins and view the sub-
sequent structural alignment.
The SCOP (Structural Classification of Proteins) database provides
a comprehensive and detailed description of the evolutionary and
structural relationships of the proteins of known structures. The
SCOP classification hierarchy is constructed based on a domain in
the experimentally determined protein structure and includes the
following levels (22):
●
● Species: distinct protein sequence and its naturally occurring
or artificially created variants.
●
● Protein: similar sequences of essentially the same functions.
●
● Family: proteins with related sequences but typically distinct
functions.
●
● Superfamily: protein families with common evolutionary
ancestor.
●
● Fold: superfamilies with structural similarity (same major sec-
ondary structures in the same arrangement and with the same
topological connections, not necessarily with common evolu-
tionary origin).
●
● Class: based on the secondary structure content and organi-
zation of folds.
The SCOP release 1.75 of June 2009 includes 38,221 PDB
entries, 1,195 folds, 1,962 superfamilies and 3,902 families.
The Protein Folding Database (PFD) is a publicly searchable
repository that collects experimental thermodynamic and kinetic
data for the folding of proteins. Experimenters deposit data
including Constructor, Mutations, Equilibrium Method, Kinetic
Method, Equilibrium Data, Kinetic Data, and Publications (38).
The PFD database uses the International Foldeomics Consortium
standards (39) for data deposition, analysis and reporting to facil-
itate the comparison of folding rates, energies and structure across
diverse sets of proteins (38). The PFD release 2.2 of June 8, 2009
contains 296 entries, 70 proteins, 53 families, 30 species and 230
3.3.3. SCOP
3.3.4. PFD
32. 15
Protein Bioinformatics Databases and Resources
(five proteins) j values. The web site provides advanced text
searches of protein names, literature references, and experimental
details with search results displayed in a tabular view. The graphi-
cal visualization tools have been built for raw equilibrium data,
chevron data, contact order and folding rates with the hyperlinks
on the graph directly link to the data in the text format.
Phospho3D (40) is a database of 3D structures of phosphoryla-
tion sites. Phospho3D is constructed by using the data collected
from the phospho.ELM (41) database of experimentally verified
phosphorylation sites in eukaryotic proteins, and is enriched with
structural information and annotations at the residue level. The
basic information unit in the Phospho3D database consists of the
instance, its flanking sequence (ten residues) and its “zone,” a 3D
neighborhood including any residue whose distance does not
exceed 12Å (40). For each zone, structural similarity and bio-
chemical similarity are used to collect the results of a large-scale
local structural comparison versus a representative dataset of PDB
(23) protein chains, which provide the clues for the identification
of new putative phosphorylation sites. Users can browse the data
in Phospho3D database or search the database using kinase name,
PDB identification code or keywords.
The unique feature of proteins that allows their diverse functions
is the ability to bind to other molecules specifically. For example,
proteins can be enzymes to catalyze the chemical reactions in the
cell or to manipulate the replication and transcription of DNA.
Many proteins are also involved in the process of cell signaling
and signal transduction. Protein function databases maintain
information about metabolic pathways, enzymes, compounds,
and the inter-molecular interactions and regulatory pathways
mechanisms underlying many biological processes.
IntAct is an open source database and toolkit for the storage, pre-
sentation and analysis of protein interaction data (42). IntAct pro-
vides all relevant experimental details of protein interactions
described in the originating publication. All the entries in the data-
base are fully compliant with the IMEx (43) guidelines and MIMIx
(44) standard. The technical details of the experiment, binding
sites, protein tags and mutations are annotated with the Molecular
Interaction ontology of the Proteomics Standard Initiative
(PSI-MI) (45). The latest database contains 202,419 binary inter-
actions, 60,310 proteins, 11,119 experiments and 1,509 con-
trolled vocabulary terms. The IntAct web site provides both textual
and graphical views of protein interactions, and allows exploring
interaction networks in the context of the Gene Ontology (46)
controlled vocabulary and InterPro (27) domains of the inter-
acting proteins. IntAct data and source code are available for
3.3.5.
Phospho3D
3.4. Protein Function
Databases
3.4.1.
IntAct
33. 16 Chen, Huang, and Wu
downloading from its web site. In addition, a set of tools have
been developed by the IntAct project:
●
● ProViz: visualization of protein–protein interaction graphs.
●
● MiNe: compute the minimal connecting networks for a given
set of proteins.
●
● PSI-MI Semantic Validator: validate files in PSI-MI XML 2.5
and PSI-PAR format.
Reactome is an open source, expert-curated and peer-reviewed
database of biological reactions and pathways with cross-references
to major molecular databases (47). The basic information in the
Reactome database is provided by either publications or sequence
similarity-based inference. The Reactome release 30 of September
30, 2009 contains 3,916 proteins, 2,955 complexes, 3,541 reac-
tions, and 1,045 pathways for Homo sapiens. Reactome data can be
exported in SBML (48), Protégé (49), Cytoscape (50) and BioPax
(http://www.biopax.org) formats. Software tools like PathFinder,
SkyPainter and Reactome BioMart (35) have been developed to
support data mining and analysis of large-scale data sets.
MetaCyc is a database of non-redundant, experimentally elucidated
metabolic pathways and enzymes curated from the scientific litera-
ture (51). MetaCyc stores pathways involved in Primary and
Secondary metabolism. It also stores compounds, proteins, protein
complexes and genes associated with these pathways with extensive
links to other biological databases of protein sequences, nucleic
acid sequences, protein structures and literature. BioCyc is a collec-
tion of Pathway/Genome Databases (PGDBs) (51). Each BioCyc
PGDB contains the metabolic network of one organism predicted
by the Pathway tool software using MetaCyc as a reference data-
base. Web-based query, browsing, visualization and comparative
analysis tools are also provided on the MetaCyc and BioCyc web
sites. A collection of data files is also available for downloading.
The advent of high-throughput 2D-gel and mass spectrometry
based analytical techniques and the available protein sequence
databases have created massive amount of proteomics data. To
facilitate the sharing and further computational analysis of pub-
lished proteomics data, several repositories have been created.
The World-2DPAGE Constellation (52) is an effort of the Swiss
Institute of Bioinformatics (SIB) to promote and publish two-
dimensional gel electrophoresis proteomics data online through
the ExPASy proteomics server. The World-2DPAGE Constellation
consists of three components:
●
● WORLD-2DPAGE List (http://www.world-2dpage.expasy.
org/list/) contains references to known federated 2D PAGE
3.4.2.
Reactome
3.4.3. MetaCyc and BioCyc
3.5. Proteomics
Databases
3.5.1.
World-2DPAGE
34. 17
Protein Bioinformatics Databases and Resources
databases, as well as to 2D PAGE-related servers and
services.
●
● World-2DPAGE Portal (http://www.world-2dpage.expasy.
org/portal/) is a dynamic portal that serves as a single inter-
face to query simultaneously world-wide gel-based proteomics
databases that are built using the Make2D-DB package (53).
●
● World-2DPAGE Repository (http://www.world-2dpage.
expasy.org/repository/) is a public repository for gel-based
proteomics data with protein identifications published in the
literature. Mass-spectrometry based proteomics data from
related studies can also be submitted to the PRIDE database
(54) so that interested readers can explore the data in the
views of 2D-gel and/or MS.
The PRoteomics IDEntifications database (PRIDE) is a reposi-
tory for mass-spectrometry based proteomics data including
identifications of proteins, peptides and post-translational modifi-
cations that have been described in the scientific literature,
together with supporting mass spectra (54). The PRIDE team
has built an infrastructure and a set of software tools to facilitate
the data submissions in PRIDE XML or mzData XML format
from labs using different MS-based proteomics technologies. The
PRIDE database can be queried by experiment accession number,
protein accession number, literature reference, and sample param-
eters including species, tissue, sub-cellular location and disease
state. The query results can be retrieved as PRIDE XML, mzData
XML, or HTML. The PRIDE database includes a BioMart (35)
interface that provides access to public PRIDE data from a query-
optimized data warehouse as well as programmatic web service
access. The PRIDE project also provides the Protein Identifier
Cross-Reference Service (PICR) (55), which maps protein
sequence identifiers from over 60 different databases via the
UniParc (14) database. The Database on Demand (DoD, http://
www.ebi.ac.uk/pride/dod) service provides custom FASTA for-
matted sequence databases according to a set of user-selectable
criteria to optimize the search engine results. By November 19,
2009, the PRIDE database contains 10,329 experiments,
2,827,384 identified proteins, 12,542,472 identified peptides,
1,891,670 unique peptides and 56,703,344 Spectra.
Although a variety of protein bioinformatics databases and
resources have been developed to catalog and store different
information about proteins, there are still opportunities to develop
3.5.2. PRIDE
4.
Discussion
35. 18 Chen, Huang, and Wu
new solutions to facilitate comparative analysis, data-driven
hypothesis generation, and biological knowledge discovery.
As the volume and diversity of data and the desire to share those
data increase, we inevitably encounter the problem of combining
heterogeneous data generated from many different but related
sources and providing the users with a unified view of this com-
bined data set. This problem emerges in the biological and bio-
medical research community, where research data from different
bioinformatics data repositories and laboratories need to be com-
bined and analyzed. There are urgent needs for developing com-
putational methods to integrate data from multiple studies and to
answer more complex biological questions than traditional meth-
ods can tackle. Comparing experimental results across multiple
laboratories and data types can also help forming new hypotheses
for further experimentation (56–58). Different laboratories use
different experimental protocols, instruments and analysis tech-
niques, which make direct comparisons of their experimental
results difficult. However, having related data in one place can
make queries and comparisons of combined protein and gene
data sets and further analysis possible.
In general, there are two types of data integration approaches.
The data warehouse approach puts data sources into a centralized
location with a global data schema and an indexing system for fast
data retrieval. An example of this approach is the NIAID (National
Institute for Allergy and Infectious Diseases) Biodefense Resource
Center (http://www.proteomicsresource.org), which uses a pro-
tein-centric data warehouse (Master Protein Directory) to integrate
and support mining and comparative analysis of large and hetero-
geneous “omics” data across different experiments and organisms
(59). Another approach to data integration involves the federation
of data across multiple sources. An example of this approach is the
BioMart (35), an open source database management system that
uses integrated query interfaces to query different BioMarts and
allows users to group and refine their query results. The BioMart
can also be accessed programmatically through web services or
software libraries written in programming languages Java or Perl.
In many cases, the most difficult tasks in protein bioinformatics
data management and analysis are not mapping biological entities
from different sources or managing and processing large set of
experimental data, such as gel images and mass spectra. Rather, it
is in recording the detailed provenance of data, i.e., what was
done, why it was done, where it was done, which instrument was
used, what settings were used, how it was done. The provenance
of experimental data is an important aspect of scientific best prac-
tice and is central to scientific discovery (60).
4.1. Data Integration
and Comparative
Analysis
4.2. Data Provenance
and Biological
Knowledge
36. 19
Protein Bioinformatics Databases and Resources
In proteomics studies, although great efforts have been made
to develop and maintain data format standards, such as mzXML
(61) and HUPO PSI (HUPO Proteomics Standards Initiative)
(62), and minimal information standards for describing such data,
for example, MIAPE (Minimum Information About a Proteomics
Experiment) (63), the ontologies and related tools that provide
formal representation of a set of concepts and their relationships
within the domain of “omics” experiments still lag behind the
current development of experimental protocols and methods.
The standardization of data provenance remains a somewhat
manual process, which depends on the efforts of database main-
tainers and data submitters.
The general biological and biomedical scientists are more inter-
ested in finding and viewing the “knowledge” contained in an
already analyzed data set. However, much of the protein data gener-
ated in high-throughput research is insignificant in the conclusions
of an analysis. Unfortunately, this information seldom comes with
the standard data files and formats and is usually not easily found in
omics repositories unless a reanalysis is performed or the data is
annotated by a curator. For example, tables of proteins present in a
given proteomics experiment are routinely found as supplemental
data in scientific publications, but are not available in a searchable or
easily computable format. This is unfortunate as this supplemental
information is the result of considerable analysis by the original
authors of a study to minimize false positive and false negative
results, thus often representing the “knowledge” that underlies
additional analysis and conclusions reached in a publication.
The NIAID Biodefense Resource Center developed a simple
set of defined fields called “structured assertion” that could be used
across proteomics, microarray and possibly other data types (59).
A “structured assertion” can represent the results in a simple form
like “protein V (presented) in experimental condition W,” where V
represents any valid identifier and W represents a value in a simple
experimental ontology. A simple two-field assertion for the analyzed
results of proteomics and microarray data and an “experimental
condition” field containing simple keywords was implemented to
describe the key experimental variables (growth conditions, sample
fractionation, time, temperature, infection status and others) and
“Expression Status,” which has three values: increase, decrease or
present. Although seemingly simple, the approach provides unique
analytical power in the form of enabling simple queries across results
from different data types and laboratories.
Acknowledgment
We would like to thank Dr. Winona C. Barker for reviewing the
manuscript and providing constructive comments.
37. 20 Chen, Huang, and Wu
References
1. Ridley, M. (2006) Genome. Harper Perennial,
New York.
2. Velculescu, V. E., Zhang, L., Zhou, W.,
Vogelstein, J., Basrai, M. A., Bassett, D. E. Jr,
Hieter, P., Vogelstein, B., Kinzler, K. W.
(1997) Characterization of the yeast tran-
scriptome. Cell 2, 243–251.
3. Anderson, N. L., Anderson, N. G. (1998)
Proteome and proteomics: new technologies,
new concepts, and new words. Electrophoresis
11, 1853–1861.
4. Hye, A., Lynham, S., Thambisetty, M.,
Causevic, M., Campbell, J., Byers, H. L.,
Hooper, C., Rijsdijk, F., Tabrizi, S. J., Banner,
S., Shaw, C. E., Foy, C., Poppe, M., Archer,
N., Hamilton, G., Powell, J., Brown, R. G.,
Sham, P., Ward, M., Lovestone, S. (2006)
Proteome-based plasma biomarkers for
Alzheimer’s disease. Brain 11, 3042–3050.
5. Decramer, S., Wittke, S., Mischak, H., Zürbig,
P., Walden, M., Bouissou, F., Bascands, J. L.,
Schanstra, J. P. (2006) Predicting the clinical
outcome of congenital unilateral ureteropel-
vic junction obstruction in newborn by uri-
nary proteome analysis. Nat. Med. 4,
398–400.
6. Savidor, A., Donahoo, R. S., Hurtado-
Gonzales, O., Land, M. L., Shah, M. B.,
Lamour, K. H., McDonald, W. H. (2008)
Cross-species global proteomics reveals con-
served and unique processes in Phytophthora
sojae and Phytophthora ramorum. Mol. Cell
Proteomics 8, 1501–1516.
7. Huang, M., Chen, T., Chan, Z. (2006) An
evaluation for cross-species proteomics research
by publicly available expressed sequence tag
database search using tandem mass spectral
data. Rapid Commun. Mass Spectrom. 18,
2635–2640.
8. Ishii, A., Dutta, R., Wark, G. M., Hwang, S. I.,
Han, D. K., Trapp, B. D., Pfeiffer, S. E., Bansal,
R. (2009) Human myelin proteome and com-
parative analysis with mouse myelin. Proc. Natl.
Acad. Sci. U. S. A. 34, 14605–14610.
9. Irmler, M., Hartl, D., Schmidt, T.,
Schuchhardt, J., Lach, C., Meyer, H. E.,
Hrabé, de Angelis M., Klose, J., Beckers, J.
(2008) An approach to handling and inter-
pretation of ambiguous data in transcriptome
and proteome comparisons. Proteomics 6,
1165–1169.
10. Galperin, M. Y., Cochrane, G. R. (2009)
Nucleic acids research annual database issue
and the NAR online molecular biology data-
base collection in 2009. Nucleic Acids Res.
37, D1–D4.
11. Pruitt, K. D., Tatusova, T., Maglott, D. R.
(2007) NCBI reference sequences (RefSeq): a
curated non-redundant sequence database of
genomes, transcripts and proteins. Nucleic
Acids Res. 35, D61–D65.
12. Benson, D. A., Karsch-Mizrachi, I., Lipman,
D. J., Ostell, J., Wheeler, D. L. (2008)
GeneBank. Nucleic Acids Res. 36, D25–D30.
13. The UniProt Consortium. (2010) The uni-
versal protein resource (UniProt) in 2010.
Nucleic Acids Res. 38, D142–D148.
14. Leinonen, R., Diez, F. G., Binns, D.,
Fleischmann, W., Lopez, R., Apweiler, R.
(2004) UniProt archive. Bioinformatics 20,
3236–3237.
15. Suzek, B. E., Huang, H., McGarvey, P.,
Mazumder, R., Wu, C. H. (2007) UniRef:
comprehensive and non-redundant UniProt
reference clusters. Bioinformatics 23,
1282–1288.
16. Yooseph, S., Sutton, G., Rusch, D. B., Halpern,
A. L., Williamson, S. J., Remington, K., Eisen,
J. A., Heidelberg, K. B., Manning, G., Li, W.,
Jaroszewski, L., Cieplak, P., Miller, C. S., Li,
H., Mashiyama, S. T., Joachimiak, M. P., van
Belle, C., Chandonia, J. M., Soergel, D. A.,
Zhai, Y., Natarajan, K., Lee, S., Raphael, B. J.,
Bafna, V., Friedman, R., Brenner, S. E., Godzik,
A., Eisenberg, D., Dixon, J. E., Taylor, S. S.,
Strausberg, R. L., Frazier, M., Venter, J. C.
(2007) The Sorcerer II global ocean sampling
expedition: expanding the universe of protein
families. PLoS Biol. 5, e16.
17. Patient, S., Wieser, D., Kleen, M.,
Kretschmann, E., Martin, M. J., Apweiler, R.
(2008) UniProtJAPI: a remote API for access-
ing UniProt data. Bioinformatics 24,
1321–1322.
18. Nikolskaya, A. N., Arighi, C. N., Huang, H.,
Barker, W. C., Wu, C. H. (2006) PIRSF fam-
ily classification system for protein functional
and evolutionary analysis. Evol. Bioinform.
Online 2, 197–209.
19. Finn, R. D., Tate, J., Mistry, J., Coggill, P. C.,
Sammut, S. J., Hotz, H. R., Ceric, G.,
Forslund, K., Eddy, S. R., Sonnhammer, E.
L., Bateman, A. (2008) The Pfam protein
families database. Nucleic Acids Res. 36,
D281–D288.
20. Wheeler, D. L., Barrett, T., Benson, D. A.,
Bryant, S. H., Canese, K., Chetvernin, V.,
Church, D. M., DiCuccio, M., Edgar, R.,
Federhen, S., Geer, L. Y., Kapustin, Y.,
Khovayko, O., Landsman, D., Lipman, D. J.,
Madden, T. L., Maglott, D. R., Ostell, J.,
Miller, V., Pruitt, K. D., Schuler, G. D.,
38. 21
Protein Bioinformatics Databases and Resources
Sequeira, E., Sherry, S. T., Sirotkin, K.,
Souvorov, A., Starchenko, G., Tatusov, R. L.,
Tatusova, T. A., Wagner, L., Yaschenko, E.
(2007) Database resources of the National
Center for Biotechnology Information.
Nucleic Acids Res. 35, D5–D12.
21. Bru, C., Courcelle, E., Carrère, S., Beausse,
Y., Dalmar, S., Kahn, D. (2005) The ProDom
database of protein domain families: more
emphasis on 3D. Nucleic Acids Res. 33,
D212–D215.
22. Andreeva, A., Howorth, D., Chandonia, J.
M., Brenner, S. E., Hubbard, T. J., Chothia,
C., Murzin, A. G. (2008) Data growth and its
impact on the SCOP database: new develop-
ments. Nucleic Acids Res. 36, D419–D425.
23. Berman, H., Henrick, K., Nakamura, H.,
Markley, J. L. (2007) The worldwide Protein
Data Bank (wwPDB): ensuring a single, uni-
form archive of PDB data. Nucleic Acids Res.
35, D301–D303.
24. Hulo, N., Bairoch, A., Bulliard, V., Cerutti,
L., Cuche, B. A., de Castro, E, Lachaize, C.,
Langendijk-Genevaux, P. S., Sigrist, C. J.
(2008) The 20 years of PROSITE. Nucleic
Acids Res. 36, D245–D249.
25. Sigrist, C. J. A., Cerutti, L., Hulo, N., Gattiker,
A., Falquet, L., Pagni, M., Bairoch, A., Bucher,
P. (2002) PROSITE: a documented database
using patterns and profiles as motif descrip-
tors. Brief. Bioinform. 3, 265–274.
26. De Castro, E., Sigrist, C. J. A., Gattiker, A.,
Bulliard, V., Langendijk-Genevaux, P. S.,
Gasteiger, E., Bairoch, A., Hulo, N. (2006)
ScanProsite: detection of PROSITE signature
matches and ProRule-associated functional
and structural residues in proteins. Nucleic
Acids Res. 34, W362–W365.
27. Hunter, S., Apweiler, R., Attwood, T. K.,
Bairoch, A., Bateman, A., Binns, D., Bork, P.,
Das, U., Daugherty, L., Duquenne, L., Finn,
R. D., Gough, J., Haft, D., Hulo, N., Kahn,
D., Kelly, E., Laugraud, A., Letunic, I.,
Lonsdale, D., Lopez, R., Madera, M., Maslen,
J., McAnulla, C., McDowall, J., Mistry, J.,
Mitchell, A., Mulder, N., Natale, D., Orengo,
C., Quinn, A. F., Selengut, J. D., Sigrist, C. J.,
Thimma, M., Thomas, P. D., Valentin, F.,
Wilson, D., Wu, C. H., Yeats, C. (2009)
InterPro: the integrative protein signature data-
base. Nucleic Acids Res. 37, D211–D215.
28. Yeats, C., Lees, J., Reid, A., Kellam, P., Martin,
N., Liu, X., Orengo, C. (2008) Gene3D:
comprehensive structural and functional
annotation of genomes. Nucleic Acids Res.
36, D414–D418.
29. Mi, H., Guo, N., Kejariwal, A., Thomas, P. D.
(2007)PANTHERversion 6: proteinsequence
and function evolution data with expanded
representation of biological pathways. Nucleic
Acids Res. 35, D247–D252.
30. Attwood, T. K. (2002) The PRINTS data-
base: a resource for identification of protein
families. Brief. Bioinform. 3, 252–263.
31. Letunic, I., Doerks, T., Bork, P. (2009)
SMART 6: recent updates and new develop-
ments. Nucleic Acids Res. 37, D229–D232.
32. Wilson, D., Pethica, R., Zhou, Y., Talbot, C.,
Vogel, C., Madera, M., Chothia, C., Gough,
J. (2009) SUPERFAMILY – sophisticated
comparative genomics, data mining, visualiza-
tion and phylogeny. Nucleic Acids Res. 37,
D380–D386.
33. Haft, D. H., Selengut, J. D., White, O. (2003)
The TIGRFAMs database of protein families.
Nucleic Acids Res. 31, D371–D373.
34. Mulder, N., Apweiler, R. (2007) InterPro and
InterProScan: tools for protein sequence clas-
sification and comparison. Methods Mol Biol.
396, 59–70.
35. Smedley, D., Haider, S., Ballester, B., Holland,
R., London, D., Thorisson, G., Kasprzyk, A.
(2009) BioMart – biological queries made
easy. BMC Genomics 10, 22.
36. Westbrook, J., Ito, N., Nakamura, H.,
Henrick, K., Berman, H. M. (2005) PDBML:
the representation of archival macromolecular
structure data in XML. Bioinformatics 21,
988–992.
37. Cuff, A. L., Sillitoe, I., Lewis, T., Redfern, O.
C., Garratt, R., Thornton, J., Orengo, C. A.
(2009) The CATH classification revisited –
architectures reviewed and new ways to char-
acterize structural divergence in superfamilies.
Nucleic Acids Res. 37, D310–D314.
38. Fulton, K. F., Bate, M. A., Faux, N. G.,
Mahmood, K., Betts, C., Buckle, A. M.
(2007) Protein Folding Database (PFD 2.0):
an online environment for the International
Foldeomics Consortium. Nucleic Acids Res.
35, D304–D307.
39. Maxwell, K. L., Wildes, D., Zarrine-Afsar, A.,
De Los Rios, M. A., Brown, A. G., Friel, C.
T., Hedberg, L., Horng, J. C., Bona, D.,
Miller, E. J., Vallée-Bélisle, A., Main, E. R.,
Bemporad, F., Qiu, L., Teilum, K., Vu, N. D.,
Edwards, A. M., Ruczinski, I., Poulsen, F. M.,
Kragelund, B. B., Michnick, S. W., Chiti, F.,
Bai, Y., Hagen, S. J., Serrano, L., Oliveberg,
M., Raleigh, D. P., Wittung-Stafshede, P.,
Radford, S. E., Jackson, S. E., Sosnick, T. R.,
Marqusee, S., Davidson, A. R., Plaxco, K. W.
(2005) Protein folding: defining a “standard”
set of experimental conditions and a prelimi-
nary kinetic data set of two-state proteins.
Protein Sci. 14, 602–616.
39. 22 Chen, Huang, and Wu
40. Zanzoni, A., Ausiello, G., Via, A., Gherardini,
P.F.,Helmer-Citterich,M.(2007)Phospho3D:
a database of three-dimensional structures of
protein phosphorylation sites. Nucleic Acids
Res. 35, D229–D231.
41. Diella, F., Cameron, S., Gemünd, C., Linding,
R., Via, A., Kuster, B., Sicheritz-Pontén, T.,
Blom, N., Gibson, T. J. (2004) Phospho.
ELM: a database of experimentally verified
phosphorylation sites in eukaryotic proteins.
BMC Bioinformatics 5, 79.
42. Aranda, B., Achuthan, P., Alam-Faruque, Y.,
Armean,I.,Bridge,A.,Derow,C.,Feuermann,
M., Ghanbarian, A. T., Kerrien, S., Khadake,
J., Kerssemakers, J., Leroy, C., Menden, M.,
Michaut, M., Montecchi-Palazzi, L.,
Neuhauser, S. N., Orchard, S., Perreau, V.,
Roechert, B., van Eijk, K., Hermjakob, H.
(2010) The IntAct molecular interaction
database in 2010. Nucleic Acids Res. 38,
D525–D531.
43. Orchard, S., Kerrien, S., Jones, P., Ceol, A.,
Chatr-Aryamontri, A., Salwinski, L., Nerothin,
J., Hermjakob, H. (2007) Submit your inter-
action data the IMEx way: a step by step guide
to trouble-free deposition. Proteomics 7 Suppl
1, 28–34.
44. Orchard, S., Salwinski, L., Kerrien, S.,
Montecchi-Palazzi, L., Oesterheld, M.,
Stümpflen, V., Ceol, A., Chatr-aryamontri,
A., Armstrong, J., Woollard, P., Salama, J. J.,
Moore, S., Wojcik, J., Bader, G. D., Vidal, M.,
Cusick, M. E., Gerstein, M., Gavin, A. C.,
Superti-Furga, G., Greenblatt, J., Bader, J.,
Uetz, P., Tyers, M., Legrain, P., Fields, S,,
Mulder, N., Gilson, M., Niepmann, M.,
Burgoon, L., De Las Rivas, J., Prieto, C.,
Perreau, V. M., Hogue, C., Mewes, H. W.,
Apweiler, R., Xenarios, I., Eisenberg, D.,
Cesareni, G., Hermjakob, H. (2007) The
minimum information required for reporting
a molecular interaction experiment (MIMIx).
Nat. Biotechnol. 25, 894–898.
45. Kerrien, S., Orchard, S., Montecchi-Palazzi,
L., Aranda, B., Quinn, A. F., Vinod, N.,
Bader, G. D., Xenarios, I., Wojcik, J., Sherman,
D., Tyers, M., Salama, J. J., Moore, S., Ceol,
A., Chatr-Aryamontri, A., Oesterheld, M.,
Stümpflen, V., Salwinski, L., Nerothin, J.,
Cerami, E., Cusick, M. E., Vidal, M., Gilson,
M., Armstrong, J., Woollard, P., Hogue, C.,
Eisenberg, D., Cesareni, G., Apweiler, R.,
Hermjakob, H. (2007) Broadening the hori-
zon – level 2.5 of the HUPO-PSI format for
molecular interactions. BMC Biol. 5, 44.
46. Ashburner M, Ball CA, Blake JA, Botstein D,
Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT, Harris MA, Hill DP,
Issel-Tarver L, Kasarskis A, Lewis S, Matese
JC, Richardson JE, Ringwald M, Rubin GM,
Sherlock G. (2000) Gene ontology: tool for
the unification of biology. The Gene Ontology
Consortium. Nat. Genet. 25, 25–29.
47. Matthews, L., Gopinath, G., Gillespie, M.,
Caudy, M., Croft, D., de Bono, B., Garapati, P.,
Hemish, J., Hermjakob, H., Jassal, B., Kanapin,
A., Lewis, S., Mahajan, S., May, B., Schmidt,
E., Vastrik, I., Wu, G., Birney, E., Stein, L.,
D’Eustachio, P. (2009) Reactome knowledge-
base of human biological pathways and pro-
cesses. Nucleic Acids Res. 37, D619–D622.
48. Hucka, M., Finney, A., Sauro, H. M., Bolouri,
H., Doyle, J. C., Kitano, H., Arkin, A. P.,
Bornstein, B. J., Bray, D., Cornish-Bowden,
A., Cuellar, A. A., Dronov, S., Gilles, E. D.,
Ginkel, M., Gor, V., Goryanin, II., Hedley, W.
J., Hodgman, T. C., Hofmeyr, J. H., Hunter,
P. J., Juty, N. S., Kasberger, J. L., Kremling,
A., Kummer, U., Le Novère, N., Loew, L. M.,
Lucio, D., Mendes, P., Minch, E., Mjolsness,
E. D., Nakayama, Y., Nelson, M. R., Nielsen,
P. F., Sakurada, T., Schaff, J. C., Shapiro, B.
E., Shimizu, T. S., Spence, H. D., Stelling, J.,
Takahashi, K., Tomita, M., Wagner, J., Wang,
J., SBML Forum. (2003) The systems biology
markup language (SBML): a medium for rep-
resentation and exchange of biochemical net-
work models. Bioinformatics 19, 524–531.
49. Noy, N. F., Crubezy, M., Fergerson, R. W.,
Knublauch, H., Tu, S. W., Vendetti, J.,
Musen, M. A. (2003) Protégé-2000: an
open-source ontology-development and
knowledge-acquisition environment. AMIA.
Annu Symp Proc. 953.
50. Cline, M. S., Smoot, M., Cerami, E.,
Kuchinsky, A., Landys, N., Workman, C.,
Christmas, R., Avila-Campilo, I., Creech,
M., Gross, B., Hanspers, K., Isserlin, R.,
Kelley, R., Killcoyne, S., Lotia, S., Maere, S.,
Morris, J., Ono, K., Pavlovic, V., Pico, A. R.,
Vailaya, A., Wang, P. L., Adler, A., Conklin,
B. R., Hood, L., Kuiper, M., Sander, C.,
Schmulevich, I., Schwikowski, B., Warner,
G. J., Ideker, T., Bader, G. D. (2007)
Integration of biological networks and gene
expression data using Cytoscape. Nat. Protoc.
2, 2366–2382.
51. Caspi, R., Foerster, H., Fulcher, C. A., Kaipa,
P., Krummenacker, M., Latendresse, M.,
Paley, S., Rhee, S. Y., Shearer, A., Tissier, C.,
Walk, T. C., Zhang, P. and Karp, P. D. (2008)
The MetaCyc Database of metabolic pathways
and enzymes and the BioCyc collection of
Pathway/Genome Databases. Nucleic Acids
Res. 36, D623–D631.
52. Hoogland, C., Mostaguir, K., Appel, R. D.,
Lisacek, F. (2008) The World-2DPAGE
Constellation to promote and publish gel-based
40. 23
Protein Bioinformatics Databases and Resources
proteomics data through the ExPASy server.
J. Proteomics 71, 245–248.
53. Mostaguir, K., Hoogland, C., Binz, P. A.,
Appel, R. D. (2003) The Make 2D-DB II
package: conversion of federated two-dimen-
sional gel electrophoresis databases into a rela-
tionalformatandinterconnectionofdistributed
databases. Proteomics 3, 1441–1444.
54. Vizcaíno, J. A., Côté, R., Reisinger, F.,
Barsnes, H., Foster, J. M., Rameseder, J.,
Hermjakob, H., Martens, L. (2009) The pro-
teomics identifications database: 2010 update.
Nucleic Acids Res. 38, D736–D742.
55. Côté, R. G., Jones, P., Martens, L., Kerrien,
S., Reisinger, F., Lin, Q., Leinonen, R.,
Apweiler, R., Hermjakob, H. (2007) The
protein identifier cross-referencing (PICR)
service: reconciling protein identifiers across
multiplesourcedatabases.BMCBioinformatics
8, 401–414.
56. Burgun,A.,Bodenreider,O.(2008)Accessing
and integrating data and knowledge for bio-
medical research. Yearb Med Inform.
91–101.
57. Hwang, D., Rust, A. G., Ramsey, S., Smith, J.
J., Leslie, D. M., Weston, A. D., de Atauri, P.,
Aitchison, J. D., Hood, L., Siegel, A. F.,
Bolouri, H. (2005) A data integration meth-
odology for systems biology. Proc. Natl Acad.
Sci. U. S. A. 102, 17296–17301.
58. Mathew, J. P., Taylor, B. S., Bader, G. D.,
Pyarajan, S., Antoniotti, M., Chinnaiyan, A.
M., Sander, C., Burakoff, S. J., Mishra, B.
(2007) From bytes to bedside: data integra-
tion and computational biology for transla-
tional cancer research. PLoS Comput. Biol.
3, e12.
59. McGarvey, P. B., Huang, H., Mazumder, R.,
Zhang, J., Chen, Y., Zhang, C., Cammer, S.,
Will, R., Odle, M., Sobral, B., Moore, M.,
Wu, C. H. (2009) Systems integration of bio-
defense omics data for analysis of pathogen–
host interactions and identification of potential
targets. PLoS One 4, e7162.
60. Stevens, R., Zhao, J., Goble, C. (2007) Using
provenance to manage knowledge of in silico
experiments. Brief. Bioinform. 8, 183–194.
61. Pedrioli, P. G., Eng, J. K., Hubley, R.,
Vogelzang, M., Deutsch, E. W., Raught, B.,
Pratt, B., Nilsson, E., Angeletti, R. H.,
Apweiler, R., Cheung, K., Costello, C. E.,
Hermjakob, H., Huang, S., Julian, R. K.,
Kapp, E., McComb, M. E., Oliver, S. G.,
Omenn, G., Paton, N. W., Simpson, R.,
Smith, R., Taylor, C. F., Zhu, W., Aebersold,
R. (2004) A common open representation
of mass spectrometry data and its applica-
tion to proteomics research. Nat. Biotechnol.
22, 1459–1466.
62. Orchard, S., Montechi-Palazzi, L., Deutsch,
E. W., Binz, P. A., Jones, A. R., Paton, N.,
Pizarro, A., Creasy, D. M., Wojcik, J.,
Hermjakob, H. (2007) Five years of prog-
ress in the standardization of proteomics
data 4(th) annual spring workshop of the
HUPO-proteomics standards initiative April
23–25, 2007 Ecole Nationale Supérieure
(ENS), Lyon, France. Proteomics 7,
3436–3440.
63. Taylor, C. F., Paton, N. W., Lilley, K. S.,
Binz, P. A., Julian, R. K. Jr, Jones, A. R.,
Zhu, W., Apweiler, R., Aebersold, R.,
Deutsch, E. W., Dunn, M. J., Heck, A. J.,
Leitner, A., Macht, M., Mann, M., Martens,
L., Neubert, T. A., Patterson, S. D., Ping,
P., Seymour, S. L., Souda, P., Tsugita, A.,
Vandekerckhove, J., Vondriska, T. M.,
Whitelegge, J. P., Wilkins, M. R., Xenarios,
I., Yates, J. R. 3rd, Hermjakob, H. (2007)
The minimum information about a proteom-
ics experiment (MIAPE). Nat. Biotechnol. 25,
887–893.
64. Tatusov, R. L., Fedorova, N. D., Jackson, J.
D., Jacobs, A. R., Kiryutin, B., Koonin, E. V.,
Krylov, D. M., Mazumder, R., Mekhedov, S.
L., Nikolskaya, A. N., Rao, B. S., Smirnov, S.,
Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin,
J. J., Natale, D. A. (2003) The COG data-
base: an updated version includes eukaryotes.
BMC Bioinformatics 4, 41–54.
65. Kaplan, N., Sasson, O., Inbar, U., Friedlich, M.,
Fromer, M., Fleischer, H., Portugaly, E., Linial,
N., Linial, M. (2005) ProtoNet 4.0: a hierarchi-
calclassificationofonemillionproteinsequences.
Nucleic Acids Res. 33, D216–D218.
66. Marchler-Bauer, A., Anderson, J. B., Chitsaz,
F., Derbyshire, M. K., DeWeese-Scott, C.,
Fong, J. H., Geer, L. Y., Geer, R. C., Gonzales,
N. R., Gwadz, M., He, S., Hurwitz, D. I.,
Jackson, J. D., Ke, Z., Lanczycki, C. J.,
Liebert, C. A., Liu, C., Lu, F., Lu, S.,
Marchler, G. H., Mullokandov, M., Song, J.
S., Tasneem, A., Thanki, N., Yamashita, R. A.,
Zhang, D., Zhang, N., Bryant, S. H. (2009)
CDD: specific functional annotation with the
conserved domain database. Nucleic Acids
Res. 37, D205–D210.
67. Wang, Y., Addess, K. J., Chen, J., Geer, L.
Y., He, J., He, S., Lu, S., Madej, T.,
Marchler-Bauer, A., Thiessen, P. A., Zhang,
N., Bryant, S. H. (2007) MMDB: annotat-
ing protein sequences with Entrez’s
3D-structure database. Nucleic Acids Res.
35, D298–D300.
68. Pieper, U., Eswar, N., Webb, B. M., Eramian,
D., Kelly, L., Barkan, D. T., Carter, H.,
Mankoo, P., Karchin, R., Marti-Renom, M.
A., Davis, F. P., Sali, A. (2009) MODBASE, a
41. 24 Chen, Huang, and Wu
database of annotated comparative protein
structure models and associated resources.
Nucleic Acids Res. 37, D347–D354.
69. Kiefer, F., Arnold, K., Künzli, M., Bordoli, L.,
Schwede, T. (2009) The SWISS-MODEL
repository and associated resources. Nucleic
Acids Res. 37, D387–D392.
70. Bogatyreva, N. S., Osypov, A. A., Ivankov, D.
N. (2009) KineticDB: a database of protein
folding kinetics. Nucleic Acids Res. 37,
D342–D346.
71. Garavelli, J. S. (2004) The RESID database of
protein modifications as a resource and anno-
tation tool. Proteomics 4, 1527–1533.
72. Salwinski, L., Miller, C. S., Smith, A. J., Pettit,
F. K., Bowie, J. U., Eisenberg, D. (2004) The
database of interacting proteins: 2004 update.
Nucleic Acids Res. 32, D449–D451.
73. Breitkreutz, B. J., Stark, C., Reguly, T.,
Boucher, L., Breitkreutz, A., Livstone, M.,
Oughtred, R., Lackner, D. H., Bähler, J.,
Wood, V., Dolinski, K., Tyers, M. (2008) The
BioGRID interaction database: 2008 update.
Nucleic Acids Res. 36, D637–D640.
74. Kanehisa, M., Goto, S. (2000) KEGG: Kyoto
encyclopedia of genes and genomes. Nucleic
Acids Res. 28, 27–30.
75. Tarcea, V. G., Weymouth, T., Ade, A.,
Bookvich, A., Gao, J., Mahavisno, V.,
Wright, Z., Chapman, A., Jayapandian, M.,
Ozgür, A., Tian, Y., Cavalcoli, J., Mirel, B.,
Patel, J., Radev, D., Athey, B., States, D.,
Jagadish, H. V. (2009) Michigan molecular
interactions r2: from interacting proteins
to pathways. Nucleic Acids Res. 37,
D642–D646.
76. Craig, R., Cortens, J. C., Fenyo, D., Beavis,
R. C. (2006) Using annotated peptide mass
spectrum libraries for protein identification.
J. Proteome Res. 5, 1843–1849.
77. Deutsch, E. W., Lam, H., Aebersold, R.
(2008) PeptideAtlas: a resource for target
selection for emerging targeted proteomics
workflows. EMBO Rep. 9, 429–434.
78. Slotta, D. J., Barrett, T., Edgar, R. (2009)
NCBI peptidome: a new public repository for
mass spectrometry peptide identifications.
Nat. Biotechnol. 27, 600–601.
43. 26 O’Donovan and Apweiler
The mission of the Universal Protein Resource (UniProt) is to
provide the scientific community with a comprehensive, high-quality
and freely accessible resource of protein sequence and functional
information, which is essential for modern biological research.
UniProt is produced by the UniProt Consortium, which consists
of groups from the European Bioinformatics Institute (EBI), the
Protein Information Resource (PIR), and the Swiss Institute of
Bioinformatics (SIB). Its activities are mainly supported by the
National Institutes of Health (NIH) with additional funding from
the European Commission and the Swiss Federal Government.
It has five components optimized for different uses. The
UniProt Knowledgebase (UniProtKB) (1) is an expertly curated
database, a central access point for integrated protein information
with cross-references to multiple sources. The UniProt Archive
(UniParc) (2) is a comprehensive sequence repository, reflecting
the history of all protein sequences. UniProt Reference Clusters
(UniRef) (3) merge closely related sequences based on sequence
identity to speed up searches whereas the UniProt Metagenomic and
Environmental Sequences database (UniMES) was created to
respond to the expanding area of metagenomic data. UniProtKB
Sequence/AnnotationVersionArchive(UniSave)istheUniProtKB
protein entry archive, which contains all versions of each protein
entry (Fig. 1).
2. Materials
Fig. 1. UniProt databases.
44. 27
A Guide to UniProt for Protein Scientists
UniParc is the main sequence storehouse and is a comprehensive
repository that reflects the history of all protein sequences.
UniParc contains all new and revised protein sequences from all
publicly available sources (http:/
/www.uniprot.org/help/uniparc)
to ensure that complete coverage is available at a single site. To
avoid redundancy, all sequences 100% identical over the entire
length are merged, regardless of source organism. New and
updated sequences are loaded on a daily basis, cross-referenced to
the source database accession number, and provided with a
sequence version that increments on changes to the underlying
sequence. The basic information stored within each UniParc
entry is the identifier, the sequence, cyclic redundancy check
number, source database(s) with accession and version numbers, and
atimestamp.IfaUniParcentrylacksacross-referencetoaUniProtKB
entry, the reason for its exclusion from UniProtKB is provided (e.g.,
pseudogene). In addition, each source database accession number is
tagged with its status in that database, indicating if the sequence still
exists or has been deleted in the source database and cross-references
to NCBI GI and TaxId if appropriate.
UniProtKB consists of two sections, UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL. The former contains manually annotated
recordswithinformationextractedfromliteratureandcurator-evaluated
computational analysis. Annotation is done by biologists with specific
expertise to achieve accuracy. In UniProtKB/Swiss-Prot, annotation
consists of the description of the following: function(s), enzyme-
specific information, biologically relevant domains and sites, post-
translational modifications, subcellular location(s), tissue specificity,
developmental specific expression, structure, interactions, splice
isoform(s), associated diseases or deficiencies, or abnormalities etc.
The UniProt Knowledgebase aims to describe, in a single record, all
protein products derived from a certain gene from a certain species.
After an inspection of the sequences, the curator selects the refer-
ence sequence, does the corresponding merging, and lists the splice
and genetic variants along with disease information when available.
This results in not only the whole record having an accession num-
ber but also unique identifiers for each protein form derived by
alternative splicing, proteolytic cleavage, and posttranslational mod-
ification. The freely available tool VARSPLIC (4) enables the recre-
ation of all annotated splice variants from the feature table of a
UniProt Knowledgebase entry, or for the complete database.
A FASTA-formatted file containing all splice variants annotated in
the UniProt Knowledgebase can be downloaded for use with
similarity search programs.
UniProtKB/TrEMBL contains high quality computationally
analyzed records enriched with automatic annotation and classifi-
cation. The computer-assisted annotation is created using both
automatically generated rules as well as manually curated rules
2.1. The UniProt
Archive
2.2. The UniProt
Knowledgebase
45. 28 O’Donovan and Apweiler
(UniRule) based on protein families (5–8). UniProtKB/TrEMBL
contains the translations of all coding sequences (CDS) present in
the EMBL/GenBank/DDBJ Nucleotide Sequence Databases
and, with some defined exclusions, Arabidopsis thaliana sequences
from The Arabidopsis Information Resource (TAIR) (9), yeast
sequences from the Saccharomyces Genome Database (SGD)
(10) and Homo sapiens sequences from the Ensembl database
(11). It will soon be extended to include other Ensembl organism
sets and RefSeq records. Records are selected for full manual
annotation and integration into UniProtKB/Swiss-Prot accord-
ing to defined annotation priorities.
Integration between the three types of sequence-related data-
bases (nucleic acid sequences, protein sequences, and protein
tertiary structures) as well as with specialized data collections is
important for the UniProt users. UniProtKB is currently cross-
referenced with more than ten million links to 114 different data-
bases with regular update cycles. This extensive network of
cross-references allows UniProt to act as a focal point of biomo-
lecular database interconnectivity. All cross-referenced databases
are documented at http:/
/www.uniprot.org/docs/dbxref and if
appropriate are included in the UniProt ID mapping tool at
http:/
/www.uniprot.org/help/mapping with the file for down-
load at ftp:/
/ftp.uniprot.org/pub/databases/uniprot/current_
release/knowledgebase/idmapping.
UniRef provides clustered sets of all sequences from the UniProt
Knowledgebase (including splice forms as separate entries) and
selected records from the UniProt Archive to achieve complete
coverage of sequence space at identity levels of 100, 90, and 50%
while hiding redundant sequences (3). The UniRef clusters are
generated in a hierarchical manner; the UniRef100 database com-
bines identical sequences and sub-fragments into a single UniRef
entry, UniRef90 is built from UniRef100 clusters and UniRef50
is built from UniRef90 clusters. Each individual member sequence
can exist in only one UniRef cluster at each identity level and have
only one parent or child cluster at another identity level.
UniRef100, UniRef90, and UniRef50 yield a database size reduc-
tion of ~10, 40, and 70%, respectively. Each cluster record con-
tains source database, protein name, and taxonomy information
on each member sequence but is represented by a single selected
representative protein sequence and name; the number of mem-
bers and lowest common taxonomy node for the membership is
also included. The representative protein sequence or cluster rep-
resentative is automatically selected using an algorithm that
accounts for (1) Quality of entry annotation: order of preference
is a member from UniProtKB/Swiss-Prot, UniProtKB/TrEMBL,
then UniParc; (2) Meaningful name: members with protein names
that do not contain words such as “hypothetical” or “probable”
2.3. The UniProt
Reference Clusters
46. 29
A Guide to UniProt for Protein Scientists
are preferred; (3) Organism: members from model organisms are
preferred; (4) Sequence length: longest sequence is preferred.
UniRef100 is one of the most comprehensive and nonredundant
protein sequence dataset available. The reduced size of the
UniRef90 and UniRef50 datasets provide faster sequence similar-
ity searches and reduce the research bias in similarity searches by
providing a more even sampling of sequence space.
The UniProt Knowledgebase contains entries with a known taxo-
nomic source. However, the expanding area of metagenomic data
has necessitated the creation of a separate database, the UniProt
Metagenomic and Environmental Sequences database (UniMES).
UniMES currently contains data from the Global Ocean Sampling
Expedition (GOS), which predicts nearly six million proteins, pri-
marily from oceanic microbes. By combining the predicted pro-
tein sequences with automatic classification by InterPro, the
integrated resource for protein families, domains and functional
sites, UniMES uniquely provides free access to the array of
genomic information gathered.
UniSaveisarepositoryofUniProtKB/Swiss-ProtandUniProtKB/
TrEMBL entry versions and provides the backend to the
UniProtKB entry history service (Fig. 2) and is also provided as a
standalone service at http:/
/www.ebi.ac.uk/uniprot/unisave.
These descriptions of our databases should illustrate that
UniProt does provide a high quality annotated nonredundant
database with maximal coverage and sequence archiving.
This section will describe particular features of the UniProt activities,
which fulfill the proteomics community requirements of detailed
information on protein function, biological processes, molecular
2.4. The UniProt
Metagenomic
and Environmental
Sequences
2.5. The UniProtKB
Sequence/Annotation
Version Archive
3. Methods
Fig. 2. UniSave link.
47. 30 O’Donovan and Apweiler
interactions and pathways cross-referenced to appropriate external
sources and stable identifiers, consistent nomenclature and con-
trolled vocabularies.
UniProtKB consists of two sections, Swiss-Prot and TrEMBL.
UniProtKB/Swiss-Prot contains manually annotated records
with information extracted from literature and curator-evaluated
computational analysis. Manual annotation consists of a critical
review of experimentally proven or computer-predicted data
about each protein. An essential aspect of the annotation protocol
is the use of official nomenclatures and controlled vocabularies
that facilitate consistent and accurate identification (Fig. 3).
Annotation consists of the description of the following:
functions(s), enzyme-specific information, biologically relevant
domains and sites, posttranslation modifications, subcellular
location(s), tissue specificity, developmental specific expression,
structure, interactions, splice isoforms(s), associated diseases or
deficiencies, or abnormalities etc (Fig. 4).
Another important part of the annotation process involves
merging of different reports for a single protein. After an inspec-
tion of the sequences the curator selects the reference sequence,
does the corresponding merging and lists the splice and genetic
variants along with disease information when available (Fig. 5).
Data are continuously updated by an expert team of biologists.
3.1. Protein Annotation
Fig. 3. UniProt nomenclature.
48. 31
A Guide to UniProt for Protein Scientists
To promote database interoperability and provide consistent
annotation, the UniProt Consortium is a key member of the
Gene Ontology Consortium (12) and benefits from the presence
of the GO editorial office at the EBI. UniProt curators will con-
tinue to assign Gene Ontology (GO terms) to the gene products
in UniProtKB during the UniProt manual curation process.
UniProtKB also profits from GO annotation carried out by other
GO Consortium members. Currently we include manual GO
annotations from 19 GO Consortium annotation groups, and we
further supplement this with high-quality annotations from other
manual annotation sources (including the Human Protein Atlas
and LIFEdb). In addition to this manually curated GO annota-
tion, automatic GO annotation pipelines exist and will be further
developed to ensure that the functional knowledge supplied by
various UniProtKB ontologies, Ensembl orthology data, and
InterPro matches are fully exploited to provide high-quality, com-
prehensive set of GO annotation predictions for all UniProtKB
entries.
One challenge in life sciences research is the ability to integrate
and exchange data coming from multiple research groups. The
UniProt Consortium is committed to fostering interaction and
exchange with the scientific community, ensuring wide access to
UniProt resources, and promoting interoperability between
resources. An essential component of this interoperability is the
provision of cross-references to these resources in UniProt entries
(Fig. 6).
3.2. The Gene Ontology
Consortium
and UniProt
3.3. Cross-references
to External Sources
Fig. 4. Protein annotation.
49. 32 O’Donovan and Apweiler
UniProt constructs complete nonredundant proteome sets. Each
set and its analysis is made available shortly after the appearance of
a new complete genome sequence in the nucleotide sequence
databases. A standard procedure is used to create, from the
UniProtKB, proteome sets for bacterial, archaeal and some eukary-
otic genomes. Proteome sets for certain metazoan genomes are
3.4. Nonredundant
Complete UniProt
Proteome Sets
Fig. 5. Sequence annotation.
51. *** END OF THE PROJECT GUTENBERG EBOOK HARPER'S ROUND
TABLE, MARCH 23, 1897 ***
Updated editions will replace the previous one—the old editions
will be renamed.
Creating the works from print editions not protected by U.S.
copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.
START: FULL LICENSE
53. PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the
free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.
Section 1. General Terms of Use and
Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only
be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
54. 1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E. Unless you have removed all references to Project
Gutenberg:
1.E.1. The following sentence, with active links to, or other
immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
55. Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is
derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is
posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.
1.E.4. Do not unlink or detach or remove the full Project
Gutenberg™ License terms from this work, or any files
56. containing a part of this work or any other work associated with
Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,
performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or
providing access to or distributing Project Gutenberg™
electronic works provided that:
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
57. payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who
notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of
any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project
Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
58. law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except
for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you
discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
59. If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.
1.F.4. Except for the limited right of replacement or refund set
forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied
warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the
Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.
Section 2. Information about the Mission
of Project Gutenberg™
60. Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.
The Foundation’s business office is located at 809 North 1500
West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
61. Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws
regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states
where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot
make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.
Please check the Project Gutenberg web pages for current
donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
62. credit card donations. To donate, please visit:
www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.
Project Gutenberg™ eBooks are often created from several
printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
63. Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com