Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount

Bioinformatics Sequence And Genome Analysis 1st
Edition David W Mount download
https://ebookbell.com/product/bioinformatics-sequence-and-genome-
analysis-1st-edition-david-w-mount-6723516
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Bioinformatics Sequence And Genome Analysis Mount Dw
analysis-mount-dw-2046182
Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount
analysis-1st-edition-david-w-mount-23719368
Bioinformatics Sequence And Genome Analysis David W Mount
analysis-david-w-mount-1254376
Bioinformatics Sequence And Genome Analysis 2nd Edition David W Mount
analysis-2nd-edition-david-w-mount-23767558

Bioinformatics Sequence Alignment And Markov Models Sharma K
https://ebookbell.com/product/bioinformatics-sequence-alignment-and-
markov-models-sharma-k-2047570
Bioinformatics Sequence Structure And Databanks Higgins Des Taylor
https://ebookbell.com/product/bioinformatics-sequence-structure-and-
databanks-higgins-des-taylor-9970728
Protein Bioinformatics An Algorithmic Approach To Sequence And
Structure Analysis 1st Edition Ingvar Eidhammer
https://ebookbell.com/product/protein-bioinformatics-an-algorithmic-
approach-to-sequence-and-structure-analysis-1st-edition-ingvar-
eidhammer-1862676
Sequence Analysis And Modern C The Creation Of The Seqan3
Bioinformatics Library 1st Ed 2022 Hauswedell
https://ebookbell.com/product/sequence-analysis-and-modern-c-the-
creation-of-the-seqan3-bioinformatics-library-1st-
ed-2022-hauswedell-38644214
Bioinformatics A Practical Guide To Ncbi Databases And Sequence
Alignments Ismail D Hamid
https://ebookbell.com/product/bioinformatics-a-practical-guide-to-
ncbi-databases-and-sequence-alignments-ismail-d-hamid-38223726

1
1
C H A P T E R
Historical Introduction and Overview
The first sequences to be collected were those of proteins, 2
DNA sequence databases, 3
Sequence retrieval from public databases, 4
Sequence analysis programs, 5
The dot matrix or diagram method for comparing sequences, 5
Alignment of sequences by dynamic programming, 6
Finding local alignments between sequences, 8
Multiple sequence alignment, 9
Prediction of RNA secondary structure, 9
Discovery of evolutionary relationships using sequences, 10
Importance of database searches for similar sequences, 11
The FASTA and BLAST methods for database searches, 11
Predicting the sequence of a protein by translation of DNA sequences, 12
Predicting protein secondary structure, 13
The first complete genome sequence, 14
ACEDB, the first genome database, 15
REFERENCES, 15

2 ■ CHAPTER 1
THE DEVELOPMENT OF SEQUENCE ANALYSIS METHODS has depended on the contributions of
many individuals from varied scientific backgrounds. This chapter provides a brief histor-
ical account of the more significant advances that have taken place, as well as an overview
of the chapters of this book. Because many contributors cannot be mentioned due to space
constraints, additional references to earlier and current reference books, articles, reviews,
and journals provide a broader view of the field and are included in the reference lists to
this chapter.
THE FIRST SEQUENCES TO BE COLLECTED WERE THOSE OF PROTEINS
The development of protein-sequencing methods (Sanger and Tuppy 1951) led to the
sequencing of representatives of several of the more common protein families such as
cytochromes from a variety of organisms. Margaret Dayhoff (1972, 1978) and her collabo-
rators at the National Biomedical Research Foundation (NBRF), Washington, DC, were the
first to assemble databases of these sequences into a protein sequence atlas in the 1960s, and
their collection center eventually became known as the Protein Information Resource (PIR,
formerly Protein Identification Resource; http://watson.gmu.edu:8080/pirwww/index.
html). The NBRF maintained the database from 1984, and in 1988, the PIR-International
Protein Sequence Database (http://www-nbrf.georgetown.edu/pir) was established as a
collaboration of NBRF, the Munich Center for Protein Sequences (MIPS), and the Japan
International Protein Information Database (JIPID).
Dayhoff and her coworkers organized the proteins into families and superfamilies based
on the degree of sequence similarity. Tables that reflected the frequency of changes observed
in the sequences of a group of closely related proteins were then derived. Proteins that were
less than 15% different were chosen to avoid the chance that the observed amino acid
changes reflected two sequential amino acid changes instead of only one. From aligned
sequences, a phylogenetic tree was derived showing graphically which sequences were most
related and therefore shared a common branch on the tree. Once these trees were made,
they were used to score the amino acid changes that occurred during evolution of the genes
for these proteins in the various organisms from which they originated (Fig. 1.1).
Figure 1.1. Method of predicting phylogenetic relationships and probable amino acid changes dur-
ing the evolution of related protein sequences. Shown are three highly conserved sequences (A, B, and
C) of the same protein from three different organisms. The sequences are so similar that each posi-
tion should only have changed once during evolution. The proteins differ by one or two substitu-
tions, allowing the construction of the tree shown. Once this tree is obtained, the indicated amino
acid changes can be determined. The particular changes shown are examples of two that occur much
more often than expected by a random replacement process.
ORGANISM A
ORGANISM B
ORGANISM C
A
A
A
A B
W to Y
L to R
C
A
A
A
W
Y
W
T
T
T
V
V
V
I
I
I
A
A
A
V
V
V
S
A
A
S
S
S
T
T
T
R
R
L
Margaret Dayhoff

HISTORICAL INTRODUCTION AND OVERVIEW ■ 3
Subsequently, a set of matrices (tables)—the percent amino acid mutations accepted by
evolutionary selection or PAM tables—which showed the probability that one amino acid
changed into any other in these trees was constructed, thus showing which amino acids are
most conserved at the corresponding position in two sequences. These tables are still used
to measure similarity between protein sequences and in database searches to find
sequences that match a query sequence. The rule used is that the more identical and con-
served amino acids that there are in two sequences, the more likely they are to have been
derived from a common ancestor gene during evolution. If the sequences are very much
alike, the proteins probably have the same biochemical function and three-dimensional
structural folds. Thus, Dayhoff and her colleagues contributed in several ways to modern
biological sequence analysis by providing the first protein sequence database as well as
PAM tables for performing protein sequence comparisons. Amino acid substitution tables
are routinely used in performing sequence alignments and database similarity searches,
and their use for this purpose is discussed in Chapters 3 and 7.
DNA SEQUENCE DATABASES
DNA sequence databases were first assembled at Los Alamos National Laboratory (LANL),
New Mexico, by Walter Goad and colleagues in the GenBank database and at the European
Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. Translated DNA
sequences were also included in the Protein Information Resource (PIR) database at the
National Biomedical Research Foundation in Washington, DC. Goad had conceived of the
GenBank prototype in 1979; LANL collected GenBank data from 1982 to 1992. GenBank
is now under the auspices of the National Center for Biotechnology Information (NCBI)
(http://www.ncbi.nlm.nih.gov). The EMBL Data Library was founded in 1980
(http://www.ebi.ac.uk). In 1984 the DNA DataBank of Japan (DDBJ), Mishima, Japan,
came into existence (http://www.ddbj.nig.ac.jp). GenBank, EMBL, and DDBJ have now
formed the International Nucleotide Sequence Database Collaboration (http://www.
ncbi.nlm.nih.gov/collab), which acts to facilitate exchange of data on a daily basis. PIR has
made similar arrangements.
Initially, a sequence entry included a computer filename and DNA or protein sequence
files. These were eventually expanded to include much more information about the
sequence, such as function, mutations, encoded proteins, regulatory sites, and references.
This information was then placed along with the sequence into a database format that
could be readily searched for many types of information. There are many such databases
and formats, which are discussed in Chapter 2.
The number of entries in the nucleic acid sequence databases GenBank and EMBL has
continued to increase enormously from the daily updates. Annotating all of these new
sequences is a time-consuming, painstaking, and sometimes error-prone process. As time
passes, the process is becoming more automated, creating additional problems of acc-
uracy and reliability. In December 1997, there were 1.26 109
bases in GenBank; this
number increased to 2.57 109
bases as of April 1999, and 1.0 1010
as of September
2000. Despite the exponentially increasing numbers of sequences stored, the implementa-
tion of efficient search methods has provided ready public access to these sequences.
To decrease the number of matches to a database search, non-redundant databases that
list only a single representative of identical sequences have been prepared. However, many
sequence databases still include a large number of entries of the same gene or protein
sequences originating from sequence fragments, patents, replica entries from different
databases, and other such sequences.
Many types of se-
quence databases are
described in the first
annual issue of the
journal Nucleic Acids
Research.
The growth of the
number of sequences
in GenBank can be
tracked at http://www.
ncbi.nlm.nih.gov/Gen
Bank/genebankstats.
html.
Walter Goad

4 ■ CHAPTER 1
SEQUENCE RETRIEVAL FROM PUBLIC DATABASES
An important step in providing sequence database access was the development of Web
pages that allow queries to be made of the major sequence databases (GenBank, EMBL,
etc.). An early example of this technology at NCBI was a menu-driven program called GEN-
INFO developed by D. Benson, D. Lipman, and colleagues. This program searched rapidly
through previously indexed sequence databases for entries that matched a biologist’s query.
Subsequently, a derivative program called ENTREZ (http://www.ncbi.nlm.nih.gov/Entrez)
with a simple window-based interface, and eventually a Web-based interface, was developed
at NCBI. The idea behind these programs was to provide an easy-to-use interface with a
flexible search procedure to the sequence databases.
Sequence entries in the major databases have additional information about the
sequence included with the sequence entry, such as accession or index number, name and
alternative names for the sequence, names of relevant genes, types of regulatory
sequences, the source organism, references, and known mutations. ENTREZ accesses this
information, thus allowing rapid searches of entire sequence databases for matches to one
or more specified search terms. These programs also can locate similar sequences (called
“neighbors” by ENTREZ) on the basis of previous similarity comparisons. When asked to
perform a search for one or more terms in a database, simple pattern search programs will
only find exact matches to a query. In contrast, ENTREZ searches for similar or related
terms, or complex searches composed of several choices, with great ease and lists the
found items in the order of likelihood that they matched the original query. ENTREZ
originally allowed straightforward access to databases of both DNA and protein sequences
and their supporting references, and even to an index of related entries or similar
sequences in separate or the same databases. More recently, ENTREZ has provided access
to all of Medline, the full bibliographic database of the National Library of Medicine
(NLM), Washington, DC. Access to a number of other databases, such as a phylogenetic
database of organisms and a protein structure database, is also provided. This access is
provided without cost to any user—private, government, industry, or research—a deci-
sion by the staff of NCBI that has provided a stimulus to biomedical research that cannot
be underestimated. NCBI presently handles several million independent accesses to their
system each day.
A note of caution is in order. Database query programs such as ENTREZ greatly facili-
tate keeping up with the increasing number of sequences and biomedical journals.
However, as with any automated method, one should be wary that a requested database
search may not retrieve all of the relevant material, and important entries may be
missed. Bear in mind that each database entry has required manual editing at some
stage, giving rise to a low frequency of inescapable spelling errors and other problems.
On occasion, a particular reference that should be in the database is not found because
the search terms may be misspelled in the relevant database entry, the entry may not be
present in the database, or there may be some more complicated problem. If exhaustive
and careful attempts fail, reporting such problems to the program manager or system
administrator should correct the problem.
David Lipman

SEQUENCE ANALYSIS PROGRAMS
Because DNA sequencing involves ordering a set of peaks (A, G, C, or T) on a sequencing
gel, the process can be quite error-prone, depending on the quality of the data.
As more DNA sequences became available in the late 1970s, interest also increased in
developing computer programs to analyze these sequences in various ways. In 1982 and
1984, Nucleic Acids Research published two special issues devoted to the application of com-
puters for sequence analysis, including programs for large mainframe computers down to
the then-new microcomputers. Shortly after, the Genetics Computer Group (GCG) was
started at the University of Wisconsin by J. Devereux, offering a set of programs for analysis
that ran on a VAX computer. Eventually GCG became commercial (http://www.gcg.com/).
Other companies offering microcomputer programs for sequence analysis, including Intelli-
genetics, DNAStar, and others, also appeared at approximately the same time. Laboratories
also developed and shared computer programs on a no-cost or low-cost basis. For example,
to facilitate the collection of data, the programs PHRED (Ewing and Green 1998; Ewing et
al. 1998) and PHRAP were developed by Phil Green and colleagues at the University of
Washington to assist with reading and processing sequencing data. PHRED and PHRAP are
now distributed by CodonCode Corporation (http://www.codoncode.com).
These commercial and noncommercial programs are still widely used. In addition, Web
sites are available to perform many types of sequence analyses; they are free to academic
institutions or are available at moderate cost to commercial users. Following is a brief
review of the development of methods for sequence analysis.
THE DOT MATRIX OR DIAGRAM METHOD FOR COMPARING SEQUENCES
In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method for comparing two
amino acid and nucleotide sequences in which a graph was drawn with one sequence writ-
ten across the page and the other down the left-hand side. Whenever the same letter
appeared in both sequences, a dot was placed at the intersection of the corresponding
sequence positions on the graph (Fig. 1.2). The resulting graph was then scanned for a
series of dots that formed a diagonal, which revealed similarity, or a string of the same
characters, between the sequences. Long sequences can also be compared in this manner
on a single page by using smaller dots.
The dot matrix method quite readily reveals the presence of insertions or deletions
between sequences because they shift the diagonal horizontally or vertically by the amount
of change. Comparing a single sequence to itself can reveal the presence of a repeat of the
same sequence in the same (direct repeat) or reverse (inverted repeat or palindrome) ori-
entation. This method of self-comparison can reveal several features, such as similarity
between chromosomes, tandem genes, repeated domains in a protein sequence, regions of
low sequence complexity where the same characters are often repeated, or self-comple-
mentary sequences in RNA that can potentially base-pair to give a double-stranded struc-
ture. Because diagonals may not always be apparent on the graph due to weak similarity,
Gibbs and McIntyre counted all possible diagonals and these counts were compared to
those of random sequences to identify the most significant alignments.
Maizel and Lenk (1981) later developed various filtering and color display schemes that
greatly increased the usefulness of the dot matrix method. This dot matrix representation
of sequence comparisons continues to play an important role in analysis of DNA and pro-
tein sequence similarity, as well as repeats in genes and very long chromosomal sequences,
as described in Chapter 3 (p. 59).
Methods for DNA
sequencing were devel-
oped in 1977 by
Maxam and Gilbert
(1977) and Sanger et
al. (1977). They are
described in greater
detail at the beginning
of Chapter 2.

6 ■ CHAPTER 1
ALIGNMENT OF SEQUENCES BY DYNAMIC PROGRAMMING
Although the dot matrix method can be used to detect sequence similarity, it does not
readily resolve similarity that is interrupted by regions that do not match very well or that
are present in only one of the sequences (e.g., insertions or deletions). Therefore, one
would like to devise a method that can find what might be a tortuous path through a dot
matrix, providing the very best possible alignment, called an optimal alignment, between
the two sequences. Such an alignment can be represented by writing the sequences on suc-
cessive lines across the page, with matching characters placed in the same column and
unmatched characters placed in the same column as a mismatch or next to a gap as an
insertion (or deletion in the other sequence), as shown in Figure 1.3. To find an optimal
alignment in which all possible matches, insertions, and deletions have been considered to
find the best one is computationally so difficult that for proteins of length 300, 1088
com-
parisons will have to be made (Waterman 1989).
To simplify the task, Needleman and Wunsch (1970) broke the problem down into a
progressive building of an alignment by comparing two amino acids at a time. They start-
ed at the end of each sequence and then moved ahead one amino acid pair at a time, allow-
ing for various combinations of matched pairs, mismatched pairs, or extra amino acids in
one sequence (insertion or deletion). In computer science, this approach is called dynam-
ic programming. The Needleman and Wunsch approach generated (1) every possible
alignment, each one including every possible combination of match, mismatch, and single
insertion or deletion, and (2) a scoring system to score the alignment. The object was to
determine which was the best alignment of all by determining the highest score. Thus,
every match in a trial alignment was given a score of 1, every mismatch a score of 0, and
individual gaps a penalty score. These numbers were then added across the alignment to
Figure 1.2. A simple dot matrix comparison of two DNA sequences, AGCTAGGA and GACTAG-
GC. The diagonal of dots reveals a run of similar sequence CTAGG in the two sequences.
G
A
C
T
A
G
G
C
A G C T A G G A
Figure 1.3. An alignment of two sequences showing matches, mismatches, and gaps (). The best
or optimal alignment requires that all three types of changes be allowed.
SEQUENCE A
SEQUENCE B
A
A
C
C
G
G
Λ
E
Λ
Y
E
Λ
V
I
D
D
G
G
I
I

obtain a total score for the alignment. The alignment with the highest possible score was
defined as the optimal alignment.
The procedure for generating all of the possible alignments is to move sequentially
through all of the matched positions within a matrix, much like the dot matrix graph (see
above), starting at those positions that correspond to the end of one of the sequences, as
shown in Figure 1.4. At each position in the matrix, the highest possible score that can be
achieved up to that point is placed in that position, allowing for all possible starting points
in either sequence and any combination of matches, mismatches, insertions, and deletions.
The best alignment is found by finding the highest-scoring position in the graph, and then
tracing back through the graph through the path that generated the highest-scoring posi-
tions. The sequences are then aligned so that the sequence characters corresponding to this
path are matched.
Figure 1.4. Simplified example of Needleman-Wunsch alignment of sequences GATCTA and
GATCA. First, all matches in the two sequences are given a score of 1, and mismatches a score of 0
(not shown), chosen arbitrarily for this example. Second, the diagonal 1s are added sequentially, in
this case to a total score of 4. At this point the row cannot be extended by another match of 1 to a
total score of 5. However, an extension is possible if a gap is placed in GATCA to produce
GATC A, where is the gap. To add the gap, a penalty score is subtracted from the total match
score of 5 now appearing in the last row and column. The best alignment is found starting with the
sequence characters that correspond to the highest number and tracing back through the positions
that contributed to this highest score.

8 ■ CHAPTER 1
FINDING LOCAL ALIGNMENTS BETWEEN SEQUENCES
The above method finds the optimal alignment between two sequences, including the
entirety of each of the sequences. Such an alignment is called a global alignment. Smith and
Waterman (1981a,b) recognized that the most biologically significant regions in DNA and
protein sequences were subregions that align well and that the remaining regions made up
of less-related sequences were less significant. Therefore, they developed an important
modification of the Needleman-Wunsch algorithm, called the local alignment or Smith-
Waterman (or the Waterman-Smith) algorithm, to locate such regions. They also recog-
nized that insertions or deletions of any size are likely to be found as evolutionary changes
in sequences, and therefore adjusted their method to accommodate such changes. Finally,
they provided mathematical proof that the dynamic programming method is guaranteed
to provide an optimal alignment between sequences. The algorithm is discussed in detail
in Chapter 3 (p. 64).
Two complementary measurements had been devised for scoring an alignment of two
sequences, a similarity score and a distance score. As shown in Figure 1.3, there are three
types of aligned pairs of characters in each column of an alignment—identical matches,
mismatches, and a gap opposite an unmatched character. Using as an example a simple
scoring system of 1 for each type of match, the similarity score adds up all of the matches
in the aligned sequences, and divides by the sum of the number of matches and mis-
matches (gaps are usually ignored). This method of scoring sequence similarity is the one
most familiar to biologists and was devised by Needleman and Wunsch and used by Smith
and Waterman. The other scoring method is a distance score that adds up the number of
substitutions required to change one sequence into the other. This score is most useful for
making predictions of evolutionary distances between genes or proteins to be used for phy-
logenetic (evolutionary) predictions, and the method was the work of mathematicians,
notably P. Sellers. The distance score is usually calculated by summing the number of
mismatches in an alignment divided by the total number of matches and mismatches. The
calculation represents the number of changes required to change one sequence into the
other, ignoring gaps. Thus, in the example shown in Figure 1.3, there are 6 matches and 1
mismatch in an alignment. The similarity score for the alignment is 6/7 0.86 and the dis-
tance score is 1/7 0.14, if the required condition is given a simple score of 1. With this
simple scoring scheme, the similarity and distance scores add up to 1. Note also the equiv-
alence that the sum of the sequence lengths is equal to twice the number of matches plus
mismatches plus the number of deletions or insertions. Thus, in our example, the calcula-
tion is 8 9 2 (6 1) 3 17. Usually more complex systems of scoring are used
to produce meaningful alignments, and alignments are evaluated by likelihood or odds
scores (Chapter 3), but an inverse relationship between similarity and distance scores for
the alignment still holds.
A difficult problem encountered in aligning sequences is deciding whether or not a par-
ticular alignment is significant. Does a particular alignment score reveal similarity between
two sequences, or would the score be just as easily found between two unrelated sequences
(or random sequence of similar composition generated by the computer)? This problem
was addressed by S. Karlin and S. Altschul (1990, 1993) and is addressed in detail in Chap-
ter 3 (p. 96).
An analysis of scores of unrelated or random sequences revealed that the scores could
frequently achieve a value much higher than expected in a normal distribution. Rather, the
scores followed a distribution with a positively skewed tail, known as the extreme value dis-
tribution. This analysis provided a way to assess the probability that a score found between
two sequences could also be found in an alignment of unrelated or random sequences of
Mike Waterman
Temple Smith

the same length. This discovery was particularly useful for assessing matches between a
query sequence and a sequence database discussed in Chapter 7. In this case, the evalua-
tion of a particular alignment score must take into account the number of sequence com-
parisons made in searching the database. Thus, if a score between a query protein sequence
and a database protein sequence is achieved with a probability of 107
of being between
unrelated sequences, and 80,000 sequences were compared, then the highest expected
score (called the EXPECT score) is 107
8 104
8 103
0.008. A value of
0.02–0.05 is considered significant. Even when such a score is found, the alignment must
be carefully examined for shortness of the alignment, unrealistic amino acid matches, and
runs of repeated amino acids, the presence of which decreases confidence in an alignment.
MULTIPLE SEQUENCE ALIGNMENT
In addition to aligning a pair of sequences, methods have been developed for aligning three
or more sequences at the same time (for an early example, see Johnson and Doolittle 1986).
These methods are computer-intensive and usually are based on a sequential aligning of
the most-alike pairs of sequences. The programs commonly used are the GCG program
PILEUP (http://www.gcg. com/) and CLUSTALW (Thompson et al. 1994) (Baylor College
of Medicine, http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html). Once the
alignment of a related set of molecular sequences (a family) has been produced, highly
conserved regions (Gribskov et al. 1987) can be identified that may be common to that
particular family and may be used to identify other members of the same family. Two
matrix representations of the multiple sequence alignment called a PROFILE and a
POSITION-SPECIFIC SCORING MATRIX (PSSM) are important computational tools
for this purpose.
Multiple sequence alignments can also be the starting point for evolutionary modeling.
Each column of aligned sequence characters is examined, and then the most probable phy-
logenetic relationship or tree that would give rise to the observed changes is identified.
Another form of multiple sequence alignment is to search for a pattern that a set of DNA
or protein sequences has in common without first aligning the sequences (Stormo et al. 1982;
Stormo and Hartzell 1989; Staden 1984, 1989; Lawrence and Reilly 1990). For proteins, these
patterns may define a conserved component of a structural or functional domain. For DNA
sequences, the patterns may specify the binding site for a regulatory protein in a promoter
region or a processing signal in an RNA molecule. Both statistical and nonstatistical methods
have been widely used for this purpose. In effect, these methods sort through the sequences
trying to locate a series of adjacent characters in each of the sequences that, when aligned,
provides the highest number of matches. Neural networks, hidden Markov models, and the
expectation maximization and Gibbs sampling methods (Stormo et al. 1982; Lawrence et al.
1993; Krogh et al. 1994; Eddy et al. 1995) are examples of methods that are used. Explana-
tions and examples of these methods are described in Chapter 4.
PREDICTION OF RNA SECONDARY STRUCTURE
In addition to methods for predicting protein structure, other methods for predicting
RNA secondary structure on computers were also developed at an early time. If the com-
plement of a sequence on an RNA molecule is repeated down the sequence in the opposite
chemical direction, the regions may base-pair and form a hairpin structure, as illustrated
in Figure 1.5.

10 ■ CHAPTER 1
Tinoco et al. (1971) generated these symmetrical regions in small oligonucleotide
molecules and tried to predict their stability based on estimates of the free energy associat-
ed with stacked base pairs in the model and of the destabilizing effects of loops, using a
table of energy values (Tinoco et al. 1971; Salser 1978). Single-stranded loops and other
unpaired regions decreased the predicted energy. Subsequently, Nussinov and Jacobson
(1980) devised a fast computer method for predicting an RNA molecule with the highest
possible number of base pairs based on the same dynamic programming algorithm used
for aligning sequences. This method was improved by Zuker and Stiegler (1981), who
added molecular constraints and thermodynamic information to predict the most ener-
getically stable structure.
Another important use of RNA structure modeling is in the construction of databases
of RNA molecules. One of the most significant of these is the ribosomal RNA database
prepared by the laboratory of C. Woese (1987) (http://www.cme.msu.edu/RDP
html/index.html). RNA secondary structure prediction is discussed in Chapter 5. Align-
ment, structural modeling, and phylogenetic analysis based on these RNA sequences have
made possible the discovery of evolutionary relationships among organisms that would
not have been possible otherwise.
DISCOVERY OF EVOLUTIONARY RELATIONSHIPS USING SEQUENCES
Variations within a family of related nucleic acid or protein sequences provide an invalu-
able source of information for evolutionary biology. With the wealth of sequence infor-
mation becoming available, it is possible to track ancient genes, such as ribosomal RNA
and some proteins, back through the tree of life and to discover new organisms based on
their sequence (Barns et al. 1996). Diverse genes may follow different evolutionary histo-
ries, reflecting transfers of genetic material between species. Other types of phylogenetic
analyses can be used to identify genes within a family that are related by evolutionary
descent, called orthologs. Gene duplication events create two copies of a gene, called par-
alogs, and many such events can create a family of genes, each with a slightly altered, or
possibly new, function. Once alignments have been produced and alignment scores found,
the most closely related sequence pairs become apparent and may be placed in the outer
branches of an evolutionary tree, as shown for sequences A and B in Figure 1.1 (p. 2). The
next most-alike sequence, sequence C in Figure 1.1, will be represented by the next branch
down on the tree. Continuing this process generates a predicted pattern of evolution for
Figure 1.5. Folding of single-stranded RNA molecule into a hairpin secondary structure. Shown are
portions of the sequence that are complementary: They can base-pair to form a double-stranded
region. G/C base pairs are the most energetic due to 3 H bonds; A/U and G/U are next most ener-
getic with two and one H bonds, respectively.
I
III
II
I III
II
G
C
G
C
C
G
U
A
U
A
G
C
G
C
A
U
C
G
C
G
G G C U G A C C U G C A G G U C A G C C

that particular gene. Once a tree has been found, the sequence changes that have taken
place in the tree branches can be inferred.
The starting point for making a phylogenetic tree is a sequence alignment. For each pair
of sequences, the sequence similarity score gives an indication as to which sequences are
most closely related. A tree that best accounts for the numbers of changes (distances)
between the sequences (Fitch and Margoliash 1987) of these scores may then be derived.
The method most commonly used for this purpose is the neighbor-joining method (Saitou
and Nei 1987) described in Chapter 6. Alternatively, if a reliable multiple sequence align-
ment is available, the tree that is most consistent with the observed variation found in each
column of the sequence alignment may be used. The tree that imposes the minimum num-
ber of changes (the maximum parsimony tree) is the one chosen (Felsenstein 1988).
In making phylogenetic predictions, one must consider the possibility that several trees
may give almost the same results. Tests of significance have therefore been derived to
determine how well the sequence variation supports the existence of a particular tree
branch (Felsenstein 1988). These developments are also discussed in Chapter 6.
IMPORTANCE OF DATABASE SEARCHES FOR SIMILAR SEQUENCES
As DNA sequencing became a common laboratory activity, genes with an important bio-
logical function could be sequenced with the hope of learning something about the bio-
chemical nature of the gene product. An example was the retrovirus-encoded v-sis and
v-src oncogenes, genes that cause cancer in animals. By comparing the predicted sequences
of the viral products with all of the known protein sequences at the time, R. Doolittle and
colleagues (1983) and W. Barker and M. Dayhoff (1982) both made the startling discovery
that these genes appeared to be derived from cellular genes. The Sis protein had a sequence
very similar to that of the platelet-derived growth factor (PDGF) from mammalian cells,
and Src to the catalytic chain of mammalian cAMP-dependent kinases. Thus, it appeared
likely that the retrovirus had acquired the gene from the host cell as some kind of genetic
exchange event and then had produced a mutant form of the protein that could compro-
mise the function of the normal protein when the virus infected another animal. Subse-
quently, as molecular biologists analyzed more and more gene sequences, they discovered
that many organisms share similar genes that can be identified by their sequence similarity.
These searches have been greatly facilitated by having genetic and biochemical informa-
tion from model organisms, such as the bacterium Escherichia coli and the budding yeast Sac-
charomyces cerevisiae. In these organisms, extensive genetic analysis has revealed the function
of genes, and the sequences of these genes have also been determined. Finding a gene in a new
organism (e.g., a crop plant) with a sequence similar to a model organism gene (e.g., yeast)
provides a prediction that the new gene has the same function as in the model organism.
Such searches are becoming quite commonplace and are greatly facilitated by programs such
as FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990).
The methods used by BLAST and other additional powerful methods to perform
sequence similarity searching are described further in the next section and in Chapter 7.
THE FASTA AND BLAST METHODS FOR DATABASE SEARCHES
As the number of new sequences collected in the laboratory increased, there was also an
increased need for computer programs that provided a way to compare these new
sequences sequentially to each sequence in the existing database of sequences, as was done

12 ■ CHAPTER 1
to identify successfully the function of viral oncogenes. The dynamic programming
method of Needleman and Wunsch would not work because it was much too slow for the
computers of the time; today, however, with much faster computers available, this method
can be used. W. Pearson and D. Lipman (1988) developed a program called FASTA, which
performed a database scan for similarity in a short enough time to make such scans rou-
tinely possible. FASTA provides a rapid way to find short stretches of similar sequence
between a new sequence and any sequence in a database. Each sequence is broken down
into short words a few sequence characters long, and these words are organized into a table
indicating where they are in the sequence. If one or more words are present in both
sequences, and especially if several words can be joined, the sequences must be similar in
those regions. Pearson (1990, 1996) has continued to improve the FASTA method for sim-
ilarity searches in sequence databases.
An even faster program for similarity searching in sequence databases, called BLAST,
was developed by S. Altschul et al. (1990). This method is widely used from the Web site
of the National Center for Biotechnology Information at the National Library of Medicine
in Washington, DC (http://www.ncbi.nlm.nih.gov/BLAST). The BLAST server is probably
the most widely used sequence analysis facility in the world and provides similarity search-
ing to all currently available sequences. Like FASTA, BLAST prepares a table of short
sequence words in each sequence, but it also determines which of these words are most sig-
nificant such that they are a good indicator of similarity in two sequences, and then con-
fines the search to these words (and related ones), as described in Figure 1.6. There are ver-
sions of BLAST for searching nucleic acid and protein databases, which can be used to
translate DNA sequences prior to comparing them to protein sequence databases (Altschul
et al. 1997). Recent improvements in BLAST include GAPPED-BLAST, which is threefold
faster than the original BLAST, but which appears to find as many matches in databases,
and PSI-BLAST (position-specific-iterated BLAST), which can find more distant matches
to a test protein sequence by repeatedly searching for additional sequences that match an
alignment of the query and initially matched sequences. These methods are discussed in
Chapter 7.
PREDICTING THE SEQUENCE OF A PROTEIN BY TRANSLATION OF DNA SEQUENCES
Protein sequences are predicted by translating DNA sequences that are cDNA copies of
mRNA sequences from a predicted start and end of an open reading frame. Unfortunate-
ly, cDNA sequences are much less prevalent than genomic sequences in the databases. Par-
tial sequence (expressed sequence tags, or ESTs) libraries for many organisms are available,
but these only provide a fraction of the carboxy-terminal end of the protein sequence and
usually only have about 99% accuracy. For organisms that have few or no introns in their
genomic DNA (such as bacterial genomes), the genomic DNA may be translated. For most
Figure 1.6. Rapid identification of sequence similarity by FASTA and BLAST. FASTA looks for
short regions in these two amino acid sequences that match and then tries to extend the alignment
to the right and left. In this case, the program found by a quick and simple indexing method that
W, I, and then V occurred in the same order in both sequences, providing a good starting point for
an alignment. BLAST works similarly, but only examines matched patterns of length 3 of the more
significant amino acid substitutions that are expected to align less frequently by chance alone.
PORTION OF SEQUENCE A
PORTION OF SEQUENCE B
–
–
V
V
–
–
W
W
I
I
–
–
–
–
Bill Pearson

eukaryotic organisms with introns in their genes, the protein-encoding exons must be pre-
dicted and then translated by methods described in Chapter 8. These genome-based pre-
dictions are not always accurate, and thus it remains important to have cDNA sequences
of protein-encoding genes. Promoter sequences in genomes may also be analyzed for com-
mon patterns that reflect common regulatory features. These types of analyses require
sophisticated approaches that are also discussed in Chapter 8 (Hertz et al. 1990).
PREDICTING PROTEIN SECONDARY STRUCTURE
There are a large number of proteins whose sequences are known, but very few whose
structures have been solved. Solving protein structures involves the time-consuming and
highly specialized procedures of X-ray crystallography and nuclear magnetic resonance
(NMR). Consequently, there is much interest in trying to predict the structure of a protein,
given its sequence. Proteins are synthesized as linear chains of amino acids; they then form
secondary structures along the chain, such as helices, as a result of interactions between
side chains of nearby amino acids. The region of the molecule with these secondary struc-
tures then folds back and forth on itself to form tertiary structures that include helices,
sheets comprising interacting strands, and loops (Fig. 1.7). This folding often leaves
amino acids with hydrophobic side chains facing into the interior of the folded molecule
and polar amino acids that can interact with water and the molecular environment facing
outside in loops. The amino acid sequence of the protein directs the folding pathway,
sometimes assisted by proteins called chaperonins. Chou and Fasman (1978) and Garnier
et al. (1978) searched the small structural database of proteins for the amino acids associ-
ated with each of the secondary structure types— helices, turns, and strands. Sequences
of proteins whose structures were not known were then scanned to determine whether the
amino acids in each region were those often associated with one type of structure. For
example, the amino acid proline is not often found in helices because its side chain is not
compatible with forming a helix. This method predicted the structure of some proteins
well but, in general, was about as likely to predict a correct as an incorrect structure.
As more protein structures were solved experimentally, computational methods were
used to find those that had a similar structural fold (the same arrangement of secondary
structures connected by similar loops). These methods led to the discovery that as new
protein structures were being solved, they often had a structural fold that was already
known in a group of sequences. Thus, proteins are found to have a limited number of ~500
folds (Chothia 1992), perhaps due to chemical restraints on protein folding or to the exis-
Figure 1.7. Folding of a protein from a linear chain of amino acids to a three-dimensional structure.
The folding pathway involves amino acid interactions. Many different amino acid patterns are found
in the same types of folds, thus making structure prediction from amino acid sequence a difficult
undertaking.

14 ■ CHAPTER 1
tence of a single evolutionary pathway for protein structure (Gibrat et al. 1996). Further-
more, proteins without any sequence similarity could adopt the same fold, thus greatly
complicating the prediction of structure from sequence. Methods for finding whether or
not a given protein sequence can occupy the same three-dimensional conformation as
another based on the properties of the amino acids have been devised (Bowie et al. 1991).
Databases of structural families of proteins are available on the Web and are described in
Chapter 9.
Amos Bairoch (Bairoch et al. 1997) developed another method for predicting the bio-
chemical activity of an unknown protein, given its sequence. He collected sequences of
proteins that had a common biochemical activity, for example an ATP-binding site, and
deduced the pattern of amino acids that was responsible for that activity, allowing for some
variability. These patterns were collected into the PROSITE database (http://www.expasy.
ch/prosite). Unknown sequences were scanned for the same patterns. Subsequently, Steve
and Jorga Henikoff (Henikoff and Henikoff 1992) examined alignments of the protein
sequences that make up each MOTIF and discovered additional patterns in the aligned
sequences called BLOCKS (see http://www.blocks.fhcrc.org/). These patterns offered an
expanded ability to determine whether or not an unknown protein possessed a particular
biochemical activity. The changes that were in each column of these aligned patterns were
counted and a new set of amino acid substitution matrices, called BLOSUM matrices, sim-
ilar to the PAM matrices of Margaret Dayhoff, were produced. One of these matrices,
BLOSUM62, is most often used for aligning protein sequences and searching databases for
similar sequences (Henikoff and Henikoff 1992) (see Chapter 7).
Sophisticated statistical and machine-training techniques have been used in more recent
protein structure prediction programs, and the success rate has increased. A recent
advance in this now active field of research is to organize proteins into groups or families
on the basis of sequence similarity, and to find consensus patterns of amino acid domains
characteristic of these families using the statistical methods described in Chapters 4 and 9.
There are many publicly accessible Web sites described in Chapter 9 that provide the lat-
est methods for identifying proteins and predicting their structures.
THE FIRST COMPLETE GENOME SEQUENCE
Although many viruses had already been sequenced, the first planned attempt to sequence
a free-living organism was by Fred Blattner and colleagues (Blattner et al. 1997) using the
bacterium E. coli. However, there was some concern over whether such a large sequence,
about 4 106
bp, could be obtained by the then-current sequencing technology. The first
published genome sequence was that of the single, circular chromosome of another bac-
terium, Hemophilus influenzae (Fleischmann et al. 1995), by The Institute of Genetics
Research (TIGR, at http://www.tigr.org/), which had been started by researcher Craig Ven-
ter. The project was assisted by microbiologist Hamilton Smith, who had worked with this
organism for many years. The speedup in sequencing involved using automated reading of
DNA sequencing gels through dye-labeling of bases, and breaking down the chromosome
into random fragments and sequencing these fragments as rapidly as possible without
knowledge of their location in the whole chromosome. Computer analysis of such shotgun
cloning and sequencing techniques had been developed much earlier by R. Staden at Cam-
bridge University and other workers, but the TIGR undertaking was much more ambi-
tious. In this genome project, newly read sequences were immediately entered into a com-
puter database and compared with each other to find overlaps and produce contigs of two
or more sequences with the assistance of computer programs. This procedure circumvent-
ed the need to grow and keep track of large numbers of subclones. Although the same

sequence was often obtained up to 10 times, the sequence of the entire chromosome (2
109
bp), less a few gaps, was rapidly assembled in the computer over a 9-month period at
a cost of about $106
.
This success heralded a large number of other sequencing projects of various prokary-
otic and eukaryotic microorganisms, with a tremendous potential payoff in terms of uti-
lizable gene products and evolutionary information about these organisms. To date, com-
pleted projects include more than 30 prokaryotes, yeast S. cerevisiae (see Cherry et al.
1997), the nematode Caenorhabditis elegans (see C. elegans Sequencing Consortium 1998),
and the fruit fly Drosophila (see Adams et al. 2000). The plant Arabidopsis thaliana and the
human genome sequencing projects are ongoing and will be completed during 2000 or
shortly thereafter.
ACEDB, THE FIRST GENOME DATABASE
As more genetic and sequence information became available for the model organisms,
interest arose in generating specific genome databases that could be queried to retrieve this
information. Such an enterprise required a new level of sharing of data and resources
between laboratories. Although there were initial concerns about copyright issues, credits,
accuracy, editorial review, and curating, eventually these concerns disappeared or became
resolved as resources on the Internet developed. The first genome database, called ACEDB
(a C. elegans database), and the methods to access this database were developed by Mike
Cherry and colleagues (Cherry and Cartinhour 1993). This database was accessible
through the internet and allowed retrieval of sequences, information about genes and
mutants, investigator addresses, and references. Similar databases were subsequently
developed using the same methods for A. thaliana and S. cerevisiae. Presently, there is a
large number of such publicly available databases. Web access to these databases is dis-
cussed in Chapter 10 (Table 10.1, p. 482).
REFERENCES
Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W.,
Hoskins R.A., Galle R.F., et al. 2000. The genome sequence of Drosophila melanogaster. Science 287:
2185–2195.
The Human Genome Project, a large, federally funded collaborative project, will com-
plete sequencing of the entire human genome by 2003. The project was developed from
an idea discussed at scientific meetings in 1984 and 1985, and a pilot project, the
Human Genome Initiative, was begun by the Department of Energy (DOE) in 1986.
National Institutes of Health funding of the project began in 1987 under the Office of
Genome Research. Currently, the project is constituted as the National Human
Genome Research Initiative. In 1998, a new commercial venture under the leadership
of Craig Venter was formed to sequence the majority of the human genome by 2001.
This group, which uses a whole genome shotgun cloning approach and intensive com-
puter processing of data, has already completed the Drosophila sequence and will
sequence the mouse genome following completion of the human genome. Both groups
simultaneously announced completion of the sequencing of the human genome in
2000.

16 ■ CHAPTER 1
Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J. 1990. Basic local alignment search tool.
J. Mol. Biol. 215: 403–410.
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J. 1997. Gapped
BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res.
25: 3389–3402.
Bairoch A., Bucher P., and Hofmann K. 1997. The PROSITE database, its status in 1997. Nucleic Acids
Res. 25: 217–221.
Barker W.C. and Dayhoff M.O. 1982. Viral src gene products are related to the catalytic chain of mam-
malian cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. 79: 2836–2839.
Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal diversity, ther-
mophily and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. 93: 9188–9193.
Blattner F.R., Plunkett III, G., Bloch C.A., Perna N.T., Burland V., Riley M., Collado-Vides J., Glasner
J.D., Rode C.K., Mayhew G.F., Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
Mau B., and Shao Y. 1997. The complete genome sequence of Escherichia coli K-12. Science 277:
1453–1474.
Bowie J.U., Luthy R., and Eisenberg D. 1991. A method to identify protein sequences that fold into a
known three-dimensional structure. Science 253: 164–170.
C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for
investigating biology. Science 282: 2012–2018.
Cherry J.M. and Cartinhour S.W. 1993. ACEDB, a tool for biological information. In Automated DNA
sequencing and analysis (ed. M. Adams et al.). Academic Press, New York.
Cherry J.M., Ball C., Weng S., Juvik G., Schmidt R., Adler C., Dunn B., Dwight S., Riles L., Mortimer
R. K., and Botstein D. 1997. Genetic and physical maps of Saccharomyces cerevisiae. Nature (suppl.
6632) 387: 67–73.
Chothia C. 1992. Proteins. One thousand families for the molecular biologist. Nature 357: 543–544.
Chou P.Y. and Fasman G.D. 1978. Prediction of the secondary structure of proteins from their amino
acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 47: 45–147.
Dayhoff M.O., Ed. 1972. Atlas of protein sequence and structure, vol. 5. National Biomedical Research
Foundation, Georgetown University, Washington, D.C.
———. 1978. Survey of new data and computer methods of analysis. In Atlas of protein sequence and
structure, vol. 5, suppl. 3. National Biomedical Research Foundation, Georgetown University, Wash-
ington, D.C.
Doolittle R.F., Hunkapiller M.W., Hood L.E., Devare S.G., Robbins K.C., Aaronson S.A., and Antoni-
ades H.N. 1983. Simian sarcoma onc gene v-sis is derived from the gene (or genes) encoding a
platelet-derived growth factor. Science 221: 275–277.
Eddy S.R., Mitchison G., and Durbin R. 1995. Maximum discrimination hidden Markov models of
sequence consensus. J. Comput. Biol. 2: 9–23.
Ewing B. and Green P. 1998. Base-calling of automated sequence traces using phred. II. Error probabil-
ities. Genome Res. 8: 186–194.
Ewing B., Hillier L., Wendl, M.C., and Green P. 1998. Base-calling of automated sequence traces using
phred. I. Accuracy assessment. Genome Res. 8: 175–185.
Felsenstein J. 1988. Phylogenies from molecular sequences: Inferences and reliability. Annu. Rev. Genet.
22: 521–565.
Fitch W.M. and Margoliash E. 1987. Construction of phylogenetic trees. Science 155: 279–284.
Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., Kerlavage A.R., Bult C.J., Tomb
J.F., Dougherty B.A., Merrick J.M., et al. 1995. Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd. Science 269: 496–512.
Garnier J., Osguthorpe D.J., and Robson B. 1978. Analysis of the accuracy and implications of simple
methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120: 97–120.
Gibbs A.J. and McIntyre G.A. 1970. The diagram, a method for comparing sequences. Its use with amino
acid and nucleotide sequences. Eur. J. Biochem. 16: 1–11.
Gibrat J.F., Madej T., and Bryant S.H. 1996. Surprising similarity in structure comparison. Curr. Opin.
Struct. Biol. 6: 377–385.
Gribskov M., McLachlan A.D., and Eisenberg D. 1987. Profile analysis: Detection of distantly related
proteins. Proc. Natl. Acad. Sci. 84: 4355–4358.
Henikoff S. and Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl.
Acad. Sci. 89: 10915–10919.

Hertz G.Z., Hartzell III, G.W., and Stormo G.D. 1990. Identification of consensus patterns in unaligned
DNA sequences known to be functionally related. Comput. Appl. Biosci. 6: 81–92.
Johnson M.S. and Doolittle R.F. 1986. A method for the simultaneous alignment of three or more amino
acid sequences. J. Mol. Evol. 23: 267–268.
Karlin S. and Altschul S.F. 1990. Methods for assessing the statistical significance of molecular sequence
features by using general scoring schemes. Proc. Natl. Acad. Sci. 87: 2264–2268.
———. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences.
Proc. Natl. Acad. Sci. 90: 5873–5877.
Krogh A., Brown M., Mian I.S., Sjölander K., and Haussler D. 1994. Hidden Markov models in compu-
tational biology. Applications to protein modeling. J. Mol. Biol. 235: 1501–1531.
Lawrence C.E. and Reilly A.A. 1990. An expectation maximization (EM) algorithm for the identification
and characterization of common sites in unaligned biopolymer sequences. Proteins Struct. Funct.
Genet. 7: 41–51.
Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., and Wootton J.C. 1993. Detecting
subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262: 208–214.
Maizel Jr., J.V. and Lenk R.P. 1981. Enhanced graphic matrix analyses of nucleic acid and protein syn-
thesis. Proc. Natl. Acad. Sci. 78: 7665–7669.
Maxam A.M. and Gilbert W. 1977. A new method for sequencing DNA. Proc. Natl. Acad. Sci. 74:
560–564.
Needleman S.B. and Wunsch C.D. 1970. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J. Mol. Biol. 48: 443–453.
Nussinov R. and Jacobson A.B. 1980. Fast algorithm for predicting the secondary structure of single-
stranded RNA. Proc. Natl. Acad. Sci. 77: 6903–6913.
Pearson W.R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzy-
mol. 183: 63–98.
———. 1996. Effective protein sequence comparison. Methods Enzymol. 266: 227–258.
Pearson W.R. and Lipman D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl.
Acad. Sci. 85: 2444–2448.
Saitou N. and Nei M. 1987. The neighbor-joining method: A new method for reconstructing phyloge-
netic trees. Mol. Biol. Evol. 4: 406–425.
Salser W. 1978. Globin mRNA sequences: Analysis of base pairing and evolutionary implications. Cold
Spring Harbor Symp. Quant. Biol. 42: 985–1002.
Sanger F. and Tuppy H. 1951. The amino acid sequence of the phenylalanyl chain of insulin. Biochem. J.
49: 481–490.
Sanger F., Nicklen S., and Coulson A.R. 1977. DNA sequencing with chain terminating inhibitors. Proc.
Natl. Acad. Sci. 74: 5463–5467.
Smith T.F. and Waterman M.S. 1981a. Identification of common molecular subsequences. J. Mol. Biol.
147: 195–197.
———. 1981b. Comparison of biosequences. Adv. Appl. Math. 2: 482–489.
Staden R. 1984. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12:
505–519.
———. 1989. Methods for calculating the probabilities of finding patterns in sequences. Comput. Appl.
Biosci. 5: 89–96.
Stormo G.D. and Hartzell III, G.W. 1989. Identifying protein-binding sites from unaligned DNA frag-
ments. Proc. Natl. Acad. Sci. 86: 1183–1187.
Stormo G.D., Schneider T.D., Gold L., and Ehrenfeucht A. 1982. Use of the ‘Perceptron’ algorithm to
distinguish translational initiation sites in E. coli. Nucleic Acids Res. 10: 2997–3011.
Thompson J.D., Higgins D.G., and Gibson T.J. 1994. CLUSTAL W: Improving the sensitivity of pro-
gressive multiple sequence alignment through sequence weighting, position-specific gap penalties
and weight matrix choice. Nucleic Acids Res. 22: 4673–4680.
Tinoco Jr., I., Uhlenbeck O.C., and Levine M.D. 1971. Estimation of secondary structure in ribonucleic
acids. Nature 230: 362–367.
Waterman M.S., Ed. 1989. Sequence alignments. In Mathematical methods for DNA sequences. CRC
Press, Boca Raton, Florida.
Woese C.R. 1987. Bacterial evolution. Microbiol. Rev. 51: 221–271.
Zuker M. and Stiegler P. 1981. Optimal computer folding of large RNA sequences using thermodynam-
ics and auxiliary information. Nucleic Acids Res. 9: 133–148.

18 ■ CHAPTER 1
Additional Reading
Reference Books and Special Journal Editions
Baldi P. and Brunck S. 1998. Bioinformatics: The machine learning approach. MIT Press, Cambridge,
Massachusetts.
Baxevanis A.D. and Ouellette B.F., Eds. 1998. Bioinformatics: A practical guide to the analysis of genes and
proteins. John Wiley Sons, New York.
Doolittle R.F. 1986. Of URFS and ORFS: A primer on how to analyze derived amino acid sequences. Uni-
versity Science Books, Mill Valley, California.
———, Ed. 1990. Molecular evolution: Computer analysis of protein and nucleic acid sequences. Meth-
ods Enzymol., vol. 183. Academic Press, San Diego.
———, Ed. 1996. Computer methods for macromolecular sequence analysis. Methods Enzymol., vol.
266. Academic Press, San Diego, California.
Durbin R., Eddy S., Krogh A., and Mitchison G., Eds. 1998. Biological sequence analysis. Probabilistic
models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Gribskov M. and Devereux J., Eds. 1991. Sequence analysis primer. University of Wisconsin Biotechnol-
ogy Center Biotechnical Resource Ser. (ser. ed. R.R. Burgess). Stockton Press, New York.
Gusfield D. 1997. Algorithms on strings, trees, and sequences: Computer science and computational biology.
Cambridge University Press, Cambridge, United Kingdom.
Martinez H., Ed. 1984. Mathematical and computational problems in the analysis of molecular
sequences (special commemorative issue honoring Margaret Oakley Dayhoff). Bull. Math. Biol.
Pergamon Press, New York.
Nucleic Acids Research. 1996–2000. Special database issues published in the January issues of volumes
22–26. Oxford University Press, Oxford, United Kingdom.
Salzberg S.L., Searls D.B., and Kasif S., Eds. 1999. Computational methods in molecular biology. New
Compr. Biochem., vol. 32. Elsevier, Amsterdam, The Netherlands.
Sankoff D. and Kruskal J.R., Eds. 1983. Time warps, string edits, and macromolecules: The theory and prac-
tice of sequence comparison. Addison-Wesley, Don Mills, Ontario.
Söll D. and Roberts R.J., Eds. 1982. The application of computers to research on nucleic acids. I. Nucle-
ic Acids Res., vol. 10. Oxford University Press, Oxford, United Kingdom.
———. 1984. The application of computers to research on nucleic acids. II. Nucleic Acids Res., vol. 12.
Oxford University Press, Oxford, United Kingdom.
von Heijne G. 1987. Sequence analysis in molecular biology — Treasure trove or trivial pursuit. Academic
Press, San Diego, California.
Waterman M.S., Ed. 1989. Mathematical analysis of molecular sequences (special issue). Bull. Math. Biol.
Pergamon Press, New York.
———. 1995. Introduction to computational biology: Maps, sequences, and genomes. Chapman and Hall,
London, United Kingdom.
Yap, T.K., Frieder O., and Martino R.L. 1996. High performance computational methods for biological
sequence analysis. Kluwer Academic, Norwell, Massachusetts.
Journals That Routinely Publish Papers on Sequence Analysis
Bioinformatics (formerly Comput. Appl. Biosci. [CABIOS]). Oxford University Press, Oxford, United
Kingdom. http://bioinformatics.oupjournals.org/cabios/.
Journal of Computational Biology. Mary Ann Liebert, Larchmont, New York. http://www-
hto.usc.edu/jcb/.
Journal of Molecular Biology. Academic Press, London, United Kingdom. http://www.hbuk.co.uk/jmb.
Nucleic Acids Research (sections on Genomics and Computational Biology). Oxford University Press,
Oxford, United Kingdom. http://nar.oupjournals.org.

Collecting and Storing Sequences in
the Laboratory
DNA sequencing, 20
Genomic sequencing, 24
Sequencing cDNA libraries of expressed genes, 25
Submission of sequences to the databases, 26
Sequence accuracy, 26
Computer storage of sequences, 27
Sequence formats, 29
GenBank DNA sequence entry, 29
European Molecular Biology Laboratory data library format, 31
SwissProt sequence format, 31
FASTA sequence format, 31
National Biomedical Research Foundation/Protein Information Resource
sequence format, 31
Stanford University/Intelligenetics sequence format, 33
Genetics Computer Group sequence format, 33
Format of sequence file retrieved from the National Biomedical Research
Foundation/Protein Information Resource, 34
Plain/ASCII. Staden sequence format, 34
Abstract Syntax Notation sequence format, 35
Genetic Data Environment sequence format, 35
Conversions of one sequence format to another, 36
READSEQ to switch between sequence formats, 36
GCG Programs for Conversion of Sequence Formats, 40
Multiple sequence formats, 40
Storage of information in a sequence database, 44
Using the database access program ENTREZ, 45
REFERENCES, 48
19
2
C H A P T E R

20 ■ CHAPTER 2
THIS CHAPTER SUMMARIZES METHODS used to collect sequences of DNA molecules and
store them in computer files. Once in the computer, the sequences can be analyzed by a
variety of methods. Additionally, assembly of the sequences of large molecules from short
sequence fragments can readily be undertaken. Assembled sequences are stored in a com-
puter file along with identifying features, such as DNA source (organism), gene name, and
investigator. Sequences and accessory information are then entered into a database. This
procedure organizes them so that specific ones can be retrieved by a database query pro-
gram for subsequent use. Unfortunately, most sequence analysis programs require that the
information in a sequence file be stored in a particular format. To use these programs, it is
necessary to be aware of these formats and to be able to convert one format to another.
These programs are outlined in greater detail in Chapter 3.
DNA SEQUENCING
Sequencing DNA has become a routine task in the molecular biology laboratory. Purified
fragments of DNA cut from plasmid/phage clones or amplified by polymerase chain reac-
tion (PCR) are denatured to single strands, and one of the strands is hybridized to an
oligonucleotide primer. In an automated procedure, new strands of DNA are synthesized
from the end of the primer by heat-resistant Taq polymerase from a pool of deoxyribonu-
cleotide triphosphates (dNTPs) that includes a small amount of one of four chain-termi-
nating nucleotides (ddNTPs). For example, using ddATP, the resulting synthesis creates a
set of nested DNA fragments, each one ending at one of the As in the sequence through the
substitution of a fluorescent-labeled ddATP, as shown in Figure 2.1. A similar set of frag-
ments is made for each of the other three bases, but each is labeled with a different fluo-
rescent ddNTP.
The combined mixture of all labeled DNA fragments is electrophoresed to separate the
fragments by size, and the ladder of fragments is scanned for the presence of each of the
four labels, producing data similar to those shown in Figure 2.2. A computer program then
determines the probable order of the bands and predicts the sequence. Depending on the
actual procedure being used, one run may generate a reliable sequence of as many as 500
nucleotides. For accurate work, a printout of the scan is usually examined for abnormali-
Figure 2.1. Method used to synthesize a nested set of DNA fragments, each ending at a base position
complementary to one of the bases in the template sequence. To the left is a double-stranded DNA
molecule several kilobases in length. After denaturation, the DNA is annealed to a short primer oligonu-
cleotide primer (black arrow), which is complementary to an already sequenced region on the molecule.
New DNA is then synthesized in the presence of a fluorescently labeled chain-terminating ddNTP or one
of the four bases. The reactions produce a nested set of labeled molecules. The resulting fragments are sep-
arated in order by length to give the sequence display shown in Fig. 2.2.
A
A
A
G
G
G
C
C
C
T
T
T

C
O
L
L
E
C
T
I
N
G
A
N
D
S
T
O
R
I
N
G
S
E
Q
U
E
N
C
E
S
I
N
T
H
E
L
A
B
O
R
A
T
O
R
Y
■
21
Figure 2.2. Continued.

22
■
C
H
A
P
T
E
R
2
Figure 2.2. Continued.

C
O
L
L
E
C
T
I
N
G
A
N
D
S
T
O
R
I
N
G
S
E
Q
U
E
N
C
E
S
I
N
T
H
E
L
A
B
O
R
A
T
O
R
Y
■
23
Figure 2.2. Example of a DNA sequence obtained on an ABI-Prism 377 automated sequencer. The target DNA is denatured by heating and then annealed
to a specific primer. Sequencing reactions are carried out in a single tube containing Amplitaq (Perkin-Elmer), dNTPs, and four ddNTPs, each base labeled
with a different fluorescent dichloro-rhodamine dye. The polymerase extends synthesis from the primer, until a ddNTP is incorporated instead of dNTP,
terminating the molecule. The denaturing, reannealing, and synthesis steps are recycled up to 25 times, excess labeled ddNTPs are removed, and the
remaining products are electrophoresed on one lane of a polyacrylamide gel. As the bands move down the gel, the rhodamine dyes are excited by a laser
within the sequencer. Each of the four ddNTP types emits light at a different wavelength band that is detected by a digital camera. The sequence of changes
is plotted as shown in the figure and the sequence is read by a base-calling algorithm. More recently developed machines allow sequencing of 96 samples
at a time by capillary electrophoresis using more automated procedures. The accuracy and reliability of high-throughput sequencing have been much
improved by the development of the PHRED, PHRAP, and CONSED system for base-calling, sequence assembly, and assembled sequence editing (Ewing
and Green 1998; Gordon et al. 1998).

24 ■ CHAPTER 2
ties that decrease the quality of the sequence, and the sequence may then be edited manu-
ally. The sequence can also be verified by making an oligonucleotide primer complemen-
tary to the distal part of the readable sequence and using it to obtain the sequence of the
complementary strand on the original DNA template. The first sequence can also be
extended by making a second oligonucleotide matching the distal end of the readable
sequence and using this primer to read more of the original template. When the process is
fully automated, a number of priming sites may be used to obtain sequencing results that
give optimal separation of bands in each region of the sequence. By repeating this proce-
dure, both strands of a DNA fragment several kilobases in length can be sequenced
(Fig. 2.3).
GENOMIC SEQUENCING
To sequence larger molecules, such as human chromosomes, individual chromosomes are
purified and broken into 100-kb or larger random fragments, which are cloned into vec-
tors designed for large molecules, such as artificial yeast (YAC) or bacterial (BAC) chro-
mosomes. In a laborious procedure, the resulting library is screened for fragments called
contigs, which have overlapping or common sequences, to produce an integrated map of
the chromosome. Many levels of clone redundancy may be required to build a consensus
map because individual clones can have rearrangements, deletions, or two separate frag-
ments. These do not reflect the correct map and have to be eliminated. Once the correct
map has been obtained, unique overlapping clones are chosen for sequencing. However,
these molecules are too large for direct sequencing. One procedure for sequencing these
clones is to subclone them further into smaller fragments that are of sizes suitable for
sequencing, make a map of these clones, and then sequence overlapping clones (Fig. 2.4).
However, this method is expensive because it requires a great deal of time to keep track of
all the subclones.
An alternative method is to sequence all the subclones, produce a computer database of
the sequences, and then have the computer assemble the sequences from the overlaps that
are found. Up to 10 levels of redundancy are used to get around the problem of a small
fraction of abnormal clones. This procedure was first used to obtain the sequence of the 4-
Mb chromosome of the bacterium Haemophilus influenzae by The Institute of Genetics
Research (TIGR) team (Fleischmann et al. 1995). Only a few regions could not be joined
because of a problem subcloning those regions into plasmids, requiring manual sequenc-
ing of these regions from another library of phage subclones.
Figure 2.3. Sequential sequencing of a DNA molecule using oligonucleotide primers. One of the
denatured template DNA strands is primed for sequencing by an oligonucleotide (yellow) comple-
mentary to a known sequence on the molecule. The resulting sequence may then be used to pro-
duce two more oligonucleotide primers downstream in the sequence, one to sequence more of the
same strand (purple) and a second (turquoise) that hybridizes to the complementary strand and pro-
duces a sequence running backward on this strand, thus providing a way to confirm the first
sequence obtained.

COLLECTING AND STORING SEQUENCES IN THE LABORATORY ■ 25
SEQUENCING cDNA LIBRARIES OF EXPRESSED GENES
Two common goals in sequence analysis are to identify sequences that encode proteins,
which determine all cellular metabolism, and to discover sequences that regulate the
expression of genes or other cellular processes. Genomic sequencing as described above
meets both goals. However, only a small percentage of the genomic sequence of many
organisms actually encodes proteins because of the presence of introns within coding
regions and other noncoding regions in the genome. Although there has been a great deal
of progress in developing computational methods for analyzing genomic sequences and
finding these protein-encoding regions (see Chapter 8), these methods are not completely
Shotgun Sequencing
A controversy has arisen as to whether or not the above shotgun sequencing strategy
can be applied to genomes with repetitive sequences such as those likely to be
encountered in sequencing the human genome (Green 1997; Myers 1997). When
DNA fragments derived from different chromosomal regions have repeats of the
same sequence, they will appear to overlap. In a new whole shotgun approach, Cel-
era Genomics is sequencing both ends of DNA fragments of short (2 kb), medium
(10 kb), and long (BAC or 100 kb) lengths. A large number of reads are then
assembled by computer. This method has been used to assemble the genome of the
fruit fly Drosophila melanogaster after removal of the most highly repetitive regions
(Myers et al. 2000) and also to assemble a significant proportion of the human
genome.
Map fragments
Sequence overlapping
fragments Sequence all fragments
and assemble
Assembled
sequence
Figure 2.4. Methods for large-scale sequencing. A large DNA molecule 100 kb to several megabas-
es in size is randomly sheared and cloned into a cloning vector. In one method, a map of various-
sized fragments is first made, overlapping fragments are identified, and these are sequenced. In a
faster method that is computationally intense, fragments in different size ranges are placed in vec-
tors, and their ends are sequenced. Fragments are sequenced without knowledge of their chromoso-
mal location, and the sequence of the large parent molecule is assembled from any overlaps found.
As more and more fragments are sequenced, there are enough overlaps to cover most of the
sequence.

26 ■ CHAPTER 2
reliable and, furthermore, such genomic sequences are often not available. Therefore,
cDNA libraries have been prepared that have the same sequences as the mRNA molecules
produced by organisms, or else cDNA copies are sequenced directly by RT-PCR (copying
of mRNA by reverse transcriptase followed by sequencing of the cDNA copy by the poly-
merase chain reaction). By using cDNA sequence with the introns removed, it is much
simpler to locate protein-encoding sequences in these molecules. The only possible diffi-
culty is that a gene of interest may be developmentally expressed or regulated in such a way
that the mRNA is not present. This problem has been circumvented by pooling mRNA
preparations from tissues that express a large proportion of the genome, from a variety of
tissues and developing organs or from organisms subjected to several environmental influ-
ences. An important development for computational purposes was the decision by Craig
Venter to prepare databases of partial sequences of the expressed genes, called expressed
sequence tags or ESTs, which have just enough DNA sequence to give a pretty good idea
of the protein sequence. The translated sequence can then be compared to a database of
protein sequences with the hope of finding a strong similarity to a protein of known func-
tion, and hence to identify the function of the cloned EST. The corresponding cDNA clone
of the gene of interest can then be obtained and the gene completely sequenced.
SUBMISSION OF SEQUENCES TO THE DATABASES
Investigators are encouraged to submit their newly obtained sequences directly to a
member of the International Nucleotide Sequence Database Collaboration, such as the
National Center for Biotechnology Information (NCBI), which manages GenBank
(http://www.ncbi.nlm.nih.gov); the DNA Databank of Japan (DDBJ;
http://www.ddbj.nig.ac.jp); or the European Molecular Biology Laboratory (EMBL)/EBI
Nucleotide Sequence Database (http://www.embl-heidelberg.de). NCBI reviews new
entries and updates existing ones, as requested. A database accession number, which is
required to publish the sequence, is provided. New sequences are exchanged daily by the
GenBank, EMBL, and DDBJ databases.
The simplest and newest way of submitting sequences is through the Web site
http://www.ncbi.nlm.nih.gov/ on a Web form page called BankIt. The sequence can also be
annotated with information about the sequence, such as mRNA start and coding regions.
The submitted form is transformed into GenBank format and returned to the submitter
for review before being added to GenBank. The other method of submission is to use
Sequin (formerly called Authorin), which runs on personal computers and UNIX
machines. The program provides an easy-to-use graphic interface and can manage large
submissions such as genomic sequence information. It is described and demonstrated on
http://www.ncbi.nlm.nih.gov/Sequin/index.html and may be obtained by anonymous FTP
from ncbi.nlm.nih.gov/sequin/. Completed files can also be E-mailed to gb-
subncbi.nlm.nih.gov or can be mailed on diskette to GenBank Submissions, National
Center for Biotechnology Information, National Library of Medicine, Bldg. 38A, Room
8N-803, Bethesda, Maryland 20894.
SEQUENCE ACCURACY
It should be apparent from the above description of sequencing projects that the higher the
level of accuracy required in DNA sequences, the more time-consuming and expensive the
procedure. There is no detailed check of sequence accuracy prior to submission to GenBank

and other databases. Often, a sequence is submitted at the time of publication of the
sequence in a journal article, providing a certain level of checking by the editorial peer-
review process. However, many sequences are submitted without being published or prior
to publication. In laboratories performing large sequencing projects, such as those engaged
in the Human Genome Project or the genome projects of model organisms, the granting
agency requires a certain level of accuracy of the order of 1 possible error per 10 kb. This
level of accuracy should be sufficient for most sequence analysis applications such as
sequence comparisons, pattern searching, and translation. In other laboratories, such as
those performing a single-attempt sequencing of ESTs, the error rate may be much higher,
approximately 1 in 100, including incorrectly identified bases and inserted or deleted bases.
Thus, in translating EST sequences in GenBank and other databases, incorrect bases may
translate to the wrong amino acid. The worst problem, however, is that base insertions/dele-
tions will cause frameshifts in the sequence, thus making alignment with a protein sequence
very difficult. Another type of database sequence that is error-prone is a fragment of
sequence from the immunological variant of a pathogenic organism, such as the regions in
the protein coat of the human immunodeficiency virus (HIV). Although this low level of
accuracy may be suitable for some purposes such as identification, for more detailed analy-
ses, e.g., evolutionary analyses, the accuracy of such sequence fragments should be verified.
COMPUTER STORAGE OF SEQUENCES
Before using a sequence file in a sequence analysis program, it is important to ensure that
computer sequence files contain only sequence characters and not special characters used
by text editors. Editing a sequence file with a word processor can introduce such changes
if one is not careful to work only with text or so-called ASCII files (those on the typewrit-
er keyboard). Most text editors normally create text files that include control characters in
addition to standard ASCII characters. These control characters will only be recognized
correctly by the text editor program. Sequence files that contain such control characters
may not be analyzed correctly, depending on whether or not the sequence analysis pro-
gram filters them out. Editors usually provide a way to save files with only standard ASCII
characters, and these files will be suitable for most sequence analysis programs.
ASCII and Hexadecimal
Computers store sequence information as simple rows of sequence characters called
strings, which are similar to the sequences shown on the computer terminal. Each
character is stored in binary code in the smallest unit of memory, called a byte. Each
byte comprises 8 bits, with each bit having a possible value of 0 or 1, producing 255
possible combinations. By convention, many of these combinations have a specific
definition, called their ASCII equivalent. Some ASCII values are defined as keyboard
characters, others as special control characters, such as signaling the end of a line (a
line feed and a carriage return), or the end of a file full of text (end-of-file character).
A file with only ASCII characters is called an ASCII file. For convenience, all binary
values may be written in a hexadecimal format, which corresponds to our decimal
format 0, 1, . . . . . . 9 plus the letters A, B, . . . . F. Thus, hexadecimal 0F corresponds
to binary 0000 1111 and decimal 15, and FF corresponds to binary 1111 1111 and
decimal 255. A DNA sequence is usually stored and read in the computer as a series
of 8-bit words in this binary format. A protein sequence appears as a series of 8-bit
words comprising the corresponding binary form of the amino acid letters.

28 ■ CHAPTER 2
Sequence and other data files that contain non-ASCII characters also may not be transferred
correctly from one machine to another and may cause unpredictable behavior of the commu-
nications software. Some communications software can be set to ignore such control charac-
ters. For example, the file transfer program (FTP) has ASCII and binary modes, which may be
set by the user. The ASCII mode is useful for transferring text files, and the binary mode is use-
ful for transferring compressed data files, which also contain non-ASCII characters.
Most sequence analysis programs also require not only that a DNA or protein sequence
file be a standard ASCII file, but also that the file be in a particular format such as the
FASTA format (see below). The use of windows on a computer has simplified such prob-
lems, since one merely has to copy a sequence from one window, for example, a window
that is running a Web browser on the ENTREZ Web site, and paste it into another, for
example, that of a translation program.
In addition to the standard four base symbols, A, T, G, and C, the Nomenclature
Committee of the International Union of Biochemistry has established a standard code to
represent bases in a nucleic acid sequence that are uncertain or ambiguous. The codes are
listed in Table 2.1.
For computer analysis of proteins, it is more convenient to use single-letter than three-
letter amino acid codes. For example, GenBank DNA sequence entries contain a translat-
ed sequence in single-letter code. The standard, single-letter amino acid code was estab-
lished by a joint international committee, and is shown in Table 2.2. When the name of
only one amino acid starts with a particular letter, then that letter is used, e.g., C, cysteine.
In other cases, the letter chosen is phonetically similar (R, arginine) or close by in the
alphabet (K, lysine).
Table 2.1. Base–nucleic acid codes
Symbol Meaning Explanation
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
R A or G puRine
Y C or T pYrimidine
M A or C aMino
K G or T Keto
S C or G Strong interactions
3 h bonds
W A or T Weak interactions
2 h bonds
H A, C or T H follows G in
not G alphabet
B C, G or T B follows A in
not A alphabet
V A, C or G V follows U in
not T (not U) alphabet
D A, G or T D follows C in
not C alphabet
N A,C,G or T Any base
Adapted from NC-IUB (1984).

SEQUENCE FORMATS
One major difficulty encountered in running sequence analysis software is the use of dif-
fering sequence formats by different programs. These formats all are standard ASCII files,
but they may differ in the presence of certain characters and words that indicate where dif-
ferent types of information and the sequence itself are to be found. The more commonly
used sequence formats are discussed below.
GenBank DNA Sequence Entry
The format of a database entry in GenBank, the NCBI nucleic acid and protein sequence
database, is as follows: Information describing each sequence entry is given, including lit-
erature references, information about the function of the sequence, locations of mRNAs
and coding regions, and positions of important mutations. This information is organized
into fields, each with an identifier, shown as the first text on each line. In some entries,
these identifiers may be abbreviated to two letters, e.g., RF for reference, and some identi-
fiers may have additional subfields. The information provided in these fields is described
in Figure 2.5 and the database organization is described in Figure 2.6. The CDS subfield in
the field FEATURES gives the amino acid sequence, obtained by translation of known and
Table 2.2. Table of standard amino acid code letters
1-letter code 3-letter code Amino acid
Aa
Ala alanine
C Cys cysteine
D Asp aspartic acid
E Glu glutamic acid
F Phe phenylalanine
G Gly glycine
H His histidine
I Ile isoleucine
K Lys lysine
L Leu leucine
M Met methionine
N Asn asparagine
P Pro proline
Q Gln glutamine
R Arg arginine
S Ser serine
T Thr threonine
V Val valine
W Trp tryptophan
X Xxx undetermined amino acid
Y Tyr tyrosine
Zb
Glx either glutamic acid or glutamine
Adapted from IUPAC-IUB (1969, 1972, 1983).
a
Letters not shown are not commonly used.
b
Note that sometimes when computer programs translate DNA sequences, they will put a
“Z” at the end to indicate the termination codon. This character should be deleted from the
sequence.

30 ■ CHAPTER 2
potential open reading frames, i.e., a consecutive set of three-letter words that could be
codons specifying the amino acid sequence of a protein. The sequence entry is assumed by
computer programs to lie between the identifiers “ORIGIN” and “//”.
The sequence includes numbers on each line so that sequence positions can be located
by eye. Because the sequence count or a sequence checksum value may be used by the com-
puter program to verify the sequence composition, the sequence count should not be mod-
ified except by programs that also modify the count. The GenBank sequence format often
has to be changed for use with sequence analysis software.
Figure 2.5. GenBank DNA sequence entry.
Figure 2.6. Organization of the GenBank database and the search procedure used by ENTREZ. In this database format, each
row is another sequence entry and each column another GenBank field. When one sequence entry is retrieved, all of these
fields will be displayed, as in Fig. 2.5. Only a few fields and simple examples are shown for illustration. A search for the term
“SOS regulon and coli” in all fields will find two matching sequences. Finding these sequences is simple because indexes have
been made listing all of the sequences that have any given term, one index for each field. Similarly, a search for transcriptional
regulator will find three sequences.

European Molecular Biology Laboratory Data Library Format
The European Molecular Biology Laboratory (EMBL) maintains DNA and protein
sequence databases. The format for each entry in these databases is shown in Figure 2.7. As
with GenBank entries, a large amount of information describing each sequence entry is
given, including literature references, information about the function of the sequence,
locations of mRNAs and coding regions, and positions of important mutations. This infor-
mation is organized into fields, each with an identifier, shown as the first text on each line.
The meaning of each of these fields is explained in Figure 2.7. These identifiers are abbre-
viated to two letters, e.g., RF for reference, and some identifiers may have additional sub-
fields. The sequence entry is assumed by computer programs to lie between the identifiers
“SEQUENCE” and “//” and includes numbers on each line to locate parts of the sequence
visually. The sequence count or a checksum value for the sequence may be used by com-
puter programs to make sure that the sequence is complete and accurate. For this reason,
the sequence part of the entry should usually not be modified except with programs that
also modify this count. This EMBL sequence format is very similar to the GenBank format.
The main differences are in the use of the term ORIGIN in the GenBank format to indi-
cate the start of sequence; also, the EMBL entry does not include the sequence of any trans-
lation products, which are shown instead as a different entry in the database. This sequence
format often has to be changed for use with sequence analysis software.
SwissProt Sequence Format
The format of an entry in the SwissProt protein sequence database is very similar to the
EMBL format, except that considerably more information about the physical and bio-
chemical properties of the protein is provided.
FASTA Sequence Format
The FASTA sequence format includes three parts shown in Figure 2.8: (1) a comment line
identified by a “” character in the first column followed by the name and origin of the
Figure 2.7. EMBL sequence entry format.
The output of a DDBJ
DNA sequence entry is
almost identical to
that of GenBank.

32 ■ CHAPTER 2
sequence; (2) the sequence in standard one-letter symbols; and (3) an optional “*” which
indicates end of sequence and which may or may not be present. The presence of “*” may
be essential for reading the sequence correctly by some sequence analysis programs. The
FASTA format is the one most often used by sequence analysis software. This format pro-
vides a very convenient way to copy just the sequence part from one window to another
because there are no numbers or other nonsequence characters within the sequence. The
FASTA sequence format is similar to the protein information resource (NBRF) format
except that the NBRF format includes a first line with a “” character in the first column
followed by information about the sequence, a second line containing an identification
name for the sequence, and the third to last lines containing the sequence, as described
below.
National Biomedical Research Foundation/Protein Information Resource Sequence
Format
This sequence format, which is sometimes also called the PIR format, has been used by the
National Biomedical Research Foundation/Protein Information Resource (NBRF) and
also by other sequence analysis programs. Note that sequences retrieved from the PIR
database on their Web site (http://www-nbrf.georgetown.edu) are not in this compact for-
mat, but in an expanded format with much more information about the sequence, as
shown below. The NBRF format is similar to the FASTA sequence format but with signif-
icant differences. An example of a PIR sequence format is given in Figure 2.9. The first line
includes an initial “” character followed by a two-letter code such as P for complete
sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a semi-
colon, then a four- to six-character unique name for the entry. There is also an essential
second line with the full name of the sequence, a hyphen, then the species of origin. In
FASTA format, the second line is the start of the sequence and the first line gives the
sequence identifier after a “” sign. The sequence terminates with an asterisk.
Figure 2.8. FASTA sequence entry format.
Figure 2.9. NBRF sequence entry format.

Stanford University/Intelligenetics Sequence Format
Started by a molecular genetics group at Stanford University, and subsequently continued
by a company, Intelligenetics, the IG format is similar to the PIR format (Fig. 2.10), except
that a semicolon is usually placed before the comment line. The identifier on the second
line is also present. At the end of the sequence, a 1 is placed if the sequence is linear, and a
2 if the sequence is circular.
Genetics Computer Group Sequence Format
Earlier versions of the Genetics Computer Group (GCG) programs require a unique
sequence format and include programs that convert other sequence formats into GCG for-
mat. Later versions of GCG accept several sequence formats. A converted GenBank file is
illustrated in Figure 2.11. Information about the sequence in the GenBank entry is first
included, followed by a line of information about the sequence and a checksum value. This
value (not shown) is provided as a check on the accuracy of the sequence by the addition
of the ASCII values of the sequence. If the sequence has not been changed, this value
should stay the same. If one or more sequence characters become changed through error,
a program reading the sequence will be able to determine that the change has occurred
because the checksum value in the sequence entry will no longer be correct. Lines of infor-
mation are terminated by two periods, which mark the end of information and the start of
the sequence on the next line. The rest of the text in the entry is treated as sequence. Note
the presence of line numbers. Since there is no symbol to indicate end of sequence, no text
other than sequence should be added beyond this point. The sequence should not be
altered except by programs that will also adjust the checksum score for the sequence. The
GCG sequence format may have to be changed for use with other sequence analysis soft-
ware. GCG also includes programs for reformatting sequence files.
Figure 2.10. Intelligenetics sequence entry format.
Figure 2.11. GCG sequence entry format.

34 ■ CHAPTER 2
Format of Sequence File Retrieved from the National Biomedical Research
Foundation/Protein Information Resource
The file format has approximately the same information as a GenBank or EMBL sequence
file but is formatted slightly differently, as in Figure 2.12. This format is presently called the
PIR/CODATA format.
Plain/ASCII.Staden Sequence Format
This sequence format is a computer file that includes only the sequence with no other
accessory information. This particular format is used by the Staden Sequence Analysis pro-
grams (http://www/.mrc-lmb.com.ac.uk/pubseq) produced by Roger Staden at Cambridge
University (Staden et al. 2000). The sequence must be further formatted to be used for
most sequence analysis programs.
Figure 2.12. Protein Information Resource sequence format.

Abstract Syntax Notation Sequence Format
Abstract Syntax Notation (ASN.1) is a formal data description language that has been
developed by the computer industry. ASN.1 (http://www-sop.inria.fr/rodeo/personnel/
hoschka/asn1.html; NCBI 1993) has been adopted by the National Center for Biotechnol-
ogy Information (NCBI) to encode data such as sequences, maps, taxonomic information,
molecular structures, and bibliographic information. These data sets may then be easily
connected and accessed by computers. The ASN.1 sequence format is a highly structured
and detailed format especially designed for computer access to the data. All the informa-
tion found in other forms of sequence storage, e.g., the GenBank format, is present. For
example, sequences can be retrieved in this format by ENTREZ (see below). However, the
information is much more difficult to read by eye than a GenBank formatted sequence.
One would normally not need to use the ASN.1 format except when running a computer
program that uses this format as input.
Genetic Data Environment Sequence Format
Genetic Data Environment (GDE) format is used by a sequence analysis system called the
Genetic Data Environment, which was designed by Steven Smith and collaborators (Smith
et al. 1994) around a multiple sequence alignment editor that runs on UNIX machines.
The GDE features are incorporated into the SEQLAB interface of the GCG software, ver-
sion 9. GDE format is a tagged-field format similar to ASN.1 that is used for storing all
available information about a sequence, including residue color. The file consists of vari-
ous fields (Fig. 2.13), each enclosed by brackets, and each field has specific lines, each with
a given name tag. The information following each tag is placed in double quotes or follows
the tag name by one or more spaces.
Figure 2.13. The Genetic Data Environment format.

36 ■ CHAPTER 2
CONVERSIONS OF ONE SEQUENCE FORMAT TO ANOTHER
READSEQ to Switch between Sequence Formats
READSEQ is an extremely useful sequence formatting program developed by D. G. Gilbert
at Indiana University, Bloomington (gilbertdbio.indiana.edu). READSEQ can recognize
a DNA or protein sequence file in any of the formats shown in Table 2.3, identify the for-
mat, and write a new file with an alternative format. Some of these formats are used for
special types of analyses such as multiple sequence alignment and phylogenetic analysis.
The appearance of these formats for two sample DNA sequences, seq1 and seq2, is shown
in Table 2.4. READSEQ may be reached at the Baylor College of Medicine site at
http://dot.imgen.bcm.tmc.edu:9331/seq-util/readseq.html and also by anonymous FTP
from ftp.bio.indiana.edu/molbio/readseq or ftp.bioindiana.edu/molbio/mac to obtain the
appropriate files.
Data files that have multiple sequences, such as those required for multiple sequence
alignment and phylogenetic analysis using parsimony (PAUP), are also converted. Exam-
ples of the types of files produced are shown in Table 2.4. Options to reverse-complement
and to remove gaps from sequences are included. SEQIO, another sequence conversion
program for a UNIX machine, is described at http://bioweb.pasteur.fr/docs/seqio/seqio.
html and is available for download at http://www.cs.ucdavis.edu/gusfield/seqio.html.
Table 2.3. Sequence formats recognized by format conversion
program READSEQ
1. Abstract Syntax Notation (ASN.1)
2. DNA Strider
3. European Molecular Biology Laboratory (EMBL)
4. Fasta/Pearson
5. Fitch (for phylogenetic analysis)
6. GenBank
7. Genetics Computer Group (GCG)a
8. Intelligenetics/Stanford
9. Multiple sequence format (MSF)
10. National Biomedical Research Foundation (NBRF)
11. Olsen (in only)
12. Phylogenetic Analysis Using Parsimony (PAUP) NEXUS format
13. Phylogenetic Inference package (Phylip v3.3, v3.4)
14. Phylogenetic Inference package (Phylip v3.2)
15. Plain text/Stadena
16. Pretty format for publication (output only)
17. Protein Information Resource (PIR or CODATA)
18. Zuker for RNA analysis (in only)
a
For conversion of single sequence files only. The other conversions can
be performed on files with single or multiple sequences.

Table 2.4. Multiple sequence format conversions by READSEQ
1. Fasta/Pearson format
seq1
agctagct agct agct
seq2
aactaact aact aact
2. Intelligenetics format
;seq1, 16 bases, 2688 checksum.
seq1
agctagctagctagct1
;seq2, 16 bases, 25C8 checksum.
seq2
aactaactaactaact1
3. GenBank format
LOCUS seq1 16 bp
DEFINITION seq1, 16 bases, 2688 checksum.
ORIGIN
1 agctagctag ctagct
//
LOCUS seq2 16 bp
DEFINITION seq2, 16 bases, 25C8 checksum.
ORIGIN
1 aactaactaa ctaact
//
4. NBRF format
DL;seq1
seq1, 16 bases, 2688 checksum.
agctagctag ctagct*
DL;seq2
seq2, 16 bases, 25C8 checksum.
aactaactaa ctaact*
5. EMBL format
ID seq1
DE seq1, 16 bases, 2688 checksum.
SQ 16 BP
agctagctag ctagct
//
ID seq2
DE seq2, 16 bases, 25C8 checksum.
SQ 16 BP
aactaactaa ctaact
//
Continued.

38 ■ CHAPTER 2
Table 2.4. Continued.
6. GCG format
seq1
seq1 Length: 16 Check: 9864 ..
1 agctagctag ctagct
seq2
seq2 Length: 16 Check: 9672 ..
1 aactaactaa ctaact
7. Format for the Macintosh sequence analysis program DNA Strider
; ### from DNA Strider ;-)
; DNA sequence seq1, 16 bases, 2688 checksum.
;
agctagctagctagct
//
; ### from DNA Strider ;-)
; DNA sequence seq2, 16 bases, 25C8 checksum.
;
aactaactaactaact
//
8. Format for phylogenetic analysis programs of Walter Fitch
agc tag cta gct agc t
aac taa cta act aac t
9. Format for phylogenetic analysis programs PHYLIP of J. Felsenstein v 3.3 and 3.4.
2 16
seq1 agctagctag ctagct
seq2 aactaactaa ctaact
10. Protein International Resource PIR/CODATA format

ENTRY seq1
TITLE seq1, 16 bases, 2688 checksum.
SEQUENCE
5 10 15 20
25 30
1 a g c t a g c t a g c t a g c t
///
ENTRY seq2
TITLE seq2, 16 bases, 25C8 checksum.
SEQUENCE
5 10 15 20
25 30
1 a a c t a a c t a a c t a a c t
///

Table 2.4. Continued.
11. GCG multiple sequence format (MSF)
/tmp/readseq.in.2449 MSF: 16 Type: N January 01,
1776 12:00 Check: 9536 ..
Name: seq1 Len: 16 Check: 9864
Weight: 1.00
Name: seq2 Len: 16 Check: 9672
Weight: 1.00
//
12. Abstract Syntax Notation (ASN.1) format
Bioseq-set ::= {
seq-set {
seq {
id { local id 1 },
descr { title “seq1” },
inst {
repr raw, mol dna, length 16, topology linear,
seq-data
iupacna “agctagctagctagct”
} } ,
seq {
id { local id 2 },
descr { title “seq2” },
inst {
repr raw, mol dna, length 16, topology linear,
seq-data
iupacna “aactaactaactaact”
} } ,
} }
13. NEXUS format used by the phylogenetic analysis program PAUP by David Swofford
#NEXUS
[/tmp/readseq.in.2506 -- data title]
[Name: seq1 Len: 16 Check: 2688]
[Name: seq2 Len: 16 Check: 25C8]
begin data;
dimensions ntax=2 nchar=16;
format datatype=dna interleave missing=-;
matrix
seq1 agctagctagctagct
seq2 aactaactaactaact
Two sequences in FASTA multiple sequence format (1) were used as input for the remainder of the for-
mat options (2–14).

40 ■ CHAPTER 2
GCG Programs for Conversion of Sequence Formats
The “from” programs convert sequence files from GCG format into the named format,
and the “to” programs convert the alternative format into GCG format. Shown are the
actual program names, no spaces included. There are no programs to convert to GenBank
and EMBL formats.
FROMEMBL
FROMFASTA
FROMGENBANK
FROMIG
FROMPIR
FROMSTADEN
TOFASTA
TOIG
TOPIR
TOSTADEN
In addition, the GCG programs include the following sequence formatting programs: (1)
GETSEQ, which converts a simple ASCII file being received from a remote PC to GCG for-
mat; (2) REFORMAT, which will format a GCG file that has been edited, and will also per-
form other functions; and (3) SPEW, which sends a GCG sequence file as an ASCII file to
a remote PC.
MULTIPLE SEQUENCE FORMATS
Most of the sequence formats listed above can be used to store multiple sequences in tan-
dem in the same computer file. Exceptions are the GCG and raw sequence formats, which
are designed only for single sequences. GCG has an alternative multiple sequence format,
which is described below. In addition, there are formats especially designed for multiple
sequences that can also be used to show their alignments or to perform types of multiple
sequence analyses such as phylogenetic analysis. In the case of PAUP, the program will
accept MSA format and convert to the NEXUS format. These formats are illustrated below
using the same two short sequences.
1. Aligned sequences in FASTA format. The aligned sequence characters occupy the same
line and column, and gaps are indicated by a dash.
gi|730305|
MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFL
RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ
gi|404390|
----------------------APEAQVSVQPNFQPDKFL
RTQTPRAELKEKFTAFCKAQGFTEDSIVFLPQTDKCMTEQ
gi|895868
MAALRMLWMGLVLLGLLGFPQTPAQGHDTVQPNFQQDKFL
RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE-
represents the same alignment as:
MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFL
----------------------APEAQVSVQPNFQPDKFL
RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ
RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE-

2. GCG multiple sequence format (MSF) produced by the GCG multiple sequence align-
ment program PILEUP. The gap symbol is “~”. The length indicated is the length of the
alignment, which is the length of the longest sequence including gaps.
3. ALN form produced by multiple sequence alignment program CLUSTALW (Thomp-
son et al. 1994). In addition to the alignment position, the program also shows the cur-
rent sequence position at the end of each row.
4. Blocked alignment used by GDE and GCG SEQLAB (Fig. 2.14). Unlike the other exam-
ples shown, which are all simple text files of an alignment, the following figure is a
screen display of an alignment, using GDE and SEQLAB display programs. The under-
lying alignment in text format would be similar to the GCG multiple sequence align-
ment file shown above.
Page 1.1
1 15 16 30 31 45
1 gi|730305| MATHHTLWMGLALLG VLGDLQAAPEAQVSV QPNFQQDKFLGRWFS
23
2 gi|404390| --------------- -------APEAQVSV QPNFQPDKFLGRWFS
45
3 gi|895868 MAALRMLWMGLVLLG LLGFPQTPAQGHDTV QPNFQQDKFLGRWYS
PileUp of: @list4
Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430
GapWeight: 12
GapLengthWeight: 4
list4.msf MSF: 883 Type: P February 28, 1997 16:42 Check: 482
Name: haywire Len: 883 Check: 3979 Weight: 1.00
Name: xpb-human Len: 883 Check: 9129 Weight: 1.00
Name: rad25 Len: 883 Check: 5359 Weight: 1.00
Name: xpb-ara Len: 883 Check: 2015 Weight: 1.00
//
1 50
haywire ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MGPPK
xpb-human ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~
rad25 MTDVEGYQPK SKGKIFPDMG ESFFSSDEDS PATDAEIDEN YDDNRETSEG
xpb-ara ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~
51 100
haywire KSRKDRSG.. GDKFGKKRRA EDEAFTQLVD DNDSLDATES EGIPGAASKN
xpb-human MGKRDRAD.. RDKKKSRKRH YED...EEDD EEDAPGNDPQ EAVPSAAGKQ
rad25 RGERDTGAMV TGLKKPRKKT KSSRHTAADS SMNQMDAKDK ALLQDTNSDI
xpb-ara ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~M KYGGKDDQKM KNIQNAEDYY
.
.
.

42 ■ CHAPTER 2
5. Format used by Fitch phylogenetic analysis programs.
6. Formats used by Felsenstein phylogenetic analysis programs PHYLIP (phylogenetic
inference package): 2 for two sequences, 16 for length of alignment.
7. Format used by phylogenetic analysis program PAUP (phylogenetic analysis using par-
simony). ntax is number of taxa, nchar is the length of the alignment, and interleave
allows the alignment to be shown in readable blocks. The other terms describe the type
of sequence and the character used to indicate gaps.
a. version 3.2
2 16 YF
b. versions 3.3 and 3.4
2 16
agc tag cta gct agc t
aac taa cta act aac t
Figure 2.14. A multiple sequence alignment editor for GCG MSF files. For information on using multiple sequence align-
ment editors and for examples of other editors, see Chapter 4.

8. The Selex format used by hidden Markov program HMMER by Sean Eddy has been
used to keep track of the alignment of small RNA molecules.
Each line contains a name, followed by the aligned sequence. A space, dash, underscore,
or period denotes a gap. Long alignments are split into multiple blocks and interleaved or
separated by blank lines. The number of sequences, their order, and their names must be
the same in every block, and every sequence must be represented even though there are no
residues present.
9. The block multiple sequence alignment format (see http://www.blocks.fhcrc.org/).
Identification starts contain a short identifier for the group of sequences from which the
block was made and often is the original Prosite group ID. The identifier is terminated by
a semicolon, and “BLOCK” indicates the entry type.
AC contains the block number, a seven-character group number for sequences from
which the block was made, followed by a letter (A–Z) indicating the order of the block in
the sequences. The block number is a 5-digit number preceded by BL (BLOCKS database)
or PR (PRINTS database). min,max is the minimum,maximum number of amino acids
from the previous block or from the sequence start. DE describes sequences from which
# Example selex file
seq1 ACGACGACGACG.
seq2 ..GGGAAAGG.GA
seq3 UUU..AAAUUU.A
seq1 ..ACG
seq2 AAGGG
seq3 AA...UUU
#NEXUS
[ comments ]
begin data;
dimensions ntax=4 nchar=100;
format datatype=protein interleave gap=-;
matrix
[ 1
50]
haywire ---------- ---------- ---------- ---------- ----- MGPPK
xpb-human ---------- ---------- ---------- ---------- --------- -
rad25 MTDVEGYQPK SKGKIFPDMG ESFFSSDEDS PATDAEIDEN YDDNRETSEG
xpb-ara ---------- ---------- ---------- ---------- --------- -
[ 51
100]
haywire KSRKDRSG-- GDKFGKKRRA EDEAFTQLVD DNDSLDATES EGIPGAASKN
xpb-human MGKRDRAD-- RDKKKSRKRH YED---EEDD EEDAPGNDPQ EAVPSAAGKQ
rad25 RGERDTGAMV TGLKKPRKKT KSSRHTAADS SMNQMDAKDK ALLQDTNSDI
xpb-ara ---------- ---------- ---------M KYGGKDDQKM KNIQNAEDYY
;
endblock;

44 ■ CHAPTER 2
the block was made. BL contains information about the block: xxx is the amino acids in the
spaced triplet found by MOTIF upon which the block is based. w is the width of the
sequence segments (columns) in the block. s is the number of sequence segments (rows)
in the block. Other values (n1, n2) describe statistical features of the block. Sequence_id is
a list of sequences. Each sequence line contains a sequence identifier, the offset from the
beginning of the sequence to the block in parentheses, the sequence segment, and a weight
for the segment.
STORAGE OF INFORMATION IN A SEQUENCE DATABASE
As shown by the above examples, each DNA or protein sequence database entry has much
information, including an assigned accession number(s); source organism; name of locus;
reference(s); keywords that apply to sequence; features in the sequence such as coding
regions, intron splice sites, and mutations; and finally the sequence itself. The above infor-
mation is organized into a tabular form very much like that found in a relational database.
(Additional information about databases is given in the box “Database Types.”) If one
imagines a large table with each sequence entry occupying one row, then each column will
include one of the above types of information for each sequence, and each column is called
a FIELD (see Fig. 2.6). The last column contains the sequences themselves. It is very easy
to make an index of the information in each of these fields so that a search query can locate
all the occurrences through the index. Even related sequences are cross-referenced. In
addition, the information in one database can be cross-referenced to that in another
database. The DNA, protein, and reference databases have all been cross-referenced so that
moving between them is readily accomplished (see ENTREZ section below, p. 45).
Database Types
There are several types of databases; the two principal types are the relational and
object-oriented databases. The relational database orders data in tables made up of
ID short_identifier; BLOCK
AC block_number; distance from previous block = (min,max)
DE description
BL xxx motif; width=w; seqs=s; 99.5%=n1; strength=n2
sequence_id (offset) sequence_segment sequence_weight.
//
ID GLU_CARBOXYLATION; BLOCK
AC BL00011; distance from previous block=(1,64)
DE Vitamin K-dependent carboxylation domain proteins.
BL ECA motif; width=40; seqs=34; 99.5%=1833; strength=1412
FA10_BOVIN ( 45) LEEVKQGNLERECLEEACSLEEAREVFEDAEQTDEFWSKY 31
FA10_CHICK ( 45) LEEMKQGNIERECNEERCSKEEAREAFEDNEKTEEFWNIY 46
FA10_HUMAN ( 45) LEEMKKGHLERECMEETCSYEEAREVFEDSDKTNEFWNKY 33
FA7_BOVIN ( 5) LEELLPGSLERECREELCSFEEAHEIFRNEERTRQFWVSY 57
FA7_HUMAN ( 65) LEELRPGSLERECKEEQCSFEEAREIFKDAERTKLFWISY 42
OSTC_CHICK ( 6) SGVAGAPPNPIEAQREVCELSPDCNELADELGFQEAYQRR 94
//

DNA sequence analysis software packages often include sequence databases that are
updated regularly. The organizations that manage sequence databases also provide public
access through the internet. Using a browser such as Netscape or Explorer on a local per-
sonal computer, these sites may be visited through the internet and a form can be filled out
with the sequence name. Once the correct sequence has been identified, the sequence is
delivered to the browser and may be saved as a local computer file, cut-and-pasted from
the browser window into another window of an analysis program or editor, or even past-
ed into another browser page for analysis at a second Web site. A useful feature of brows-
er programs for sequence analysis is the capability of having more than one browser win-
dow running at a time. Hence, one browser window may retrieve sequences from a
database and a second may analyze these sequences. At the time of retrieving the sequence,
several sequence formats may be available. The FASTA format, which is readily converted
into other formats and also is smaller and simpler, containing just a line of sequence iden-
tifiers followed by the sequence without numbers, is very useful for this purpose. A list of
sequence databases accessible through the internet is provided in Table 2.5.
USING THE DATABASE ACCESS PROGRAM ENTREZ
One straightforward way to access the sequence databases is through ENTREZ, a resource
prepared by the staff of the National Center for Biotechnology Information, National
Library of Medicine, Bethesda, Maryland, and available through their web site at
http://ncbi.nlm.nih.gov/Entrez. ENTREZ provides a series of forms that can be filled out
to retrieve a DNA or protein sequence, or a Medline reference related to the molecular
biology sequence databases. After search for either a protein or a DNA sequence is chosen
at the above address, another Web page is provided with a form to fill out for the search,
as shown in Figure 2.15.
rows giving specific items in the database, and columns giving the features as
attributes of those items. These tables are carefully indexed and cross-referenced with
each other, sometimes using additional tables, so that each item in the database has a
unique set of identifying features. A relational model for the GenBank sequence
database has been devised at the National Center for Genome Resources
(http://www.ncgr.org/research/sequence/schema.html).
The object-oriented database structure has been useful in the development of bio-
logical databases. The objects, such as genetic maps, genes, or proteins, each have an
associated set of utilities for analysis and display of the object and a set of attributes
such as identifying name or references. In developing the database, relationships
among these objects are identified. To standardize some commonly arising objects in
biological databases, e.g., maps, the Object Management Group (http://www.
omg.org) has formed a Life Science Research Group. The Life Science Research
Group is a consortium of commercial companies, academic institutions, and soft-
ware vendors that is trying to establish standards for displaying biological informa-
tion from bioinformatics and genomics analyses (http://www.omg.org/home
pages/lsr). The Common Object Request Broker Architecture (CORBA) is the Object
Management Group’s interface for objects that allows different computer applica-
tions to communicate with each other through a common language, Interface Defi-
nition Language (IDL). To plan an object-oriented database by defining the classes of
objects and the relationships among these objects, a specific set of procedures called
the Unified Modeling Language (UML) has been devised by the OMG group.

46 ■ CHAPTER 2
On the ENTREZ form, make a selection in the data entry window after the term
“Search,” then enter search terms in the longer data entry window after “for.” The database
will be searched for sequence database entries that contain all of these terms or related
ones. Using boolean logic, the search looks for database entries that include the first term
AND the second, and subsequent terms repeated until the last term. The “Limits” link on
the ENTREZ form page is used to limit the GenBank field to be searched, and various log-
ical combinations of search terms may be designed by this method. These fields refer to the
GenBank fields described above in Figure 2.5. When searching for terms in a particular
field, some knowledge of the terms that are in the database can be helpful. To assist in find-
ing suitable terms, for each field, ENTREZ provides a list of index entries.
For a protein search, for example, current choices for fields include accession (number),
all fields, author name, E. C. number, issue, journal name, keyword, modification date,
organism, page number, primary accession (number), properties, protein name, publica-
tion date (of reference), seqID string, sequence length, substance name, text word, title
word, volume, and sequence ID. Similar fields are shown for the DNA database search.
Later, the results of searches in separate fields may be combined to narrow down the
choices. The number of terms to be searched for and the field to be searched are the main
decisions to be made. In doing so, keep in mind that it is important to be as specific as pos-
sible, or else there may be a great many possibilities. Thus, knowing accession number,
protein name, or name of gene should be enough to find the required entry quickly. If the
same protein has been sequenced in several organisms, providing an organism name is also
helpful. When the chosen search terms and fields have been decided and submitted, a
database comprising all of the currently available sequences (called the nonredundant or
NR database) will be searched. Other database selections may also be made.
The program returns the number of matches found and provides an opportunity to nar-
row this list by including more terms. When the number of matching sequences has been
narrowed to a reasonable number, the sequence may be retrieved in a chosen format in
Table 2.5. Major sequence databases accessible through the internet
1. GenBank at the National Center of Biotechnology Information, National Library of Medicine, Wash-
ington, DC accessible from:
http://www.ncbi.nlm.nih.gov/Entrez
2. European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England
http://www.ebi.ac.uk/embl/index.html
3. DNA DataBank of Japan (DDBJ) at Mishima, Japan
http://www.ddbj.nig.ac.jp/
4. Protein International Resource (PIR) database at the National Biomedical Research Foundation in
Washington, DC (see Barker et al. 1998)
http://www-nbrf.georgetown.edu/pirwww/
5. The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research
in Epalinges/Lausanne
http://www.expasy.ch/cgi-bin/sprot-search-de
6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute allows both simple and
complex concurrent searches of one or more sequence databases. The SRS system may also be used on
a local machine to assist in the preparation of local sequence databases.
http://srs6.ebi.ac.uk
The databases are available at the indicated addresses and return sequence files through an internet brows-
er. Many of the sites shown provide access to multiple databases. The first three database centers are updat-
ed daily and exchange new sequences daily, so that it is only necessary to access one of them. Additional Web
addresses of databases of protein families and structure, and genomic databases, are given in Chapter 9.
These databases can also provide access to sequence of a protein family or organism.
Biological databases
are beginning to use
“controlled vocabular-
ies” for entering data
so that these defined
terms can confidently
be used for database
subsequent searches.

Figure 2.15. ENTREZ Web form for protein database search. The window shown is from the protein database search option
at http://www.ncbi.nlm.nih.gov/Entrez/. The search term input window is activated by clicking, one or more search terms are
typed, and the “Go” button is clicked (top window). Batch ENTREZ, available from the main ENTREZ Web page, provides
a method for retrieving large numbers of sequences at the same time. A particular field (e.g., gene name, organism, protein
name) in the GenBank entry can also be searched, by using the “Limits” option. The request is then sent to a server in which
all key words in the sequence entries have been indexed, as in looking up a word in the index of a book. GenBank entries with
all of the requested terms can be readily identified because the index will indicate in which entry they are all found. The
machine returns the number of matches found. Clicking on the retrieve button leads to a list of the found items. Those items
chosen are retrieved in a new window format.
several straightforward steps. It is important to look through the sequences to locate the
one intended. There may be several different copies of the sequence because it may have
been sequenced from more than one organism, or the sequence may be a mutant sequence,
a particular clone, or a fragment. There is no simple way to find the correct sequence with-
out manually checking the information provided in each sequence, but this usually takes
only a short time. Before leaving ENTREZ, it is often useful to check for sequence database
entries that are similar to the one of interest, called “neighbors” by ENTREZ. The expand-

48 ■ CHAPTER 2
ed query searches other database entries of interest, such as the same protein in another
organism, a large chromosomal sequence that includes the gene, or members of the same
gene family. While visiting the site, note that ENTREZ has been adapted to search through
a number of other biological databases, and also through Medline, and these searches are
available from the initial ENTREZ Web page.
REFERENCES
Barker W.C., Garavelli J.S., Haft D.H., Hunt L.T., Marzec C.R., Orcutt B.C., Srinivasarao G.Y., Yeh
L.-S.L., Ledley R.S., Mewes H.-W., Pfeiffer F., and Tsugita A. 1998. The PIR-International Protein
Sequence Database. Nucleic Acids Res. 26: 27–32.
Ewing B. and Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error proba-
bilities. Genome Res. 8: 186–194.
Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., Kerlavage A.R., Bult C.J., Tomb
J.F., Dougherty B.A., Merrick J.M., et al. 1995. Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd. Science 269: 496–512.
Gordon D., Abajian C., and Green P. 1998. Consed: A graphical tool for sequence finishing. Genome Res.
8: 195–202.
Green P. 1997. Against a whole-genome shotgun. Genome Res. 7: 410–417.
IUPAC-IUB: Commission on Biochemical Nomenclature. 1969. A one-letter notation for amino acid
sequences. Tentative rules. Biochem. J. 113: 1–4.
———. 1972. Symbols for amino-acid derivatives and peptides. Recommendations 1971. J. Biol. Chem.
247: 977–983.
IUPAC-IUB: Joint Commission on Biochemical Nomenclature (JCBN). 1983. Nomenclature and sym-
bolism for amino acids and peptides. Corrections to recommendations. Eur. J. Biochem. 213: 2.
Myers E.W. 1997. Is whole genome sequencing feasible? In Computational methods in genome research
(ed. S. Suhai). Plenum Press, New York.
Myers E.W., Sutton G.G., Delcher A.L., Dew I.M., Fasulo D.P., Flanigan M.J., Kravitz S.A., Mobarry
C.M., Reinert K.H.J., Remington K.A., et al. 2000. A whole-genome assembly of Drosophila. Science
287: 2196–2204.
NCBI: National Center for Biotechnology Information. 1993. Manual for NCBI Software Development
Tool Kit Version 1.8. August 1, 1993. National Library of Medicine, National Institutes of Health.
Retrieving a Specific Sequence
Even following the above instructions, it can be difficult to retrieve the sequence of a
specific gene or protein simply because of the sheer number of sequences in the Gen-
Bank database and the complex problem of indexing them. For projects that require
the most currently available sequences, the NR databases should be searched. Other
projects may benefit from the availability of better curated and annotated protein
sequence databases, including PIR and SwissProt. The genomic databases described
in Chapter 10 can also provide the sequence of a particular gene or protein. Protein
sequences in the Genpro database are generated by automatic translation of DNA
sequences. When read from cDNA copies of mRNA sequences, they provide a reli-
able sequence, given a certain amount of uncertainty as to the translational start site.
Many protein sequences are now predicted by translation of genomic sequences,
requiring a prediction of exons, a somewhat error-prone step described in more
detail in Chapter 8. The origin of protein sequence entries thus needs to be deter-
mined, and if they are not from a cDNA sequence, it may be necessary to obtain and
sequence a cDNA copy of the gene.

NC-IUB: Nomenclature Committee of the International Union of Biochemistry. 1984. Nomenclature
for incompletely specified bases in nucleic acid sequences. Recommendations. Eur. J. Biochem. 150:
1–5.
Smith S.W., Overbeek R., Woese C.R., Gilbert W., and Gillevet P.M. 1994. The genetic data environ-
ment: An expandable GUI for multiple sequence analysis. Comput. Appl. Biosci. 10: 671–675.
Staden R., Beal K.F., and Bonfield J.K. 2000. The Staden package, 1998. Methods Mol. Biol. 132: 115–130.
Thompson J.D., Higgins D.G., and Gibson T.J. 1994. CLUSTAL W: Improving the sensitivity of pro-
gressive multiple sequence alignment through sequence weighting, positions-specific gap penalties
and weight matrix choice. Nucleic Acids Res. 22: 4673–4680.

Exploring the Variety of Random
Documents with Different Content

undergoes, under the charge of guardians in charge of her
chastisement, the severe punishments she has incurred.
PROMPT FLIGHT HERE BELOW LEAVES THE SOUL
UNHARMED BY HER STAY HERE.
Thus, although the soul have a divine nature (or being),
though she originate in the intelligible world, she enters into a body.
Being a lower divinity, she descends here below by a voluntary
inclination, for the purpose of developing her power, and to adorn
what is below her. If she flee promptly from here below, she does
not need to regret having become acquainted with evil, and knowing
the nature of vice,177 nor having had the opportunity of manifesting
her faculties, and to manifest her activities and deeds. Indeed, the
faculties of the soul would be useless if they slumbered continuously
in incorporeal being without ever becoming actualized. The soul
herself would ignore what she possesses if her faculties did not
manifest by procession, for everywhere it is the actualization that
manifests the potentiality. Otherwise, the latter would be completely
hidden and obscured; or rather, it would not really exist, and would
not possess any reality. It is the variety of sense-effects which
illustrates the greatness of the intelligible principle, whose nature
publishes itself by the beauty of its works.
CONTINUOUS PROCESSION NECESSARY TO THE
SUPREME.
6. Unity was not to exist alone; for if unity remained self-
enclosed, all things would remain hidden in unity without having any
form, and no beings would achieve existence. Consequently, even if
constituted by beings born of unity, plurality would not exist, unless
the inferior natures, by their rank destined to be souls, issued from
those beings by the way of procession. Likewise, it was not sufficient

for souls to exist, they also had to reveal what they were capable of
begetting. It is likewise natural for each essence to produce
something beneath it, to draw it out from itself by a development
similar to that of a seed, a development in which an indivisible
principle proceeds to the production of a sense-object, and where
that which precedes remains in its own place at the same time as it
begets that which follows by an inexpressible power, which is
essential to intelligible natures. Now as this power was not to be
stopped or circumscribed in its actions by jealousy, there was need
of a continuous procession until, from degree to degree, all things
had descended to the extreme limits of what was possible;178 for it
is the characteristic of an inexhaustible power to communicate all its
gifts to everything, and not to permit any of them to be disinherited,
since there is nothing which hinders any of them from participating
in the nature of the Good in the measure that it is capable of doing
so. Since matter has existed from all eternity, it was impossible that
from the time since it existed, it should not participate in that which
communicates goodness to all things according to their receptivity
thereof.179 If the generation of matter were the necessary
consequence of anterior principles, still it must not be entirely
deprived of the good by its primitive impotence, when the cause
which gratuitously communicated being to it remained self-
enclosed.
SENSE-OBJECTS ARE NECESSARY AS REVEALERS OF
THE ETERNAL.
The excellence, power and goodness of intelligible (essences)
are therefore revealed by sense-objects; and there is an eternal
connection between intelligible (entities) that are self-existent, and
sense-objects, which eternally derive their existence therefrom by
participation, and which imitate intelligible nature to the extent of
their ability.

THE SOUL'S NATURE IS OF AN INTERMEDIATE
KIND.
7. As there are two kinds of being (or, existence), one of
sensation, and the other intelligible, it is preferable for the soul to
live in the intelligible world; nevertheless, as a result of her nature, it
is necessary for her also to participate in sense-affairs.180 Since she
occupies only an intermediate rank, she must not feel wronged at
not being the best of beings.181 Though on one hand her condition
be divine, on the other she is located on the limits of the intelligible
world, because of her affinity for sense-nature. She causes this
nature to participate in her powers, and she even receives
something therefrom, when, instead of managing the body without
compromising her own security, she permits herself to be carried
away by her own inclination to penetrate profoundly within it,
ceasing her complete union with the universal Soul. Besides, the
soul can rise above the body after having learned to feel how happy
one is to dwell on high, by the experience of things seen and
suffered here below, and after having appreciated the true Good by
the comparison of contraries. Indeed the knowledge of the good
becomes clearer by the experience of evil, especially among souls
which are not strong enough to know evil before having experienced
it.182
THE PROCESSION OF INTELLIGENCE IS AN
EXCURSION DOWNWARDS AND UPWARDS.
The procession of intelligence consists in descending to things
that occupy the lowest rank, and which have an inferior nature,183
for Intelligence could not rise to the superior Nature. Obliged to act
outside of itself, and not being able to remain self-enclosed, by a
necessity and by a law of its nature, intelligence must advance unto
the soul where it stops; then, after having communicated of itself to
that which immediately follows it, intelligence must return to the

intelligible world. Likewise, the soul has a double action in her
double relation with what is below and above her. By her first action,
the soul manages the body to which she is united; by the second,
she contemplates the intelligible entities. These alternatives work
out, for individual souls, with the course of time; and finally there
occurs a conversion which brings them back from the lower to the
higher natures.
THE UNIVERSAL SOUL, HOWEVER, IS NOT
DISTURBED BY THE URGENCIES BELOW HER.
The universal Soul, however, does not need to busy herself with
troublesome functions, and remains out of the reach of evils. She
considers what is below her in a purely contemplative manner, while
at the same time remaining related to what is above her. She is
therefore enabled simultaneously on one side to receive, and on the
other to give, since her nature compels her to relate herself closely
with the objects of sense.184
THE SOUL DOES NOT ENTIRELY ENTER INTO THE
BODY.
8. Though I should set myself in opposition to popular views, I
shall set down clearly what seems to me the true state of affairs.
Not the whole soul enters into the body. By her higher part, she ever
remains united to the intelligible world; as, by her lower part, she
remains united to the sense-world. If this lower part dominates, or
rather, if it be dominated (by sensation) and troubled, it hinders us
from being conscious of what the higher part of the soul
contemplates. Indeed that which is thought impinges on our
consciousness only in case it descends to us, and is felt. In general,
we are conscious of what goes on in every part of the soul only
when it is felt by the entire soul. For instance, appetite, which is the

actualization of lustful desire, is by us cognized only when we
perceive it by the interior sense or by discursive reason, or by both
simultaneously. Every soul has a · lower part turned towards the
body, and a higher part turned towards divine Intelligence. The
universal Soul manages the universe by her lower part without any
kind of trouble, because she governs her body not as we do by any
reasoning, but by intelligence, and consequently in a manner entirely
different from that adopted by art. The individual souls, each of
whom administers a part of the universe,185 also have a part that
rises above their body; but they are distracted from thought by
sensation, and by a perception of a number of things which are
contrary to nature, and which come to trouble them, and afflict
them. Indeed, the body that they take care of constitutes but a part
of the universe, is incomplete, and is surrounded by exterior objects.
That is why it has so many needs, why it desires luxuriousness, and
why it is deceived thereby. On the contrary, the higher part of the
soul is insensible to the attraction of these transitory pleasures, and
leads an undisturbed life.

FIFTH ENNEAD, BOOK FOUR.
How What is After the First Proceeds
Therefrom; of the One.
NECESSITY OF THE EXISTENCE OF THE FIRST.
1. Everything that exists after the First is derived therefrom,
either directly or mediately, and constitutes a series of different
orders such that the second can be traced back to the First, the third
to the second, and so forth. Above all beings there must be
Something simple and different from all the rest which would exist in
itself, and which, without ever mingling with anything else, might
nevertheless preside over everything, which might really be the One,
and not that deceptive unity which is only the attribute of essence,
and which would be a principle superior even to being, unreachable
by speech, reason, or science. For if it be not completely simple,
foreign to all complexity and composition, and be not really one, it
could not be a principle. It is sovereignly absolute only because it is
simple and first. For what is not first, is in need of superior things;
what is not simple has need of being constituted by simple things.
The Principle of everything must therefore be one and only. If it
were admitted that there was a second principle of that kind, both
would constitute but a single one. For we do not say that they are
bodies, nor that the One and First is a body; for every body is
composite and begotten, and consequently is not a principle; for a
principle cannot be begotten.186 Therefore, since the principle of

everything cannot be corporeal, because it must be essentially one,
it must be the First.
THE FIRST NECESSARILY BEGETS A SECOND,
WHICH MUST BE PERFECT.
If something after the One exist, it is no more the simple One,
but the multiple One. Whence is this derived? Evidently from the
First, for it could not be supposed that it came from chance; that
would be to admit that the First is not the principle of everything.
How then is the multiple One derived from the First? If the First be
not only perfect, but the most perfect, if it be the first Power, it must
surely, in respect to power, be superior to all the rest, and the other
powers must merely imitate it to the limit of their ability. Now we see
that all that arrives to perfection cannot unfruitfully remain in itself,
but begets and produces. Not only do beings capable of choice, but
even those lacking reflection or soul have a tendency to impart to
other beings, what is in them; as, for instance, fire emits heat, snow
emits cold; and plant-juices (dye and soak) into whatever they
happen to touch. All things in nature imitate the First principle by
seeking to achieve immortality by procreation, and by manifestation
of their qualities. How then would He who is sovereignly perfect,
who is the supreme Good, remain absorbed in Himself, as if a
sentiment of jealousy hindered Him from communicating Himself, or
as if He were powerless, though He is the power of everything? How
then would He remain principle of everything? He must therefore
beget something, just as what He begets must in turn beget. There
must therefore be something beneath the First. Now this thing
(which is immediately beneath the First), must be very venerable,
first because it begets everything else, then because it is begotten
by the First, and because it must, as being the Second, rank and
surpass everything else.

INTELLIGENCE CANNOT BE THE FIRST, AND RANKS
ALL ELSE.
2. If the generating principle were intelligence, what it begot
would have to be inferior to intelligence, and nevertheless
approximate it, and resemble it more than anything else. Now as the
generating principle is superior to intelligence, the first begotten
thing is necessarily intelligence. Why, however, is the generating
principle not intelligence? Because the act of intelligence is thought,
and thought consists in seeing the intelligible; for it is only by its
conversion towards it that intelligence achieves a complete and
perfect existence. In itself, intelligence is only an indeterminate
power to see; only by contemplation of the intelligible does it
achieve the state of being determined. This is the reason of the
saying, The ideas and numbers, that is, intelligence, are born from
the indefinite doubleness, and the One. Consequently, instead of
being simple, intelligence is multiple. It is composed of several
elements; these are doubtless intelligible, but what intelligence sees
is none the less multiple. In any case, intelligence is simultaneously
the object thought, and the thinking subject; it is therefore already
double.
THE FIRST AND SECOND AS HIGHER AND LOWER
INTELLIGIBLE ENTITIES.
But besides this intelligible (entity, namely, intelligence), there is
another (higher) intelligible (the supreme Intelligible, the First). In
what way does the intelligence, thus determined, proceed from the
(First) Intelligible? The Intelligible abides in itself, and has need of
nothing else, while there is a need of something else in that which
sees and thinks (that is, that which thinks has need of contemplating
the supreme Intelligible). But even while remaining within Himself,
the Intelligible (One) is not devoid of sentiment; all things belong to
Him, are in Him, and with Him. Consequently, He has the conception

of Himself, a conception which implies consciousness, and which
consists in eternal repose, and in a thought, but in a thought
different from that of intelligence. If He begets something while
remaining within Himself, He begets it precisely when He is at the
highest point of individuality. It is therefore by remaining in His own
state that He begets what He begets; He procreates by
individualizing. Now as He remains intelligible, what He begets
cannot be anything else than thought; therefore thought, by
existing, and by thinking the Principle whence it is derived (for it
could not think any other object), becomes simultaneously
intelligence and intelligible; but this second intelligible differs from
the first Intelligible from which it proceeds, and of which it is but the
image and the reflection.
THE SECOND IS THE ACTUALIZATION OF THE
POTENTIALITY OF THE FIRST.
But how is an actualization begotten from that self-limited
(intelligible)? We shall have to draw a distinction between an
actualization of being, and an actualization out of the being of each
thing (actualized being, and actualization emanating from being).
Actualized being cannot differ from being, for it is being itself. But
the actualization emanating from being—and everything necessarily
has an actualization of this kind—differs from what produces it. It is
as if with fire: there is a difference between the heat which
constitutes its being, and the heat which radiates exteriorly, while
the fire interiorly realizes the actualization which constitutes its
being, and which makes it preserve its nature. Here also, and far
more so, the First remains in His proper state, and yet
simultaneously, by His inherent perfection, by the actualization which
resides in Him, has been begotten the actualization which, deriving
its existence from so great a power, nay, from supreme Power, has
arrived at, or achieved essence and being. As to the First, He was

above being; for He was the potentiality of all things, already being
all things.
HOW THE FIRST IS ABOVE ALL BEING.
If this (actualization begotten by the First, this external
actualization) be all things, then that (One) is above all things, and
consequently above being. If then (this external actualization) be all
things, and be before all things, it does not occupy the same rank as
the remainder (of all other things); and must, in this respect also, be
superior to being, and consequently also to intelligence; for there is
Something superior to intelligence. Essence is not, as you might say,
dead; it is not devoid of life or thought; for intelligence and essence
are identical. Intelligible entities do not exist before the intelligence
that thinks them, as sense-objects exist before the sensation which
perceives them. Intelligence itself is the things that it thinks, since
their forms are not introduced to them from without. From where
indeed would intelligence receive these forms? Intelligence exists
with the intelligible things; intelligence is identical with them, is one
with them. Reciprocally, intelligible entities do not exist without their
matter (that is, Intelligence).

FOURTH ENNEAD, BOOK NINE.
Whether All Souls Form a Single One?
IF ALL SOULS BE ONE IN THE WORLD-SOUL, WHY
SHOULD THEY NOT TOGETHER FORM ONE?
1. Just as the soul of each animal is one, because she is entirely
present in the whole body, and because she is thus really one,
because she does not have one part in one organ, and some other
part in another; and just as the sense-soul is equally one in all the
beings which feel, and just as the vegetative soul is everywhere
entirely one in each part of the growing plants; why then should
your soul and mine not form a single unity? Why should not all souls
form but a single one? Why should not the universal (Soul) which is
present in all beings, be one because she is not divided in the
manner of a body, being everywhere the same? Why indeed should
the soul in myself form but one, and the universal (Soul) likewise not
be one, similarly, since no more than my own is this universal (Soul)
either material extension, or a body? If both my soul and yours
proceed from the universal (Soul), and if the latter be one, then
should my soul and yours together form but a single one. Or again,
on the supposition that the universal (Soul) and mine proceed from
a single soul, even on this hypothesis would all souls form but a
single one. We shall have to examine in what (this Soul which is but)
one consists.

SOULS MAY NOT FORM A NUMERIC UNITY, BUT MAY
FORM A GENERIC UNITY.
Let us first consider if it may be affirmed that all souls form but
one in the sense in which it is said that the soul of each individual is
one. It seems absurd to pretend that my soul and yours form but
one in this (numerical) sense; for then you would be feeling
simultaneously with my feeling, and you would be virtuous when I
was, and you would have the same desires as I, and not only would
we both have the same sentiments, but even the identical
sentiments of the universal (Soul), so that every sensation felt by me
would have been felt by the entire universe. If in this manner all the
souls form but one, why is one soul reasonable, and the other
unreasonable, why is the one in an animal, and the other in a plant?
On the other hand, if we do not admit that there is a single Soul, we
will not be able to explain the unity of the universe, nor find a single
principle for (human) souls.
THE UNITY OF THE PRINCIPLE OF SEVERAL SOULS
NEED NOT IMPLY THEIR BEING IDENTICAL.
2. In the first place, if the souls of myself and of another man
form but one soul, this does not necessarily imply their being
identical with their principle. Granting the existence of different
beings, the same principle need not experience in each the same
affections. Thus, humanity may equally reside in me, who am in
motion, as in you, who may be at rest, although in me it moves, and
it rests in you. Nevertheless, it is neither absurd nor paradoxical to
insist that the same principle is both in you and in me; and this does
not necessarily make us feel the identical affections. Consider a
single body: it is not the left hand which feels what the right one
does, but the soul which is present in the whole body. To make you
feel the same as I do, our two bodies would have to constitute but a
single one; then, being thus united, our souls would perceive the

same affections. Consider also that the All remains deaf to a
multitude of impressions experienced by the parts of a single and
same organism, and that so much the more as the body is larger.
This is the state of affairs, for instance, with the large whales which
do not feel the impression received in some one part of their body,
because of the smallness of the movement.
SYMPATHY DOES NOT FORCE IDENTITY OF
SENSATION.
It is therefore by no means necessary that when one member of
the universe experiences an affection, the latter be clearly felt by the
All. The existence of sympathy is natural enough, and it could not be
denied; but this does not imply identity of sensation. Nor is it absurd
that our souls, while forming a single one should be virtuous and
vicious, just as it would be possible that the same essence be at
motion in me, but at rest in you. Indeed, the unity that we attribute
to the universal (Soul) does not exclude all multiplicity, such a unity
as befits intelligence. We may however say that (the soul) is
simultaneously unity and plurality, because she participates not only
in divisible essence in the bodies, but also in the indivisible, which
consequently is one. Now, just as the impression perceived by one
of my parts is not necessarily felt all over my body, while that which
happens to the principal organ is felt by all the other parts, likewise,
the impressions that the universe communicates to the individual are
clearer, because usually the parts perceive the same affections as
the All, while it is not evident that the particular affections that we
feel would be also experienced by the Whole.
UNITY OF ALL BEINGS IMPLIED BY SYMPATHY,
LOVE, AND MAGIC ENCHANTMENT.

3. On the other hand, observation teaches us that we
sympathize with each other, that we cannot see the suffering of
another man without sharing it, that we are naturally inclined to
confide in each other, and to love; for love is a fact whose origin is
connected with the question that occupies us. Further, if
enchantments and magic charms mutually attract individuals,
leading distant persons to sympathize, these effects can only be
explained by the unity of soul. (It is well known that) words
pronounced in a low tone of voice (telepathically?) affect a distant
person, and make him hear what is going on at a great distance.
Hence appears the unity of all beings, which demands the unity of
the Soul.
WHAT OF THE DIFFERENCES OF RATIONALITY, IF
THE SOUL BE ONE?
If, however, the Soul be one, why is some one soul reasonable,
another irrational, or some other one merely vegetative? The
indivisible part of the soul consists in reason, which is not divided in
the bodies, while the part of the divisible soul in the bodies (which,
though being one in herself, nevertheless divides herself in the
bodies, because she sheds sentiment everywhere), must be
regarded as another power of the soul (the sensitive power);
likewise, the part which fashions and produces the bodies is still
another power (the vegetative power); nevertheless, this plurality of
powers does not destroy the unity of the soul. For instance, in a
grain of seed there are also several powers; nevertheless this grain
of seed is one, and from this unity is born a multiplicity which forms
a unity.
THE POWERS OF THE SOUL ARE NOT EXERCISED
EVERYWHERE BECAUSE THEY DIFFER.

But why do not all the powers of the soul act everywhere? Now
if we consider the Soul which is one everywhere, we find that
sensation is not similar in all its parts (that is, in all the individual
souls); that reason is not in all (but in certain souls exclusively); and
that the vegetative power is granted to those beings who do not
possess sensation, and that all these powers return to unity when
they separate from the body.
THE BODY'S POWER OF GROWTH IS DERIVED
FROM THE WHOLE, AND THE SOUL; BUT NOT FROM
OUR SOUL.
If, however, the body derive its vegetative power from the Whole
and from this (universal) Soul which is one, why should it not derive
it also from our soul? Because that which is nourished by this power
forms a part of the universe, which possesses sensation only at the
price of suffering. As to the sense-power which rises as far as the
judgment, and which is united to every intelligence, there was no
need for it to form what had already been formed by the Whole, but
it could have given its forms if these forms were not parts of the
Whole which produces them.
THE UNITY OF THE SOULS IS A CONDITION OF
THEIR MULTIPLICITY.
4. Such justifications will preclude surprise at our deriving all
souls from unity. But completeness of treatment demands
explanation how all souls are but a single one. Is this due to their
proceeding from a single Soul, or because they all form a single one?
If all proceed from a single one, did this one divide herself, or did
she remain whole, while begetting the multitude of souls? In this
case, how could an essence beget a multitude like her, while herself
remaining undiminished? We shall invoke the help of the divinity (in

solving this problem); and say that the existence of the one single
Soul is the condition of the existence of the multitude of souls, and
that this multitude must proceed from the Soul that is one.
THE SOUL CAN BEGET MANY BECAUSE SHE IS AN
INCORPOREAL ESSENCE.
If the Soul were a body, then would the division of this body
necessarily produce the multitude of souls, and this essence would
be different in its different parts. Nevertheless, as this essence would
be homogeneous, the souls (between which it would divide itself)
would be similar to each other, because they would possess a single
identical form in its totality, but they would differ by their body. If
the essence of these souls consisted in the bodies which would serve
them as subjects, they would be different from each other. If the
essence of these souls consisted in their form, they would, in form,
be but one single form; in other terms, there would be but one same
single soul in a multitude of bodies. Besides, above this soul which
would be one, but which would be spread abroad in the multitude of
bodies, there would be another Soul which would not be spread
abroad in the multitude of bodies; it would be from her that would
proceed the soul which would be the unity in plurality, the multiple
image of the single Soul in a single body, like a single seal, by
impressing the same figure to a multitude of pieces of wax, would
be distributing this figure in a multitude of impressions. In this case
(if the essence of the soul consisted in her form) the soul would be
something incorporeal, and as she would consist in an affection of
the body, there would be nothing astonishing in that a single quality,
emanating from a single principle, might be in a multitude of
subjects simultaneously. Last, if the essence of the soul consisted in
being both things (being simultaneously a part of a homogeneous
body and an affection of the body), there would be nothing
surprising (if there were a unity of essence in a multitude of

subjects). We have thus shown that the soul is incorporeal, and an
essence; we must now consider the results of this view.
HOW AN ESSENCE CAN BE ONE IN A MULTITUDE OF
SOULS IS ILLUSTRATED BY SEED.
5. How can an essence be single in a multitude of souls? Either
this one essence is entire in all souls, or this one and entire essence
begets all souls while remaining (undiminished) in itself. In either
case, the essence is single. It is the unity to which the individual
souls are related; the essence gives itself to this multitude, and yet
simultaneously the essence does not give itself; it can give of itself
to all individual souls, and nevertheless remain single; it is powerful
enough to pass into all simultaneously, and to be separated from
none; thus its essence remains identical, while being present in a
multitude of souls. This is nothing astonishing; all of science is
entirely in each of its parts, and it begets them without itself ceasing
to remain entire within itself. Likewise, a grain of seed is entire in
each of its parts in which it naturally divides itself; each of its parts
has the same properties as the whole seed; nevertheless the seed
remains entire, without diminution; and if the matter (in which the
seed resides) offer it any cause of division, all the parts will not any
the less form a single unity.
THIS MIRACLE IS EXPLAINED BY THE USE OF THE
CONCEPTION OF POTENTIALITY.
It may be objected that in science a part is not the total science.
Doubtless, the notion which is actualized, and which is studied to the
exclusion of others, because there is special need of it, is only
partially an actualization. Nevertheless, in a latent manner it
potentially comprises all the other notions it implies. Thus, all the
notions are contained in each part of the science, and in this respect

each part is the total science; for what is only partially actualized
(potentially) comprises all the notions of science. Each notion that
one wishes to render explicit is at one's disposition; and this in every
part of the science that is considered; but if it be compared with the
whole science, it seems to be there only potentially. It must not,
however, be thought that the particular notion does not contain
anything of the other notions; in this case, there would be nothing
systematic or scientific about it; it would be nothing more than a
sterile conception. Being a really scientific notion, it potentially
contains all the notions of the science; and the genuine scientist
knows how to discover all its notions in a single one, and how to
develop its consequences. The geometrical expert shows in his
demonstrations how each theorem contains all the preceding ones,
to which he harks back by analysis, and how each theorem leads to
all the following ones, by deduction.
DIFFICULT AS THESE EXPLANATIONS ARE, THEY
ARE CLEAR INTELLIGIBLY.
These truths excite our incredulity, because here below our
reason is weak, and it is confused by the body. In the intelligible
world, however, all the verities are clear, and each is evident, by
itself.

SIXTH ENNEAD, BOOK NINE.
Of the Good and the One.
UNITY NECESSARY TO EXISTENCE OF ALL BEINGS.
1. All beings, both primary, as well as those who are so called on
any pretext soever, are beings only because of their unity. What,
indeed would they be without it? Deprived of their unity, they would
cease to be what they are said to be. No army can exist unless it be
one. So with a choric ballet or a flock. Neither a house nor a ship
can exist without unity; by losing it they would cease to be what
they are.187 So also with continuous quantities which would not exist
without unity. On being divided by losing their unity, they
simultaneously lose their nature. Consider farther the bodies of
plants and animals, of which each is a unity. On losing their unity by
being broken up into several parts, they simultaneously lose their
nature. They are no more what they were, they have become new
beings, which themselves exist only so long as they are one. What
effects health in us, is that the parts of our bodies are co-ordinated
in unity. Beauty is formed by the unity of our members. Virtue is our
soul's tendency to unity, and becoming one through the harmony of
her faculties.
THE SOUL MAY IMPART UNITY, BUT IS NOT UNITY.
The soul imparts unity to all things when producing them,
fashioning them, and forming them. Should we, therefore, after
rising to the Soul, say that she not only imparts unity, but herself is

unity in itself? Certainly not. The soul that imparts form and figure to
bodies is not identical with form, and figure. Therefore the soul
imparts unity without being unity. She unifies each of her
productions only by contemplation of the One, just as she produces
man only by contemplating Man-in-himself, although adding to that
idea the implied unity. Each of the things that are called one have
a unity proportionate to their nature (being); so that they
participate in unity more or less according as they share essence188
(being). Thus the soul is something different from unity;
nevertheless, as she exists in a degree higher (than the body), she
participates more in unity, without being unity itself; indeed she is
one, but the unity in her is no more than contingent. There is a
difference between the soul and unity, just as between the body and
unity. A discrete quantity such as a company of dancers, or choric
ballet, is very far from being unity; a continuous quantity
approximates that further; the soul gets still nearer to it, and
participates therein still more. Thus from the fact that the soul could
not exist without being one, the identity between the soul and unity
is suggested. But this may be answered in two ways. First, other
things also possess individual existence because they possess unity,
and nevertheless are not unity itself; as, though the body is not
identical with unity, it also participates in unity. Further, the soul is
manifold as well as one, though she be not composed of parts. She
possesses several faculties, discursive reason, desire, and perception
—all of them faculties joined together by unity as a bond. Doubtless
the soul imparts unity to something else (the body), because she
herself possesses unity; but this unity is by her received from some
other principle (namely, from unity itself).
BEING AND ESSENCE IDENTICAL WITH UNITY.
2. (Aristotle189) suggests that in each of the individual beings
which are one, being is identical with unity. Are not being and
essence identical with unity, in every being and in every essence, in
a manner such that on discovering essence, unity also is discovered?

Is not being in itself unity in itself, so that if being be intelligence,
unity also must be intelligence, as intelligence which, being essence
in the highest degree, is also unity in the first degree, and which,
imparting essence to other things, also imparts unity to them? What
indeed could unity be, apart from essence and being? As man, and
a man are equivalent,190 essence must be identical with unity; or,
unity is the number of everything considered individually; and as one
object joined to another is spoken of as two, so an object alone is
referred to as one.
UNITY IS NOT A NUMBERING DEVICE, BUT IS
IDENTICAL WITH EXISTENCE.
If number belongs to the class of beings, evidently the latter
must include unity also; and we shall have to discover what kind of a
being it is. If unity be no more than a numbering device invented by
the soul, then unity would possess no real existence. But we have
above observed that each object, on losing unity, loses existence
also. We are therefore compelled to investigate whether essence and
unity be identical either when considered in themselves, or in each
individual object.
EVEN UNIVERSAL ESSENCE CONTAINS
MANIFOLDNESS.
If the essence of each thing be manifoldness, and as unity
cannot be manifoldness, unity must differ from essence. Now man,
being both animal and rational, contains a manifoldness of elements
of which unity is the bond. There is therefore a difference between
man and unity; man is divisible, while unity is indivisible. Besides,
universal Essence, containing all essences, is still more manifold.
Therefore it differs from unity; though it does possess unity by
participation. Essence possesses life and intelligence, for it cannot be

considered lifeless; it must therefore be manifold. Besides, if essence
be intelligence, it must in this respect also be manifold, and must be
much more so if it contain forms; for the idea191 is not genuinely
one. Both as individual and general it is rather a number; it is one
only as the world is one.
BESIDES, ABSOLUTE UNITY IS THE FIRST, WHICH
INTELLIGENCE IS NOT.
Besides, Unity in itself is the first of all; but intelligence, forms
and essence are not primary. Every form is manifold and composite,
and consequently must be something posterior; for parts are prior to
the composite they constitute. Nor is intelligence primary, as appears
from the following considerations. For intelligence existence is
necessarily thought and the best intelligence which does not
contemplate exterior objects, must think what is above it; for, on
turning towards itself, it turns towards its principle. On the one hand,
if intelligence be both thinker and thought, it implies duality, and is
not simple or unitary. On the other hand, if intelligence contemplate
some object other than itself, this might be nothing more than some
object better than itself, placed above it. Even if intelligence
contemplate itself simultaneously with what is better than it, even so
intelligence is only of secondary rank. We may indeed admit that the
intelligence which has such a nature enjoys the presence of the
Good, of the First, and that intelligence contemplates the First; but
nevertheless at the same time intelligence is present to itself, and
thinks itself as being all things. Containing such a diversity,
intelligence is far from unity.
UNITY AS ABOVE ALL THINGS, INTELLIGENCE AND
ESSENCE.

Thus Unity is not all things, for if so, it would no longer be unity.
Nor is it Intelligence, for since intelligence is all things, unity too
would be all things. Nor is it essence, since essence also is all things.
UNITY IS DIFFICULT TO ASCERTAIN BECAUSE THE
SOUL IS FEARFUL OF SUCH ABSTRUSE
RESEARCHES.
3. What then is unity? What is its nature? It is not surprising that
it is so difficult to say so, when it is difficult to explain of what even
essence or form consist. But, nevertheless, forms are the basis of
our knowledge. Everything that the soul advances towards what is
formless, not being able to understand it because it is indeterminate,
and so to speak has not received the impression of a distinctive
type, the soul withdraws therefrom, fearing she will meet nonentity.
That is why, in the presence of such things she grows troubled, and
descends with pleasure. Then, withdrawing therefrom, she, so to
speak, lets herself fall till she meets some sense-object, on which
she pauses, and recovers; just as the eye which, fatigued by the
contemplation of small objects, gladly turns back to large ones.
When the soul wishes to see by herself, then seeing only because
she is the object that she sees, and, further, being one because she
forms but one with this object, she imagines that what she sought
has escaped, because she herself is not distinct from the object that
she thinks.
THE PATH OF SIMPLIFICATION TO UNITY.
Nevertheless a philosophical study of unity will follow the
following course. Since it is Unity that we seek, since it is the
principle of all things, the Good, the First that we consider, those
who will wish to reach it must not withdraw from that which is of
primary rank to decline to what occupies the last, but they must

withdraw their souls from sense-objects, which occupy the last
degree in the scale of existence, to those entities that occupy the
first rank. Such a man will have to free himself from all evil, since he
aspires to rise to the Good. He will rise to the principle that he
possesses within himself. From the manifold that he was he will
again become one. Only under these conditions will he contemplate
the supreme principle, Unity. Thus having become intelligence,
having trusted his soul to intelligence, educating and establishing
her therein, so that with vigilant attention she may grasp all that
intelligence sees, he will, by intelligence, contemplate unity, without
the use of any senses, without mingling any of their perceptions with
the flashes of intelligence. He will contemplate the purest Principle,
through the highest degree of the purest Intelligence. So when a
man applies himself to the contemplation of such a principle and
represents it to himself as a magnitude, or a figure, or even a form,
it is not his intelligence that guides him in this contemplation for
intelligence is not destined to see such things; it is sensation, or
opinion, the associate of sensation, which is active in him.
Intelligence is only capable of informing us about things within its
sphere.
UNITY AS THE UNIFORM IN ITSELF AND FORMLESS
SUPERFORM.
Intelligence can see both the things that are above it, those
which belong to it, and the things that proceed from it. The things
that belong to intelligence are pure; but they are still less pure and
less simple than the things that are above Intelligence, or rather
than what is above it; this is not Intelligence, and is superior to
Intelligence. Intelligence indeed is essence, while the principle above
it is not essence, but is superior to all beings. Nor is it essence, for
essence has a special form, that of essence, and the One is
shapeless even intelligible. As Unity is the nature that begets all
things, Unity cannot be any of them. It is therefore neither any

particular thing, nor quantity, nor quality, nor intelligence, nor soul,
nor what is movable, nor what is stable; it is neither in place nor
time; but it is the uniform in itself, or rather it is formless, as it is
above all form, above movement and stability. These are my views
about essence and what makes it manifold.192
WHY IT IS NOT STABLE, THOUGH IT DOES NOT
MOVE.
But if it does not move, why does it not possess stability?
Because either of these things, or both together, are suitable to
nothing but essence. Besides, that which possesses stability is stable
through stability, and is not identical with stability itself;
consequently it possesses stability only by accident, and would no
longer remain simple.
BEING A PRIMARY CAUSE, UNITY IS NOTHING
CONTINGENT.
Nor let anybody object that something contingent is attributed to
Unity when we call it the primary cause. It is to ourselves that we
are then attributing contingency, since it is we who are receiving
something from Unity, while Unity remains within itself.
UNITY CANNOT BE DEFINED; WE CAN ONLY REFER
TO IT BY OUR FEELINGS OF IT.
Speaking strictly, we should say that the One is this or that (that
is, we should not apply any name to it). We can do no more than
turn around it, so to speak, trying to express what we feel (in regard
to it); for at times we approach Unity, and at times withdraw from it
as a result of our uncertainty about it.

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount

More Related Content

Similar to Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount (20)

Recently uploaded (20)

Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount