Introduction to bioinformatics Arthur M. Lesk

Introduction to bioinformatics Arthur M. Lesk
download
https://ebookgate.com/product/introduction-to-bioinformatics-
arthur-m-lesk/
Get Instant Ebook Downloads – Browse at https://ebookgate.com

Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats
Introduction to bioinformatics 2ed Edition Lesk A.M.
https://ebookgate.com/product/introduction-to-bioinformatics-2ed-
edition-lesk-a-m/
Introduction to Protein Science Architecture Function
and Genomics 1st Edition Arthur M. Lesk
https://ebookgate.com/product/introduction-to-protein-science-
architecture-function-and-genomics-1st-edition-arthur-m-lesk/
Learning From Data An Introduction to Statistical
Reasoning using JASP 4th Edition Arthur M. Glenberg &
Matthew E. Andrzejewski
https://ebookgate.com/product/learning-from-data-an-introduction-
to-statistical-reasoning-using-jasp-4th-edition-arthur-m-
glenberg-matthew-e-andrzejewski/
Introduction to Logic 14th Edition Irving M. Copi
https://ebookgate.com/product/introduction-to-logic-14th-edition-
irving-m-copi/

Introduction to Business Statistics 7th Edition Ronald
M.(Ronald M. Weiers) Weiers
https://ebookgate.com/product/introduction-to-business-
statistics-7th-edition-ronald-m-ronald-m-weiers-weiers/
Which Way Out and Other Essays Arthur M. Young
https://ebookgate.com/product/which-way-out-and-other-essays-
arthur-m-young/
Heart Failure Pharmacologic Management 1st Edition
Arthur M. Feldman
https://ebookgate.com/product/heart-failure-pharmacologic-
management-1st-edition-arthur-m-feldman/
Bioinformatics from genomes to drugs 1st Edition Thomas
Lengauer
https://ebookgate.com/product/bioinformatics-from-genomes-to-
drugs-1st-edition-thomas-lengauer/
Introduction to information systems Fourth Edition
Patricia M. Wallace
https://ebookgate.com/product/introduction-to-information-
systems-fourth-edition-patricia-m-wallace/

Introduction to Bioinformatics
Arthur M. Lesk University of Cambridge
In nature's infinite book of secrecy
A little I can read.
- Anthony and Cleopatra
OXFORD UNIVERSITY PRESS
Great Clarendon Street,,
Oxford OX2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University's objective of excellence in research, scholarship, and education by publishing
worldwide in
Oxford New York
Athens Auckland Bangkok Bogotá Buenos Aires Cape Town
Chennai DaresSalaam Delhi Florence HongKong Istanbul Karachi
Kolkata Kuala Lumpur Madrid Melbourne Mexico City Mumbai Nairobi
Paris São Paulo Shanghai Singapore Taipei Tokyo Toronto Warsaw
with associated companies in Berlin Ibadan
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
Copyright © Arthur M. Lesk, 2002
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First published 2002, reprinted 2002
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted,
in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly
permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries
concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford
University Press, at the address above
You must not circulate this book in any other binding or cover and you must impose this same condition on
any acquirer
British Library Cataloguing in Publication Data
Data available
Library of Congress Cataloging in Publication Data
Data available
ISBN (Pbk)0 19 925196 7
Typeset by Newgen Imaging Systems(P) Ltd, Chennai, India
Printed in Great Britain
on acid-free paper by The Bath Press, Bath
Dedicated to Eda, with whom I have merged my genes.

Table of Contents
Introduction to Bioinformatics...................................................2
Preface..................................................................................4
Plan of the Book......................................................................8
Chapter 1 - Introduction...........................................................9
Chapter 2 - Genome Organization and Evolution.........................63
Chapter 3 - Archives and Information Retrieval.........................106
Chapter 4 - Alignments and Phylogenetic Trees.........................154
Chapter 5 - Protein Structure and Drug Discovery......................207
Conclusions..........................................................................255

4
Preface
On 26 June, 2000, the sciences of biology and medicine changed forever. Prime Minister of the United
Kingdom Tony Blair and President of the United States Bill Clinton held a joint press conference, linked via
satellite, to announce the completion of the draft of the Human Genome. The New York Times ran a banner
headline: 'Genetic Code of Human Life is Cracked by Scientists'. The sequence of three billion bases was the
culmination of over a decade of work, during which the goal was always clearly in sight and the only questions
were how fast the technology could progress and how generously the funding would flow. The Box shows
some of the landmarks along the way.
Next to the politicians stood the scientists. John Sulston, Director of The Sanger Centre in the UK, had been a
key player since the beginning of high-throughput sequencing methods. He had grown with the project from its
earliest 'one man and a dog' stages to the current international consortium. In the US, appearing with
President Clinton were Francis Collins, director of the US National Human Genome Research Institute,
representing the US publicly-funded efforts; and J. Craig Venter, President and Chief Scientific Officer of
Celera Genomics Corporation, representing the commercial sector. It is difficult to introduce these two without
thinking, 'In this corner ... and in this corner ...' Although never actually coming to blows, there was certainly
intense competition, in the later stages a race.
The race was more than an effort to finish first and receive scientific credit for priority. Indeed, it was a race
after which the contestants would be tested not for whether they had taken drugs, but whether they and others
could discover them. Clinical applications were a prime motive for support of the human genome project. Once
the courts had held that gene sequences were patentable - with enormous potential payoffs for drugs based
on them - the commercial sector rushed to submit patents on sets of sequences that they determined, and the
academic groups rushed to place each bit of sequence that they determined into the public domain to prevent
Celera - or anyone else - from applying for patents.
The academic groups lined up against Celera were a collaborating group of laboratories primarily but not
exclusively in the UK and USA. These included The Sanger Centre in England, Washington University in St.
Louis, Missouri, the Whitehead Institute at the Massachusetts Institute of Technology in Cambridge,
Massachusetts, Baylor College of Medicine in Houston, Texas, the Joint Genome Institute at Lawrence
Livermore National Laboratory in Livermore, California, and the RIKEN Genomic Sciences Center, now in
Yokahama, Japan.
Both sides could dip into deep pockets. Celera had its original venture capitalists; its current parent company,
PE Corporation; and, after going public, anyone who cared to take a flutter. The Sanger Centre was supported
by the UK Medical Research Council and The Wellcome Trust. The US academic labs were supported by the
US National Institutes of Health, and Department of Energy.
Landmarks in the Human Genome Project
1953 Watson-Crick structure of DNA published.
1975 F. Sanger, and independently A. Maxam and W. Gilbert, develop methods for
sequencing DNA.
1977
Bacteriophage φX-174 sequenced: first 'complete genome.'
1980 US Supreme Court holds that genetically-modified bacteria are patentable.
This decision was the original basis for patenting of genes.
1981 Human mitochondrial DNA sequenced: 16 569 base pairs.
1984 Epstein-Barr virus genome sequenced: 172 281 base pairs

5
1990 International Human Genome Project launched - target horizon 15 years.
1991 J. C. Venter and colleagues identify active genes via Expressed Sequence
Tags - sequences of initial portions of DNA complementary to messenger
RNA.
1992 Complete low resolution linkage map of the human genome.
1992
Beginning of the Caenorhabditis elegans sequencing project.
1992 Wellcome Trust and United Kingdom Medical Research Council establish The
Sanger Centre for large-scale genomic sequencing, directed by J. Sulston.
1992 J. C. Venter forms The Institute for Genome Research (TIGR), associated
with plans to exploit sequencing commercially through gene identification and
drug discovery.
1995
First complete sequence of a bacterial genome, Haemophilus influenzae, by
TIGR.
1996 High-resolution map of human genome - markers spaced by ~ 600 000 base
pairs.
1996 Completion of yeast genome, first eukaryotic genome sequence.
May
1998
Celera claims to be able to finish human genome by 2001. Wellcome
responds by increasing funding to Sanger Centre.
1998
Caenorhabditis elegans sequence published.
1
Septe
mber,
1999
Drosophila melanogaster genome sequence announced, by Celera; released
Spring 2000.
1999 Human Genome Project states goal: working draft of human genome by 2001
(90% of genes sequenced to >95% accuracy).
1
Dece
mber,
1999
Sequence of first complete human chromosome published.
26
June,
Joint announcement of complete draft sequence of human genome.

6
2000
2003 Fiftieth anniversary of discovery of the structure of DNA. Target date for
completion of high-quality human genome sequence by public consortium.
On 26 June, 2000 the contestants agreed to declare the race a tie, or at least a carefully out-of-focus photo
finish.
The human genome is only one of the many complete genome sequences known. Taken together, genome
sequences from organisms distributed widely among the branches of the tree of life give us a sense, only
hinted at before, of the very great unity in detail of all life on Earth. They have changed our perceptions, much
as the first pictures of the Earth from space engendered a unified view of our planet.
The sequencing of the human genome sequence ranks with the Manhattan project that produced atomic
weapons during the Second World War, and the space program that sent people to the Moon, as one of the
great bursts of technological achievement of the last century. These projects share a grounding in
fundamental science, and large-scale and expensive engineering development and support. For biology,
neither the attitudes nor the budgets will ever be the same. Soon a 'one man and a dog project' will refer only
to an afternoon's undergraduate practical experiment in sequencing and comparison of two mammalian
genomes.
The human genome is fundamentally about information, and computers were essential both for the
determination of the sequence and for the applications to biology and medicine that are already flowing from it.
Computing contributed not only the raw capacity for processing and storage of data, but also the
mathematically-sophisticated methods required to achieve the results. The marriage of biology and computer
science has created a new field called bioinformatics.
Today bioinformatics is an applied science. We use computer programs to make inferences from the data
archives of modern molecular biology, to make connections among them, and to derive useful and interesting
predictions.
This book is aimed at students and practising scientists who need to know how to access the data archives of
genomes and proteins, the tools that have been developed to work with these archives, and the kinds of
questions that these data and tools can answer. In fact, there are a lot of sources of this information. Sites
treating topics in bioinformatics are sprawled out all over the Web. The challenge is to select an essential core
of this material and to describe it clearly and coherently, at an introductory level.
It is assumed that the reader already has some knowledge of modern molecular biology, and some facility at
using a computer. The purpose of this book is to build on and develop this background. It is suitable as a
textbook for advanced undergraduates or beginning postgraduate students. Many worked-out examples are
integrated into the text, and references to useful web sites and recommended reading are provided.
Problems test and consolidate understanding, provide opportunities to practise skills, and explore additional
topics. Three types of problems appear at the ends of chapters. Exercises are short and straightforward
applications of material in the text. Problems also involve no information not contained in the text, but require
lengthier answers or in some cases calculations. The third category, 'Web-lems,' require access to the World
Wide Web. Weblems are designed to give readers practice with the tools required for further study and
research in the field.
What has made it possible to try to write such a book now is the extent to which the World Wide Web has
made easily accessible both the archives themselves and the programs that deal with them. In the past, it was
necessary to install programs and data on one's own system, and run calculations locally. Of course this
meant that everything was dependent on the facilities available. Now it is possible to channel all the work
through an interface to the Web. The web site linked with this book will ease the transition (see inside front
cover.) To ensure that readers will be able freely to pursue discussions in the book onto the Web, descriptions
of and references to commercial software have been avoided, although many commercial packages are of
very high quality.
A serious problem with the web is its volatility. Sites come and go, leaving trails of dead links in their wake.
There are so many sites that it is necessary to try to find a few gateways that are stable - not only continuing
to exist but also kept up-to-date in both their contents and links. I have suggested some such sites, but many

7
others are just as good. The problem is not to create a long list of useful sites - this has been done many
times, and is relatively easy - but to create a short one - this is much harder!
Some computing is introduced in this book based on the widely available language PERL. Examples of simple
PERL programs appear in the context of biological problems. Many simple PERL tasks are assigned as
exercises or problems at the ends of the chapters.
Where might the reader turn next? This book is designed as a companion volume - in current parlance, a
'prequel' - to Introduction to Protein Architecture: The Structural Biology of Proteins (Oxford University Press,
2001), and that title is of course recommended. Other books on sequence analysis range from those oriented
towards biology to others in the field of computer science. The goal is that each reader will come to recognize
his or her own interests, and be equipped to follow them up.
I am grateful to many colleagues for discussions and advice during the preparation of this book, and to the
universities of Uppsala, Umeå, Rome 'Tor Vergata' and Cambridge for the opportunity to try out this material.
I thank S. Aparicio, T. Baglin, D. Baker, A. Bench, M. Brand, G. Bricogne, R.W. Carrell, C. Chothia, D.
Crowther, T. Dafforn, R. Foley, A. Friday, M.B. Gerstein, T. Gibson, T. J. P. Hubbard, J. Irving, J. Karn, K.
Karplus, B. Kieffer, E.V. Koonin, M. Krichevsky, P. Lawrence, D. Liberles, A. Lister, E.L. Lesk, M.E. Lesk, V.E.
Lesk, V.I. Lesk, L. Lo Conte, D.A. Lomas, J. Magré, C. Mitchell, J. Moult, E. Nacheva, H. Parfrey, A. Pastore,
D. Penny, F.W. Roberts, G.D. Rose, B. Rost, J. Sulston, M. Segal, E.L. Sonnhammer, R. Srinivasan, R.
Staden, G. H. Thomas, A. Tramontano, A.A. Travers, A. Venkitaraman, G. Vriend, J.C. Whisstock, S.H. White,
C. Wu, and M. Zuker for advice and critical reading.
I thank the staff of Oxford University press for their skills and patience in producing the book.
A.M.L.
Cambridge
January 2002

8
Plan of the Book
My goal is that readers of this book will emerge with
ƒ An appreciation of the nature of the very large amount of detailed information about ourselves and
other species that has become available.
ƒ A sense of the range of applications of bioinformatics to molecular biology, clinical medicine,
pharmacology, biotechnology, agriculture, forensic science, anthropology and other disciplines.
ƒ A useful knowledge of the techniques by which, through the World Wide Web, we gain access to
the data and the methods for their analysis.
ƒ An appreciation of the role of computers and computer science in the investigations and
applications of the data.
ƒ Confidence in the reader's basic skills in information retrieval, and calculations with the data, and in
the ability to extend these skills by self-directed 'field work' on the Web.
ƒ A sense of optimism that the data and methods of bioinformatics will create profound advances in
our understanding of life, and improvements in the health of humans and other living things.
Plan of the book
ƒ Chapter 1 sets the stage and introduces all of the major players: DNA and protein sequences and
structures, genomes and proteomes, databases and information retrieval, the World Wide Web and
computer programming. Before developing individual topics in detail it is important to see the
framework of their interactions.
ƒ Chapter 2 presents the nature of individual genomes, including the Human Genome, and the
relationships among them, from the biological point of view.
ƒ Chapter 3 imparts basic skills in using the Web in bioinformatics. It describes archival databanks,
and leads the reader through sample sessions, involving information retrieval from some of the major
databases in molecular biology.
ƒ Chapter 4 treats the analysis of relationships among sequences - alignments and phylogenetic
trees. These methods underlie some of the major computational challenges of bioinformatics:
detecting distant relatives, understanding relationships among genomes of different organisms, and
tracing the course of evolution at the species and molecular levels.
ƒ Chapter 5 moves into three dimensions, treating protein structure and folding. Sequence and
structure must be seen as full partners, with bioinformatics developing methods for moving back and
forth between them as fluently as possible. Understanding protein structures in detail is essential for
determining their mechanisms of action, and for clinical and pharmacological applications.

9
Chapter 1: Introduction
Overview
Biology has traditionally been an observational rather than a deductive science. Although recent developments
have not altered this basic orientation, the nature of the data has radically changed. It is arguable that until
recently all biological observations were fundamentally anecdotal - admittedly with varying degrees of
precision, some very high indeed. However, in the last generation the data have become not only much more
quantitative and precise, but, in the case of nucleotide and amino acid sequences, they have become discrete.
It is possible to determine the genome sequence of an individual organism or clone not only completely, but in
principle exactly. Experimental error can never be avoided entirely, but for modern genomic sequencing it is
extremely low.
Not that this has converted biology into a deductive science. Life does obey principles of physics and
chemistry, but for now life is too complex, and too dependent on historical contingency, for us to deduce its
detailed properties from basic principles.
A second obvious property of the data of bioinformatics is their very very large amount. Currently the
nucleotide sequence databanks contain 16 × 109
bases (abbreviated 16 Gbp). If we use the approximate size
of the human genome - 3.2 × 109
letters - as a unit, this amounts to five HUman Genome Equivalents (or 2
huges, an apt name). For a comprehensible standard of comparison, 1 huge is comparable to the number of
characters appearing in six complete years of issues of The New York Times. The database of
macromolecular structures contains 16 000 entries, the full three-dimensional coordinates of proteins, of
average length ~400 residues. Not only are the individual databanks large, but their sizes are increasing at a
very high rate. Figure 1.1 shows the growth over the past decade of GenBank (archiving nucleic acid
sequences) and the Protein Data Bank (PDB) (archiving macromolecular structures). It would be precarious to
extrapolate.
Figure 1.1: (a) Growth of GenBank, the US National Center for Biotechnology Information genetic sequence
archival databank. (b) Growth of Protein Data Bank, archive of three-dimensional biological macromolecular
structures.

10
This quality and quantity of data have encouraged scientists to aim at commensurately ambitious goals:
ƒTo have it said that they 'saw life clearly and saw it whole.' That is, to understand integrative
aspects of the biology of organisms, viewed as coherent complex systems.
ƒTo interrelate sequence, three-dimensional structure, interactions, and function of individual
proteins, nucleic acids and protein-nucleic acid complexes.
ƒTo use data on contemporary organisms as a basis for travel backward and forward in time - back
to deduce events in evolutionary history, forward to greater deliberate scientific modification of
biological systems.
ƒTo support applications to medicine, agriculture and other scientific fields.
A scenario
For a fast introduction to the role of computing in molecular biology, imagine a crisis - sometime in the future -
in which a new biological virus creates an epidemic of fatal disease in humans or animals. Laboratory
scientists will isolate its genetic material - a molecule of nucleic acid consisting of a long polymer of four
different types of residues - and determine the sequence. Computer programs will then take over.
Screening this new genome against a data bank of all known genetic messages will characterize the virus and
reveal its relationship with viruses previously studied [10]. The analysis will continue, with the goal of
developing antiviral therapies. Viruses contain protein molecules which are suitable targets, for drugs that will
interfere with viral structure or function. Like the nucleic acids, the proteins are also linear polymers; the
sequences of their residues, amino acids, are messages in a twenty-letter alphabet. From the viral DNA
sequences, computer programs will derive the amino acid sequences of one or more viral proteins crucial for
replication or assembly [01].
From the amino acid sequences, other programs will compute the structures of these proteins according to the
basic principle that the amino acid sequences of proteins determine their three-dimensional structures, and
thereby their functional properties. First, data banks will be screened for related proteins of known structure
[15]; if any are found, the problem of structure prediction will be reduced to its 'differential form' - the prediction
of the effects on a structure of changes in sequence - and the structures of the targets will be predicted by
methods known as homology modelling [25]. If no related protein of known structure is found, and a viral
protein appears genuinely new, the structure prediction must be done entirely ab initio [55]. This situation will
arise ever more infrequently as our databank of known structures grows more nearly complete, and our ability
to detect distant relatives reliably grows more powerful.
Knowing the viral protein structure will make it possible to design therapeutic agents. Proteins have sites on
their surfaces crucial for function, that are vulnerable to blocking. A small molecule, complementary in shape
and charge distribution to such a site, will be identified or designed by a program to serve as an antiviral drug
[50]; alternatively, one or more antibodies may be designed and synthesized to neutralize the virus [50].
This scenario is based on well-established principles, and I have no doubt that someday it will be implemented
as described. One reason why we cannot apply it today against AIDS is that many of the problems are as yet
unsolved. (Another is that viruses know how to defend themselves.) Computer scientists reading this book
may have recognized that the numbers in square brackets are not literature citations, but follow the convention
of D.E. Knuth in his classic books, The Art of Computer Programming, in indexing the difficulty of the problem!
Numbers below 30 correspond to problems for which solutions already exist; higher numbers signal themes of
current research.
Finally, it should be recognized that purely experimental approaches to the problem of developing antiviral
agents may well continue to be more successful than theoretical ones for many years.
Life in space and time
It is difficult to define life, and it may be necessary to modify its definition - or to live, uncomfortably, with the
old one - as computers grow in power and the silicon-life interface gains in intimacy. For now, try this: A
biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of
matter, energy and information.
From the most distant perspective, life on Earth is a complex self-perpetuating system distributed in space and
time. It is of the greatest significance that in many cases it is composed of discrete individual organisms, each
with a finite lifetime and in most cases with unique features.
Spatially, starting far away and zooming in progressively, one can distinguish, within the biosphere, local
ecosystems, which are stable until their environmental conditions change or they are invaded. Occupying each

11
ecosystem are sets of species, which evolve by Darwinian selection or genetic drift. The generation of variants
may arise from natural mutation, or the recombination of genes in sexual reproduction, or direct gene transfer.
Each species is composed of organisms carrying out individual if not independent activities. Organisms are
composed of cells. Every cell is an intimate localized ecosystem, not isolated from its environment but
interacting with it in specific and controlled ways. Eukaryotic cells contain a complex internal structure of their
own, including nuclei and other subcellular organelles, and a cytoskeleton. And finally we come down to the
level of molecules.
Life is extended not only in space but in time. We see today a snapshot of one stage in a history of life that
extends back in time for at least 3.5 billion years. The theory of natural selection has been extremely
successful in rationalizing the process of life's development. However, historical accident plays too dominant a
role in determining the course of events to allow much detailed prediction. Nor does fossil DNA afford
substantial access to any historical record at the molecular level. Instead, we must try to read the past in
contemporary genomes. US Supreme Court Justice Felix Frankfurter once wrote that '...the American
constitution is not just a document, it is a historical stream.' This is also true of genomes, which contain
records of their own development.
Dogmas: central and peripheral
The information archive in each organism - the blueprint for potential development and activity of any
individual - is the genetic material, DNA, or, in some viruses, RNA. DNA molecules are long, linear, chain
molecules containing a message in a four-letter alphabet (see Box). Even for microorganisms the message is
long, typically 106
characters. Implicit in the structure of the DNA are mechanisms for self-replication and for
translation of genes into proteins. The double-helix, and its internal self-complementarity providing for accurate
replication, are well known (see Plate I). Near-perfect replication is essential for stability of inheritance; but
some imperfect replication, or mechanism for import of foreign genetic material, is also essential, else
evolution could not take place in asexual organisms.
Plate I: Double-helix of DNA.
The strands in the double-helix are antiparallel; directions along each strand are named 3' and 5' (for positions
in the deoxyribose ring). In translation to protein, the DNA sequence is always read in the 5' → 3' direction.
The implementation of genetic information occurs, initially, through the synthesis of RNA and proteins.
Proteins are the molecules responsible for much of the structure and activities of organisms. Our hair, muscle,
digestive enzymes, receptors and antibodies are all proteins. Both nucleic acids and proteins are long, linear
chain molecules. The genetic 'code' is in fact a cipher: Successive triplets of letters from the DNA sequence
specify successive amino acids; stretches of DNA sequences encipher amino acid sequences of proteins.
Typically, proteins are 200–400 amino acids long, requiring 600–1200 letters of the DNA message to specify
them. Synthesis of RNA molecules, for instance the RNA components of the ribosome, are also directed by
DNA sequences. However, in most organisms not all of the DNA is expressed as proteins or RNAs. Some
regions of the DNA sequence are devoted to control mechanisms, and a substantial amount of the genomes
of higher organisms appears to be 'junk'. (Which in part may mean merely that we do not yet understand its
function.)
The four naturally-occurring nucleotides in DNA (RNA)

12
a
ad
eni
ne
g
gu
ani
ne
c cytosine t thymine (u
uracil)
The twenty naturally-occurring amino acids in proteins
Non-polar amino acids
G glycine A alanine P proline V valine
I isoleucine L leucine F phenylalanine M methionine
Polar amino acids
S serine C cysteine T threonine N asparagine
Q glutamine H histidine Y tyrosine W tryptophan
Charged amino acids
D aspartic
acid
E glutamic
acid
K lysine R arginine
Other classifications of amino acids can also be useful. For instance, histidine, phenylalanine, tyrosine, and
tryptophan are aromatic, and are observed to play special structural roles in membrane proteins.
Amino acid names are frequently abbreviated to their first three letters, for instance Gly for glycine; except
for isoleucine, asparagine, glutamine and tryptophan, which are abbreviated to Ile, Asn, Gln and Trp,
respectively. The rare amino acid selenocysteine has the three-letter abbreviation Sec and the one-letter
code U.
It is conventional to write nucleotides in lower case and amino acids in upper case. Thus atg = adenine-
thymine-guanine and ATG = alanine-threonine-glycine.
In DNA the molecules comprising the alphabet are chemically similar, and the structure of DNA is, to a first
approximation, uniform. Proteins, in contrast, show a great variety of three-dimensional conformations. These
are necessary to support their very diverse structural and functional roles.
The amino acid sequence of a protein dictates its three-dimensional structure. For each natural amino acid
sequence, there is a unique stable native state that under proper conditions is adopted spontaneously. If a
purified protein is heated, or otherwise brought to conditions far from the normal physiological environment, it
will 'unfold' to a disordered and biologically-inactive structure. (This is why our bodies have mechanisms to
maintain nearly-constant internal conditions.) When normal conditions are restored, protein molecules will
generally readopt the native structure, indistinguishable from the original state.
The spontaneous folding of proteins to form their native states is the point at which Nature makes the giant
leap from the one-dimensional world of genetic and protein sequences to the three-dimensional world we
inhabit. There is a paradox: The translation of DNA sequences to amino acid sequences is very simple to
describe logically; it is specified by the genetic code. The folding of the polypeptide chain into a precise three-
dimensional structure is very difficult to describe logically. However, translation requires the immensely
complicated machinery of the ribosome, tRNAs and associated molecules; but protein folding occurs
spontaneously.

13
The standard genetic code
t
t
t
P
h
e
t
c
t
S
e
r
t
a
t
T
y
r
t
g
t
C
y
s
t
t
c
P
h
e
t
c
c
S
e
r
t
a
c
T
y
r
t
g
c
C
y
s
t
t
a
L
e
u
t
c
a
S
e
r
t
a
a
S
T
O
P
t
g
a
S
T
O
P
t
t
g
L
e
u
t
c
g
S
e
r
t
a
g
S
T
O
P
t
g
g
T
r
p
c
t
t
L
e
u
c
c
t
P
r
o
c
a
t
H
i
s
c
g
t
A
r
g
c
t
c
L
e
u
c
c
c
P
r
o
c
a
c
H
i
s
c
g
c
A
r
g
c
t
a
L
e
u
c
c
a
P
r
o
c
a
a
G
l
n
c
g
a
A
r
g
c
t
g
L
e
u
c
c
g
P
r
o
c
a
a
G
l
n
c
g
a
A
r
g
a
t
t
I
l
e
a
c
t
T
h
r
a
a
t
A
s
n
a
g
t
S
e
r
a
t
c
I
l
e
a
c
c
T
h
r
a
a
c
A
s
n
a
g
c
S
e
r
a
t
a
I
l
e
a
c
a
T
h
r
a
a
a
L
y
s
a
g
a
A
r
g

14
a
t
g
M
e
t
a
c
g
T
h
r
a
a
g
L
y
s
a
g
g
A
r
g
g
t
t
V
a
l
g
c
t
A
l
a
g
a
t
A
s
p
g
g
t
G
l
y
g
t
c
V
a
l
g
c
c
A
l
a
g
a
c
A
s
p
g
g
c
G
l
y
g
t
a
V
a
l
g
c
a
A
l
a
g
a
a
G
l
u
g
g
a
G
l
y
g
t
g
V
a
l
g
c
g
A
l
a
g
a
g
G
l
u
g
g
g
G
l
y
Alternative genetic codes appear, for example, in organelles - chloroplasts and mitochondria.
The functions of proteins depend on their adopting the native three-dimensional structure. For example, the
native structure of an enzyme may have a cavity on its surface that binds a small molecule and juxtaposes it to
catalytic residues. We thus have the paradigm:
ƒDNA sequence determines protein sequence
ƒProtein sequence determines protein structure
ƒProtein structure determines protein function
Most of the organized activity of bioinformatics has been focused on the analysis of the data related to these
processes.
So far, this paradigm does not include levels higher than the molecular level of structure and organization,
including, for example, such questions as how tissues become specialized during development or, more
generally, how environmental effects exert control over genetic events. In some cases of simple feedback
loops, it is understood at the molecular level how increasing the amount of a reactant causes an increase in
the production of an enzyme that catalyses its transformation. More complex are the programs of development
during the lifetime of an organism. These fascinating problems about the information flow and control within an
organism have now come within the scope of mainstream bioinformatics.
Observables and data archives
A databank includes an archive of information, a logical organization or 'structure' of that information, and tools
to gain access to it. Databanks in molecular biology cover nucleic acid and protein sequences,
macromolecular structures, and function. They include:
ƒArchival databanks of biological information
o DNA and protein sequences, including annotation
o nucleic acid and protein structures, including annotation
o databanks of protein expression patterns
ƒDerived databanks: These contain information collected from the archival databanks, and from the
analysis of their contents. For instance:
osequence motifs (characteristic 'signature patterns' of families of proteins)
omutations and variants in DNA and protein sequences

15
oclassifications or relationships (connections and common features of entries in
archives; for instance a databank of sets of protein sequence families, or a
hierarchical classification of protein folding patterns)
ƒ Bibliographic databanks
ƒ Databanks of web sites
odatabanks of databanks containing biological information
olinks between databanks
Database queries seek to identify a set of entries (e.g. sequences or structures) on the basis of specified
features or characteristics, or on the basis of similarity to a probe sequence or structure. The most common
query is: 'I have determined a new sequence, or structure - what do the databanks contain that is like it?' Once
a set of sequences or structures similar to the probe object is fished out of the appropriate database, the
researcher is in a position to identify and investigate their common features.
The mechanism of access to a databank is the set of tools for answering questions such as:
ƒ 'Does the databank contain the information I require?' (Example: In which databanks can I
find amino acid sequences of alcohol dehydrogenases?)
ƒ 'How can I assemble selected information from the databank in a useful form?' (Example:
How can I compile a list of globin sequences, or even better, a table of aligned globin
sequences?)
ƒ Indices of databanks are useful in asking 'Where can I find some specific piece of
information?' (Example: What databanks contain the amino acid sequence of porcupine
trypsin?) Of course if I know and can specify exactly what I want the problem is relatively
straightforward.
A databank without effective modes of access is merely a data graveyard. How to achieve effective access is
an issue of database design that ideally should remain hidden from users. It has become clear that effective
access cannot be provided by bolting a query system onto an unstructured archive. Instead, the logical
organization of the storage of the information must be designed with the access in mind - what kinds of
questions users will want to ask - and the structure of the archive must mesh smoothly with the information-
retrieval software.
A variety of possible kinds of database queries can arise in bioinformatics. These include:
1.Given a sequence, or fragment of a sequence, find sequences in the database that are
similar to it. This is a central problem in bioinformatics. We share such string-matching
problems with many fields of computer science. For instance, word processing and
editing programs support string-search functions.
2.Given a protein structure, or fragment, find protein structures in the database that are
similar to it. This is the generalization of the string matching problem to three dimensions.
3.Given a sequence of a protein of unknown structure, find structures in the database that
adopt similar three-dimensional structures. One is tempted to cheat - to look in the
sequence data banks for proteins with sequences similar to the probe sequence: For if
two proteins have sufficiently similar sequences, they will have similar structures.
However, the converse is not true, and one can hope to create more powerful search
techniques that will find proteins of similar structure even though their sequences have
diverged beyond the point where they can be recognized as similar by sequence
comparison.
4.Given a protein structure, find sequences in the data bank that correspond to similar
structures. Again, one can cheat by using the structure to probe a structure data bank,
but this can give only limited success because there are so many more sequences known
than structures. It is, therefore, desirable to have a method that can pick out the structure
from the sequence.
(1) and (2) are solved problems; such searches are carried out thousands of times a day. (3) and (4) are
active fields of research.
Tasks of even greater subtlety arise when one wishes to study relationships between information contained in
separate databanks. This requires links that facilitate simultaneous access to several databanks. Here is an
example: 'For which proteins of known structure involved in diseases of purine biosynthesis in humans, are
there related proteins in yeast?' We are setting conditions on: known structure, specified function, detection of
relatedness, correlation with disease, specified species. The growing importance of simultaneous access to
databanks has led to research in databank interactivity - how can databanks 'talk to one another', without too

16
great a sacrifice of the freedom of each one to structure its own data in ways appropriate to the individual
features of the material it contains.
A problem that has not yet arisen in molecular biology is control of updates to the archives. A database of
airline reservations must prevent many agents from selling the same seat to different travellers. In
bioinformatics, users can read or extract information from archival databanks, or submit material for
processing by the staff of an archive, but not add or alter entries directly. This situation may change. From a
practical point of view, the amount of data being generated is increasing so rapidly as to swamp the ability of
archive projects to assimilate it. There is already movement towards greater involvement of scientists at the
bench in preparing data for the archive.
Although there are good arguments for unique control over the archives, there is no need to limit the ways to
access them - colloquially, the design of 'front ends'. Specialized user communities may extract subsets of the
data, or recombine data from different sources, and provide specialized avenues of access. Such 'Boutique
databases' depend on the primary archives as the source of the information they contain, but redesign the
organization and presentation as they see fit. Indeed, different derived databases can slice and dice the same
information in different ways. A reasonable extrapolation suggests the concept of specialized 'virtual
databases', grounded in the archives but providing individual scope and function, tailored to the needs of
individual research groups or even individual scientists.
Curation, annotation, and quality control
The scientific and medical communities are dependent on the quality of databanks. Indices of quality, even if
they do not permit correction of mistakes, may help us avoid arriving at wrong conclusions.
Databank entries comprise raw experimental results, and supplementary information, or annotations. Each of
these has its own sources of error.
The most important determinant of the quality of the data themselves is the state of the art of the experiments.
Older data were limited by older techniques; for instance, amino acid sequences of proteins used to be
determined by peptide sequencing, but almost all are now translated from DNA sequences. One consequence
of the data explosion is that most data are new data, governed by current technology, which in most cases
does quite a good job.
Annotations include information about the source of the data and the methods used to determine them. They
identify the investigators responsible, and cite relevant publications. They provide links to related information
in other databanks. In sequence databanks, annotations include feature tables: lists of segments of the
sequences that have biological significance - for instance, regions of a DNA sequence that code for proteins.
These appear in computer-parsable formats, and their contents may be restricted to a controlled vocabulary.
Until recently, a typical DNA sequence entry was produced by a single research group, investigating a gene
and its products in a coherent way. Annotations were grounded in experimental data and written by
specialists. In contrast, full-genome sequencing projects offer no experimental confirmation of the expression
of most putative genes, nor characterization of their products. Curators at databanks base their annotations on
the analysis of the sequences by computer programs.
Annotation is the weakest component of the genomics enterprise. Automation of annotation is possible only to
a limited extent; getting it right remains labour-intensive, and allocated resources are inadequate. But the
importance of proper annotation cannot be underestimated. P. Bork has commented that errors in gene
assignments vitiate the high quality of the sequence data themselves.
Growth of genomic data will permit improvement in the quality of annotation, as statistical methods increase in
accuracy. This will allow improved reannotation of entries. The improvement of annotations will be a good
thing. But the inevitable concomitant, that annotation will be in flux, is disturbing. Will completed research
projects have to be revisited periodically, and conclusions reconsidered? The problem is aggravated by the
proliferation of web sites, with increasingly dense networks of links. They provide useful avenues for
applications. But the web is also a vector of contagion, propagating errors in raw data, in immature data
subsequently corrected but the corrections not passed on, and variant annotations.
The only possible solution is a distributed and dynamic error-correction and annotation process. Distributed, in
that databank staff will have neither the time nor the expertise for the job; specialists will have to act as
curators. Dynamic, in that progress in automation of annotation and error identification/correction will permit
reannotation of databanks. We will have to give up the safe idea of a stable databank composed of entries that
are correct when first distributed and stay fixed. Databanks will become a seething broth of information
growing in size, and maturing - we must hope - in quality.

17
The World Wide Web
It is likely that all readers will have used the World Wide Web, for reference material, for news, for access to
databases in molecular biology, for checking out personal information about individuals - friends or colleagues
or celebrities - or just for browsing. Fundamentally, the Web is a means of interpersonal and intercomputer
contact over networks. It provides a complete global village, containing the equivalent of library, post office,
shops, and schools.
You, the user, run a browser program on your own computer. Common browsers are Netscape and Internet
Explorer. With these browser programs you can read and display material from all over the world. A browser
also presents control information, allowing you to follow trails forward and back or to interrupt a side-trip. And it
allows you to download information to your local computer.
The material displayed contains embedded links that allow you to jump around to other pages and sites,
adding new dimensions to your excursions. The interconnections animate the Web. What makes the human
brain so special is not the absolute number of neurons, but the density of interconnections between them.
Similarly, it is not only the number of entries that makes the Web so powerful, but their reticulation.
The links are visible in the material you are viewing at any time. Running your browser program, you view a
page, or frame. This view will contain active objects: words, or buttons, or pictures. These are usually
distinguished by a highlighted colour. Selecting them will effect a transfer to a new page. At the same time,
you automatically leave a trail of 'electronic breadcrumbs' so that you can return to the calling link, to take up
further perusal of the page you started from.
The Web can be thought of as a giant world wide bulletin board. It contains text, images, movies, and sounds.
Virtually anything that can be stored on a computer can be made available and accessed via the Web. An
interesting example is a page describing the poetry of William Butler Yeats. The highest level page contains
material appropriate for a table of contents. Via links displayed on this top page, you can see printed text of
different poems. You can compare different editions. You can access critical analysis of the poems. You can
see versions of some poems in Yeats' manuscripts. For some poems, there is even a link to an audio file, from
which you can hear Yeats himself reading the poem.
Links can be internal or external. Internal links may take you to other portions of the text of a current
document, or to images, movies, or sounds. External links may allow you to move down to more specialized
documents, up to more general ones (perhaps providing background to technical material), sideways to
parallel documents (other papers on the same subject), or over, to directories that show what other relevant
material is available.
The main thing to do, to get started using the Web effectively, is to find useful entry points. Once a session is
launched, the links will take you where you want to go. Among the most important sites are search engines
that index the entire Web and permit retrieval by keywords. You can enter one or more terms, such as
'phosphorylase', 'allosteric change', 'crystal structure', and the search program will return a list of links to sites
on the Web that contain these terms. You will thereby identify sites relevant to your interest.
Once you have completed a successful session, when you next log in the intersession memory facilities of the
browsers allow you to pick up cleanly where you left off. During any session, when you find yourself viewing a
document to which you will want to return, you can save the link in a file of bookmarks or favourites. In a
subsequent session you can return to any site on this list directly, not needing to follow the trail of links that led
you there in the first place.
A personal home page is a short autobiographical sketch (with links, of course). Your professional colleagues
will have their own home pages which typically include name, institutional affiliation, addresses for paper and
electronic mail, telephone and fax numbers, a list of publications and current research interests. It is not
uncommon for home pages to include personal information, such as hobbies, pictures of the individual with his
or her spouse and children, and even the family dog!
Nor is the Web solely a one-way street. Many Web documents include forms in which you can enter
information, and launch a program that returns results within your session. Search engines are common
examples. Many calculations in bioinformatics are now launched via such web servers. If the calculations are
lengthy the results may not be returned within the session, but sent by e-mail.
The hURLy-bURLy
Even brief experience with the Web will bring you into contact with the strange-looking character strings that
identify locations. These are URLs - Uniform Resource Locators. They specify the format of the material and
its location. After all, every document on the Web must be a file on some computer somewhere. An example
of a URL is:

18
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html
This is the URL of a useful tutorial about Finding Information on the Internet. The prefix http:// stands for
hypertext transfer protocol. This tells your browser to expect the document in http format, by far the most
common one. The next section, www.lib.berkeley.edu is the name of a computer, in this case in the central
library at the University of California at Berkeley. The rest of the URL specifies the location and name of the
file on the computer the contents of which your browser will display.
Electronic publication
More and more publications are appearing on the Web. A scientific journal may post only its table of contents,
a table of contents together with abstracts of articles, or even full articles. Many institutional publications -
newsletters and technical reports - appear on the Web. Many other magazines and newspapers are showing
up as well. You might want to try http://www.nytimes.com Many printed publications now contain references to
Web links containing supplementary material that never appears on paper.
We are in an era of a transition to paper-free publishing. It is already a good idea to include, in your own
printed articles, your e-mail address and the URL of your home page.
Electronic publication raises a number of questions. One concerns peer review. How can we guarantee the
same quality for electronic publication that we rely on for printed journals? Are electronic publications
'counted', in the publish-or-perish sense of judging the productivity (if not the quality) of a job candidate? A
well-known observer of the field has offered the sobering (if possibly exaggerated) prediction: 'The first time
Harvard or Stanford gives tenure to someone for electronic publication, 90% of the scientific journals will
disappear overnight'.
Computers and computer science
Bioinformatics would not be possible without advances in computing hardware and software. Fast and high-
capacity storage media are essential even to maintain the archives. Information retrieval and analysis require
programs; some fairly straightforward and others extremely sophisticated. Distribution of the information
requires the facilities of computer networks and the World Wide Web.
Computer science is a young and flourishing field with the goal of making most effective use of information
technology hardware. Certain areas of theoretical computer science impinge most directly on bioinformatics.
Let us consider them with reference to a specific biological problem: 'Retrieve from a database all sequences
similar to a probe sequence'. A good solution of this problem would appeal to computer science for:
ƒAnalysis of algorithms. An algorithm is a complete and precise specification of a method for solving
a problem. For the retrieval of similar sequences, we need to measure the similarity of the probe
sequence to every sequence in the database. It is possible to do much better than the naive
approach of checking every pair of positions in every possible juxtaposition, a method that even
without allowing gaps would require a time proportional to the product of the number of
characters in the probe sequence times the number of characters in the database. A speciality in
computer science known colloquially as 'stringology' focuses on developing efficient methods for
this type of problem, and analysing their effective performance.
ƒData structures, and information retrieval. How can we organize our data for efficient response to
queries? For instance, are there ways to index or otherwise preprocess the data to make our
sequence-similarity searches more efficient? How can we provide interfaces that will assist the
user in framing and executing queries?
ƒSoftware engineering. Hardly ever anymore does anyone write programs in the native language of
computers. Programmers work in higher-level languages, such as C, C++, PERL ('Practical
Extraction and Report Language') or even FORTRAN. The choice of programming language
depends on the nature of the algorithm and associated data structure, and the expected use of
the program. Of course most complicated software used in bioinformatics is now written by
specialists. Which brings up the question of how much programming expertise a bioinformatician
needs.
Programming
Programming is to computer science what bricklaying is to architecture. Both are creative; one is an art and
the other a craft.

19
Many students of bioinformatics ask whether it is essential to learn to write complicated computer
programmes. My advice (not agreed upon by everyone in the field) is: 'Don't. Unless you want to specialize in
it'. For working in bioinformatics, you will need to develop expertise in using tools available on the Web.
Learning how to create and maintain a web site is essential. And of course you will need facility in the use of
the operating system of your computer. Some skill in writing simple scripts in a language like PERL provides
an essential extension to the basic facilities of the operating system.
On the other hand, the size of the data archives, and the growing sophistication of the questions we wish to
address, demand a healthy respect. Truly creative programming in the field is best left to specialists, well-
trained in computer science. Nor does using programs, via highly polished (not to say flashy) Web interfaces,
provide any indication of the nature of the activity involved in writing and debugging programs. Bismarck once
said: 'Those who love sausages or the law should not watch either being made'. Perhaps computer programs
should be added to his list.
I recommend learning some basic skills with PERL. PERL is a very powerful tool. It makes it very easy to carry
out many very useful simple tasks. PERL also has the advantage of being available on most computer
systems.
How should you learn enough PERL to be useful in bioinformatics? Many institutions run courses. Learning
from colleagues is fine, depending on the ratio of your adeptness to their patience. Books are available. A very
useful approach is to find lessons on the Web - ask a search engine for 'PERL tutorial' and you will turn up
many useful sites that will lead you by the hand through the basics. And of course use it in your work as much
as you can. This book will not teach you PERL, but it will provide opportunities to practice what you learn
elsewhere.
Examples of simple PERL programs appear in this book. The strength of PERL at character-string handling
make it suitable for sequence analysis tasks in biology. Here is a very simple PERL program to translate a
nucleotide sequence into an amino acid sequence according to the standard genetic code. The first line,
#!/usr/bin/perl, is a signal to the UNIX (or LINUX) operating system that what follows is a PERL
program. Within the program, all text commencing with a #, through to the end of the line on which it appears,
is merely comment. The line __END__ signals that the program is finished and what follows is the input data.
(All material that the reader might find useful to have in computer-readable form, including all programs,
appear in the web site associated with this book: http://www.oup.com/uk/lesk/bioinf)
#!/usr/bin/perl
#translate.pl -- translate nucleic acid sequence to protein sequence
# according to standard genetic code
# set up table of standard genetic code
%standardgeneticcode = (
"ttt"=> "Phe", "tct"=> "Ser", "tat"=> "Tyr", "tgt"=> "Cys",
"ttc"=> "Phe", "tcc"=> "Ser", "tac"=> "Tyr", "tgc"=> "Cys",
"tta"=> "Leu", "tca"=> "Ser", "taa"=> "TER", "tga"=> "TER",
"ttg"=> "Leu", "tcg"=> "Ser", "tag"=> "TER", "tgg"=> "Trp",
"ctt"=> "Leu", "cct"=> "Pro", "cat"=> "His", "cgt"=> "Arg",
"ctc"=> "Leu", "ccc"=> "Pro", "cac"=> "His", "cgc"=> "Arg",
"cta"=> "Leu", "cca"=> "Pro", "caa"=> "Gln", "cga"=> "Arg",
"ctg"=> "Leu", "ccg"=> "Pro", "cag"=> "Gln", "cgg"=> "Arg",

20
"att"=> "Ile", "act"=> "Thr", "aat"=> "Asn", "agt"=> "Ser",
"atc"=> "Ile", "acc"=> "Thr", "aac"=> "Asn", "agc"=> "Ser",
"ata"=> "Ile", "aca"=> "Thr", "aaa"=> "Lys", "aga"=> "Arg",
"atg"=> "Met", "acg"=> "Thr", "aag"=> "Lys", "agg"=> "Arg",
"gtt"=> "Val", "gct"=> "Ala", "gat"=> "Asp", "ggt"=> "Gly",
"gtc"=> "Val", "gcc"=> "Ala", "gac"=> "Asp", "ggc"=> "Gly",
"gta"=> "Val", "gca"=> "Ala", "gaa"=> "Glu", "gga"=> "Gly",
"gtg"=> "Val", "gcg"=> "Ala", "gag"=> "Glu", "ggg"=> "Gly"
);
# process input data
while ($line = <DATA>) { # read in line of input
print "$line"; # transcribe to output
chop(); # remove end-of-line character
@triplets = unpack("a3" x (length($line)/3), $line); # pull out successive triplets
foreach $codon (@triplets) { # loop over triplets
print "$standardgeneticcode{$codon}"; # print out translation of each
} # end loop on triplets
print "nn"; # skip line on output
} # end loop on input lines
# what follows is input data
__END__
atgcatccctttaat
tctgtctga
Running this program on the given input data produces the output:
atgcatccctttaat
MetHisProPheAsn

21
tctgtctga
SerValTER
Even this simple program displays several features of the PERL language. The file contains background data
(the genetic code translation table), statements that tell the computer to do something with the input (i.e. the
sequence to be translated), and the input data (appearing after the __END__ line). Comments summarize
sections of the program, and also describe the effect of each statement.
The program is structured as blocks enclosed in curly brackets: {...}, which are useful in controlling the flow of
execution. Within blocks, individual statements (each ending in a ;) are executed in order of appearance. The
outer block is a loop:
while ($line = <DATA>) {
...
}
<DATA> refers to the lines of input data (appearing after the __END__). The block is executed once for each
line of input; that is, while there is any line of input remaining.
Three types of data structures appear in the program. The line of input data, referred to as $line, is a simple
character string. It is split into an array or vector of triplets. An array stores several items in a linear order, and
individual items of data can be retrieved from their positions in the array. For ease of looking up the amino acid
coded for by any triplet, the genetic code is stored as an associative array. An associative array, or hash table,
is a generalization of a simple or sequential array. If the elements of a simple array are indexed by
consecutive integers, the elements of an associative array are indexed by any character strings, in this case
the 64 triplets. We process the input triplets in order of their appearance in the nucleotide sequence, but we
need to access the elements of the genetic code table in an arbitrary order as dictated by the succession of
triplets. A simple array or vector of character strings is appropriate for processing successive triplets, and the
associative array is appropriate for looking up the amino acids that correspond to them.
Here is another PERL program, that illustrates additional aspects of the language.[1]
This program
reassembles the sentence:
All the world's a stage,
And all the men and women merely players;
They have their exits and their entrances,
And one man in his time plays many parts.
after it has been chopped into random overlapping fragments (n in the fragments represents end-of-line in the
original):
the men and women merely players;n
one man in his time
All the world's
their entrances,nand one man
stage,nAnd all the men and women
They have their exits and their entrances,n
world's a stage,nAnd all
in his time plays many parts.
merely players;nThey have
#!/usr/bin/perl
#assemble.pl -- assemble overlapping fragments of strings
# input of fragments
while ($line = <DATA>) { # read in fragments, 1 per line
cnop($line); # remove trailing carriage return

22
push(@fragments,$line); # copy each fragment into array
}
# now array @fragments contains fragments
# we need two relationships between fragments:
# (1) which fragment shares no prefix with suffix of another fragment
# * This tells us which fragment comes first
# (2) which fragment shares longest suffix with a prefix of another
# * This tells us which fragment follows any fragment
# First set array of prefixes to the default value "noprefixfound".
# Later, change this default value when a prefix is found.
# The one fragment that retains the default value must be come first.
# Then loop over pairs of fragments to determine maximal overlap.
# This determines successor of each fragment
# Note in passing that if a fragment has a successor then the
# successor must have a prefix
foreach $i (@fragments) { # initially set prefix of each fragment
$prefix{$i} = "noprefixfound"; # to "noprefixfound"
} # this will be overwritten when a prefix is found
# for each pair, find longest overlap of suffix of one with prefix of the other
# This tells us which fragment FOLLOWS any fragment
foreach $i (@fragments) { # loop over fragments
$longestsuffix = ""; # initialize longest suffix to null
foreach $j (@fragments) { # loop over fragment pairs
unless {$i eq $j} { # don't check fragment against itself

23
$combine = $i . "XXX" . $j; # concatenate fragments, with fence XXX
$combine =~ /{[S ]{2,})XXX1/; # check for repeated sequence
if (length(S1) > length($longestsuffix)) { # keep longest overlap
$longestsuffix = $1; # retain longest suffix
$successor{$i} = $j; # record that $j follows $i
}
}
}
$prefix{$successor{Si} } = "found"; # if $j follows $i then $j must have a prefix
}
foreach (@fragments) ( # find fragment that has no prefix; that's the start
if ($prefix{$_} eq "noprefixfound") {$outstring = $_;}
}
$test = $outstring; # start with fragment without prefix
while ($successor{$test}) { # append fragments in order
$test = $successor{$test}; # choose next fragment
$outstring = $outstring . "XXX" . $test; # append to string
$outstring =~ s/([S ]+)XXX1/1/; # remove overlapping segment
}
$outstring =~ s/n/n/g; # change signal n to real carriage return
print "$outstringn"; # print final result
__END__
the men and women merely players;n
one man in his time
All the world's

24
stage,nAnd all the men and women
They have their exits and their entrances,n
world's a stage,nAnd all
in his time plays many parts.
merely players;nThey have
This kind of calculation is important in assembling DNA sequences from overlapping fragments. (For
difficulties created by sequences containing repetitions, see Problem 1.4.)
Should your programming ambitions go beyond simple tasks, check out the Bioperl Project, a source of freely
available PERL programs and components in the field of bioinformatics (See: http://bio.perl.org/).
[1]
This section may be skipped on a first reading.
Biological classification and nomenclature
Back to the eighteenth century, when academic life at least was in some respects simpler.
Biological nomenclature is based on the idea that living things are divided into units called species - groups of
similar organisms with a common gene pool. (Why living things should be 'quantized' into discrete species is a
very complicated question.) Linnaeus, a Swedish naturalist, classified living things according to a hierarchy:
Kingdom, Phylum, Class, Order, Family, Genus and Species (see Box). Modern taxonomists have added
additional levels. For identification it generally suffices to specify the binomial: Genus and Species; for
instance Homo sapiens for human or Drosophila melanogaster for fruit fly. Each binomial uniquely specifies a
species that may also be known by a common name; for instance, Bos taurus = cow. Of course, most species
have no common names.
Originally the Linnaean system was only a classification based on observed similarities. With the discovery of
evolution it emerged that the system largely reflects biological ancestry. The question of which similarities truly
reflect common ancestry must now be faced. Characteristics derived from a common ancestor are called
homologous; for instance an eagle's wing and a human's arm. Other apparently similar characteristics may
have arisen independently by convergent evolution; for instance, an eagle's wing and a bee's wing: The most
recent common ancestor of eagles and bees did not have wings. Conversely, truly homologous characters
may have diverged to become very dissimilar in structure and function. The bones of the human middle ear
are homologous to bones in the jaws of primitive fishes; our eustachian tubes are homologues of gill slits. In
most cases experts can distinguish true homologies from similarities resulting from convergent evolution.

25
Classifications of Human and Fruit fly
Human Fruit fly
Kingdom Animalia Animalia
Phylum Chordata Arthropoda
Class Mammalia Insecta
Order Primata Diptera
Family Hominidae Drosophilidae
Genus
Homo Drosophila
Species
sopiens melanogaster
Sequence analysis gives the most unambiguous evidence for the relationships among species. The system
works well for higher organisms, for which sequence analysis and the classical tools of comparative anatomy,
palaeontology and embryology usually give a consistent picture. Classification of microorganisms is more
difficult, partly because it is less obvious how to select the features on which to classify them and partly
because a large amount of lateral gene transfer threatens to overturn the picture entirely.
Ribosomal RNAs turned out to have the essential feature of being present in all organisms, with the right
degree of divergence. (Too much or too little divergence and relationships become invisible.)
On the basis of 15S ribosomal RNAs, C. Woese divided living things most fundamentally into three Domains
(a level above Kingdom in the hierarchy): Bacteria, Archaea and Eukarya (see Fig. 1.2). Bacteria and archaea
are prokaryotes; their cells do not contain nuclei. Bacteria include the typical microorganisms responsible for
many infectious diseases, and, of course, Escherichia coli, the mainstay of molecular biology. Archaea
comprise extreme thermophiles and halophiles, sulphate reducers and methanogens. We ourselves are
Eukarya - organisms containing cells with nuclei, including yeast and all multicellular organisms.
Figure 1.2: Major divisions of living things, derived by C. Woese on the basis of 15S RNA sequences.
A census of the species with sequenced genomes reveals emphasis on bacteria, because of their clinical
importance, and the relative ease of sequencing genomes of prokaryotes. However, fundamentally we may
have more to learn about ourselves from archaea than from bacteria. For despite the obvious differences in
lifestyle, and the absence of a nucleus, archaea are in some ways more closely related on a molecular level to
eukarya than to bacteria. It is also likely that the archaea are the closest living organisms to the root of the tree
of life.

26
Figure 1.2 shows the deepest level of the tree of life. The Eukarya branch includes animals, plants, fungi and
single-celled organisms. At the ends of the eukarya branch are the metazoa (multicellular organisms) (Fig.
1.3). We and our closest relatives are deuterostomes (Fig. 1.4).
Figure 1.3: Phylogenetic tree of metazoa (multicellular animals). Bilaterians include all animals that share a left-
right symmetry of body plan. Protostomes and deuterostomes are two major lineages that separated at an early
stage of evolution, estimated at 670 million years ago. They show very different patterns of embryological
development, including different early cleavage patterns, opposite orientations of the mature gut with respect to the
earliest invagination of the blastula, and the origin of the skeleton from mesoderm (deuterostomes) or ectoderm
(protostomes). Protostomes comprise two subgroups distinguished on the basis of 18S RNA (from the small
ribosomal subunit) and HOX gene sequences. Morphologically, Ecdysozoa have a molting cuticle - a hard outer
layer of organic material. Lophotrochozoa have soft bodies. (Based on Adouette, A., Balavoine, G., Lartillot, N.,
Lespinet, O., Prud'homme, B., and de Rosa, R. (2000) 'The new animal phylogeny: reliability and implications',
Proceedings of the National Academy of Sciences USA 97, 4453-6.)
Figure 1.4: Phylogenetic tree of vertebrates and our closest relatives. Chordates, including vertebrates, and
echinoderms are all deuterostomes.
Use of sequences to determine phylogenetic relationships
Previous sections have treated sequence databanks and biological relationships. Here are examples of the
application of retrieval of sequences from databanks and sequence comparisons to analysis of biological
relationships.

27
Example 1.1
Retrieve the amino acid sequence of horse pancreatic ribonuclease.
Use the ExPASy server at the Swiss Institute for Bioinformatics: The URL is: http://www.expasy.ch/cgi-bin/sprot-
search-ful. Type in the keywords horse pancreatic ribonuclease followed by the ENTER key. Select
RNP_HORSE and then FASTA format (see Box: FASTA format). This will produce the following (the first line
has been truncated):
>sp|P00674|RNP_HORSE RIBONUCLEASE PANCREATIC (EC 3.1.27.5) (RNASE 1) ...
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEP
LADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTS
QKERHIIVACEGNPYVPVHFDASVEVST
which can be cut and pasted into other programs
For example, we could retrieve several sequences and align them (see Box: Sequence Alignment). Analysis of
patterns of similarity among aligned sequences are useful properties in assessing closeness of relationships.
FASTA format
A very common format for sequence data is derived from conventions of FASTA, a program for FAST
Alignment by W.R. Pearson. Many programs use FASTA format for reading sequences, or for reporting
results.
A sequence in FASTA format:
ƒBegins with a single-line description. A > must appear in the first column. The rest of the title line is
arbitrary but should be informative.
ƒSubsequent lines contain the sequence, one character per residue.
ƒUse one-letter codes for nucleotides or amino acids specified by the International Union of
Biochemistry and International Union of Pure and Applied Chemistry (IUB/IUPAC).
See http://www.chem.qmw.ac.uk/iupac/misc/naabb.html
and http://www.chem.qmw.ac.uk/iupac/AminoAcid/
Use Sec and U as the three-letter and one-letter codes for selenocysteine:
http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html
ƒLines can have different lengths; that is, 'ragged right' margins.
ƒMost programs will accept lower case letters as amino acid codes.
An example of FASTA format: Bovine glutathione peroxidase
>gi|121664|sp|P00435|GSHC_BOVIN GLUTATHIONE PEROXIDASE
MCAAQRSAAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASLUGTTVRDYTQMNDLQR
RLG
PRGLVVLGFPCNQFGHQENAKNEEILNCLKYVRPGGGFEPNFMLFEKCEVNGEKAHPLFAFLREVLP
TPS
DDATALMTDPKFITWSPVCRNDVSWNFEKFLVGPDGVPVRRYSRRFLTIDIEPDIETLLSQGASA
The title line contains the following fields:
> is obligatory in column 1
gi|121664 is the geninfo number, an identifier assigned by the US National Center for Biotechnology
Information (NCBI) to every sequence in its ENTREZ databank. The NCBI collects sequences from a variety
of sources, including primary archival data collections and patent applications. Its gi numbers provide a
common and consistent 'umbrella' identifier, superimposed on different conventions of source databases.
When a source database updates an entry, the NCBI creates a new entry with a new gi number if the
changes affect the sequence, but updates and retains its entry if the changes affect only non-sequence
information, such as a literature citation.

28
sp|P00435 indicates that the source database was SWISS-PROT, and that the accession number of the
entry in SWISS-PROT was P00435.
GSHC_BOVIN GLUTATHIONE PEROXIDASE is the SWISS-PROT identifier of sequence and species,
(GSHC_BOVIN), followed by the name of the molecule.
Sequence alignment
Sequence alignment is the assignment of residue-residue correspondences. We may wish to find:
ƒ a Global match: align all of one sequence with all of the other.
ƒ And.--so,.from.hour.to.hour,.we.ripe.and.ripe
ƒ |||| ||||||||||||||||||||||||| ||||||
ƒ And.then,.from.hour.to.hour,.we.rot-.and.rot-
This illustrates mismatches, insertions and deletions.
ƒ a Local match: find a region in one sequence that matches a region of the other.
ƒ My.care.is.loss.of.care,.by.old.care.done,
ƒ ||||||||| ||||||||||||| |||||| ||
ƒ Your.care.is.gain.of.care,.by.new.care.won
For local matching, overhangs at the ends are not treated as gaps. In addition to mismatches, seen in
this example, insertions and deletions within the matched region are also possible.
ƒ a Motif match: find matches of a short sequence in one or more regions internal to a long
one. In this case one mismatching character is allowed. Alternatively one could demand
perfect matches, or allow more mismatches or even gaps.
ƒ match
ƒ ||||
ƒ for the watch to babble and to talk is most tolerable
or:
match
||||
Any thing that's mended is but patched: virtue that transgresses is
match match
|||| ||||
but patched with sin; and sin that amends is but patched with virtue
ƒ a Multiple alignment: a mutual alignment of many sequences.
ƒ no.sooner.---met.---------but.they.-look'd
ƒ no.sooner.look'd.---------but.they.-lo-v'd
ƒ no.sooner.lo-v'd.---------but.they.-sigh'd
ƒ no.sooner.sigh'd.---------but.they.--asked.one.another.the.reason
ƒ no.sooner.knew.the.reason.but.they.-------------sought.the.remedy
ƒ no.sooner. .but.they.
The last line shows characters conserved in all sequences in the alignment.
See Chapter 4 for an extended discussion of alignment.
Example 1.2
Determine, from the sequences of pancreatic ribonuclease from horse (Equus caballus), minke whale
(Bolaenoptera acutorostrata) and red kangaroo (Macropus rufus), which two of these species are most closely
related.
Knowing that horse and whale are placental mammals and kangaroo is a marsupial, we expect horse and whale
to be the closest pair. Retrieving the three sequences as in the previous example and pasting the following:
>RNP_HORSE
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEP
LADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTS

29
QKERHIIVACEGNPYVPVHFDASVEVST
>RNP_BALAC
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHES
LEDVKAVCSQKNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTS
QKEKHIIVACEGNPYVPVHFDNSV
>RNP_MACRU
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPK
SVVDAVCHQENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSN
LNKQIIVACEGQYVPVHFDAYV
into the multiple-sequence alignment program CLUSTAL-W http://www.ebi.ac.uk/clustalw/
(or alternatively, T-coffee: http://www.ch.embnet.org/software/TCoffee.html)
produces the following:
CLUSTAL W (1.8) multiple sequence alignment
RNP_HORSE KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ
60
RNP_BALAC
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
RNP_MACRU -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ
59
*:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *
RNP_HORSE KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF
120
RNP_BALAC KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
RNP_MACRU ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:*: ****::***:*.* : **:** *..****** *:**: :::******* ******
RNP_HORSE DASVEVST 128
RNP_BALAC DNSV---- 124
RNP_MACRU DAYV---- 122
* *
In this table, an * under the sequences indicates a position that is conserved (the same in all sequences), and :
and . indicate positions at which all sequences contain residues of very similar physicochemical character (:), or
somewhat similar physicochemical character (.).

30
Large patches of the sequences are identical. There are numerous substitutions but only one internal deletion.
By comparing the sequences in pairs, the number of identical residues shared among pairs in this alignment (not
the same as counting *s) is:
Number of identical residues in aligned Ribonuclease A sequences (out of a total of 122–128 residues)
Horse and Minke
whale
95
Minke Whale and Red
kangaro
o
82
Horse and Red
kangaro
o
75
Horse and whale share the most identical residues. The result appears significant, and therefore confirms our
expectations. Warning: Or is the logic really the other way round?
Let's try a harder one:
Example 1.3
The two living genera of elephant are represented by the African elephant (Loxodonta africana) and the Indian
(Elephas maximus). It has been possible to sequence the mitochondrial cytochrome b from a specimen of the
Siberian woolly mammoth (Mammuthus primigenius) preserved in the Arctic permafrost. To which modern
elephant is this mammoth more closely related?
Retrieving the sequences and running CLUSTAL-W:
African elephant MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
Siberian mammoth MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
Indian elephant MTHTRKSHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
*** ******:**:**********************************************
African elephant TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
Siberian mammoth TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
Indian elephant TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
************************************************************
African elephant LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPCIGTNLVEWIWGGFSVDKATLNRFFA 180
Siberian mammoth LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA
180
Indian elephant LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
********************************** ***:*********************

31
African elephant LHFILPFTMIALAGVHLTFLHETGSNNPLGLISDSDKIPFHPYYTIKDFLGLLILILLLL 240
Siberian mammoth LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240
Indian elephant FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
:********:********************* *************************:**
African elephant LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
Siberian mammoth LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
Indian elephant LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300
******************************************************:*****
African Elephant LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360
Siberian mammoth LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360
Indian elephant LGLMPFLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFS 360
**:**:**********************: *** ************* ************
African elephant IILAFLPIAGMIENYLIK 378
Siberian mammoth IILAFLPIAGMIENYLIK 378
Indian elephant IILAFLPIAGMIENYLIK 378
**********:*******
The mammoth and African elephant sequences have 10 mismatches, and the mammoth and Indian elephant
sequences have 14 mismatches. It appears that mammoth is more closely related to African elephants.
However, this result is less satisfying than the previous one. There are fewer differences. Are they significant? (It
is harder to decide whether the differences are significant because we have no preconceived idea of what the
answer should be.)
This example raises a number of questions:
1. We 'know' that African and Indian elephants and mammoths must be close relatives - just look
at them. But could we tell from these sequences alone that they are from closely related
species?
2. Given that the differences are small, do they represent evolutionary divergence arising from
selection, or merely random noise or drift? We need sensitive statistical criteria for judging the
significance of the similarities and differences.
As background to such questions, let us emphasize the distinction between similarity and homology. Similarity
is the observation or measurement of resemblance and difference, independent of the source of the
resemblance. Homology means, specifically, that the sequences and the organisms in which they occur are
descended from a common ancestor, with the implication that the similarities are shared ancestral
characteristics. Similarity of sequences (or of macroscopic biological characters) is observable in data
collectable now, and involves no historical hypotheses. In contrast, assertions of homology are statements of
historical events that are almost always unobservable. Homology must be an inference from observations of
similarity. Only in a few special cases is homology directly observable; for instance in family pedigrees

32
showing unusual phenotypes such as the Hapsburg lip, or in laboratory populations, or in clinical studies that
follow the course of viral infections at the sequence level in individual patients.
The assertion that the cytochromes b from African and Indian elephants and mammoths are homologous
means that there was a common ancestor, presumably containing a unique cytochrome b, that by alternative
mutations gave rise to the proteins of mammoths and modern elephants. Does the very high degree of
similarity of the sequences justify the conclusion that they are homologous; or are there other explanations?
ƒIt might be that a functional cytochrome b requires so many conserved residues that cytochromes b
from all animals are as similar to one another as the elephant and mammoth proteins are. We
can test this by looking at cytochrome b sequences from other species. The result is that
cytochromes b from other animals differ substantially from those of elephants and mammoths.
ƒA second possibility is that there are special requirements for a cytochrome b to function well in an
elephant-like animal, that the three cytochrome b sequences started out from independent
ancestors, and that common selective pressures forced them to become similar. (Remember that
we are asking what can be deduced from cytochrome b sequences alone.)
ƒThe mammoth may be more closely related to the African elephant, but since the time of the last
common ancestor the cytochrome b sequence of the Indian elephant has evolved faster than that
of the African elephant or the mammoth, accumulating more mutations.
ƒStill a fourth hypothesis is that all common ancestors of elephants and mammoths had very
dissimilar cytochromes b, but that living elephants and mammoths gained a common gene by
transfer from an unrelated organism via a virus.
Suppose however we conclude that the similarity of the elephant and mammoth sequences is taken to be high
enough to be evidence of homology, what then about the ribonuclease sequences in the previous example?
Are the larger differences among the pancreatic ribonucleases of horse, whale and kangaroo evidence that
they are not homologues?
How can we answer these questions? Specialists have undertaken careful calibrations of sequence similarities
and divergences, among many proteins from many species for which the taxonomic relationships have been
worked out by classical methods. In the example of pancreatic ribonucleases, the reasoning from similarity to
homology is justified. The question of whether mammoths are closer to African or Indian elephants is still too
close to call, even using all available anatomical and sequence evidence. Analyses of sequence similarities
are now sufficiently well established that they are considered the most reliable methods for establishing
phylogenetic relationships, even though sometimes - as in the elephant example - the results may not be
significant, while in other cases they even give incorrect answers. There are a lot of data available, effective
tools for retrieving what is necessary to bring to bear on a specific question, and powerful analytic tools. None
of this replaces the need for thoughtful scientific judgement.
Use of SINES and LINES to derive phylogenetic relationships
Major problems with inferring phylogenies from comparisons of gene and protein sequences are (1) the wide
range of variation of similarity, which may dip below statistical significance, and (2) the effects of different rates
of evolution along different branches of the evolutionary tree. In many cases, even if sequence similarities
confidently establish relationships, it may be impossible to decide the order in which sets of taxa have split.
The phylogeneticist's dream - features that have 'all-or-none' character, and the appearance of which is
irreversible so that the order of branching events can be decided - is in some cases afforded by certain non-
coding sequences in genomes.
SINES and LINES (Short and Long Interspersed Nuclear Elements) are repetitive non-coding sequences that
form large fractions of eukaryotic genomes - at least 30% of human chromosomal DNA, and over 50% of
some higher plant genomes. Typically, SINES are ~70–500 base pairs long, and up to 106
copies may appear.
LINES may be up to 7000 base pairs long, and up to 105
copies may appear. SINES enter the genome by
reverse transcription of RNA. Most SINES contain a 5' region homologous to tRNA, a central region unrelated
to tRNA, and a 3' AT-rich region.
Features of SINES that make them useful for phylogenetic studies include:
ƒ A SINE is either present or absent. Presence of a SINE at any particular position is a property
that entails no complicated and variable measure of similarity.
ƒ SINES are inserted at random in the non-coding portion of a genome. Therefore, appearance
of similar SINES at the same locus in two species implies that the species share a common
ancestor in which the insertion event occurred. No analogue of convergent evolution muddies
this picture, because there is no selection for the site of insertion.
ƒ SINE insertion appears to be irreversible: no mechanism for loss of SINES is known, other
than rare large-scale deletions that include the SINE. Therefore, if two species share a SINE
at a common locus, absence of this SINE in a third species implies that the first two species
must be more closely related to each other than either is to the third.

33
ƒ Not only do SINES show relationships, they imply which species came first. The last common
ancestor of species containing a common SINE must have come after the last common
ancestor linking these species and another that lacks this SINE.
N. Okada and colleagues applied SINE sequences to questions of phylogeny.
Whales, like Australians, are mammals that have adopted an aquatic lifestyle. But what - in the case of the
whales - are their closest land-based relatives? Classical palaeontology linked the order Cetacea - comprising
whales, dolphins and porpoises - with the order Arteriodactyla - even-toed ungulates (including for instance
cattle). Cetaceans were once thought to have diverged before the common ancestor of the three extant
Arteriodactyl suborders: Suiformes (pigs), Tylopoda (including camels and llamas), and Ruminantia (including
deer, cattle, goats, sheep, antelopes, giraffes, etc.). To place cetaceans properly among these groups, several
studies were carried out with DNA sequences. Comparisons of mitochondrial DNA, and genes for pancreatic
ribonuclease, γ-fibrinogen, and other proteins, suggested that the closest relatives of the whales are
hippopotamuses, and that cetaceans and hippopotamuses form a separate group within the arteriodactyls,
most closely related to the Ruminantia (see Weblem 1.7).
Analysis of SINES confirms this relationship. Several SINES are common to Ruminantia, hippopotamuses and
cetaceans. Four SINES appear in hippopotamuses and cetaceans only. These observations imply the
phylogenetic tree shown in Figure 1.5, in which the SINE insertion events are marked. [Note added in proof:
New fossils of land-based ancestors of whales confirm the link between whales and arteriodactyls. This is a
good example of the complementarity between molecular and palaeontological methods: DNA sequence
analysis can specify relationships among living species quite precisely, but only with fossils can one
investigate the relationships among their extinct ancestors.]
Figure 1.5: Phylogenetic relationships among cetaceans and other arteriodactyl subgroups, derived from analysis
of SINE sequences. Small arrowheads mark insertion events. Each arrowhead indicates the presence of a
particular SINE or LINE at a specific locus in all species to the right of the arrowhead. Lower case letters identify
loci, upper-case letters identify sequence patterns. For instance, the ARE2 pattern sequence appears only in pigs,
at the ino locus. The ARE pattern appears twice in the pig genome, at loci gpi and pro, and in the precary genome
at the same loci. The ARE insertion occured in a species ancestral to pigs and peccaries but to no other species in
the diagram. This implies that pigs and peccaries are more closely related to each other than to any of the other
animals studied. (From Nikaido, M., Rooney, A.P., and Okada, N. (1999) 'Phylogenetic relationships among
cetartiodactyls based on insertions of short and long interspersed elements: Hippopotamuses are the closest
extant relatives of whales', Proceedings of the National Academy of Sciences USA 96, 10261-6. (©1999, National
Academy of Sciences, USA))
Searching for similar sequences in databases: PSI-BLAST
A common theme of the examples we have treated is the search of a database for items similar to a probe.
For instance, if you determine the sequence of a new gene, or identify within the human genome a gene
responsible for some disease, you will wish to determine whether related genes appear in other species. The
ideal method is both sensitive - that is, it picks up even very distant relationships - and selective - that is, all
the relationships that it reports are true.

34
Database search methods involve a tradeoff between sensitivity and selectivity. Does the method find all or
most of the 'hits' that are actually present, or does it miss a large fraction? Conversely, how many of the 'hits' it
reports are incorrect? Suppose a database contains 1000 globin sequences. Suppose a search of this
database for globins reported 900 results, 700 of which were really globin sequences and 200 of which were
not. This result would be said to have 300 false negatives (misses) and 200 false positives. Lowering a
tolerance threshold will increase the number of both false negatives and false positives. Often one is willing to
work with low thresholds to make sure of not missing anything that might be important; but this requires
detailed examination of the results to eliminate the resulting false positives.
A powerful tool for searching sequence databases with a probe sequence is PSI-BLAST, from the US National
Center for Biotechnological Information (NCBI). PSI-BLAST stands for 'Position Sensitive Iterated - Basic
Linear Alignment Sequence Tool'. An earlier program, BLAST, worked by identifying local regions of similarity
without gaps and then piecing them together. The PSI in PSI-BLAST refers to enhancements that identify
patterns within the sequences at preliminary stages of the database search, and then progressively refine
them. Recognition of conserved patterns can sharpen both the selectivity and sensitivity of the search. PSI-
BLAST involves a repetitive (or iterative) process, as the emergent pattern becomes better defined in
successive stages of the search.
Example 1.4
Homologues of the human PAX-6 gene. PAX-6 genes control eye development in a widely divergent set of
species (see Box, p. 33). The human PAX-6 gene encodes the protein appearing in SWISS-PROT entry
P26367. To run PSI-BLAST, go to the following URL: http://www.ncbi.nlm.nih.gov/blast/psiblast.cgi.
Enter the sequence, and use the default options for selections of the database to search, and the similarity
matrix used.
The program returns a list of entries similar to the probe, sorted in decreasing order of statistical significance.
(Extracts from the response are shown in the Box, Results of PSI-BLAST search for human PAX-6 protein.) A
typical line appears as follows:
pir||I45557 eyeless, long form - fruit fly (Drosophila melano... 255 7e-67
The first item on the line is the database and corresponding entry number (separated by ||) in this case the PIR
(Protein Identification Resource) entry 145557. It is the Drosophila homologue eyeless. The number 255 is a
score for the match detected, and the significance of this match is measured by E = 7 × 10-67
. E is related to the
probability that the observed degree of similarity could have arisen by chance: E is the number of sequences that
would be expected to match as well or better than the one being considered, if the same database were probed
with random sequences. E = 7 × 10-67
means that it is extremely unlikely that even one random sequence would
match as well as the Drosophila homologue. Values of E below about 0.05 would be considered significant; at
least they might be worth considering. For borderline cases, you would ask: are the mismatches conservative? Is
there any pattern or are the matches and mismatches distributed randomly through the sequences? There is an
elusive concept, the texture of an alignment, that you will become sensitive to.
Note that if there are many sequences in the databank that are very similar to the probe sequence, they will head
the list. In this case, there are many very similar PAX genes in other mammals. You may have to scan far down
the list to find a distant relative that you consider interesting.
In fact, the program has matched only a portion of the sequences. The full alignment is shown in the Box,
Complete pairwise sequence alignment of human PAX-6 protein and Drosophila melanogaster eyeless. (See
Exercise 1.5.)
Example 1.5
What species contain homologues of human PAX-6 detectable by PSI-BLAST?
PSI-BLAST reports the species in which the identified sequences occur (see Box, Results of PSI-BLAST search
for human PAX-6 protein). These appear, embedded in the text of the output, in square brackets; for instance:
emb|CAA56038.1| (X79493) transcription factor [Drosophila melanogaster]
(In the section reporting E-values, the species names may be truncated.)
The following PERL program extracts species names from the PSI-BLAST output.
#!/usr/bin/perl

35
#extract species from psiblast output
# Method:
# For each line of input, check for a pattern of form [Drosophila melanogaster]
# Use each pattern found as the index in an associative array
# The value corresponding to this index is irrelevant
# By using an associative array, subsequent instances of the same
# species will overwrite the first instance, keeping only a unique set
# After processing of input complete, sort results and print.
while (<>) { # read line of input
if (/[([A-Z][a-z]+ [a-z]+)]/) { # select lines containing strings of form
# [Drosophila melanogaster]
$species{$1} = 1; # make or overwrite entry in
} # associative array
}
foreach (sort(keys(%species))){ # in alphabetical order,
print "$_n"; # print species names
}
There are 52 species found (see Box: Species recognized by PSI-BLAST 'hits' to probe sequence human PAX-
6).
The program makes use of PERL's rich pattern recognition resources to search for character strings of the form
[Drosophila melanogaster]. We want to specify the following pattern:
ƒ a square bracket,
ƒ followed by a word beginning with an upper case letter followed by a variable number of lower
case letters,
ƒ then a space between words,
ƒ then a word all in lower case letters,
ƒ then a closing square bracket.
This kind of pattern is called a regular expression and appears in the PERL program in the following form:[([A-
Z][a-z]+ [a-z]+)]
Building blocks of the pattern specify ranges of characters:
ƒ [A-Z] = any letter in the range A, B, C,... Z
ƒ [a-z] = any letter in the range a, b, c,... z
We can specify repetitions:
ƒ [A-Z] = one upper case letter
ƒ [a-z] + = one or more lower case letters

36
and combine the results:
[A-Z] [a-z] + [a-z] + = an upper case letter followed by one or more lower case letters (the genus
name), followed by a blank, followed by one or more lower case letters (the species name).
Enclosing these in parentheses: ([A-Z] [a-z]+ [a-z]+) tells PERL to save the material that matched the
pattern for future reference. In PERL this matched material is designated by the variable $1. Thus if the input line
contained [Drosophila melanogaster], the statement
$species{$1} = 1;
would effectively be:
$species {"Drosophila melanogaster"} = 1;
Finally, we want to include the brackets surrounding the genus and species name, but brackets signify character
ranges. Therefore, we must precede the brackets by backslashes: [...], to give the final pattern: [([A-
Z][a-z]+ [a-z]+)]
The use of the associative array to retain only a unique set of species is another instructive aspect of the
program. Recall that an associative array is a generalization of an ordinary array or vector, in which the elements
are not indexed by integers but by arbitrary strings. A second reference to an associative array with a previously
encountered index string may possibly change the value in the array but not the list of index strings In this case
we do not care about the value but just use the index strings to compile a unique list of species detected.
Multiple references to the same species will merely overwrite the first reference, not make a repetitive list.
Et in terra PAX hominibus, muscisque ...
The eyes of the human, fly and octopus are very different in structure. Conventional wisdom, noting the
immense selective advantage conferred by the ability to see, held that eyes arose independently in different
phyla. It therefore came as a great surprise that a gene controlling human eye development has a
homologue governing eye development in Drosophila.
The PAX-6 gene was first cloned in the mouse and human. It is a master regulatory gene, controlling a
complex cascade of events in eye development. Mutations in the human gene cause the clinical condition
aniridia, a developmental defect in which the iris of the eye is absent or deformed. The PAX-6 homologue in
Drosophila - called the eyeless gene - has a similar function of control over eye development. Flies mutated
in this gene develop without eyes; conversely, expression of this gene in a fly's wing, leg, or antenna
produces ectopic (= out of place) eyes. (The Drosophila eyeless mutant was first described in 1915. Little did
anyone then suspect a relation to a mammalian gene.)
Not only are the insect and mammalian genes similar in sequence, they are so closely related that their
activity crosses species boundaries. Expression of the mouse PAX-6 gene in the fly causes ectopic eye
development just as expression of the fly's own eyeless gene does.
PAX-6 has homologues in other phyla, including flatworms, ascidians, sea urchins and nematodes. The
observation that rhodopsins - a family of proteins containing retinal as a common chromophore - function as
light-sensitive pigments in different phyla is supporting evidence for a common origin of different
photoreceptor systems. The genuine structural differences in the macroscopic anatomy of different eyes
reflect the divergence and independent development of higher-order structure.
Results of PSI-BLAST search for human PAX-6 protein
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro
A. Schaeffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David
J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= sp|P26367|PAX6_HUMAN PAIRED BOX PROTEIN PAX-6
(OCULORHOMBIN) (ANIRIDIA, TYPE II PROTEIN) - Homo sapiens (Human).

37
(422 letters)
Sequences with E-value BETTER than threshold
Score E
Sequences producing significant alignments: (bits) Value
ref|NP_037133.1| paired box homeotic gene 6 >gi|2495314|sp|P7... 730 0.0
ref|NP_000271.1| paired box gene 6, isoform a >gi|417450|sp|P... 730 0.0
pir||A41644 homeotic protein aniridia - human 728 0.0
gb|AAA59962.1| (M77844) oculorhombin [Homo sapiens] >gi|18935... 728 0.0
prf||1902328A PAX6 gene [Homo sapiens] 724 0.0
emb|CAB05885.1| (Z83307) PAX6 [Homo sapiens] 723 0.0
ref|NP_001595.2| paired box gene 6, isoform b 721 0.0
ref|NP_038655.1| paired box gene 6 >gi|543296|pir||S42234 pai... 721 0.0
dbj|BAA23004.1| (D87837) PAX6 protein [Gallus gallus] 717 0.0
gb|AAF73271.1|AF154555_1 (AF154555) paired domain transcripti... 714 0.0
sp|P55864|PAX6_XENLA PAIRED BOX PROTEIN PAX-6 >gi|1685056|gb|... 713 0.0
gb|AAB36681.1| (U76386) paired-type homeodomain Pax-6 protein... 712 0.0
gb|AAB05932.1| (U64513) Xpax6 [Xenopus laevis] 712 0.0
sp|P47238|PAX6_COTJA PAIRED BOX PROTEIN PAX-6 (PAX-QNR) >gi|4... 710 0.0
dbj|BAA24025.1| (D88741) PAX6 SL [Cynops pyrrhogaster] 707 0.0
gb|AAD50903.1|AF169414_1 (AF169414) paired-box transcription ... 706 0.0
dbj|BAA13680.1| (D88737) Xenopus Pax-6 long [Xenopus laevis] 703 0.0
sp|P26630|PAX6_BRARE PAIRED BOX PROTEIN PAX[ZF-A] (PAX-6) >gi... 699 0.0
dbj|BAA24024.1| (D88741) PAX6 LL [Cynops pyrrhogaster] 697 0.0
gb|AAD50901.1|AF169412_1 (AF169412) paired-box transcription ... 696 0.0
emb|CAA68835.1| (Y07546) PAX-6 protein [Astyanax mexicanus] >... 693 0.0
pir||I50108 paired box transcription factor Pax-6 - zebra fis... 689 0.0
sp|073917|PAX6_ORYLA PAIRED BOX PROTEIN PAX-6 >gi|3115324|emb... 686 0.0
gb|AAC96095.1| (AF061252) Pax-family transcription factor 6.2 ... 684 0.0

38
emb|CAA68837.1| (Y07547) PAX-6 protein [Astyanax mexicanus] 683 0.0
emb|CAA16493.1| (AL021531) PAX6 [Fugu rubripes] 646 0.0
gb|AAF73273.1|AF154557_1 (AF154557) paired domain transcripti... 609 e-173
dbj|BAA24023.1| (D88741) PAX6 SS [Cynops pyrrhogaster] 609 e-173
prf||1717390A pax gene [Danio rerio] 609 e-173
gb|AAD50904.1|AF169415_1 (AF169415) paired-box transcription ... 605 e-172
dbj|BAA13681.1| (D88738) Xenopus Pax-6 short [Xenopus laevis] 600 e-171
dbj|BAA24022.1| (D88741) PAX6 LS [Cynops pyrrhogaster] 599 e-170
gb|AAD50902.1|AF169413_1 (AF169413) paired-box transcription ... 595 e-169
gb|AAF73270.1| (AF154554) paired domain transcription factor ... 594 e-169
gb|AAB07733.1| (U67887) XLPAX6 [Xenopus laevis] 592 e-168
gb|AAA40109.1| (M77842) oculorhombin [Mus musculus] 455 e-127
emb|CAA11364.1| (AJ223440) Pax6 [Branchiostoma floridae] 440 e-122
gb|AAB40616.1| (U59830) Pax-6 [Loligo opalescens] 437 e-122
pir||A57374 paired box transcription factor Pax-6 - sea urchi... 437 e-121
pir||JC6130 paired box transcription factor Pax-6 - Ribbonwor... 396 e-109
gb|AAD31712.1|AF134350_1 (AF134350) transcription factor Toy ... 380 e-104
gb|AAB36534.1| (U77178) paired box homeodomain protein TPAX6 ... 377 e-104
emb|CAA71094.1| (Y09975) Pax-6 [Phallusia mammilata] 342 4e-93
dbj|BAA20936.1| (AB002408) mdkPax-6 [Oryzias sp.] 338 6e-92
pir||S60252 paired box transcription factor vab-3 - Caenorhab... 336 2e-91
pir||T20900 hypothetical protein F14F3.1 - Caenorhabditis ele... 336 2e-91
pir||S36166 paired box transcription factor Pax-6 - rat (frag... 335 5e-91

39
sp|P47237|PAX6_CHICK PAIRED BOX PROTEIN PAX-6 >gi|2147404|pir... 333 2e-90
dbj|BAA75672.1| (AB017632) DjPax-6 [Dugesia japonica] 329 4e-89
gb|AAF64460.1|AF241310_1 (AF241310) transcription factor PaxB... 290 2e-77
gb|AAF73274.1| (AF154558) paired domain transcription factor ... 287 1e-76
pdb|6PAX|A Chain A, Crystal Structure Of The Human Pax-6 Pair... 264 1e-69
pir||C41061 paired box homolog Pax6 - mouse (fragment) 261 9e-69
gb|AAC18658.1| (U73855) Pax6 [Bos taurus] 259 4e-68
pir||I45557 eyeless, long form - fruit fly (Drosophila melano... 255 7e-67
gb|AAF59318.1| (AE003843) ey gene product [Drosophila melanog... 255 7e-67
... many additional "hits" deleted ...
... two selected alignments follow ...
----------------------------------------------------------------------------
Alignments
>ref|NP_037133.1| paired box homeotic gene 6
sp|P70601|PAX6_RAT PAIRED BOX PROTEIN PAX-6
gb|AAB09042.1| (U69644) paired-box/homeobox protein [Rattus norvegicus]
Length = 422
Score = 730 bits (1865), Expect = 0.0
Identities = 362/422 (85%), Positives = 362/422 (85%)
Query: 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY
Sbjct: 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
Query: 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120
YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV
Sbjct: 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120
Query: 121 SSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTXXXXXXX
180

40
SSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPT
Sbjct: 121 SSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQ
180
Query: 181 XXXXXNTNSISSNGEDSDEAQMXXXXXXXXXXNRTSFTQEQIEALEKEFERTHYPDVFAR 240
NTNSISSNGEDSDEAQM NRTSFTQEQIEALEKEFERTHYPDVFAR
Sbjct: 181 EGQGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFAR 240
Query: 241 ERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNXXXXXXXXXXXXXXVYQPIP 300
ERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASN VYQPIP
Sbjct: 241 ERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIP 300
Query: 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360
QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT
Sbjct: 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360
Query: 361 SPSVNGRSYDTYTPPHMQTHMNSQPMXXXXXXXXXLIXXXXXXXXXXXXXXXDMSQYWPR
420
SPSVNGRSYDTYTPPHMQTHMNSQPM LI DMSQYWPR
Sbjct: 361 SPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPR
420
Query: 421 LQ 422
LQ
Sbjct: 421 LQ 422
>pir||I45557 eyeless, long form - fruit fly (Drosophila melanogaster)
emb|CAA56038.1| (X79493) transcription factor [Drosophila melanogaster]
Length = 838
Score = 255 bits (644), Expect = 7e-67
Identities = 124/132 (93%), Positives = 128/132 (96%)
Query: 5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG 64

41
HSGVNQLGGVFV GRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG
Sbjct: 38 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG 97
Query: 65 SIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSIN 124
SIRPRAIGGSKPRVAT EVVSKI+QYKRECPSIFAWEIRDRLL E VCTNDNIPSVSSIN
Sbjct: 98 SIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSIN 157
Query: 125 RVLRNLASEKQQ 136
RVLRNLA++K+Q
Sbjct: 158 RVLRNLAAQKEQ 169
Complete pairwise sequence alignment of human PAX-6 protein and Drosophila melanogaster
eyeless
PAX6_human ---------------------------------MQNSHSGVNQLGGVFVNGRPLPDSTRQ 27
eyeless MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKGHSGVNQLGGVFVGGRPLPDSTRQ 60
::.************.**********
PAX6_human KIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKI 87
eyeless KIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKI 120
*****************************************************.******
PAX6_human AQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQ----------- 136
eyeless SQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTS 180
:*******************.*.*********************::*:*
PAX6 human ------------MG-------------------------------------------ADG 141
eyeless AGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGATNSGEGSEQEA 240
:* :.
PAX6_human MYDKLRMLNGQTGS--------------------WGTRP--------------------- 160
eyeless IYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSW
300
:*:***:** * .: ***.

42
PAX6_human -------GWYPG-------TSVP------------------------------GQP---- 172
eyeless PPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIKSLASIGHQRNCP 360
.*** :*.* *:
PAX6_human ----------TQDGCQQQEGG---GENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQ 219
eyeless VATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTN 420
** *.:* * ***:*. :** :::: * ** *************:
PAX6_human EQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQAS
279
eyeless DQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPN 480
:**::********************.**.**************************** ..
PAX6_human NTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLG----------------------- 316
eyeless STGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPT 540
.* : . **: :*: *:. :. **: *** *
PAX6 human ------------------------------------------------------------
eyeless LGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHA 600
PAX6_human ----------------------------RTDTALTNTYSALPPMPSFTMANNLPMQPPVP 348
eyeless QGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSP 660
:* :::::*.*:.*:***. : *: ** *
PAX6_human S-------QTSSYSCMLPTSP---------------------------------SVNGRS 368
eyeless IPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSG
720
:* *.* :. * * .* .
PAX6_human YDTYTP-----------------------------PHMQTHMNSQP----------MGTS 389
eyeless YEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHS 780
*:. :. ** :* * :. *

43
PAX6_human GTTSTGLISPGVS----------------VPVQVPGS----EPDMSQYWPRLQ----- 422
eyeless SGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV 838
. *:. ***.** .* ...*: *. .*::.
Species recognized by PSI-BLAST 'hits' to probe sequence human PAX-6
Acropora millepora Herdmania curvata
Archegozetes longisetosus Homo sapiens
Astyanax mexicanus Hydra littoralis
Bos taurus Hydra
magnipapillata
Branchiostoma floridae Hydra vulgaris
Branchiostoma lanceolatum Ilyanassa obsoleta
Caenorhabditis elegans Lampetra japonica
Canis familiaris Lineus sanguineus
Carassius auratus Loligo opalescens
Chrysaora quinquecirrha Mesocricetus
auratus
Ciona intestinalis Mus musculus
Coturnix coturnix Notophthalmus
viridescens
Cynops pyrrhogaster Oryzias latipes
Danio rerio Paracentrotus
lividus
Drosophila mauritiana Petromyzon marinus
Drosophila melanogaster Phallusia
mammilata
Drosophila sechellia Podocoryne carnea
Drosophila simulans Ptychodera flava
Drosophila virilis Rattus norvegicus
Dugesia japonica Schistosoma
mansoni
Ephydatia fluviatilis Strongylocentrotus
purpuratus
Fugu rubripes Sus scrofa
Gallus gallus Takifugu rubripes
Girardia tigrina Tribolium
castaneum
Halocynthia roretzi Triturus alpestris
Helobdella triserialis Xenopus laevis
Introduction to protein structure
With protein structures we leave behind the one-dimensional world of nucleotide and amino acid sequences
and enter the spatial world of molecular structures. Some of the facilities for archiving and retrieving molecular
biological information survive this change pretty well intact, some must be substantially altered, and others do
not make it at all.
Biochemically, proteins play a variety of roles in life processes: there are structural proteins (e.g. viral coat
proteins, the horny outer layer of human and animal skin, and proteins of the cytoskeleton); proteins that
catalyse chemical reactions (the enzymes); transport and storage proteins (haemoglobin); regulatory proteins,
including hormones and receptor/signal transduction proteins; proteins that control genetic transcription; and
proteins involved in recognition, including cell adhesion molecules, and antibodies and other proteins of the
immune system.

44
Proteins are large molecules. In many cases only a small part of the structure - an active site - is functional,
the rest existing only to create and fix the spatial relationship among the active site residues. Proteins evolve
by structural changes produced by mutations in the amino acid sequence. The primary paradigm of evolution
is that changes in DNA generate variability in protein structure and function, which affect the reproductive
fitness of the individual, on which natural selection acts.
Approximately 15 000 protein structures are now known. Most were determined by X-ray crystallography or
Nuclear Magnetic Resonance (NMR). From these we have derived our understanding both of the functions of
individual proteins - for example, the chemical explanation of catalytic activity of enzymes - and of the general
principles of protein structure and folding.
Chemically, protein molecules are long polymers typically containing several thousand atoms, composed of a
uniform repetitive backbone (or main chain) with a particular sidechain attached to each residue (see Fig. 1.6).
The amino acid sequence of a protein records the succession of sidechains.
Figure 1.6: The polypeptide chains of proteins have a mainchain of constant structure and sidechains that vary in
sequence. Here Si-1, Si and Si+1 represent sidechains. The sidechains may be chosen, independently, from the set
of 20 standard amino acids. It is the sequence of the sidechains that gives each protein its individual structural and
functional characteristics.
The polypeptide chain folds into a curve in space; the course of the chain defining a 'folding pattern'. Proteins
show a great variety of folding patterns. Underlying these are a number of common structural features. These
include the recurrence of explicit structural paradigms - for example, α-helices and β-sheets (Fig. 1.7) - and
common principles or features such as the dense packing of the atoms in protein interiors. Folding may be
thought of as a kind of intramolecular condensation or crystallization. (See Chapter 5.)
Figure 1.7: Standard secondary structures of proteins. (a) α-helix. (b) β-sheet. Broken lines indicate H-bonds. (b)

45
illustrates a parallel β-sheet, in which all strands point in the same direction. Antiparallel β-sheets, in which all pairs
of adjacent strands point in opposite directions, are also common. In fact, β-sheets can be formed by any
combination of parallel and antiparallel strands.
The hierarchical nature of protein architecture
The Danish protein chemist K.U. Linderstrøm-Lang described the following levels of protein structure: The
amino acid sequence - the set of primary chemical bonds - is called the primary structure. The assignment of
helices and sheets - the hydrogen-bonding pattern of the mainchain - is called the secondary structure. The
assembly and interactions of the helices and sheets is called the tertiary structure. For proteins composed of
more than one subunit, J.D. Bernal called the assembly of the monomers the quaternary structure. In some
cases, evolution can merge proteins - changing quaternary to tertiary structure. For example, five separate
enzymes in the bacterium E. coli, that catalyse successive steps in the pathway of biosynthesis of aromatic
amino acids, correspond to five regions of a single protein in the fungus Aspergillus nidulans. Sometimes
homologous monomers form oligomers in different ways; for instance, globins form tetramers in mammalian
haemoglobins, and dimers - using a different interface - in the ark clam Scapharca inaequivalvis.
It has proved useful to add additional levels to the hierarchy:
ƒ Supersecondary structures. Proteins show recurrent patterns of interaction between helices
and sheets close together in the sequence. These supersecondary structures include the α-
helix hairpin, the β-hairpin, and the β-α-β unit (Fig. 1.8).

Other documents randomly have
different content

berusten op een basis van persoonlijke
verantwoordelijkheid. Waar geen vertrouwen kan zijn op
persoonlijke verantwoordelijkheid, daar kan geen vrijheid
zijn. Op de meeste gebieden van moreele werkzaamheid
wordt deze zin voor persoonlijke verantwoordelijkheid
verkregen in een tamelijk vroeg stadium van
maatschappelijken vooruitgang. De sexueele moraal is het
laatste gebied van de moraal, dat in de sfeer van de
persoonlijke verantwoordelijkheid kan gebracht worden.
De gemeenschap legt de verschillende samengestelde en
kunstmatige wetten van sexueele moraal op aan haar
leden, vooral aan haar vrouwelijke leden, en natuurlijk is
ze altijd zeer wantrouwend aangaande haar vermogen om
deze wetten na te komen, en is zeer zorgvuldig om haar,
voor zoover dat mogelijk is, geen persoonlijke
verantwoordelijkheid in de zaak te laten. Maar een
oefening in zelfbedwang, als die doorgevoerd is een lange
reeks van generaties door, is de beste voorbereiding voor
de vrijheid. De wet, die aan de vroegere generaties
opgelegd is geweest, is, zooals de oude theologie de zaak
uitlegde, de leerschool geweest om de latere generaties tot
Christus te brengen; of, zooals de nieuwe wetenschap
precies hetzelfde denkbeeld uitdrukt, de latere generaties
zijn immuun geworden en hebben ten slotte een soort van
vrijstelling gekregen tegen de ziektestof, die de vroegere
generaties zou hebben vernietigd.

Het proces, waardoor een volk verstand krijgt van persoonlijke
verantwoordelijkheid gaat langzaam, en misschien kan ze niet
geheel voldoende verkregen worden door rassen, die een hoogen
graad van zenuworganisatie missen. Dat geldt vooral van de
sexueele moraal, zooals bij de aanraking van een hoogere met een
lagere beschaving dikwijls gebleken is. Het is telkens weer
vooorgekomen, dat zendelingen—zeer tegen hun eigen wensch—
dat behoeven we niet te zeggen—door het straffe moreele systeem,
dat zij vonden, omver te werpen, en door ervoor in de plaats te
stellen de vrijheid van de Europeesche gewoonten onder volken die
geheel onvoorbereid waren voor zulk een vrijheid, hoogst nadeelig
op de zedelijkheid gewerkt hebben. Dit is het geval geweest onder
de vroeger goed georganiseerde en zeer moreele Baganda van
Centraal-Afrika, zooals vermeld is in een officieel rapport door
Kolonel Lambkin (British Medical Journal, Oct. 3, 1908).
Ook wat Polynesia aangaat, wees R. L. Stevenson er in zijn
belangwekkend boek In the South Seas (hoofdst. V) op, dat, terwijl
vóór het komen van de blanken de Polynesiërs over het geheel
kuisch waren, en de jonge menschen zorgvuldig bewaakt werden,
het nu geheel anders is.
Zelfs in Fiji, waar, volgens Lord Stanmore—die Generaal-
gevolmachtigde van de Zuidzee, en een onafhankelijk beoordeelaar
was—het streven van de zendelingen “wonderbaarlijk wel
geslaagd” geweest is, waar allen ten minste in naam zich Christenen
noemen, waardoor het leven en de volksaard zeer veranderd zijn,
heeft de kuischheid zeer geleden. Dit heeft een commissie over den
toestand van de inboorling-rassen in Fiji aangetoond. Mr. Titchett,
die verslag geeft over deze commissie (Australasian Review of
Reviews, Oct., 1897) merkt op: “Niet weinige, door de commissie
gehoorde getuigen verklaren, dat de moreele vooruitgang op Fiji als
merkwaardig knoeiwerk voor den dag komt. De afschaffing van de
veelwijverij is bij voorbeeld niet in ieder opzicht gunstig uitgevallen
voor de vrouwen. De vrouw heeft het zware werk te doen op Fiji; en
toen het onderhoud van den man verdeeld was over vier vrouwen

was de last op iedere vrouw minder dan nu, nu hij door één
gedragen moet worden. In den heidenschen tijd werd de kuischheid
van de vrouw bewaakt door de knots; een trouwelooze vrouw, een
ongehuwde moeder werden kortweg ter dood gebracht. Het
Christendom heeft het knotsrecht afgeschaft, en alleen moreele
beperking of de vrees voor de straffen van de wereld hiernamaals
nemen voor de begrensde verbeelding van de bewoners van Fiji niet
geheel de plaats ervan in. Zoo is de standaard van de kuischheid in
Fiji bedroevend laag”.
We moeten ons altijd herinneren, dat, als het hoog georganiseerde
systeem van gemengde geestelijke en physieke beperkingen
weggenomen is, kuischheid teerder begint te worden en onstabiel
van evenwicht. De controleerende invloed van persoonlijke
verantwoordelijkheid, hoe waardevol en essentieel die ook is, kan
niet voortdurend en onafgebroken de vulcanische krachten in
bedwang houden van den liefdeshartstocht, zelfs in hooge
beschavingen. “Geen volmaaktheid van moreelen aanleg bij een
vrouw,” heeft Hinton terecht gezegd, “geen kracht van wil, geen
wensch en besluit om “goed” te zijn, geen macht van den godsdienst
of contrôle van de gewoonten, kan verzekeren wat genoemd wordt
de deugd van de vrouw. Het gevoel van volkomen toewijding,
waarmede de een of andere man haar kan vervullen, zal ze allemaal
wegvagen. Waar de maatschappij zich op die basis wil oprichten,
kiest ze onvermijdelijk wanorde, en zoo lang ze voortgaat die te
kiezen, zal ze steeds hetzelfde resultaat hebben”.
Wij moeten nog verder ingaan op deze persoonlijke
verantwoordelijkheid in zaken van sexueele moraal, in den
vorm waarin ze zich onder ons doet gevoelen, en
onderzoek doen naar alles wat er onder begrepen is. Het
belangrijkste punt is ongetwijfeld economische
onafhankelijkheid. Die is werkelijk van zooveel belang,

dat men nauwelijks kan zeggen, dat er moreele
verantwoordelijkheid bestaat in den besten zin van het
woord, waar de economische onafhankelijkheid ontbreekt.
Moreele verantwoordelijkheid en economische
onafhankelijkheid zijn werkelijk identiek; zij zijn maar
twee kanten van hetzelfde maatschappelijke feit. De
verantwoordelijke persoon is de persoon, die voor zijn
daden kan instaan en, als het noodig is, ervoor kan betalen.
De economisch afhankelijke mensch kan een crimineele
verantwoordelijkheid op zich nemen; hij kan met een
leege portemonnaie in de gevangenis gaan of in den dood.
Maar in de gewone sfeer van alledaagsche moraal wordt
die groote straf niet van hem gevergd; als hij ingaat tegen
de wenschen van zijn familie of zijn vrienden of van zijn
gemeente, dan kunnen ze hem den rug toekeeren, maar ze
kunnen gewoonlijk niet de uiterste straffen van de wet
tegen hem eischen. Hij kan zijn eigen persoonlijke
verantwoordelijkheid uitoefenen, hij kan vrij zijn eigen
weg kiezen en zich daar op handhaven voor de oogen van
zijn medemenschen, op voorwaarde, dat hij in staat is er
voor te betalen. Zijn persoonlijke verantwoordelijkheid
heeft weinig of geen beteekenis, indien ze niet tevens
economische onafhankelijkheid is.
Naarmate de beschaafde maatschappijen tot rijpheid
komen, beginnen de vrouwen een steeds grootere mate
zoowel van moreele verantwoordelijkheid als van
economische onafhankelijkheid te krijgen. Iedere nieuwe

vrijheid der vrouwen en iedere schijnbare gelijkheid van
mannen en vrouwen, zelfs als ze inderdaad den schijn
aanneemt van meerderheid is onwerkelijk, indien ze niet
op economische onafhankelijkheid gebaseerd is. Ze wordt
dan alleen maar geduld; het is de vrijheid, die aan een kind
gegeven wordt, omdat het er zoo lief om vraagt of omdat
het misschien schreeuwen zal, als men ze hem weigert. Dit
is slechts parasitisme39. De basis van economische
afhankelijkheid verzekert een meer werkelijke vrijheid.
Zelfs in maatschappijen, die door wet en gewoonte de
vrouwen in strikte onderworpenheid houden, geniet de
vrouw, die toevallig in het bezit is van eigendom een
hooge mate van onafhankelijkheid zoowel als van
verantwoordelijkheid40. De groei van een hooge
beschaving schijnt inderdaad zoo nauw verbonden te zijn
met economische vrijheid en onafhankelijkheid van de
vrouwen, dat het moeilijk te zeggen is wat oorzaak is en
wat gevolg. Herodotus merkte in zijn mooi verslag over
Egypte, een land dat hij beschouwde als meer
bewonderenswaardig dan alle andere landen, met
verbazing op, dat de vrouwen de mannen thuis lieten om
het weefgetouw te behandelen en dat ze zelf naar de markt
gingen om zaken te doen of om handel te drijven41. Het is
de economische factor in het maatschappelijk leven, die de
moreele verantwoordelijkheid van de vrouwen verzekert
en die voornamelijk de positie bepaalt van de vrouw
tegenover haar man42.

In dit opzicht keert de beschaving in haar laatste stadium
terug tot hetzelfde punt, dat ze innam bij het begin, toen,
zooals reeds opgemerkt is, wij grootere gelijkheid met de
mannen vonden en tevens grootere economische
onafhankelijkheid43.
In al de toonaangevende moderne beschaafde landen,
hebben, in de laatste eeuw, gewoonte en wet
samengewerkt om een steeds grootere economische
onafhankelijkheid aan de vrouwen te verzekeren. In
sommige opzichten heeft Engeland de leiding gehad
daardoor, dat het het eerst het kapitalistisch systeem
gevormd heeft en de vrouwen langzamerhand heeft
ingelijfd in de scharen der arbeiders44, waardoor de
verandering in de wet onvermijdelijk werd, die, in 1882,
aan een getrouwde vrouw het bezit verzekerde van haar
eigen verdienste. Dezelfde beweging met dezelfde
gevolgen zien we elders. In de Vereenigde Staten, evenals
in Engeland, bestaat er een groot leger van vijf millioen
vrouwen, dat zich snel uitbreidt, die haar eigen brood
verdienen, en haar positie is in verhouding tot de
mannelijke arbeiders zelfs beter dan in Engeland. In
Frankrijk zijn van de vijf en twintig tot de zeven en twintig
percent van de werklieden in de meeste van de
voornaamste industrieën—de vrije beroepen, handel,
landbouw, fabrieksindustrieën—vrouwen, en in sommige
van de grootste, zoo als de huis-industrieën en textiel-
industrieën, zijn meer vrouwen aan het werk dan mannen.

In Japan, zegt men, dat drie vijfden van de
fabrieksarbeiders vrouwen zijn, en al de textiel-industrieën
zijn in handen van de vrouwen45. Deze beweging is een
zichtbare uitdrukking van de moderne opvatting van
persoonlijke rechten, persoonlijke waarde en persoonlijke
verantwoordelijkheid, die, zooals Hobhouse opmerkt, de
vrouwen gedwongen heeft zelf haar leven aan te pakken,
en die tegelijkertijd de oude huwelijkswetten tot een
anachronisme gemaakt heeft en het verouderde idee van
vrouwelijke onschuld van de wereld weggevaagd heeft als
niets dan een stuk valsch sentiment46.
Er kan geen twijfel aan zijn, dat het binnentreden van de vrouwen in
het gebied van den industriearbeid, in wedijver met de mannen en
onder ongeveer dezelfde omstandigheden als zij, ernstige vragen
van een andere soort doet rijzen. Dat de beschaving in het algemeen
neigt naar de economische onafhankelijkheid en de moreele
verantwoordelijkheid van de vrouwen, ligt voor de hand. Maar het is
in het geheel niet absoluut zeker, dat het beste is voor de vrouwen,
en daarom voor de gemeenschap, dat zij al de gewone beroepen en
bezigheden zullen uitoefenen, en dat onder dezelfde
omstandigheden. Niet alleen hebben de omstandigheden van de
beroepen en betrekkingen zich ontwikkeld in overeenstemming met
de speciale geschiktheden van de mannen, maar het feit, dat het
sexueele proces, waardoor het ras zich voortplant, een
onvergelijkelijk grootere hoeveelheid tijd en energie eischt van de
vrouwen dan van de mannen, verhindert de vrouwen in den regel
zich zoo uitsluitend als mannen te wijden aan industrieel werk. Voor
sommige biologen schijnt het inderdaad duidelijk te zijn, dat de
vrouw buiten het huis en de school in het geheel niet werken moet.
“Iedere natie, die zijn vrouwen laat werken is veroordeeld,” zegt
Woods Hutchinson (The Gospel According to Darwin, p. 199). Dit

is een uiterste opvatting. Toch beschouwt ook Hobhouse Hobson,
die deze kwestie van den economischen kant bekijkt, den invloed
van de industrie, die de vrouwen uit haar huis verjaagt, als “een
invloed, die strijdig is met de beschaving”. De verwaarloozing van
het tehuis, zegt hij, is, “over het geheel, het ergste nadeel, dat de
moderne industrie toegebracht heeft aan ons leven, en het is
moeilijk in te zien hoe dit goedgemaakt kan worden door een
toename van materieele producten. Het fabrieksleven voor de
vrouwen ondermijnt behalve in uiterst zeldzame gevallen, de
moreele en physieke gezondheid van de familie. De eischen van het
fabrieksleven zijn niet overeen te brengen met de positie van een
goede moeder, een goede vrouw, of een goede huisvrouw. Behalve
in geheel uiterste gevallen kan geen vermeerdering van het loon van
de familie opwegen tegen deze verliezen, waarvan de waarde op een
qualitatief hooger niveau staat”. (J. A. Hobson, Evolution of Modern
Capitalism, hoofdst. XII; vergelijk wat in hoofdstuk I van dit werk
gezegd is). Men begint nu te erkennen, dat de eerste pioniers van de
vrouwenbeweging, die werkten om “de onderwerping van de
vrouw” te doen verdwijnen, toch nog beheerscht werden door de
oude idealen van die onderwerping, volgens welke de mannelijke
sekse in alle opzichten de superieure is. Wat goed was voor een
man, dachten ze, moest ook goed zijn voor een vrouw. Dat is de
bron geweest van alles wat de eerste uitingen der
“vrouwenbeweging” zoo onvast maakte, soms ook zoo roerend en
dwaas. Men merkte niet, dat, voor alles, de vrouwen haar rechten
moeten laten gelden op haar eigen vrouwelijkheid als moeders van
het ras, en daardoor de eerste wetgevers op het gebied der sekse, en
het groote levensgebied, dat van haar sekse afhankelijk is. Deze
speciale positie van de vrouw zal waarschijnlijk een aanpassing van
de economische verhoudingen aan haar behoeften noodig maken,
hoewel het niet waarschijnlijk is, dat zulk een aanpassing inbreuk
zou maken op haar onafhankelijkheid en haar verantwoordelijkheid.
Wij hebben, zooals Juliette Adams zegt, de rechten van de mannen
gehad, die de rechten van de vrouw opofferden, gevolgd door de
rechten van de vrouw die het kind opofferden; dat moet gevolgd
worden door de rechten van het kind, die de familie weer in eere

herstellen. Het is reeds noodig geweest dit punt in het eerste
hoofdstuk van dit boek aan te raken en het zal in het laatste
hoofdstuk weer noodig zijn.
De vraag naar de middelen, waardoor de economische
zelfstandigheid van de vrouwen geheel verzekerd zal
worden, en naar de rol, die de gemeenschap tot haar
beveiliging zal moeten spelen, met inachtneming van de
bijzondere barings-functiën van de vrouw, is, van het
standpunt dat ons op het oogenblik bezig houdt, bijzaak.
Er kan echter geen twijfel zijn aan de werkelijkheid van de
beweging in die richting, welke twijfel er ook mag zijn aan
het aanpassen ten slotte van de onderdeelen. Op deze
plaats behoeven wij alleen maar op sommige van de
algemeene en meer duidelijk zichtbare veranderingen te
wijzen, waarin de groei van de verantwoordelijkheid van
de vrouw de sexueele moraal raakt.
De eerste en meest merkbare wijze, waarop deze zin voor
moreele verantwoordelijkheid werkt, is een aandringen op
werkelijkheid in de verhoudingen tusschen de seksen. De
moreele onverantwoordelijkheid van de vrouw heeft, met
haar economische afhankelijkheid te zamen, er toe geleid,
dat zij de sexueele gebeurtenis, die biologisch van het
grootste gewicht is, alleen maar als een vroolijke en
alledaagsche gebeurtenis beschouwt, op zijn hoogst als
een gebeurtenis, die haar een triomf gegeven heeft over
haar mededingsters en over den superieuren man, die, van

zijn kant, gewillig zich er toe leent om voor het oogenblik
de rol van overwonneling te spelen. “Gallanterie voor de
dames”, wordt ons verteld van den held van de grootste en
meest typische Engelsche roman, “behoorde tot zijn
grondbeginselen van eer, en hij vond, dat hij evenzeer
verplicht was een oproep tot liefde aan te nemen alsof het
een oproep was geweest om te vechten”; hij gaat
heldhaftig mee naar huis met een dame van hoogen stand,
die hij ontmoet op een maskerade, hoewel hij toen zeer
verliefd was op een meisje, waar hij later mee trouwt47. De
vrouw, wier macht alleen in haar bekoorlijkheden ligt, en
die vrijheid heeft den last van de verantwoordelijkheid op
de schouders van den man te laden48, kan gemakkelijk de
rol van verleidster spelen en daardoor onafhankelijkheid
en gezag uitoefenen in de eenige vormen, die voor haar
openstaan. De man van zijn kant, die het denkbeeld van
“eer” invoert in een gebied, waaruit het natuurlijke
denkbeeld van verantwoordelijkheid verbannen is, is
bereid, als een dame het hem vraagt, in de arena af te
dalen volgens de oude legende, en haar handschoen terug
te halen, zelfs als hij haar die later verachtelijk in het
gezicht gooit. De oude opvatting van gallanterie, die Tom
Jones zoo goed belichaamt, is het directe gevolg van een
systeem, dat de moreele onverantwoordelijkheid en
economische afhankelijkheid van de vrouwen in zich sluit,
en is tegenovergesteld aan de opvattingen van sexueele
gelijkheid, die in vroegere en latere beschaafde stadiën

geheerscht hebben, evenzeer als aan de biologische
tradities van een natuurlijken vorm van het hofmaken in de
wereld in het algemeen.
Terwijl ze haar eigen sexueele leven controleeren, en zich
duidelijk voor oogen stellen, dat haar
verantwoordelijkheid voor zulk controleeren niet langer op
de schouders geschoven kan worden van de andere sekse,
zullen de vrouwen indirect invloed hebben op het sexueele
leven van de mannen, zooals deze reeds invloed
uitoefenen op dat van de vrouwen. Op welke wijze die
invloed in hoofdzaak zal uitgeoefend worden, is nog niet
te voorspellen. Volgens sommigen zijn, evenals vroeger de
mannen hun vrouwen kochten en maagdelijkheid voor het
huwelijk eischten in het zoo gekochte artikel, op dezelfde
wijze tegenwoordig onder de betere klassen de vrouwen in
staat haar mannen te koopen en op haar beurt zijn ze
geneigd kuischheid te eischen49. Dat is echter een te
simpele wijze van de zaak te beschouwen. Het is genoeg
er op te wijzen, dat de vrouwen niet aangetrokken worden
door maagdelijke onschuld in een man en dat zij dikwijls
goede reden hebben om zulk een onschuld met
wantrouwen aan te zien50. Toch mogen we wel gelooven,
dat de vrouwen er meer en meer de voorkeur aan zullen
geven een zekere critiek uit te oefenen op het verleden van
haar man. Hoezeer een vrouw ook instinctief moge
wenschen, dat haar man ingewijd zal zijn in de kunst van
het hofmaken, mag zij er toch dikwijls wel aan twijfelen of

de beste inwijding verkregen kan worden bij de gewone
prostituée. Prostitutie is, zooals we gezien hebben, ten
slotte evenmin overeen te brengen met complete sexueele
verantwoordelijkheid als het patriarchale huwelijks-
systeem, waarmee ze nauw verbonden is geweest. Ze is
een schikking, die in hoofdzaak bepaald wordt door de
behoeften van de mannen, hoezeer ze ook toevallig aan
verschillende behoeften van de vrouwen tegemoet
gekomen is. De mannen hebben het zoo ingesteld, dat een
groep van vrouwen afgezonderd zou worden om
uitsluitend hun sexueele behoeften te dienen, terwijl een
andere groep opgevoed zou worden in ascetisme als
candidaten voor het privilege van te voorzien in de
behoeften van hun huishouden en familie. Dat dit in veel
opzichten een uitmuntend systeem geweest is, blijkt wel
voldoende uit het feit, dat het zoo’n langen tijd gebloeid
heeft, ondanks de invloeden, die het tegenwerkten. Maar
het is klaarblijkelijk alleen maar mogelijk gedurende een
zeker stadium van de beschaving en in verband met een
bepaalde maatschappelijke organisatie. Het komt niet
volkomen overeen met een democratisch stadium van de
beschaving, dat in zich sluit de economische
onafhankelijkheid en de sexueele verantwoordelijkheid
van beide seksen gelijkelijk in alle klassen van de
maatschappij. Het is mogelijk, dat de vrouwen dit feit
eerder beginnen te erkennen dan de mannen.

Het wordt ook door velen geloofd, dat de vrouwen zullen
erkennen, dat een hooge trap van moreele
verantwoordelijkheid niet gemakkelijk overeen te brengen
is met de praktijk van het veinzen, en dat economische
afhankelijkheid het bedrog—dat altijd de toevlucht is van
de zwakken—zal berooven van iedere moreele
rechtvaardiging, die het zou kunnen bezitten. Hier is het
echter noodig met voorzichtigheid te spreken, of we
zouden onrechtvaardig worden jegens de vrouwen. We
moeten opmerken, dat in de sexueele sfeer de mannen ook
dikwijls de zwakken zijn, en neiging hebben hun toevlucht
te nemen tot het hulpmiddel van de zwakken. Met de
erkenning van dat feit moeten we ook erkennen, dat vele
van de dwaze meeningen, die eeuwenlang geheerscht
hebben in den mannelijken geest bij het beschouwen van
de vrouwelijke wijzen van doen, voor een groot deel
veroorzaakt zijn door teleurstellingen in vrouwen. De
mannen hebben voortdurend de dubbele fout begaan, de
veinzerij van de vrouwen òf voorbij te zien òf er te veel
waarde aan te hechten. Dit feit heeft er altijd toe
bijgedragen om het onvermijdelijk moeilijk pad van de
vrouwen door den kronkelweg van het sexueele gedrag
nog moeilijker te maken. Pepys, die zoo levendig en zoo
open een beeld geeft van de deugden en gebreken van den
gewonen mannelijken geest, vertelt hoe eens, toen hij
Mevr. Martin bezocht, haar zuster Doll heenging om een
flesch wijn te halen en verontwaardigd terugkwam, omdat

een Hollander haar in een stal getrokken en met haar had
willen stoeien. Daar Pepys zichzelf dikwijls vrijheden met
haar veroorloofd had, scheen het hem toe, dat haar
verontwaardiging op den Hollander “het beste bewijs was
van de onoprechtheid van de vrouw, dat er ter wereld maar
wezen kon”51. Hij neemt zonder meer aan, dat een vrouw,
die het voorrecht van familiariteit heeft toegekend aan een
man, dien zij kent en naar we hopen, respecteert, ook
bereid zou moeten zijn om met genoegen de brutale
attenties aan te nemen van den eersten den besten dronken
vreemdeling, dien zij op straat tegenkomt.
Het was het aannemen van de onoprechtheid in de
vrouwen, dat den ultra-mannelijken Pepys bracht tot een
tamelijk dwaze vergissing. Op dit punt ontmoeten wij iets,
wat aan sommigen een ernstig bezwaar voor de volle
moreele verantwoordelijkheid van de vrouwen
toegeschenen heeft. Veinzen, zeggen Lombroso en
Ferrero, is bij de vrouw “bijna physiologisch”, en zij
geven verschillende gronden aan voor deze uitspraak52. De
theologen, van hun kant, zijn tot hetzelfde besluit
gekomen. “Een biechtvader moet niet dadelijk de woorden
van een vrouw gelooven”, zegt Vader Gury, “want
vrouwen hebben gewoonlijk neiging om te liegen”53. Deze
neiging, waarvan men gelooft, dat de vrouwen als sekse
haar hebben, hoezeer een groot aantal individueele
vrouwen er vrij van zijn, kunnen we naar waarheid
zeggen, dat in groote mate het resultaat is van de

onderworpenheid van de vrouwen en daardoor
waarschijnlijk verdwijnen zal, zoodra de onderworpenheid
verdwijnt. In zoover ze echter “bijna physiologisch” is, en
op onvernietigbare vrouwelijke eigenschappen gebaseerd
is, zooals schaamtegevoel, gevoeligheid en sympathie, die
een organische basis hebben in de vrouwelijke constitutie
en daarom nooit geheel kunnen veranderen, schijnt het wel
nauwelijks waarschijnlijk dat de vrouwelijke veinzerij zal
verdwijnen. Het beste, dat men kan verwachten is, dat ze
in toom zal gehouden worden door den ontwikkelden zin
van moreele verantwoordelijkheid, en, na teruggebracht te
zijn tot zijn eenvoudige natuurlijke verhoudingen, als
begrijpelijk erkend zal worden.
Het is onnoodig op te merken, dat er geen sprake kan zijn van
eenige aangeboren moreele meerderheid van het eene geslacht
boven het andere. Deze kwestie werd vele jaren geleden uitvoerig
behandeld door een van de meest fijngevoelige moralisten van het
liefdeleven. “Alles te zamen genomen”, besloot Senancour (De
l’Amour, deel II, p. 85), “hebben we geen reden om de meerderheid
van de eene sekse boven de andere vast te stellen. Beide seksen, met
hun dwalingen en goede bedoelingen, vervullen gelijkelijk de
doeleinden der natuur. We mogen wel gelooven, dat bij ieder van de
twee afdeelingen van de menschelijke soort de som van goed en
kwaad ten naastenbij gelijk is. Als we bijvoorbeeld, wat de liefde
aangaat, het zichtbaar losbandig gedrag van de mannen met de
schijnbare ingetogenheid van de vrouwen vergelijken, dan zou het
een onjuiste waardeering zijn, want het aantal fouten begaan door
vrouwen met mannen is noodzakelijk hetzelfde als dat van mannen
met vrouwen. Er bestaan onder ons minder nauwgezette mannen
dan volkomen eerlijke vrouwen, maar het is gemakkelijk te zien hoe

de weegschaal in evenwicht komt. Als deze kwestie van de moreele
meerderheid van het eene geslacht boven het andere niet
onoplosbaar was, dan zou ze nog zeer gecompliceerd blijven met
betrekking tot de geheele soort, of zelfs de geheele natie, en iedere
strijd schijnt hier nutteloos”.
Deze conclusie is in overeenstemming met de algemeen
compenseerende en aanvullende verhouding van vrouwen met
mannen.
Kort geleden, bij een rondvraag over de kwestie of vrouwen moreel
inferieur zijn aan mannen, met een speciale verwijzing naar
geschiktheid voor loyaliteit (La Revue, Jan. 1, 1909), waarbij
verscheidene beroemde Fransche mannen en vrouwen hun meening
te kennen gaven, verklaarden sommigen, dat vrouwen gewoonlijk
de meerderen zijn; anderen beschouwden het eerder als een kwestie
van verschil dan van meerderheid of minderheid; allen waren het er
over eens, dat, als zij dezelfde onafhankelijkheid genieten als
mannen, vrouwen even loyaal zijn als mannen.
Het is ongetwijfeld waar, dat—gedeeltelijk als een
resultaat van oude tradities en opvoeding, gedeeltelijk van
echt vrouwelijke karakter-eigenschappen—vele vrouwen
beschroomd zijn wat haar recht op moreele
verantwoordelijkheid aangaat en niet geneigd ze te
aanvaarden. En er is een poging gedaan om haar houding
te rechtvaardigen door te beweren, dat de rol van de vrouw
in het leven van nature die is van zelfopoffering, of, om
het gezegde in een meer technischen vorm te stellen, dat
de vrouwen van nature masochistisch zijn; en dat er,
zooals Krafft-Ebing zegt, een natuurlijke “sexueele
onderwerping” is van de vrouw. Het is in het geheel niet

duidelijk, dat het gezegde absoluut waar is, en als het waar
was, zou het niet dienen om de moreele
verantwoordelijkheid van de vrouwen te niet te doen.
Bloch (Beiträge zur Aetiologie der Psychopathia Sexualis, deel II p.
178), ontkent, in overeenstemming met Eulenburg met klem, dat er
zulk een natuurlijke “sexueele onderwerping” van de vrouwen
bestaat, en beschouwt die als kunstmatig in het leven geroepen, het
resultaat van de maatschappelijk inferieure positie van de vrouwen,
en beweert, dat zulke onderwerping in veel hoogere mate een
physiologische eigenaardigheid is van mannen dan van vrouwen.
Het schijnt duidelijk, dat de opvatting, dat vrouwen vooral geneigd
zijn tot zelfopoffering, weinig biologische waarde heeft.
Zelfopoffering, die afgedwongen wordt, hetzij door physieken of
moreelen dwang, is den naam zelfopoffering niet waard; als ze met
bedoeling geschiedt, is ze eenvoudig het opofferen van een minder
goed om een grooter goed te verkrijgen. Zoo zou men van een man,
die een goed diner verorbert, kunnen zeggen, dat hij zijn honger
“opoffert”. Zelfs binnen de sfeer van de traditioneele moraal heeft
de vrouw, die haar “eer” opoffert ter wille van haar liefde voor een
man, door haar opoffering iets verkregen, dat zij meer op prijs stelt.
“Wat een triomf is het voor een vrouw”, heeft een vrouw gezegd,
“vreugde te geven aan den man, dien zij lief heeft!” En in een
moraal, gegrond op een gezonde basis, wordt hier geen “opoffering”
geëischt. Eerder kan er gezegd worden, dat de biologische wetten
van het aanzoek in hun grond meer zelfopoffering eischen van den
man dan van de vrouw. Zoo geeft, volgens Gérard den
leeuwenjager, de leeuwin zich aan den sterksten van haar leeuw-
minnaars; zij moedigt ze aan om onder elkaar te strijden om den
voorrang, terwijl zij op haar buik ligt om naar het gevecht te kijken
en van plezier met haar staart kwispelt, ieder vrouwelijk wezen
wordt door vele mannelijke wezens het hof gemaakt, maar zij neemt
er maar éen aan; het is niet van het vrouwtje, dat erotische
zelfopoffering geëischt wordt, maar van het mannetje. Dat is

werkelijk een deel van de goddelijke compensatie van de natuur,
want daar het grootste deel van den last der sekse op de vrouw rust,
is het gepast, dat zij minder geroepen wordt tot afstand doen.
Zoo schijnt het wel waarschijnlijk, dat de toename van de
moreele verantwoordelijkheid er toe leiden zal het gedrag
van een vrouw begrijpelijker te maken voor anderen54; het
zal er in ieder geval toe leiden, dat anderen er zich minder
mee bemoeien. Dit geldt zeer bijzonder voor de
verhoudingen van de seksen. Vroeger waren het de
mannen, die zich in vele vormen van deugd moesten
oefenen; terwijl maar éen deugd voor de vrouwen
openstond. Dat is niet langer mogelijk. Als we de vrouw
belasten met de voornaamste verantwoordelijkheid voor
haar eigen sexueel gedrag, dan berooven we daarmee dat
gedrag van zijn duidelijk openlijk karakter als een deugd
of als een ondeugd. Sexueele vereeniging is zoowel voor
de vrouw als voor den man een physiologisch feit; het kan
ook een geestelijk feit zijn; maar het is geen
maatschappelijk feit. Het is integendeel een daad, die,
meer dan alle andere daden, terugtrekking en
heimelijkheid voor hare voltrekking noodig heeft. Dat is
inderdaad een algemeen menschelijk, bijna zoölogisch
feit. Bovendien wordt deze eisch van heimelijkheid meer
speciaal gedaan door de vrouw, ten gevolge van haar
grootere ingetogenheid, die, zooals we reden hebben om te
gelooven, een biologische basis heeft. Niet voordat een
kind geboren is of ontvangen, heeft de gemeenschap eenig

recht zich te interesseeren voor de sexueele daden van haar
leden. De sexueele daad gaat de gemeenschap niet meer
aan, dan eenige andere persoonlijke physiologische daad.
Het is onbeschaamd, zoo niet ergerlijk, hier navraag te
doen. Maar de geboorte van een kind is een
maatschappelijke gebeurtenis. Niet wat den schoot ingaat,
maar wat die schoot baart, is van belang voor de
maatschappij. De maatschappij wordt uitgenoodigd een
nieuwen burger te ontvangen. Ze heeft recht te eischen, dat
die burger een plaats in haar midden waardig zal zijn, en
dat hij behoorlijk zal worden geïntroduceerd door een
verantwoordelijken vader en een verantwoordelijke
moeder. De sexueele moraal draait, zooals Ellen Key
gezegd heeft, heelemaal om het kind.
Bij dit laatste punt van onze bespreking over de sexueele
moraal zullen we misschien de enorme verandering
kunnen opmerken, die de ontwikkeling bij de vrouwen van
de moreele verantwoordelijkheid in zich sluit. Zoolang
alle verantwoordelijkheid aan de vrouwen ontzegd werd,
zoolang een vader of een man, gesteund door de
gemeenschap, zich verantwoordelijk stelde voor het
sexueele gedrag van de vrouw, voor haar “deugd”, was het
noodig, dat de geheele sexueele moraal zou draaien om
den ingang van de vagina. Het werd absoluut het
hoofdpunt voor het behoud van de moraal, dat alle oogen
van de gemeenschap steeds zouden gericht zijn op dat
punt, en ook de geheele huwelijkswet moest er op gericht

zijn. Dat is niet langer mogelijk. Als een vrouw haar eigen
moreele verantwoordelijkheid op zich neemt, in sexueele
evenals in andere zaken, dan wordt het niet alleen
ondragelijk, maar ook zonder beteekenis voor de
gemeenschap, in haar meest intieme physiologische of
geestelijke daden te speuren. Zij is zelf direct
verantwoordelijk aan de maatschappij, zoodra zij een
maatschappelijke daad doet, en niet vóor dien tijd.
Vooral met betrekking tot het moederschap is de
verwerkelijking van alles, wat in de nieuwe moreele
verantwoordelijkheid van de vrouwen besloten is, van
beteekenis. Onder een systeem van moraal, waarbij een
man vrijgelaten wordt de verantwoordelijkheid op zich te
nemen voor zijn sexueele daden, terwijl een vrouw niet
even vrij is om dat ook te doen, wordt een premie gesteld
op sexueele daden, die niet uitloopen op voortplanting, en
wordt er een straf gesteld op de daden, die tot de
voortplanting leiden. De reden is, dat bij de eerste klasse
van daden de mannen voornamelijk bevrediging vinden;
en dat in de andere klassen de vrouwen voornamelijk
bevrediging vinden. Want het tragische in de oude
sexueele moraal was, dat, terwijl ze alleen de mannen
verantwoordelijk stelde voor sexueele daden, waarin de
man en de vrouw beiden deel namen, de vrouwen zoowel
maatschappelijk als wettelijk in de onmogelijkheid gesteld
werden zich het feit van de mannelijke
verantwoordelijkheid ten nutte te maken, tenzij ze de

voorwaarden vervuld hadden, die de mannen voor haar
gemaakt hadden, en die ze zichzelf toch niet oplegden. De
daad van sexueelen omgang, die de daad was, waarin de
mannen het meeste genoegen vonden, was onder alle
omstandigheden een daad van gering maatschappelijk
belang; de daad van het ter wereld brengen van een kind,
die voor de vrouwen de meest werkelijk bevredigende van
alle sexueele daden is, werd als een misdaad beschouwd,
tenzij de vrouw van tevoren de voorwaarden vervuld had,
die door den man geëischt werden. Dat was misschien het
ongelukkigste en zeker het onnatuurlijkste van de
resultaten van de patriarchale regeling van de
maatschappij. Ze heeft nooit bestaan in een of anderen
grooten Staat, waar de vrouwen wetgevende macht
bezeten hebben.
Natuurlijk is er door abstracte theoretici gezegd, dat de vrouwen de
zaken zelf in de hand hebben. Zij moeten nooit van een man
houden, eer zij hem veilig in de wettige banden van het huwelijk
vast hebben. Zulk een argument dient nergens toe, want het neemt
geen nota van het feit, dat, terwijl liefde en zelfs monogamie
natuurlijk zijn, het wettige huwelijk alleen maar een uiterlijke vorm
is, met een zeer zwakke macht om de natuurlijke impulsen ten onder
te brengen, behalve wanneer deze impulsen zwak zijn, en in het
geheel geen macht om ze duurzaam ten onder te brengen.
Beschaving sluit in zich den groei van het vooruitzien, en van
zelfbeheersching in beide seksen; maar het is dwaas deze fijnste en
laatste uitloopers van de beschaving bloot te stellen aan een druk,
waartegen ze nooit bestand zouden kunnen zijn. Hoe dwaas het is, is
kort en bondig, aangetoond door Lea in zijn bewonderenswaardige
History of Sacerdotal Celibacy.

Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com

Introduction to bioinformatics Arthur M. Lesk

More Related Content

Similar to Introduction to bioinformatics Arthur M. Lesk (20)

Recently uploaded (20)

Introduction to bioinformatics Arthur M. Lesk