Structural genomics

Structural GenomicsStructural Genomics
 Structural genomics is concerned with sequencing and
understanding the content of genomes.
 The first steps in characterizing a genome is to prepare its maps:
◦ Genetic Maps
◦ Physical Maps
These maps provide information about the:
◦ relative locations of genes,
◦ Molecular markers, and
◦ chromosome segments,

Genetic MapsGenetic Maps
 Genetic maps (also called linkage maps) provide a rough approximation
of the locations of genes relative to the locations of other known genes.
 These maps are based on the genetic function of recombination
 Individuals heterozygous at two or more genetic loci are crossed, and the
frequency of recombination between loci is determined by examining the
progeny.
 Recombination frequency between two loci is 50%, then the loci are
located on different chromosomes or are far apart on the same
chromosome.

 Recombination frequency <50%, the loci are linked.
 For linked genes, the rate of recombination is proportional to the
physical distance between the loci.
 Distances on genetic maps are measured in percent recombination
(centimorgans, cM) or map units.

Limitations in Genetic MapsLimitations in Genetic Maps
 1st
is resolution or detail.
 3.4 billion base pairs of DNA and has a total genetic distance of about
4000 cM, an average of 850,000 bp/cM.
 Even if a marker occurred every centimorgan (which is unrealistic), the
resolution in regard to the physical structure of the DNA would still be
quite low.
 2nd
they do not always accurately correspond to physical distances
between genes.
 Based on rates of crossing over, which vary; so the distances on a genetic
map are only approximations of real physical distances along a
chromosome.

Physical MapsPhysical Maps
 Based on the direct analysis of DNA, and they place genes in relation to
distances measured in number of base pairs, kilobases, or megabases.
 A common type of physical map is one when a pieces of genomic DNA is
cloned in bacteria or yeast.
 Physical maps generally have higher resolution and are more accurate than
genetic maps.
Physical maps are often used to order cloned DNA
fragments.

 A number of techniques exist for creating physical maps
including:
 Restriction mapping, which determines the positions of
restriction sites on DNA;
 Sequence-tagged site (STS) mapping, which locates the positions
of short unique sequences of DNA on a chromosome;
 Fluorescent in situ hybridization (FISH), by which markers can
be visually mapped to locations on chromosomes;
 DNA sequencing.

Genetic and physical maps may differ in relativeGenetic and physical maps may differ in relative
distances and even in the position of genes on adistances and even in the position of genes on a
chromosome.chromosome.
• Compares the genetic map of
chromosome III of yeast with a
physical map determined by
DNA sequencing.
• There are some discrepancies
between the distances and even
among the positions of some
genes.
• In spite of these limitations,
genetic maps have been critical
to the development of physical
maps and the sequencing of
whole genomes.
A physical map is analogous to a neighborhood map that shows the
location of every house along a street, whereas a genetic map is
analogous to a highway map that shows the locations of major
towns and cities.

Restriction MappingRestriction Mapping
 Restriction mapping determines the relative positions of restriction sites
on a piece of DNA.
 When a piece of DNA is cut with a restriction enzyme.
 The fragments are separated by gel electrophoresis.
 The number of restriction sites in the DNA and the distances between
them can be determined by the number and positions of bands on the gel.

 Example:
 We have sample of a linear 13,000-bp (13-kb) DNA fragment
 1st sample is cut with the restriction enzyme EcoRI;
 Second sample of the same DNA is cut with BamHI;
 Third sample is cut with both EcoRI and BamHI (double digest).
 The resulting fragments are separated and sized by gel electrophoresis
 Determine the positions of the EcoRI and BamHI restriction sites on the
original 13-kb fragment?

 Most restriction mapping is done with several restriction
enzymes, used alone and in various combinations, producing
many restriction fragments.
 With long pieces of DNA (greater than 30 kb), computer
programs are used to determine the restriction maps.
 Restriction mapping may be facilitated by tagging one end of
a large DNA fragment with radioactivity or by identifying the
end with the use of a probe.

DNA-Sequencing MethodsDNA-Sequencing Methods
 The most detailed physical maps are based on direct DNA sequence
information.
 1975 and 1977 Frederick Sanger and his colleagues created the dideoxy
sequencing method based on the elongation of DNA;
 Allan Maxam andWalter Gilbert developed a second method based on the
chemical degradation of DNA.

How the primers are constructed when we don’tHow the primers are constructed when we don’t
know the sequence of DNA?know the sequence of DNA?
• By cloning the target DNA in a vector that contains sequences
recognized by a common primer (called universal sequencing primer
sites) on either side of the site where the target DNA will be inserted.
• The target DNA is then isolated from the vector and will contain
universal sequencing primer sites at each end

Automated dideoxy methodAutomated dideoxy method

Sequencing an Entire GenomeSequencing an Entire Genome
 The ultimate goal of structural genomics is to determine the ordered
nucleotide sequences of entire genomes of organisms.
 The main obstacle to this task is the immense size of most
genomes.
◦ Bacterial genomes are usually at least several million base pairs long
◦ Eukaryotic genomes are billions of base pairs long and are distributed
among dozens of chromosomes.
◦ For technical reasons, it is not possible to begin sequencing at one end
of a chromosome and continue straight through to the other end;
◦ Only small fragments of DNA—usually from 500 to 700 nucleotides—can
be sequenced at one time.

 The DNA be broken into thousands or millions of smaller
fragments that can then be sequenced.
 Again a problem is there:
◦ Putting these short sequences back together in the correct order.
 Two approaches are used to resolve this task.
◦ Map-based sequencing
◦ Whole-genome shotgun sequencing
The genome of bacteriophage , consisting of 49,000 bp, was completed in
1982.
In 1995, the first genome of a living organism (Haemophilus influenzae) was
sequenced by Craig Venter and Claire Fraser of the Institute for Genomic
Research (TIGR) and Hamilton Smith of Johns Hopkins University.
By 1996, the genome the first eukaryotic organism (yeast) had been
determined, followed by the genome of Eschericia coli (1997),
Caenorhabditis elegans (1998), and Drosophila melanogaster (2000).
The first draft of the human genome was completed in June 2000.

Map-based approachMap-based approach
 Map-based approach, requires the initial creation of detailed genetic and
physical maps of the genome, which provide known locations of genetic
markers at regularly spaced intervals along each chromosome.
 These markers can later be used to help align the short, sequenced fragments
into their correct order.
 After the genetic and physical maps are available, chromosomes or large
pieces of chromosomes are separated by:
 PFGE -large molecules of DNA or whole chromosomes are separated in a
gel by periodically alternating the orientation of an electrical current.
 Flow cytometry: chromosomes are sorted optically by size

 Each chromosome (or sometimes the entire genome) is then cut up
by partial digestion with restriction enzymes.
 Thus partial digestion produces a set of large overlapping DNA
fragments.
 Which are then cloned by using cosmids, yeast artificial
chromosomes (YACs), or bacterial artificial chromosomes (BACs).
 Clones are screened with specific probe.
A set of two or more overlapping
DNA fragments that form a
contiguous stretch of DNA is called a
contig.
This approach was used in 1993 to
create a contig of the human
Y chromosome consisting of 196
overlapping YAC clones

Each clone can be cut with a series of
restriction enzymes, and the resulting
fragments are then separated by gel
electrophoresis.
A computer program is then used to
examine the restriction patterns of all
the clones and look for areas of
overlap.
The overlap is then used to arrange
the clones in order

Whole-genome shotgun sequencingWhole-genome shotgun sequencing
 small-insert clones are prepared directly from genomic DNA
and sequenced.
 Powerful computer programs then assemble the entire
genome by examining overlap among the small-insert clones.

Whole-genome shotgun sequencing utilizesWhole-genome shotgun sequencing utilizes
sequence overlap to align sequenced fragments.sequence overlap to align sequenced fragments.

The Human Genome ProjectThe Human Genome Project
 The Human Genome Project is an effort to sequence the entire
human genome.
 Begun in 1990, a rough draft of the
sequence was completed by two
competing teams:
 An international consortium of
publicly supported investigators
 Private company Celera Genomics,
both of which finished a rough draft
of the genome sequence in 2000.

Data Supporting Structural GenomicsData Supporting Structural Genomics
 In addition to the DNA sequence of an entire genome,
several other types of data are useful for genomic projects
and have been the focus of sequencing efforts.
 Including:
- SNPs
- ESTs

Single-Nucleotide PolymorphismsSingle-Nucleotide Polymorphisms
 Are single-base-pair differences in DNA sequence between
individual members of a species.
 Arising through mutation.
 Single-nucleotide polymorphisms are numerous and are present
throughout genomes.
 In a comparison of the same chromosome from two different
people, a SNP can be found approximately every 1000 bp.
 Because of their variability and widespread occurrence throughout
the genome, SNPs are valuable as markers in linkage studies.

Expressed-SequenceTagsExpressed-SequenceTags
 Another type of data identified by sequencing projects consists of
databases of expressed-sequence tags (ESTs).
 In most eukaryotic organisms, only a small percentage of the DNA
actually encodes proteins; in humans, less than 2% of human DNA
encodes the amino acids of proteins.
 If only protein-encoding genes are of interest, it is often more
efficient to examine RNA than the entire DNA genomic sequence.
 RNA can be examined by using ESTs—markers associated with
DNA sequences that are expressed as RNA.
 RNA Reverse transciptase cDNA Short stretches of
cDNA fragments are then sequenced, and the sequence obtained
(called a tag) provides a marker that identifies the DNA fragment.
 Expressed-sequence tags can be used to find active genes in a
particular tissue or at a particular point in development.

Functional Genomics
attempts to understand the function of information in genomes

Goals of functional genomicsGoals of functional genomics
 The goals of functional genomics include:
 Identifying all the RNA molecules transcribed from a genome (the
transcriptome)
 All the proteins encoded by the genome (the proteome).
 Functional genomics exploits both bioinformatics and laboratory-
based experimental approaches in its search to define the function
of DNA sequences.

•Several methods for identifying genes and assessing their
functions are available including:
•In situhybridization,
•DNA footprinting,
•Experimental mutagenesis,
•Use of transgenic animals and knockouts.

Predicting Function from SequencePredicting Function from Sequence
• The nucleotide sequence of a gene can be used to predict the amino
acid sequence of the protein that it encodes.
•The protein can then be synthesized or isolated and its properties
studied to determine its function.
•This biochemical approach to understanding gene function is both
time consuming and expensive.
•A major goal of functional genomics has been to develop computational
methods that allow gene function to be identified from DNA sequence
alone,
•This will bypassing the laborious process of isolating and characterizing
individual proteins.

Homology searchesHomology searches
Relies on comparing DNA and protein sequences from the same and
different organisms.
Genes that are evolutionarily related are said to be homologous.
Homologous genes found in different species that evolved from the
same gene in a common ancestor are called orthologs
For example, both mouse and human genomes contain a gene that
encodes the alpha subunit of hemoglobin;
The mouse and human alpha hemoglobin genes are said to be
orthologs, because both genes evolved from an alpha-hemoglobin gene in
a mammalian ancestor common to mice and humans.

Homologous genes in the same organism are called paralogs
 By duplication of a single gene in the evolutionary past
Within the human genome is a gene that encodes the alpha subunit of
hemoglobin and another homologous gene that encodes the beta subunit of
hemoglobin.
These two genes arose because an ancestral gene underwent duplication
and the resulting two genes diverged through evolutionary time, giving rise
to the alpha- and beta-subunit genes; these two genes are paralogs.
 Homologous genes (both orthologs and paralogs) often have the
same or related functions; so, after a function has been assigned to a
particular gene, it can provide a clue to the function of a homologous gene.

Database Search for OrthologousDatabase Search for Orthologous
Databases containing genes and proteins found in a wide array of organisms
are available for homology searches.
Powerful computer programs have been developed for scanning these
databases to look for particular sequences.
A commonly used homology search program is BLAST (Basic Local
Alignment Search Tool).
Suppose a geneticist sequences a genome and locates a gene that encodes a
protein of unknown function.
A homology search conducted on databases containing the DNA or protein
sequences of other organisms may identify one or more orthologous
sequences.

Database Search for ParalogousDatabase Search for Paralogous
 computer programs can search a single genome for paralogs.
 Eukaryotic organisms often contain families of genes that have
arisen by duplication of a single gene.
 If a paralog is found and its function has been previously assigned,
this function can provide information about a possible function of
the unknown gene.

Other sequence comparisonsOther sequence comparisons
 Complex proteins often have specific domains.
 Each domain has its characteristic amino acid arrangement
 For example, certain DNA-binding proteins attach to DNA in the same way;
 All these proteins have a common DNA-binding domain
 Many protein domains have been characterized, and their molecular
functions have been determined.
 Newly identified gene can be scanned against a database of known domains.
 If it encodes one or more domains of known function, the function of
the domain can provide important information about a possible
function of the new gene.

Phylogenetic Profile :Phylogenetic Profile : Another computational method forAnother computational method for
predicting protein functionpredicting protein function
 Phylogenetic profiling is a bioinformatics technique in which
the joint presence or joint absence of two traits across large
numbers of species is used to infer a meaningful biological
connection, such as involvement of two different proteins in the
same biological pathway
 It was first introduced by Pellegrini and co-workers in 1999
 In this method, the presence- and-absence pattern of a
particular protein is examined across a set of organisms
whose genomes have been sequenced.

Phylogenetic ProfilePhylogenetic Profile
 If two proteins are either both present or both absent in all genomes
surveyed, the two proteins may be functionally related.
 The genes (proteins) found and lost together should be involved in a
common function
◦ 1) being involved in the same biological pathway which is therefore incomplete without all
its members in a given genome,
◦ 2) being beneficial for the phenotype in a particular environment.
 Consider the following proteins in four bacterial species :
 E. coli: protein 1, protein 2, protein 3, protein 4, protein 5, protein 6
 Species A: protein 1, protein 2, protein 3, protein 6
 Species B: protein 1, protein 3, protein 4, protein 6
 Species C: protein 2, protein 4, protein 5

The phylogenetic profile reveals
that proteins 1, 3, and 6 are either
all present or all absent in all
species; so these proteins might
be functionally related.
proteins P2 and P7 have the same pattern
of presence/absence in the analyzed
genomes. We could therefore infer that
they are involved in the same biological
process.

Gene Neighbor AnalysisGene Neighbor Analysis
 Genes that encode functionally related proteins are often closely
linked in bacteria.
 For example, if two genes are
consistently linked in the genomes
of several bacteria, they might be
functionally related.

42
mRNA Expression: Two dominant approaches
RNA sequencing
DNA Microarrays

Microarrays
Monitors the level of each gene:
Is it turned on or off in a
particular biological condition?
Is this on/off state different
between two biological conditions?
Microarray is a rectangular grid of
spots printed on a glass microscope
slide, where each spot contains
DNA for a different gene

Gene Expression and MicroarraysGene Expression and Microarrays
 The development of microarrays has allowed the expression of thousand
of genes to be monitored simultaneously.
 Microarrays rely on nucleic acid hybridization, in which a known DNA
fragment is used as a probe to find complementary sequences
 The probe is usually fixed to some type of solid support, such as a nylon
filter or a glass slide.
 A solution containing a mixture of DNA or RNA is applied to the solid
support; any nucleic acid that is complementary to the probe will bind to
it.
 Nucleic acids in the mixture are labeled with a radioactive or fluorescent
tag so that molecules bound to the probe can be easily detected.

How can we examine changes in geneHow can we examine changes in gene
expression?expression?
 Let we have two types of cells:
Experimental cells
mRNA is converted into cDNA and
labeled with red fluorescent nucleotides
Control cells
cDNA, labeled with green
fluorescent nucleotides.
Labelled cDNA are mixed and
Hybridized to DNA chip, which contain DNA
Probes form different genes
Hybridization of the red (experimental) and green (control) cDNAs is
proportional to the relative amounts of mRNA in the samples.

 Red indicates the over expression of a gene in the experimental
cells (more red-labeled cDNA hybridizes),
 Green indicates the under expression of a gene in the experimental
cells (more green-labeled cDNA hybridizes).
 Yellow indicates equal expression (equal hybridization of red- and
greenlabeled cDNAs.
 No or black color indicates no expression.
◦ Microarrays allow the expression of thousands of genes to be monitored
simultaneously,
◦ To study which genes are active in particular tissues.
◦ To investigate how gene expression changes in the course of biological processes
such as development or disease progression.

Genome wide MutagenesisGenome wide Mutagenesis
 One of the best methods for determining the function of a gene is to
examine the phenotypes of individual organisms that possess a
mutation in the gene.
 To conduct a mutagenesis screen, random mutations are induced in a
population of organisms, creating new phenotypes.
 Random inducement of mutations on a genome wide basis and
mapping with molecular markers—are coupled and automated in a
mutagenesis screen.
 Mutagenesis screens can be used to search for specific genes
encoding a particular function or trait.

Structural genomics

More Related Content

What's hot (20)

Similar to Structural genomics (20)

More from Ashfaq Ahmad (20)

Recently uploaded (20)

Structural genomics