SlideShare a Scribd company logo
Genome exploration in  A-T G-C  space an introduction to ‘ DNA walking’ Jonathan Blakes  Submitted for the degree MSc Biotechnology and Computation Julie Newdoll Dawn of the Double Helix  Oil/Mixed, 2002
Exponential growth of DNA sequences GenBank release notes December 2007 “ from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months.” 83,874,179,730 base pairs in 80,388,382 entries
Genome Browsers EnsEMBL EMBL-EBI Sanger Institute Wellcome Trust UCSC University of California at Santa Cruz
Can we understand this information a priori, summarise  and preserve fine structure ?
 
GraphDNA – DNA Walker
 
 
Mapping
Mapping 4 rotations for each of the 3 previous mappings 2 reflections of each of those 24 possible combinations of cardinal vectors These are the 3 most  parsimonious  mappings of those 24
A-T G-C
A-G C-T
A-C G-T
A-T G-C
A-T G-C is consistently smallest A-T G-C walks contain more information in less space are simply easier to print
 
 
 
 
Genome Exploration
Human chromosome 1 250,000,000 bases
S. cerevisiae  chromosome 1
EnsEMBL annotation
 
Duplications small (~4 base) sequences occur several times in each larger sequence for each small sequence calculate line  between occurrences if the angle and length of 2 or more lines are consistent then draw  lines  to reveal possible duplications Can detect duplications by eye or  algorithmically :
 
Comparison with published data This is a 7 fold contiguous duplication in the male   Y chromosome. Members of the TSPY (Testis-specific Y-encoded proteins) family identified by Skaletsky et al  Nature  423 (2003) using a combination of a whole chromosome  dotplot  with a 2-kb window and a custom Perl script running BLAST alignments of all 5-kb sequence segments, in 2-kb steps, of the entire MSY (Male Specific Y). exons   introns
Phylogenetics Phylogenetics is the reconstruction of evolutionary relatedness from primary sequence information: DNA or protein Traditional phylogenetic methods such as ClustalW and TCoffee all start from a multiple sequence alignment Hard to find optimal alignment for  many   long  sequences Want to use simple measures such as Manhattan or Euclidean  distance  derived from DNA walks to produce phylogenies without alignment Are these comparable to alignment methods?
 
Phylogeny algorithms Published Distance Matrix from  Gilfillan GD, et. al.  Microbiology  144 (1998) 829-838 of 7 aligned 1798-nucleotide long small rRNA of Candida and Saccharomyces species  neighbour joining UPGMA
UPGMA method
Tree construction    Distance Matrix Output Newick format string representation of a tree: (Bovine:0.69395, (Gibbon:0.36079, (Orang:0.33636, (Gorilla:0.17147, (Chimp:0.19268, Human:0.11927) :0.08386):0.06124):0.15057):0.54939, Mouse:1.21460);
Phylogenies with each mapping
Can summing 3 mappings eliminate bias? No. Possible improvements: A more complex distance measure, perhaps a composite of small sequence distances Larger sequences – human / chimpanzee chromosome – where mapping bias may be informative rather than destructive
Conclusion DNA walks can summarise information about nucleotide content in DNA sequences visualise tandem repeats uncover more distant relationships such as duplications and retroviral genomes without expert knowledge or complex algorithms be overlaid with annotations from Ensembl walks to function as an alternative to linear genome browsers be used to construct phylogenetic relationships but are inaccurate A-T G-C is most useful mapping for viewing 2D walks
Future 3D walks mappings where one nucleotide opposes another result in information loss, which can be helpful, but we can’t know what we are missing use a tetrahedral mapping: each step starts from the centre and proceeds to a corner should produce 3D structure like proteins but much bigger can recover 2D walk by viewing orientation
Acknowledgments Biosciences Dr. Gary Robinson (supervisor Biosciences) Dr. J ü rgen Schmidt (course convenor) Dr. Anthony Baines (Bioinformatics lecturer) Computing Dr. Colin Johnson (supervisor Computing)

More Related Content

PPTX
Antisense and RNAi
PPTX
PHYSICAL MAPPING STRATEGIES IN GENOMICS
PPT
The role of machine learning in modelling the cell
PPTX
Physical mapping
DOCX
Restriction mapping
PPT
Est database
PPTX
Expressed sequence tag (EST), molecular marker
PPTX
Gene mapping
Antisense and RNAi
PHYSICAL MAPPING STRATEGIES IN GENOMICS
The role of machine learning in modelling the cell
Physical mapping
Restriction mapping
Est database
Expressed sequence tag (EST), molecular marker
Gene mapping

What's hot (19)

PDF
Gene mapping / Genetic map vs Physical Map | determination of map distance a...
PPTX
DNA Sequencing in Phylogeny
PPTX
Molecular folding
PPTX
Molecular folding
PDF
Predicting Functional Regions in Genomic DNA Sequences Using Artificial Neur...
PDF
Databeers Dub #6 - Alberto Caimo - Investigating the local dependencies of ne...
PPTX
Massively Parallel Signature Sequencing (MPSS)
PPT
Dna sequencing
PPT
Bio process
PPTX
Human Genome
PDF
Pharmacology Toxicology and Neuroscience Seminar 2014
PDF
Apollo : A workshop for the Manakin Research Coordination Network
PPT
Gene Array Analyzer
PPT
Sample Powerpoint Presentation
PDF
Schadt ad webinar (23 may 2013)
PPTX
Mutations and Cancer
PPTX
Human genome
PDF
Liang_Shiochee_ArestyPoster_FINAL
PDF
8 Ab Solution structure of the single-strand
Gene mapping / Genetic map vs Physical Map | determination of map distance a...
DNA Sequencing in Phylogeny
Molecular folding
Molecular folding
Predicting Functional Regions in Genomic DNA Sequences Using Artificial Neur...
Databeers Dub #6 - Alberto Caimo - Investigating the local dependencies of ne...
Massively Parallel Signature Sequencing (MPSS)
Dna sequencing
Bio process
Human Genome
Pharmacology Toxicology and Neuroscience Seminar 2014
Apollo : A workshop for the Manakin Research Coordination Network
Gene Array Analyzer
Sample Powerpoint Presentation
Schadt ad webinar (23 may 2013)
Mutations and Cancer
Human genome
Liang_Shiochee_ArestyPoster_FINAL
8 Ab Solution structure of the single-strand
Ad

Similar to 20080110 Genome exploration in A-T G-C space: an introduction to DNA walking (20)

PPT
Genome Exploration in A-T G-C space (mk1)
PPTX
Gene mapping and cloning of disease gene
PPT
Human Genome 2009
PPT
Bioinformatica 08-12-2011-t8-go-hmm
PPT
Unilag workshop complex genome analysis
PPT
somatic vs germline and genetics vs genomics
PDF
AdaptivesequencingusingnanoporesanddeeplearningofmitochondrialDNA
PDF
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
PDF
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
PDF
Apollo - A webinar for the Phascolarctos cinereus research community
PPT
Microarray biotechnologg ppy dna microarrays
PPTX
SAGE- Serial Analysis of Gene Expression
PPT
31931 31941
PPTX
Construction of human gene map through map integration- from genetic map to p...
PPTX
HGP, the human genome project
PDF
New generation Sequencing
PDF
Gene mapping and its sequence
PPTX
Linkage mapping molecularmarkerstechnology.pptx
PPTX
cytogenomics tools and techniques and chromosome sorting.pptx
PDF
MCQs on DNA MicroArray.pdf
Genome Exploration in A-T G-C space (mk1)
Gene mapping and cloning of disease gene
Human Genome 2009
Bioinformatica 08-12-2011-t8-go-hmm
Unilag workshop complex genome analysis
somatic vs germline and genetics vs genomics
AdaptivesequencingusingnanoporesanddeeplearningofmitochondrialDNA
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...
Apollo - A webinar for the Phascolarctos cinereus research community
Microarray biotechnologg ppy dna microarrays
SAGE- Serial Analysis of Gene Expression
31931 31941
Construction of human gene map through map integration- from genetic map to p...
HGP, the human genome project
New generation Sequencing
Gene mapping and its sequence
Linkage mapping molecularmarkerstechnology.pptx
cytogenomics tools and techniques and chromosome sorting.pptx
MCQs on DNA MicroArray.pdf
Ad

More from Jonathan Blakes (6)

ODP
20101026 ASAP Seminar
PPT
20080516 Spontaneous separation of bi-stable biochemical systems
PPT
20090608 Abstraction and reusability in the biological modelling process
PPT
20090918 Agile Computer Control of a Complex Experiment
PPT
20090219 The case for another systems biology modelling environment
PPT
20080620 Formal systems/synthetic biology modelling re-engineered
20101026 ASAP Seminar
20080516 Spontaneous separation of bi-stable biochemical systems
20090608 Abstraction and reusability in the biological modelling process
20090918 Agile Computer Control of a Complex Experiment
20090219 The case for another systems biology modelling environment
20080620 Formal systems/synthetic biology modelling re-engineered

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PDF
The_EHRA_Book_of_Interventional Electrophysiology.pdf
PDF
Lecture 8- Cornea and Sclera .pdf 5tg year
PDF
OSCE Series Set 1 ( Questions & Answers ).pdf
PPTX
Neonate anatomy and physiology presentation
PPTX
Post Op complications in general surgery
PDF
OSCE SERIES - Set 7 ( Questions & Answers ).pdf
PPT
nephrology MRCP - Member of Royal College of Physicians ppt
PPT
Dermatology for member of royalcollege.ppt
PPTX
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PPTX
Hearthhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
PPTX
Radiation Dose Management for Patients in Medical Imaging- Avinesh Shrestha
PDF
SEMEN PREPARATION TECHNIGUES FOR INTRAUTERINE INSEMINATION.pdf
PPTX
Epidemiology of diptheria, pertusis and tetanus with their prevention
PDF
OSCE SERIES ( Questions & Answers ) - Set 3.pdf
PDF
OSCE SERIES ( Questions & Answers ) - Set 5.pdf
PDF
TISSUE LECTURE (anatomy and physiology )
PPTX
Electrolyte Disturbance in Paediatric - Nitthi.pptx
PPTX
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
Transcultural that can help you someday.
The_EHRA_Book_of_Interventional Electrophysiology.pdf
Lecture 8- Cornea and Sclera .pdf 5tg year
OSCE Series Set 1 ( Questions & Answers ).pdf
Neonate anatomy and physiology presentation
Post Op complications in general surgery
OSCE SERIES - Set 7 ( Questions & Answers ).pdf
nephrology MRCP - Member of Royal College of Physicians ppt
Dermatology for member of royalcollege.ppt
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
Hearthhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Radiation Dose Management for Patients in Medical Imaging- Avinesh Shrestha
SEMEN PREPARATION TECHNIGUES FOR INTRAUTERINE INSEMINATION.pdf
Epidemiology of diptheria, pertusis and tetanus with their prevention
OSCE SERIES ( Questions & Answers ) - Set 3.pdf
OSCE SERIES ( Questions & Answers ) - Set 5.pdf
TISSUE LECTURE (anatomy and physiology )
Electrolyte Disturbance in Paediatric - Nitthi.pptx
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx

20080110 Genome exploration in A-T G-C space: an introduction to DNA walking

  • 1. Genome exploration in A-T G-C space an introduction to ‘ DNA walking’ Jonathan Blakes Submitted for the degree MSc Biotechnology and Computation Julie Newdoll Dawn of the Double Helix Oil/Mixed, 2002
  • 2. Exponential growth of DNA sequences GenBank release notes December 2007 “ from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months.” 83,874,179,730 base pairs in 80,388,382 entries
  • 3. Genome Browsers EnsEMBL EMBL-EBI Sanger Institute Wellcome Trust UCSC University of California at Santa Cruz
  • 4. Can we understand this information a priori, summarise and preserve fine structure ?
  • 5.  
  • 7.  
  • 8.  
  • 10. Mapping 4 rotations for each of the 3 previous mappings 2 reflections of each of those 24 possible combinations of cardinal vectors These are the 3 most parsimonious mappings of those 24
  • 15. A-T G-C is consistently smallest A-T G-C walks contain more information in less space are simply easier to print
  • 16.  
  • 17.  
  • 18.  
  • 19.  
  • 21. Human chromosome 1 250,000,000 bases
  • 22. S. cerevisiae chromosome 1
  • 24.  
  • 25. Duplications small (~4 base) sequences occur several times in each larger sequence for each small sequence calculate line between occurrences if the angle and length of 2 or more lines are consistent then draw lines to reveal possible duplications Can detect duplications by eye or algorithmically :
  • 26.  
  • 27. Comparison with published data This is a 7 fold contiguous duplication in the male Y chromosome. Members of the TSPY (Testis-specific Y-encoded proteins) family identified by Skaletsky et al Nature 423 (2003) using a combination of a whole chromosome dotplot with a 2-kb window and a custom Perl script running BLAST alignments of all 5-kb sequence segments, in 2-kb steps, of the entire MSY (Male Specific Y). exons introns
  • 28. Phylogenetics Phylogenetics is the reconstruction of evolutionary relatedness from primary sequence information: DNA or protein Traditional phylogenetic methods such as ClustalW and TCoffee all start from a multiple sequence alignment Hard to find optimal alignment for many long sequences Want to use simple measures such as Manhattan or Euclidean distance derived from DNA walks to produce phylogenies without alignment Are these comparable to alignment methods?
  • 29.  
  • 30. Phylogeny algorithms Published Distance Matrix from Gilfillan GD, et. al. Microbiology 144 (1998) 829-838 of 7 aligned 1798-nucleotide long small rRNA of Candida and Saccharomyces species neighbour joining UPGMA
  • 32. Tree construction  Distance Matrix Output Newick format string representation of a tree: (Bovine:0.69395, (Gibbon:0.36079, (Orang:0.33636, (Gorilla:0.17147, (Chimp:0.19268, Human:0.11927) :0.08386):0.06124):0.15057):0.54939, Mouse:1.21460);
  • 34. Can summing 3 mappings eliminate bias? No. Possible improvements: A more complex distance measure, perhaps a composite of small sequence distances Larger sequences – human / chimpanzee chromosome – where mapping bias may be informative rather than destructive
  • 35. Conclusion DNA walks can summarise information about nucleotide content in DNA sequences visualise tandem repeats uncover more distant relationships such as duplications and retroviral genomes without expert knowledge or complex algorithms be overlaid with annotations from Ensembl walks to function as an alternative to linear genome browsers be used to construct phylogenetic relationships but are inaccurate A-T G-C is most useful mapping for viewing 2D walks
  • 36. Future 3D walks mappings where one nucleotide opposes another result in information loss, which can be helpful, but we can’t know what we are missing use a tetrahedral mapping: each step starts from the centre and proceeds to a corner should produce 3D structure like proteins but much bigger can recover 2D walk by viewing orientation
  • 37. Acknowledgments Biosciences Dr. Gary Robinson (supervisor Biosciences) Dr. J ü rgen Schmidt (course convenor) Dr. Anthony Baines (Bioinformatics lecturer) Computing Dr. Colin Johnson (supervisor Computing)

Editor's Notes

  • #4: Using our knowledge of gene structure – start codons (ATG), stop codons (UAA, UAG, UGA) and , BLASTs (sequence alignments) of mRNA transcripts with DNA
  • #5: An alphabet of 4 bases, with words millions and millions of letters long with no punctuation.
  • #20: Do you know what is at the top of this mountain? It is the origin of replication of the B. burgdorferi chromosome that, incidentally, has just been experimentally mapped. As a consequence, when you are climbing the mountain you are reading the lagging strand for replication and when you are going down the mountain you are reading the leading strand for replication. What is usually found, but not always if you remember Synechocystis , is that the leading strand is enriched in keto (G or T) bases and that the lagging strand is enriched in amino (A or C) bases. Interestingly, these biases have been reported for both eubacteria and archaea, mitochondria, chloroplasts, viruses and plasmids, but not for eukaryotes up to now. In B. burgdorferi the biases are so important that they affect the amino acid content of proteins and you can guess if a protein was encoded on the leading or the lagging strand solely from its amino acid content. The term ‘chirochore’ was coined to describe fragments of the genome corresponding to a mountainside, that is a DNA fragment more or less homogeneous for the base composition biases. This is a purely descriptive term without reference to any mechanism, reminiscent of ‘isochore’ for the description of DNA fragments with a homogeneous G+C content in some vertebrate chromosomes. On the other hand, the term ‘replichore’ was introduced to designate the two oppositely replicated halves of the chromosome between the origin and the terminus in bacteria. The good thing is that chirochore and replichore boundaries are the same in bacteria: the origins of replication are found at the top of the mountains while the termini are found at the bottom of the lowlands. This strongly suggests that these kinds of genome landscapes have something to do with replication. l Genomic tectonics A simple model to explain the universality of the phenomenon is based on the spontaneous deamination of cytosines that induce C®T mutations. The rate of this deamination is highly increased in single-stranded DNA, probably because of greater accessibility to the solvent in this state than in the double-stranded state. During replication the lagging strand is continuously protected by the newly synthesized leading strand, but the leading strand has to maintain a transient singlestranded state while waiting for the next Okazaki fragment to be long enough to restore the doublestranded state (see Fig. 6). This fundamental asymmetry of replication may explain the universality of the observed systematic biases in base composition; these are at least compatible with the hypothesis. The protection against cytosine deamination may differ between species and explain the variability in intensity of biases. The cytosine deamination theory is nice but it is just a theory. The fundamental limit of bioinformatics is that without experimental data you can discuss in silico results endlessly and fruitlessly. I would be very curious to know the minimum inhibitory concentration (MIC) of chemicals such as bisulphite, which catalyses the deamination of cytosine, for bacteria in which compositional biases are important (for instance B. burgdorferi , Treponema pallidum , Neisseria meningitidis ). These compounds may not only inhibit replication but also transcription because during transcription a transient single-stranded DNA state is necessary too. It would also be interesting to determine toxicological information for eukaryotic cells.
  • #31: Unweighted Pair Group Method with Arithmatic Mean (UPGMA)
  • #32: Unweighted Pair Group Method with Arithmatic Mean (UPGMA) simplest method cluster pairs with least distance, calculating the new distance using the mean