GENE315: Genome size and complexity
Paul Gardner
March 14, 2023
Introduction
▶ I started at Otago in 2018.
▶ Taught bioinformatics, genomics & biostatistics for 6 years at
the University of Canterbury.
▶ Worked for ∼ 10 years in Europe. I’ve lived in Bielefeld,
Copenhagen and Cambridge. Taught courses in Denmark,
Cambridge, Poland, Portugal, Australia, USA, and NZ.
▶ Originally from Whāngārā, married with three children.
Overview of lectures 03-08
▶ 3. Genome size and complexity.
▶ 4. Comparative genomics & genome annotation.
▶ 5. Comparative genomics & sequence alignment.
▶ 6. Comparative genomics & homology search.
▶ 7. Measuring genome function.
Objectives for lecture 03
▶ An appreciation for:
▶ “Junk DNA” & the C-value paradox.
▶ The ENCODE project.
▶ Critical appraisals of genome function.
What is a gene?
NIH:
Wikipedia:
Ohno S (1972) So Much “Junk DNA” in our Genome. Brookhaven Symposium on Biology. – available on
Blackboard.
Summary of Susumu Ohno’s main points about junk DNA
▶ The human genome is roughly 3 × 109 bases, ≈ 750 times E.
coli.
▶ Assuming the same gene density as E. coli, humans will have
≈ 3 million genes.
▶ HOWEVER, if gene density is preserved, then salamanders
have 10 times the number of genes that we do.
▶ There may be a strict upper limit to the number of gene loci.
▶ Therefore, only a fraction of human DNA is “genic”.
▶ Argues that based upon mutation rates, that the number of
genes must be less than 1
mutation rate to avoid meltdown.
▶ and the number of genes is ≈ 30, 000, therefore about 6% of
our DNA is genic.1
▶ Speculates about some hypothetical functions of non-coding
DNA, e.g. chromosome structure, promoter/operators,
inhibiting non-sense/frameshift mutations, ...
▶ Speculates that our genome is littered with the “fossil”
remains of extinct genes.
Vertebrate genomes & junk DNA
▶ Vary in length by two orders of magnitude, e.g. bird genomes
are ≈ 1Gb, while salamander genomes are ≈ 32Gb (giant
lungfish genome is ≈ 43Gb).
▶ Variation largely driven by decaying remnants of transposons.
▶ ≈ 300Mb of sequence is conserved across the vertebrates.
▶ Randomly generated sequences when inserted into genomes
are also transcribed, promoters are very degenerate.
▶ I.e. many different sequences can be bound by a transcription
factor
▶ Which suggests the number of functional elements should not
scale with genome length.
Ohno S (1972) So much’junk’DNA in our genome. In Evolution of Genetic Systems, Brookhaven Symp. Biol.
Mammalian genome variation & junk DNA
▶ But salamanders are special... (if the genome sizes were
longer in the birds than in the salamanders we could
rationalise how complex birds are).
▶ Let’s look at just the mammalian extremes of genome size
then... the red vizcacha rat, the mountain vizcacha rat and a
bat!
Which of these mammals is more complex?
Southern bent-wing bat
Miniopterus orianae bassanii
~1.6 Gb genome
Mountain viscacha rat
Octomys mimax
~3.2 Gb genome
Plains viscacha rat
Tympanoctomys barrerae
~6.2 Gb genome
Evans et al. (2017) Evolution of the Largest Mammalian Genome. GBE.
30 years later we forgot about Ohno’s line of reasoning...
Editorial (2000) The nature of the number. Nature genetics.
Willyard (2018) New human gene tally reignites debate. Nature.
Hatje et al. (2019) The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays.
Genome sizes & number of genes
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
A
Nucleus
Cytoplasm
ER
Golgi
Lysosome
Mitochondria
B
Interactions
Genome length
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Paramecium
Arabidopsis
Homo
Gemmata
Escherichia
Saccharomyces
Lokiarchaeum
Mycoplasma
Nanoarchaeum
Escherichia phage
104
105
106
107
108
109
1010
101
102
103
104
105
●
Animal
Fungi
Plant
Protists
Archaea
Bacteria
Phage
Transect
Number
of
protein
coding
genes
▶ Prediction: “In fact, there seems to be a strict upper limit for
the number of genes loci which we can afford to keep in our
genome” (Ohno, 1972)
▶ Eukaryotic genomes can vary widely in size, with little change
in the number of protein-coding genes. Gardner, Gumy & Fineran (2019)
Unpublished.
Related to junk DNA, is the “C-value paradox”
▶ C-values (a.k.a. genome size) vary enormously between
closely related species.
▶ Vertebrates: birds (1 Gb), bats (1.6 Gb), human (3 Gb),
tuatara (5 Gb), salamander (32 Gb), lungfish (130 Gb).
▶ Allium (onions): 7 Mb to 32 Mb.
▶ There is little relationship between the “perceived
complexity” (or the corresponding number of genes) of a
species and the size of a genome.
▶ Given that C-values are relatively constant within
species/cells, yet bears no relationship to the complexity, or
presumed number of genes, then this somewhat of a paradox.
https://en.wikipedia.org/wiki/C-value#C-value paradox
Eddy, SR (2012) The C-value paradox, junk DNA and ENCODE. Current Biology.
Palazzo & Gregory et al. (2014) The Case for Junk DNA PLoS Genetics.
Drivers of genome size variation
C-value variation is largely driven by transposable elements and
other repetitive DNA elements
https://sandwalk.blogspot.com/2018/03/whats-in-your-genome-pie-chart.html
http://www.dfam.org/entry/DF0000001/hits
Protein functions, 2023 – still lots of unknowns
cytoskeletal protein 3% (627)
transporter 5% (1031)
scaffold/adaptor protein 4% (746)
protein−binding activity modulator 4% (793)
RNA metabolism protein 4% (827)
gene−specific transcriptional regulator 7% (1516)
defense/immunity protein 3% (664)
metabolite interconversion enzyme 9% (1939)
protein modifying enzyme 8% (1622)
transmembrane signal receptor 6% (1151)
Unclassified 33% (6695)
Other 14% (2980)
Unknown 0% (0)
20,851 Human Proteome Functions 2023
Gardner (2023)
Data from Panther: http://www.pantherdb.org
The ENCODE Project 2012
▶ 443 authors, from genome/analysis centres around the world
▶ Aim to map functional elements across the human genome
▶ Generated LOTS of important data that we are still using:
▶ Transcription: RNA-seq, CAGE (Cap Analysis Gene
Expression), RNA-PET (paired end tag)
▶ Translation: mass spectrometry
▶ Transcription-factor-binding sites: ChIP-seq and DNase-seq
▶ Chromatin structure: DNase-seq, FAIRE-seq, histone
ChIP-seq and MNase-seq
▶ DNA methylation sites: RRBS assay
▶ Largely used 3 cell lines: erythroleukaemia K562;
B-lymphoblastoid GM12878 & H1 embryonic stem cells
The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome.
Nature.
ENCODE 2012
http://www.nature.com/encode/
ENCODE 2012: LOTS of cool results (30 papers)
E.g. transcription is correlated (r = 0.9) with transcription factor
binding (left) and cancer associated SNPs are linked to chromatin
structure, which may influence gene expression (right).
ENCODE 2012
▶ Acknowledged that “Comparative genomic studies suggest
that 3% to 8% of bases are under purifying (negative)
selection”
▶ Yet conclude “The vast majority (80.4%) of the human
genome participates in at least one biochemical RNA- and/or
chromatin-associated event in at least one cell type.”
▶ “Many non-coding variants in individual genome sequences lie
in ENCODE-annotated functional regions; this number is at
least as large as those that lie in protein-coding genes.”
▶ “Single nucleotide polymorphisms (SNPs) associated with
disease by GWAS are enriched within non-coding functional
elements”
The ENCODE press lead to some “interesting” headlines:
And there was a predictable scientific backlash:
▶ Eddy, SR (2012) The C-value paradox, junk DNA and
ENCODE. Current Biology.
▶ Q&A style discussion of C-values, junk DNA and ENCODE
▶ Eddy, SR (2013) The ENCODE project: Missteps
overshadowing a success. Current Biology.
▶ “all reproducible biochemical events were claimed to be
‘critical’ and ‘needed’.”
▶ “Far from disproving junk DNA, ENCODE’s operationalized
definition of function included junk DNA”.
▶ “genomes contain a lot of DNA that is not critically important
for host functions, much of which arises from mobile element
replication”.
▶ Sean proposes the “The Random Genome Project: the
missing negative control”.
▶ “Biology is noisy” therefore an assessment of noise in
transcription, translation & DNA binding would allow an
assessment of how much of the ENCODE results are noise,
and how much is genuinely functional.
▶ Recent progress: Deciphering eukaryotic gene-regulatory logic
with 100 million random promoters
And further responses
▶ Doolittle WF (2013) Is junk DNA bunk? A critique of
ENCODE. PNAS.
▶ If the number of functional elements were to rise significantly
with C-value then
▶ (i) organisms with larger genomes are more complex
phenotypically
▶ NB. Organism/phenotypic complexity is generally not well
defined.
▶ (ii) ENCODE’s definition of a functional element identifies
many sites that would not be considered functional or
phenotype-determining by standard uses in biology
▶ (iii) the same phenotypic functions are often determined in a
more diffuse fashion in larger-genomed organisms
▶ Graur D et al. (2013) On the immortality of television
sets:“function” in the human genome according to the
evolution-free gospel of ENCODE. Genome biology and
evolution.
▶ “a very forcefully worded critique” – W. Ford Doolittle
▶ “angry, dogmatic, scattershot, sometimes inaccurate” – Sean
Eddy
Note: Transcription & translation are likely to be noisy
processes
Genome
Functional Junk
Untranscribed Transcribed Transcribed Untranscribed
Untranslated Translated Translated Untranslated
Transcriptome
Proteome
Promoters/Transcription factor binding motifs are
degenerate
▶ “Degenerate” motifs refers to the fact that many different
sequences can promote transcription
▶ This implies there is lots of potential for random, noisy
transcription
http://jaspar.genereg.net/
How would you determine if a genomic region is
“functional”?
▶ Lecture #7
Humans are not the most (or least) complex species!
Big brains, hands, sure. But we can’t breath underwater water,
regrow limbs, don’t have tentacles, can’t fly unassisted, can’t
photosynthesise, melt under mild radiation, turn inside out in a
vacuum, get cancers - circulatory disease - infectious disease, do
stupid things, get too hot and too cold, ...
Humans are weak and puny!
Great Chain of Being
https://en.wikipedia.org/wiki/Category:Obsolete biological theories
The main points
▶ C-values (genome sizes) vary dramatically, even between
closely related species.
▶ The “C-value paradox” is that genome size is not proportional
to the perceived organism “complexity”.
▶ Many “functional” non-coding elements exist in the genome
(.e.g ncRNAs, promoters, etc.)
▶ The majority of the eukaryotic genomes is littered with
“Junk” DNA.
▶ these are largely pseudogenised transposable elements and
endogenous retroviruses
▶ NB. junk ̸= in-active AND non-coding ̸= junk
▶ ENCODE is a valuable resource for identifying “biochemical”
activity in cell lines.
▶ Some of the ENCODE conclusions and press are controversial.
▶ Reports of the death of junk DNA are greatly exaggerated.
Self-evaluation exercises
▶ Outline the C-value paradox, include in your answer a
definition of junk DNA.
▶ What are the main sources of junk DNA?
▶ Describe the ENCODE project and main conclusions.
▶ Outline an approach to distinguish between functional and
noisy cellular processes.
▶ The below figures have been used to associate proportions of
noncoding DNA with “complexity”. Using your knowledge of
the C-value paradox, write a critique of this result.
Mattick (2004) The hidden genetic program of complex organisms. Scientific American.
Taft et al. (2007) The relationship between non-protein-coding DNA and eukaryotic complexity. BioEssays.
Self-evaluation exercises
▶ Use chatGPT to generate an essay on “genome function and
junk DNA”. Critique the essay based upon the references
presented in this lecture.
AI generated art by https://creator.nightcafe.studio/studio, “Junk DNA”, Preset: Surreal
Suggested reading
▶ Important: Ohno S (1972) So Much “Junk DNA” in our
Genome. Brookhaven Symposium on Biology.
▶ Important: Eddy, SR (2013) The ENCODE project: Missteps
overshadowing a success. Current Biology.
▶ Extra: Palazzo & Gregory (2014) The Case for Junk DNA.
PLOS Genetics.
Questions relating to my lectures can be asked & viewed
here:
https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv-
qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
ppgardner-lecture03-genomesize-complexity.pdf
The End
Karitane, 23 Oct 2020.

More Related Content

PPTX
System biology and its tools
PPSX
Gene regulation in eukaryotes
PPTX
Metagenomics
PPTX
Gene expression concept and analysis
PPTX
Gene prediction and expression
PPTX
Functional genomics
PPTX
Genome Editing with TALENS
PPTX
Rna seq and chip seq
System biology and its tools
Gene regulation in eukaryotes
Metagenomics
Gene expression concept and analysis
Gene prediction and expression
Functional genomics
Genome Editing with TALENS
Rna seq and chip seq

What's hot (20)

PPTX
Prokaryotic DNA replication
PPTX
PIR & MINT
PPT
Gene library
PDF
Finding motif
PPTX
String.pptx
PPTX
Enzymes used in Genetic Engineering
PPTX
Functional genomics, a conceptual approach
PPTX
RNA-seq Data Analysis Overview
PPT
Protein protein interaction
PPTX
Transcriptomics
PPTX
ZINC FINGER NUCLEASE TECHNOLOGY
PPTX
Real time PCR
PPTX
Applications of microarray
PPT
PPTX
Role of bioinformatics in drug designing
PPTX
Role of bioinformatics of drug designing
PPTX
PPT
RNA secondary structure prediction
PDF
Genetic recombination
Prokaryotic DNA replication
PIR & MINT
Gene library
Finding motif
String.pptx
Enzymes used in Genetic Engineering
Functional genomics, a conceptual approach
RNA-seq Data Analysis Overview
Protein protein interaction
Transcriptomics
ZINC FINGER NUCLEASE TECHNOLOGY
Real time PCR
Applications of microarray
Role of bioinformatics in drug designing
Role of bioinformatics of drug designing
RNA secondary structure prediction
Genetic recombination
Ad

Similar to ppgardner-lecture03-genomesize-complexity.pdf (20)

PPTX
VARSHA KOSHLE PAPER 2 C-VALUE PARADOX PRESENTATION MSC 2ND SEM BOTANY.pptx
PDF
ppgardner-lecture07-genome-function.pdf
PDF
New generation Sequencing
PDF
Genome project.pdf
PPT
Human genome project(ibri)
PPT
PDF
20150115_JQO_NYAPopulationGenomics
PDF
Introduction to Apollo: A webinar for the i5K Research Community
PPTX
2014 villefranche
PPT
Comparative genomics
PPT
genetics lab poster SRC
PPT
Investigation of phylogenic relationships of shrew populations using genetic...
PPT
Investigation of phylogenic relationships of shrew populations using genetic...
PPTX
2014 naples
PDF
Marzillier_09052014.pdf
PPTX
Human genome project by kk sahu
PPT
Cot curve analysis for gene and genome complexity
PDF
Apollo - A webinar for the Phascolarctos cinereus research community
PPTX
INTRODUCTION OF Genes AND GENOMICS .pptx
PPTX
Dan Graur - Can the human genome be 100% functional?
VARSHA KOSHLE PAPER 2 C-VALUE PARADOX PRESENTATION MSC 2ND SEM BOTANY.pptx
ppgardner-lecture07-genome-function.pdf
New generation Sequencing
Genome project.pdf
Human genome project(ibri)
20150115_JQO_NYAPopulationGenomics
Introduction to Apollo: A webinar for the i5K Research Community
2014 villefranche
Comparative genomics
genetics lab poster SRC
Investigation of phylogenic relationships of shrew populations using genetic...
Investigation of phylogenic relationships of shrew populations using genetic...
2014 naples
Marzillier_09052014.pdf
Human genome project by kk sahu
Cot curve analysis for gene and genome complexity
Apollo - A webinar for the Phascolarctos cinereus research community
INTRODUCTION OF Genes AND GENOMICS .pptx
Dan Graur - Can the human genome be 100% functional?
Ad

More from Paul Gardner (20)

PDF
ppgardner-lecture06-homologysearch.pdf
PDF
ppgardner-lecture05-alignment-comparativegenomics.pdf
PDF
ppgardner-lecture04-annotation-comparativegenomics.pdf
PDF
Does RNA avoidance dictate protein expression level?
PDF
Machine learning methods
PDF
Clustering
PDF
Monte Carlo methods
PDF
The jackknife and bootstrap
PDF
Contingency tables
PDF
Regression (II)
PDF
Regression (I)
PDF
Analysis of covariation and correlation
PDF
Analysis of two samples
PDF
Analysis of single samples
PDF
Centrality and spread
PDF
Fundamentals of statistical analysis
PDF
Random RNA interactions control protein expression in prokaryotes
PDF
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
PDF
A meta-analysis of computational biology benchmarks reveals predictors of pro...
PDF
01 nc rna-intro
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
Does RNA avoidance dictate protein expression level?
Machine learning methods
Clustering
Monte Carlo methods
The jackknife and bootstrap
Contingency tables
Regression (II)
Regression (I)
Analysis of covariation and correlation
Analysis of two samples
Analysis of single samples
Centrality and spread
Fundamentals of statistical analysis
Random RNA interactions control protein expression in prokaryotes
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
A meta-analysis of computational biology benchmarks reveals predictors of pro...
01 nc rna-intro

Recently uploaded (20)

PPTX
gene cloning powerpoint for general biology 2
PPT
LEC Synthetic Biology and its application.ppt
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
Introcution to Microbes Burton's Biology for the Health
PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PPTX
endocrine - management of adrenal incidentaloma.pptx
PPTX
Substance Disorders- part different drugs change body
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Wound infection.pdfWound infection.pdf123
PDF
Social preventive and pharmacy. Pdf
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
gene cloning powerpoint for general biology 2
LEC Synthetic Biology and its application.ppt
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
A powerpoint on colorectal cancer with brief background
Introcution to Microbes Burton's Biology for the Health
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
endocrine - management of adrenal incidentaloma.pptx
Substance Disorders- part different drugs change body
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
TORCH INFECTIONS in pregnancy with toxoplasma
Wound infection.pdfWound infection.pdf123
Social preventive and pharmacy. Pdf
Presentation1 INTRODUCTION TO ENZYMES.pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx

ppgardner-lecture03-genomesize-complexity.pdf

  • 1. GENE315: Genome size and complexity Paul Gardner March 14, 2023
  • 2. Introduction ▶ I started at Otago in 2018. ▶ Taught bioinformatics, genomics & biostatistics for 6 years at the University of Canterbury. ▶ Worked for ∼ 10 years in Europe. I’ve lived in Bielefeld, Copenhagen and Cambridge. Taught courses in Denmark, Cambridge, Poland, Portugal, Australia, USA, and NZ. ▶ Originally from Whāngārā, married with three children.
  • 3. Overview of lectures 03-08 ▶ 3. Genome size and complexity. ▶ 4. Comparative genomics & genome annotation. ▶ 5. Comparative genomics & sequence alignment. ▶ 6. Comparative genomics & homology search. ▶ 7. Measuring genome function.
  • 4. Objectives for lecture 03 ▶ An appreciation for: ▶ “Junk DNA” & the C-value paradox. ▶ The ENCODE project. ▶ Critical appraisals of genome function.
  • 5. What is a gene? NIH: Wikipedia:
  • 6. Ohno S (1972) So Much “Junk DNA” in our Genome. Brookhaven Symposium on Biology. – available on Blackboard.
  • 7. Summary of Susumu Ohno’s main points about junk DNA ▶ The human genome is roughly 3 × 109 bases, ≈ 750 times E. coli. ▶ Assuming the same gene density as E. coli, humans will have ≈ 3 million genes. ▶ HOWEVER, if gene density is preserved, then salamanders have 10 times the number of genes that we do. ▶ There may be a strict upper limit to the number of gene loci. ▶ Therefore, only a fraction of human DNA is “genic”. ▶ Argues that based upon mutation rates, that the number of genes must be less than 1 mutation rate to avoid meltdown. ▶ and the number of genes is ≈ 30, 000, therefore about 6% of our DNA is genic.1 ▶ Speculates about some hypothetical functions of non-coding DNA, e.g. chromosome structure, promoter/operators, inhibiting non-sense/frameshift mutations, ... ▶ Speculates that our genome is littered with the “fossil” remains of extinct genes.
  • 8. Vertebrate genomes & junk DNA ▶ Vary in length by two orders of magnitude, e.g. bird genomes are ≈ 1Gb, while salamander genomes are ≈ 32Gb (giant lungfish genome is ≈ 43Gb). ▶ Variation largely driven by decaying remnants of transposons. ▶ ≈ 300Mb of sequence is conserved across the vertebrates. ▶ Randomly generated sequences when inserted into genomes are also transcribed, promoters are very degenerate. ▶ I.e. many different sequences can be bound by a transcription factor ▶ Which suggests the number of functional elements should not scale with genome length. Ohno S (1972) So much’junk’DNA in our genome. In Evolution of Genetic Systems, Brookhaven Symp. Biol.
  • 9. Mammalian genome variation & junk DNA ▶ But salamanders are special... (if the genome sizes were longer in the birds than in the salamanders we could rationalise how complex birds are). ▶ Let’s look at just the mammalian extremes of genome size then... the red vizcacha rat, the mountain vizcacha rat and a bat! Which of these mammals is more complex? Southern bent-wing bat Miniopterus orianae bassanii ~1.6 Gb genome Mountain viscacha rat Octomys mimax ~3.2 Gb genome Plains viscacha rat Tympanoctomys barrerae ~6.2 Gb genome Evans et al. (2017) Evolution of the Largest Mammalian Genome. GBE.
  • 10. 30 years later we forgot about Ohno’s line of reasoning... Editorial (2000) The nature of the number. Nature genetics. Willyard (2018) New human gene tally reignites debate. Nature. Hatje et al. (2019) The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays.
  • 11. Genome sizes & number of genes 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 A Nucleus Cytoplasm ER Golgi Lysosome Mitochondria B Interactions Genome length ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Paramecium Arabidopsis Homo Gemmata Escherichia Saccharomyces Lokiarchaeum Mycoplasma Nanoarchaeum Escherichia phage 104 105 106 107 108 109 1010 101 102 103 104 105 ● Animal Fungi Plant Protists Archaea Bacteria Phage Transect Number of protein coding genes ▶ Prediction: “In fact, there seems to be a strict upper limit for the number of genes loci which we can afford to keep in our genome” (Ohno, 1972) ▶ Eukaryotic genomes can vary widely in size, with little change in the number of protein-coding genes. Gardner, Gumy & Fineran (2019) Unpublished.
  • 12. Related to junk DNA, is the “C-value paradox” ▶ C-values (a.k.a. genome size) vary enormously between closely related species. ▶ Vertebrates: birds (1 Gb), bats (1.6 Gb), human (3 Gb), tuatara (5 Gb), salamander (32 Gb), lungfish (130 Gb). ▶ Allium (onions): 7 Mb to 32 Mb. ▶ There is little relationship between the “perceived complexity” (or the corresponding number of genes) of a species and the size of a genome. ▶ Given that C-values are relatively constant within species/cells, yet bears no relationship to the complexity, or presumed number of genes, then this somewhat of a paradox. https://en.wikipedia.org/wiki/C-value#C-value paradox Eddy, SR (2012) The C-value paradox, junk DNA and ENCODE. Current Biology. Palazzo & Gregory et al. (2014) The Case for Junk DNA PLoS Genetics.
  • 13. Drivers of genome size variation C-value variation is largely driven by transposable elements and other repetitive DNA elements https://sandwalk.blogspot.com/2018/03/whats-in-your-genome-pie-chart.html http://www.dfam.org/entry/DF0000001/hits
  • 14. Protein functions, 2023 – still lots of unknowns cytoskeletal protein 3% (627) transporter 5% (1031) scaffold/adaptor protein 4% (746) protein−binding activity modulator 4% (793) RNA metabolism protein 4% (827) gene−specific transcriptional regulator 7% (1516) defense/immunity protein 3% (664) metabolite interconversion enzyme 9% (1939) protein modifying enzyme 8% (1622) transmembrane signal receptor 6% (1151) Unclassified 33% (6695) Other 14% (2980) Unknown 0% (0) 20,851 Human Proteome Functions 2023 Gardner (2023) Data from Panther: http://www.pantherdb.org
  • 15. The ENCODE Project 2012 ▶ 443 authors, from genome/analysis centres around the world ▶ Aim to map functional elements across the human genome ▶ Generated LOTS of important data that we are still using: ▶ Transcription: RNA-seq, CAGE (Cap Analysis Gene Expression), RNA-PET (paired end tag) ▶ Translation: mass spectrometry ▶ Transcription-factor-binding sites: ChIP-seq and DNase-seq ▶ Chromatin structure: DNase-seq, FAIRE-seq, histone ChIP-seq and MNase-seq ▶ DNA methylation sites: RRBS assay ▶ Largely used 3 cell lines: erythroleukaemia K562; B-lymphoblastoid GM12878 & H1 embryonic stem cells The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature.
  • 17. ENCODE 2012: LOTS of cool results (30 papers) E.g. transcription is correlated (r = 0.9) with transcription factor binding (left) and cancer associated SNPs are linked to chromatin structure, which may influence gene expression (right).
  • 18. ENCODE 2012 ▶ Acknowledged that “Comparative genomic studies suggest that 3% to 8% of bases are under purifying (negative) selection” ▶ Yet conclude “The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.” ▶ “Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions; this number is at least as large as those that lie in protein-coding genes.” ▶ “Single nucleotide polymorphisms (SNPs) associated with disease by GWAS are enriched within non-coding functional elements”
  • 19. The ENCODE press lead to some “interesting” headlines:
  • 20. And there was a predictable scientific backlash: ▶ Eddy, SR (2012) The C-value paradox, junk DNA and ENCODE. Current Biology. ▶ Q&A style discussion of C-values, junk DNA and ENCODE ▶ Eddy, SR (2013) The ENCODE project: Missteps overshadowing a success. Current Biology. ▶ “all reproducible biochemical events were claimed to be ‘critical’ and ‘needed’.” ▶ “Far from disproving junk DNA, ENCODE’s operationalized definition of function included junk DNA”. ▶ “genomes contain a lot of DNA that is not critically important for host functions, much of which arises from mobile element replication”. ▶ Sean proposes the “The Random Genome Project: the missing negative control”. ▶ “Biology is noisy” therefore an assessment of noise in transcription, translation & DNA binding would allow an assessment of how much of the ENCODE results are noise, and how much is genuinely functional. ▶ Recent progress: Deciphering eukaryotic gene-regulatory logic with 100 million random promoters
  • 21. And further responses ▶ Doolittle WF (2013) Is junk DNA bunk? A critique of ENCODE. PNAS. ▶ If the number of functional elements were to rise significantly with C-value then ▶ (i) organisms with larger genomes are more complex phenotypically ▶ NB. Organism/phenotypic complexity is generally not well defined. ▶ (ii) ENCODE’s definition of a functional element identifies many sites that would not be considered functional or phenotype-determining by standard uses in biology ▶ (iii) the same phenotypic functions are often determined in a more diffuse fashion in larger-genomed organisms ▶ Graur D et al. (2013) On the immortality of television sets:“function” in the human genome according to the evolution-free gospel of ENCODE. Genome biology and evolution. ▶ “a very forcefully worded critique” – W. Ford Doolittle ▶ “angry, dogmatic, scattershot, sometimes inaccurate” – Sean Eddy
  • 22. Note: Transcription & translation are likely to be noisy processes Genome Functional Junk Untranscribed Transcribed Transcribed Untranscribed Untranslated Translated Translated Untranslated Transcriptome Proteome
  • 23. Promoters/Transcription factor binding motifs are degenerate ▶ “Degenerate” motifs refers to the fact that many different sequences can promote transcription ▶ This implies there is lots of potential for random, noisy transcription http://jaspar.genereg.net/
  • 24. How would you determine if a genomic region is “functional”? ▶ Lecture #7
  • 25. Humans are not the most (or least) complex species! Big brains, hands, sure. But we can’t breath underwater water, regrow limbs, don’t have tentacles, can’t fly unassisted, can’t photosynthesise, melt under mild radiation, turn inside out in a vacuum, get cancers - circulatory disease - infectious disease, do stupid things, get too hot and too cold, ... Humans are weak and puny! Great Chain of Being https://en.wikipedia.org/wiki/Category:Obsolete biological theories
  • 26. The main points ▶ C-values (genome sizes) vary dramatically, even between closely related species. ▶ The “C-value paradox” is that genome size is not proportional to the perceived organism “complexity”. ▶ Many “functional” non-coding elements exist in the genome (.e.g ncRNAs, promoters, etc.) ▶ The majority of the eukaryotic genomes is littered with “Junk” DNA. ▶ these are largely pseudogenised transposable elements and endogenous retroviruses ▶ NB. junk ̸= in-active AND non-coding ̸= junk ▶ ENCODE is a valuable resource for identifying “biochemical” activity in cell lines. ▶ Some of the ENCODE conclusions and press are controversial. ▶ Reports of the death of junk DNA are greatly exaggerated.
  • 27. Self-evaluation exercises ▶ Outline the C-value paradox, include in your answer a definition of junk DNA. ▶ What are the main sources of junk DNA? ▶ Describe the ENCODE project and main conclusions. ▶ Outline an approach to distinguish between functional and noisy cellular processes. ▶ The below figures have been used to associate proportions of noncoding DNA with “complexity”. Using your knowledge of the C-value paradox, write a critique of this result. Mattick (2004) The hidden genetic program of complex organisms. Scientific American. Taft et al. (2007) The relationship between non-protein-coding DNA and eukaryotic complexity. BioEssays.
  • 28. Self-evaluation exercises ▶ Use chatGPT to generate an essay on “genome function and junk DNA”. Critique the essay based upon the references presented in this lecture. AI generated art by https://creator.nightcafe.studio/studio, “Junk DNA”, Preset: Surreal
  • 29. Suggested reading ▶ Important: Ohno S (1972) So Much “Junk DNA” in our Genome. Brookhaven Symposium on Biology. ▶ Important: Eddy, SR (2013) The ENCODE project: Missteps overshadowing a success. Current Biology. ▶ Extra: Palazzo & Gregory (2014) The Case for Junk DNA. PLOS Genetics.
  • 30. Questions relating to my lectures can be asked & viewed here: https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv- qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing
  • 32. The End Karitane, 23 Oct 2020.