SlideShare a Scribd company logo
Introduction to Bioinformatics
2. Genetics Background
Course 341
Department of Computing
Imperial College, London
© Simon Colton
Coursework
 1 coursework – worth 20 marks
– Work in pairs
 Retrieving information from a database
 Using Perl to manipulate that information
The Robot Scientist
 Performs experiments
 Learns from results
– Using machine learning
 Plans more experiments
 Saves time and money
 Team member:
– Stephen Muggleton
Biological Nomenclature
 Need to know the meaning of:
– Species, organism, cell, nucleus, chromosome, DNA
– Genome, gene, base, residue, protein, amino acid
– Transcription, translation, messenger RNA
– Codons, genetic code, evolution, mutation, crossover
– Polymer, genotype, phenotype, conformation
– Inheritance, homology, phylogenetic trees
Substructure and Effect
(Top Down/Bottom Up)
Species
Organism
Cell
Nucleus
Chromosome
DNA strand
Gene
Base
Protein
Amino Acid
Folds
into
Affects the
Function of
Affects the
Behaviour of
Prescribes
Cells
 Basic unit of life
 Different types of cell:
– Skin, brain, red/white blood
– Different biological function
 Cells produced by cells
– Cell division (mitosis)
– 2 daughter cells
 Eukaryotic cells
– Have a nucleus
Nucleus and Chromosomes
 Each cell has nucleus
 Rod-shaped particles inside
– Are chromosomes
– Which we think of in pairs
 Different number for species
– Human(46),tobacco(48)
– Goldfish(94),chimp(48)
– Usually paired up
 X & Y Chromosomes
– Humans: Male(xy), Female(xx)
– Birds: Male(xx), Female(xy)
DNA Strands
 Chromosomes are same in every cell of organism
– Supercoiled DNA (Deoxyribonucleic acid)
 Take a human, take one cell
– Determine the structure of all chromosonal DNA
– You’ve just read the human genome (for 1 person)
– Human genome project
 13 years, 3.2 billion chemicals (bases) in human genome
 Other genomes being/been decoded:
– Pufferfish, fruit fly, mouse, chicken, yeast, bacteria
DNA Structure
 Double Helix (Crick & Watson)
– 2 coiled matching strands
– Backbone of sugar phosphate pairs
 Nitrogenous Base Pairs
– Roughly 20 atoms in a base
– Adenine  Thymine [A,T]
– Cytosine  Guanine [C,G]
– Weak bonds (can be broken)
– Form long chains called polymers
 Read the sequence on 1 strand
– GATTCATCATGGATCATACTAAC
Differences in DNA
2% tiny
Roughly
4%
Share
M
aterial
 DNA differentiates:
– Species/race/gender
– Individuals
 We share DNA with
– Primates,mammals
– Fish, plants, bacteria
 Genotype
– DNA of an individual
 Genetic constitution
 Phenotype
– Characteristics of the
resulting organism
 Nature and nurture
Genes
 Chunks of DNA sequence
– Between 600 and 1200 bases long
– 32,000 human genes, 100,000 genes in tulips
 Large percentage of human genome
– Is “junk”: does not code for proteins
 “Simpler” organisms such as bacteria
– Are much more evolved (have hardly any junk)
– Viruses have overlapping genes (zipped/compressed)
 Often the active part of a gene is spit into exons
– Seperated by introns
The Synthesis of Proteins
 Instructions for generating Amino Acid sequences
– (i) DNA double helix is unzipped
– (ii) One strand is transcribed to messenger RNA
– (iii) RNA acts as a template
 ribosomes translate the RNA into the sequence of amino acids
 Amino acid sequences fold into a 3d molecule
 Gene expression
– Every cell has every gene in it (has all chromosomes)
– Which ones produce proteins (are expressed) & when?
Transcription
 Take one strand of DNA
 Write out the counterparts to each base
– G becomes C (and vice versa)
– A becomes T (and vice versa)
 Change Thymine [T] to Uracil [U]
 You have transcribed DNA into messenger RNA
 Example:
Start: GGATGCCAATG
Intermediate: CCTACGGTTAC
Transcribed: CCUACGGUUAC
Genetic Code
 How the translation occurs
 Think of this as a function:
– Input: triples of three base letters (Codons)
– Output: amino acid
– Example: ACC becomes threonine (T)
 Gene sequences end with:
– TAA, TAG or TGA
Genetic Code
A=Ala=Alanine
C=Cys=Cysteine
D=Asp=Aspartic acid
E=Glu=Glutamic acid
F=Phe=Phenylalanine
G=Gly=Glycine
H=His=Histidine
I=Ile=Isoleucine
K=Lys=Lysine
L=Leu=Leucine
M=Met=Methionine
N=Asn=Asparagine
P=Pro=Proline
Q=Gln=Glutamine
R=Arg=Arginine
S=Ser=Serine
T=Thr=Threonine
V=Val=Valine
W=Trp=Tryptophan
Y=Tyr=Tyrosine
Example Synthesis
 TCGGTGAATCTGTTTGAT
Transcribed to:
 AGCCACUUAGACAAACUA
Translated to:
 SHLDKL
Proteins
 DNA codes for
– strings of amino acids
 Amino acids strings
– Fold up into complex 3d molecule
– 3d structures:conformations
– Between 200 & 400 “residues”
– Folds are proteins
 Residue sequences
– Always fold to same conformation
 Proteins play a part
– In almost every biological process
Evolution of Genes: Inheritance
 Evolution of species
– Caused by reproduction and survival of the fittest
 But actually, it is the genotype which evolves
– Organism has to live with it (or die before reproduction)
– Three mechanisms: inheritance, mutation and crossover
 Inheritance: properties from parents
– Embryo has cells with 23 pairs of chromosomes
– Each pair: 1 chromosome from father, 1 from mother
– Most important factor in offspring’s genetic makeup
Evolution of Genes: Mutation
 Genes alter (slightly) during reproduction
– Caused by errors, from radiation, from toxicity
– 3 possibilities: deletion, insertion, alteration
 Deletion: ACGTTGACTC  ACGTGACTC
 Insertion: ACGTTGACTC  AGCGTTGACTC
 Substitution: ACGTTGACTC  ACGATGACTT
 Mutations are almost always deleterious
– A single change has a massive effect on translation
– Causes a different protein conformation
Evolution of Genes:
Crossover (Recombination)
 DNA sections are swapped
– From male and female genetic input to offspring DNA
Bioinformatics Application #1
Phylogenetic trees
 Understand our evolution
 Genes are homologous
– If they share a common ancestor
 By looking at DNA seqs
– For particular genes
– See who evolved from who
 Example:
– Mammoth most related to
 African or Indian Elephants?
 LUCA:
– Last Universal Common Ancestor
– Roughly 4 billion years ago
Genetic Disorders
 Disorders have fuelled much genetics research
– Remember that genes have evolved to function
 Not to malfunction
 Different types of genetic problems
 Downs syndrome: three chromosome 21s
 Cystic fibrosis:
– Single base-pair mutation disables a protein
– Restricts the flow of ions into certain lung cells
– Lung is less able to expel fluids
Bioinformatics Application #2
Predicting Protein Structure
 Proteins fold to set up an active site
– Small, but highly effective (sub)structure
– Active site(s) determine the activity of the protein
 Remember that translation is a function
– Always same structure given same set of codons
– Is there a set of rules governing how proteins fold?
– No one has found one yet
– “Holy Grail” of bioinformatics
Protein Structure Knowledge
 Both protein sequence and structure
– Are being determined at an exponential rate
 1.3+ Million protein sequences known
– Found with projects like Human Genome Project
 20,000+ protein structures known
– Found using techniques like X-ray crystallography
 Takes between 1 month and 3 years
– To determine the structure of a protein
– Process is getting quicker
Sequence versus Structure
00
95
90
85
0
100000
200000
300000
400000
500000
Year
Number
Protein sequence
Protein structure
Database Approaches
 Slow(er) rate of finding protein structure
– Still a good idea to pursue the Holy Grail
 Structure is much more conservative than sequence
– 1.3m genes, but only 2,000 – 10,000 different conformations
 First approach to sequence prediction:
– Store [sequence,structure] pairs in a database
– Find ways to score similarity of residue sequences
– Given a new sequence, find closest matches
 A good match will possibly mean similar protein shape
 E.g., sequence identity > 35% will give a good match
– Rest of the first half of the course about these issues
Potential (Big) Payoffs
of Protein Structure Prediction
 Protein function prediction
– Protein interactions and docking
 Rational drug design
– Inhibit or stimulate protein activity with a drug
 Systems biology
– Putting it all together: “E-cell” and “E-organism”
– In-silico modelling of biological entities and process
Further Reading
 Human Genome Project at Sanger Centre
– http://www.sanger.ac.uk/HGP/
 Talking glossary of genetic terms
– http://www.genome.gov/glossary.cfm
 Primer on molecular genetics
– http://www.ornl.gov/TechResources/Human_Genome/publicat/primer/toc.html

More Related Content

PPTX
DCIT_411_Lecture_1_Bioinformatics_Is.pptx
PPT
IntroductionBio.ppt
PPT
Biological information transfer
PPT
Bioinformatics
PPT
Introduction-to-Bioinformatics-1.ppt
PPT
Bioinformatics for Computer Scientists.ppt
PDF
L1 intro biology-pdf
PPTX
Introduction to Bioinformatics
DCIT_411_Lecture_1_Bioinformatics_Is.pptx
IntroductionBio.ppt
Biological information transfer
Bioinformatics
Introduction-to-Bioinformatics-1.ppt
Bioinformatics for Computer Scientists.ppt
L1 intro biology-pdf
Introduction to Bioinformatics

Similar to Introduction to Bioinformatics from Simon Colton (20)

PPTX
Introduction
PPT
Introduction to Bioinformatics Molecular Biology Primer
PPT
BioPrimer.ppt
PPT
BioPrimer (1).ppt
PPT
Introduction to Bioinformatics DNA, RNA, Transcriotion, Genes
PPTX
DNA, CHROMOSOMES & GENES
PDF
Bioinformatics manual
PPT
18lecturepresentation 160212184145
PPT
Biology in Focus - Chapter 18
PDF
Bioinformatics2015.pdf
PDF
Bioinformatics2015.pdf
PPT
BIOLOGY FORM 5 CHAPTER 5 - 5.3 A (DNA)
PPT
Hoofdstuk 21 2008
PPT
Dna and genes
PPT
DNA_and_inheritance.ppt
PPTX
2024 Chapter 5 Adexhhhhhhhhhhhhhhhhhhhhhhh.pptx
PPTX
Introduction to bioinformaticsIntroduction to bioinformaticsIntroduction to b...
PPTX
Lesson2_1_FundamentalGenomics in Genomics.pptx
PDF
PPTX
Lesson2_1_FundamentalGenomics.undergraduate
Introduction
Introduction to Bioinformatics Molecular Biology Primer
BioPrimer.ppt
BioPrimer (1).ppt
Introduction to Bioinformatics DNA, RNA, Transcriotion, Genes
DNA, CHROMOSOMES & GENES
Bioinformatics manual
18lecturepresentation 160212184145
Biology in Focus - Chapter 18
Bioinformatics2015.pdf
Bioinformatics2015.pdf
BIOLOGY FORM 5 CHAPTER 5 - 5.3 A (DNA)
Hoofdstuk 21 2008
Dna and genes
DNA_and_inheritance.ppt
2024 Chapter 5 Adexhhhhhhhhhhhhhhhhhhhhhhh.pptx
Introduction to bioinformaticsIntroduction to bioinformaticsIntroduction to b...
Lesson2_1_FundamentalGenomics in Genomics.pptx
Lesson2_1_FundamentalGenomics.undergraduate
Ad

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Lesson notes of climatology university.
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
01-Introduction-to-Information-Management.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Basic Mud Logging Guide for educational purpose
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Insiders guide to clinical Medicine.pdf
PPTX
GDM (1) (1).pptx small presentation for students
Pharma ospi slides which help in ospi learning
O7-L3 Supply Chain Operations - ICLT Program
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Sports Quiz easy sports quiz sports quiz
PPH.pptx obstetrics and gynecology in nursing
Lesson notes of climatology university.
human mycosis Human fungal infections are called human mycosis..pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Supply Chain Operations Speaking Notes -ICLT Program
01-Introduction-to-Information-Management.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Basic Mud Logging Guide for educational purpose
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O5-L3 Freight Transport Ops (International) V1.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Insiders guide to clinical Medicine.pdf
GDM (1) (1).pptx small presentation for students
Ad

Introduction to Bioinformatics from Simon Colton

  • 1. Introduction to Bioinformatics 2. Genetics Background Course 341 Department of Computing Imperial College, London © Simon Colton
  • 2. Coursework  1 coursework – worth 20 marks – Work in pairs  Retrieving information from a database  Using Perl to manipulate that information
  • 3. The Robot Scientist  Performs experiments  Learns from results – Using machine learning  Plans more experiments  Saves time and money  Team member: – Stephen Muggleton
  • 4. Biological Nomenclature  Need to know the meaning of: – Species, organism, cell, nucleus, chromosome, DNA – Genome, gene, base, residue, protein, amino acid – Transcription, translation, messenger RNA – Codons, genetic code, evolution, mutation, crossover – Polymer, genotype, phenotype, conformation – Inheritance, homology, phylogenetic trees
  • 5. Substructure and Effect (Top Down/Bottom Up) Species Organism Cell Nucleus Chromosome DNA strand Gene Base Protein Amino Acid Folds into Affects the Function of Affects the Behaviour of Prescribes
  • 6. Cells  Basic unit of life  Different types of cell: – Skin, brain, red/white blood – Different biological function  Cells produced by cells – Cell division (mitosis) – 2 daughter cells  Eukaryotic cells – Have a nucleus
  • 7. Nucleus and Chromosomes  Each cell has nucleus  Rod-shaped particles inside – Are chromosomes – Which we think of in pairs  Different number for species – Human(46),tobacco(48) – Goldfish(94),chimp(48) – Usually paired up  X & Y Chromosomes – Humans: Male(xy), Female(xx) – Birds: Male(xx), Female(xy)
  • 8. DNA Strands  Chromosomes are same in every cell of organism – Supercoiled DNA (Deoxyribonucleic acid)  Take a human, take one cell – Determine the structure of all chromosonal DNA – You’ve just read the human genome (for 1 person) – Human genome project  13 years, 3.2 billion chemicals (bases) in human genome  Other genomes being/been decoded: – Pufferfish, fruit fly, mouse, chicken, yeast, bacteria
  • 9. DNA Structure  Double Helix (Crick & Watson) – 2 coiled matching strands – Backbone of sugar phosphate pairs  Nitrogenous Base Pairs – Roughly 20 atoms in a base – Adenine  Thymine [A,T] – Cytosine  Guanine [C,G] – Weak bonds (can be broken) – Form long chains called polymers  Read the sequence on 1 strand – GATTCATCATGGATCATACTAAC
  • 10. Differences in DNA 2% tiny Roughly 4% Share M aterial  DNA differentiates: – Species/race/gender – Individuals  We share DNA with – Primates,mammals – Fish, plants, bacteria  Genotype – DNA of an individual  Genetic constitution  Phenotype – Characteristics of the resulting organism  Nature and nurture
  • 11. Genes  Chunks of DNA sequence – Between 600 and 1200 bases long – 32,000 human genes, 100,000 genes in tulips  Large percentage of human genome – Is “junk”: does not code for proteins  “Simpler” organisms such as bacteria – Are much more evolved (have hardly any junk) – Viruses have overlapping genes (zipped/compressed)  Often the active part of a gene is spit into exons – Seperated by introns
  • 12. The Synthesis of Proteins  Instructions for generating Amino Acid sequences – (i) DNA double helix is unzipped – (ii) One strand is transcribed to messenger RNA – (iii) RNA acts as a template  ribosomes translate the RNA into the sequence of amino acids  Amino acid sequences fold into a 3d molecule  Gene expression – Every cell has every gene in it (has all chromosomes) – Which ones produce proteins (are expressed) & when?
  • 13. Transcription  Take one strand of DNA  Write out the counterparts to each base – G becomes C (and vice versa) – A becomes T (and vice versa)  Change Thymine [T] to Uracil [U]  You have transcribed DNA into messenger RNA  Example: Start: GGATGCCAATG Intermediate: CCTACGGTTAC Transcribed: CCUACGGUUAC
  • 14. Genetic Code  How the translation occurs  Think of this as a function: – Input: triples of three base letters (Codons) – Output: amino acid – Example: ACC becomes threonine (T)  Gene sequences end with: – TAA, TAG or TGA
  • 15. Genetic Code A=Ala=Alanine C=Cys=Cysteine D=Asp=Aspartic acid E=Glu=Glutamic acid F=Phe=Phenylalanine G=Gly=Glycine H=His=Histidine I=Ile=Isoleucine K=Lys=Lysine L=Leu=Leucine M=Met=Methionine N=Asn=Asparagine P=Pro=Proline Q=Gln=Glutamine R=Arg=Arginine S=Ser=Serine T=Thr=Threonine V=Val=Valine W=Trp=Tryptophan Y=Tyr=Tyrosine
  • 16. Example Synthesis  TCGGTGAATCTGTTTGAT Transcribed to:  AGCCACUUAGACAAACUA Translated to:  SHLDKL
  • 17. Proteins  DNA codes for – strings of amino acids  Amino acids strings – Fold up into complex 3d molecule – 3d structures:conformations – Between 200 & 400 “residues” – Folds are proteins  Residue sequences – Always fold to same conformation  Proteins play a part – In almost every biological process
  • 18. Evolution of Genes: Inheritance  Evolution of species – Caused by reproduction and survival of the fittest  But actually, it is the genotype which evolves – Organism has to live with it (or die before reproduction) – Three mechanisms: inheritance, mutation and crossover  Inheritance: properties from parents – Embryo has cells with 23 pairs of chromosomes – Each pair: 1 chromosome from father, 1 from mother – Most important factor in offspring’s genetic makeup
  • 19. Evolution of Genes: Mutation  Genes alter (slightly) during reproduction – Caused by errors, from radiation, from toxicity – 3 possibilities: deletion, insertion, alteration  Deletion: ACGTTGACTC  ACGTGACTC  Insertion: ACGTTGACTC  AGCGTTGACTC  Substitution: ACGTTGACTC  ACGATGACTT  Mutations are almost always deleterious – A single change has a massive effect on translation – Causes a different protein conformation
  • 20. Evolution of Genes: Crossover (Recombination)  DNA sections are swapped – From male and female genetic input to offspring DNA
  • 21. Bioinformatics Application #1 Phylogenetic trees  Understand our evolution  Genes are homologous – If they share a common ancestor  By looking at DNA seqs – For particular genes – See who evolved from who  Example: – Mammoth most related to  African or Indian Elephants?  LUCA: – Last Universal Common Ancestor – Roughly 4 billion years ago
  • 22. Genetic Disorders  Disorders have fuelled much genetics research – Remember that genes have evolved to function  Not to malfunction  Different types of genetic problems  Downs syndrome: three chromosome 21s  Cystic fibrosis: – Single base-pair mutation disables a protein – Restricts the flow of ions into certain lung cells – Lung is less able to expel fluids
  • 23. Bioinformatics Application #2 Predicting Protein Structure  Proteins fold to set up an active site – Small, but highly effective (sub)structure – Active site(s) determine the activity of the protein  Remember that translation is a function – Always same structure given same set of codons – Is there a set of rules governing how proteins fold? – No one has found one yet – “Holy Grail” of bioinformatics
  • 24. Protein Structure Knowledge  Both protein sequence and structure – Are being determined at an exponential rate  1.3+ Million protein sequences known – Found with projects like Human Genome Project  20,000+ protein structures known – Found using techniques like X-ray crystallography  Takes between 1 month and 3 years – To determine the structure of a protein – Process is getting quicker
  • 26. Database Approaches  Slow(er) rate of finding protein structure – Still a good idea to pursue the Holy Grail  Structure is much more conservative than sequence – 1.3m genes, but only 2,000 – 10,000 different conformations  First approach to sequence prediction: – Store [sequence,structure] pairs in a database – Find ways to score similarity of residue sequences – Given a new sequence, find closest matches  A good match will possibly mean similar protein shape  E.g., sequence identity > 35% will give a good match – Rest of the first half of the course about these issues
  • 27. Potential (Big) Payoffs of Protein Structure Prediction  Protein function prediction – Protein interactions and docking  Rational drug design – Inhibit or stimulate protein activity with a drug  Systems biology – Putting it all together: “E-cell” and “E-organism” – In-silico modelling of biological entities and process
  • 28. Further Reading  Human Genome Project at Sanger Centre – http://www.sanger.ac.uk/HGP/  Talking glossary of genetic terms – http://www.genome.gov/glossary.cfm  Primer on molecular genetics – http://www.ornl.gov/TechResources/Human_Genome/publicat/primer/toc.html