M.Alroy Mascrenghe 1
Introduction……
M.Alroy Mascrenghe 2
2000
n  A Major event happened that was to
change the course of human history
n  It was a joint British and American
effort
n  nothing to do with IRAQ!
n  It was a race – who will complete first
n  Race Test – not whether they have
taken drugs but whether they can
produce them!
n  Human genome was sequenced
M.Alroy Mascrenghe 3
A Situ…somewhere in the
near future
n  A virus –not ‘I love you’ virus- creates an epidemic
n  Geneticists and bioinformaticians role on their
sleeves
n  Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses
n  As the characteristics of the other viruses are
known
n  From genetic material computer programs will
derive the proteins necessary for the survival of the
virus
n  When the protein (sequence and structure) is
known then medicines can be designed
M.Alroy Mascrenghe 4
What is
n  The marriage between computer
science and molecular biology
l  The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
n  ‘Information technology applied to
the management and analysis of
biological data’
l  Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
M.Alroy Mascrenghe 5
Biology Chemistry
Statistics
Computer
Science
Bioinformatics
M.Alroy Mascrenghe 6
What is..
n  This is the age of the Information
Technology
n  However storing info is nothing new
n  Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
n  ‘Bioinformatics tries to determine
what info is biologically important’
M.Alroy Mascrenghe 7
Basics
of
Molecular Biology….
M.Alroy Mascrenghe 8
DNA & Genes
n  DNA is where the genetic information is
stored
n  Blonde hair and blue eyes are inherited by
this
n  Gene - The basic unit of heredity
l  There are genes for characteristics i.e. a gene
for blond hair etc
n  Genes contain the information as a
sequence of nucleotides
n  Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
n  Genes are made up of nucleotides
M.Alroy Mascrenghe 9
M.Alroy Mascrenghe 10
Nucleotide (nt)
n  Each nt I made up of
l  Sugar
l  Phospate group
l  Base
n  The base it (nt) contains makes the only
difference between one nt and the other
n  There are 4 different bases
n  G(uanine),A(denine),T(hymine),C(ytosine)
n  The information is in the order of nucleotide
and the order is the info
n  Genes can be many thousands of nt long
n  The complete set of genetic instructions is
called genomes
M.Alroy Mascrenghe 11
Chromosomes
n  DNA strings make
chromosomes
n  Analogy
l  Letters - nt
l  Sentences – genes
l  Individual volumes of Britannica
encyclopedia – chromosomes
l  All voles together - Genome
M.Alroy Mascrenghe 12
Double Helix
n  The DNA is a double helix
n  Each strand has complementary
information
n  Each particular base in one strand is
bonded with another particular base in the
next strand
l  G - C
l  A - T
n  For example -
l  AATGC one strand
l  TTACG other strand
M.Alroy Mascrenghe 13
Proteins
n  Proteins are very important
biological feature
n  Amino Acids make up the proteins
n  20 different amino acids are there
n  The function of a protein is
dependant on the order of the amino
acids
M.Alroy Mascrenghe 14
Proteins…
n  The information required to make aa is
stored in DNA
n  DNA sequence determines amino acid
sequence
n  Amino Acid sequence determines protein
structure
n  Protein structure determines protein
function
n  A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
n  Storage - DNA
n  Information Transfer – RNA
n  RNA is the message boy!
M.Alroy Mascrenghe 15
Central dogma
DNA transcription RNA Translation Protein
RNA Polymerase Ribosomes
M.Alroy Mascrenghe 16
M.Alroy Mascrenghe 17
Proteins…..
n  Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
n  So in triplet codes – codon – protein
information is carried
n  The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA (RNA has U instead of T)
n  Some codons are used as start
codons - AUG as well as to code
methionine
M.Alroy Mascrenghe 18
Protein Structure
n  Shows a wide variety as opposed to the
DNA whose structure is uniform
n  X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
n  Structure is related to the function or rather
structure determines the function
n  Although proteins are created as a linear
structure of aa chain they fold into 3 d
structure.
n  If you stretch them and leave them they will
go back to this structure – this is the native
structure of a protein
n  Only in the native structure the proteins
functions well
n  Even after the translation is over protein
goes through some changes to its structure
M.Alroy Mascrenghe 19
Gene Expression
n  Gene Expression – the process of
Transcripting a DNA and translating a RNA
to make protein
n  Where do the genes begin in a
chromosome?
n  How does the RNA identify the beginning
of a gene to make a protein
n  A single nt cannot be taken to point out the
beginning of a gene as they occur
frequently
n  But a particular combination of a nucleotide
can be
n  Promoter sequences – the order of nt
which mark the beginning of a gene
M.Alroy Mascrenghe 20
Bioinformatics
Techniques…..
M.Alroy Mascrenghe 21
Prediction and Pattern
Recognition
n  The two main areas of bioinformatics
are
n  Pattern recognition
l  ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
n  Prediction
l  From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
M.Alroy Mascrenghe 22
Dot plots….
n  Simple way of evaluating
similarity between two
sequences
n  In a graph one sequence is on
one side the next on the other
side
n  Where there are matches
between the two sequences the
graph is marked
M.Alroy Mascrenghe 23
M.Alroy Mascrenghe 24
Alignments
n  A match for similarity between the characters of two or
more sequences
n  Eg.
l  TTACTATA
l  TAGATA
n  There are so many ways to align the above two
sequences
l  1.
n  TTACTATA
n  TAGATA
l  2.
n  TTACTATA
n  TAGATA
l  3.
n  TTACTATA
n  TAGATA
n  So which one do we choose and on what basis?
n  Solution is to Provide a match score and mismatch score
M.Alroy Mascrenghe 25
Gaps
n  Introduce gaps and a penalty
score for gaps
n  TTACTATA
n  T_A_GATA
n  In gap scores a single indel which is two characters long is preferred to two indels which are each one
character long
n  However not all gaps are bad
l  TTGCAATCT
l  CAA
l  How do we align?
l  ---CAA---
l  These gaps are not biologically significant
l  Semi Global Alignments
M.Alroy Mascrenghe 26
Scoring Matrix
n  For DNA/protein sequence alignment we create a matrix
n  If A and A score is 1
n  If A and T score is -5
n  If A and C score is -1
M.Alroy Mascrenghe 27
Dynamic Programming
n  As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
n  We cannot perform an exhaustive search
n  Combinatorial explosion occurs – too much
combinations to search for
n  Dynamic programming is a way of using
heuristics to search in the most promising
path
M.Alroy Mascrenghe 28
Databases
n  Sequence info is stored in
databases
n  So that they can be manipulated
easily
n  The db (next slide) are located
at diff places
n  They exchange info on a daily
basis so that they are up-to-date
and are in sync
n  Primary db – sequence data
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS-
PROT
NRL-3D
M.Alroy Mascrenghe 30
Composite DB
n  As there are many db which one to
search? Some are good in some
aspects and weak in others?
n  Composite db is the answer – which
has several db for its base data
n  Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db
M.Alroy Mascrenghe 31
Composite DB
n  OWL has these as their primary
db
l  SWISS PROT (top priority)
l  PIR
l  GenBank
l  NRL-3D
M.Alroy Mascrenghe 32
Secondary db
n  Store secondary structure info or
results of searches of the
primary db
Compo
DB
Primary
Source
PROSITE SWISS-PROT
PRINTS OWL
M.Alroy Mascrenghe 33
Database Searches
n  We have sequenced and identified
genes. So we know what they do
n  The sequences are stored in
databases
n  So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
n  Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence
n  So heuristics must be used again.
M.Alroy Mascrenghe 34
Areas in
Bioinformatics…
M.Alroy Mascrenghe 35
Genomics
n  Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
n  i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
M.Alroy Mascrenghe 36
Genomics - Finding Genes
n  Gene in sequence data – needle in a
haystack
n  However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
n  Is whole array of nt we try to find and
border mark a set o nt as a gene
n  This is one of the challenges of
bioinformatics
n  Neural networks and dynamic
programming are being employed
Organism Genome
Size
(Mb)
bp * 1,000,000
Gene
Number
Web Site
Yeast 13.5 6,241 http://genome-
www.stanford.ed
u/
Saccharomyces
Fruit Flies 180 13,601 http://
flybase.bio.india
na.edu
Homo
Sapiens
3,000 45,000 http://
www.ncbi.nlm.ni
h.gov/genome/
M.Alroy Mascrenghe 38
Proteomics
n  Proteome is the sum total of an
organisms proteins
n  More difficult than genomics
l  4 20
l  Simple chemical makeup complex
l  Can duplicate can’t
n  We are entering into the ‘post
genome era’
n  Meaning much has been done with
the Genes – not that it’s a over
M.Alroy Mascrenghe 39
Proteomics…..
n  The relationship between the RNA and the protein it codes are
usually very different
n  After translation proteins do change
l  So aa sequence do not tell anything about the post
translation changes
n  Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
n  So aa only hint in these things
n  Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material
M.Alroy Mascrenghe 40
Protein Structure Prediction
n  Is one of the biggest challenges
of bioinformatics and esp.
biochemistry
n  No algorithm is there now to
consistently predict the structure
of proteins
M.Alroy Mascrenghe 41
Structure Prediction methods
n  Comparative Modeling
l  Target proteins structure is
compared with related proteins
l  Proteins with similar sequences
are searched for structures
M.Alroy Mascrenghe 42
Phylogenetics
n  The taxonomical system reflects
evolutionary relationships
n  Phylogenetics trees are things which reflect
the evolutionary relationship thru a picture/
graph
n  Rooted trees where there is only one
ancestor
n  Un rooted trees just showing the
relationship
n  Phylogenetic tree reconstruction algorithms
are also an area of research
M.Alroy Mascrenghe 43
Applications….
M.Alroy Mascrenghe 44
Medical Implications
n  Pharmacogenomics
l  Not all drugs work on all patients, some good
drugs cause death in some patients
l  So by doing a gene analysis before the
treatment the offensive drugs can be avoided
l  Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
l  Customized treatment
n  Gene Therapy
l  Replace or supply the defective or missing gene
l  E.g: Insulin and Factor VIII or Haemophilia
n  BioWeapons (??)
M.Alroy Mascrenghe 45
Diagnosis of Disease
n  Diagnosis of disease
l  Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
n  Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
n  Death in 10-15 years
n  The gene responsible for the disease has
been identified
n  Contains excessively repeated sections of
CAG
n  So once analyzed the couple can be
counseled
M.Alroy Mascrenghe 46
Drug Design
n  Can go up to 15yrs and
$700million
n  One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
n  The process
l  Discovery
n  Computational methods can
improves this
l  Testing
M.Alroy Mascrenghe 47
Discovery
Target identification
l  Identifying the molecule on which the
germs relies for its survival
l  Then we develop another molecule
i.e. drug which will bind to the target
l  So the germ will not be able to interact
with the target.
l  Proteins are the most common targets
M.Alroy Mascrenghe 48
Discovery…
n  For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
n  This HIV protease has an active
site where it binds to other
molecules
n  So HIV drug will go and bind
with that active site
l  Easily said than done!
M.Alroy Mascrenghe 49
Discovery…
n  Lead compounds are the
molecules that go and bind to
the target protein’s active site
n  Traditionally this has been a trial
and error method
n  Now this is being moved into the
realm of computers
M.Alroy Mascrenghe 50
Related Computer
Technology………….
M.Alroy Mascrenghe 51
PERL
n  Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols
n  The default CGI language
n  It started out as a scripting language
but has become a fully fledged
language
n  IT has everything now, even web
service support
n  http://bio.perl.org
M.Alroy Mascrenghe 52
The place of XML & Web
Services
n  Various markup languages are being created –
Gene Markup language etc to represent sequence/
gene data
n  Web Services – program to program interaction,
making the web application centric as opposed to
human centric
n  So this has to platform language independent
n  Protocols like SOAP help in this regard
n  In bioinformatics various databases are being used,
different platforms, languages etc
n  So web services helps achieve platform
independence and program interaction
n  Since sequence data bases are in various formats,
platforms SOAP also helps in this regards
M.Alroy Mascrenghe 53
The place of GRID
n  GRID - new kid on the block
n  Using many computers to fulfill a
single computational tasks
n  Bioinformatics is the ideal
platform as it has to deal with a
large amount of data in
alignment and searches
n  E-science initiative in the UK
n  ORACLE 10g – the worlds first
GRID database
M.Alroy Mascrenghe 54
Data bases and Mining
n  Lot of the sequence databases are
available publicly
n  As there is a DB involved various
data mining techniques are used to
pull the data out
n  As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
M.Alroy Mascrenghe 55
European Molecular Biology
Network (EMBnet)
n  A central system for sharing, training
and centralizing up to date bio info
n  Some of the EMBnet sites are:
n  SQENET
l  http://www.seqnet.dl.ac.uk
n  UCL
l  http://www.biochem.ucl.ac.uk/bsm/
dbbrowser/embnet/
n  EBI – European Bioinformatics
Institute
l  www.ebi.ac.uk
M.Alroy Mascrenghe 56
References
n  Dan E. Krane and Michael L. Raymer
l  Basic Concepts of Bioinformatics
n  Arthur M Lesk
l  Intro to Bioinformatics
n  T.K. Attwood & D. J. Parry-Smith
l  Intro to Bioinformatics
n  The genetic Revolution
l  Dr Patrick Dixon
n  Prof David Gilbert’s Site
l  http://www.brc.dcs.gla.ac.uk/~drg/
M.Alroy Mascrenghe 57
Thank You!

More Related Content

PPTX
Bioinformatics
PDF
Gene prediction methods vijay
PDF
Tools and database of NCBI
PPTX
Introduction to NCBI
PPT
PPTX
Sequence database
Bioinformatics
Gene prediction methods vijay
Tools and database of NCBI
Introduction to NCBI
Sequence database

What's hot (20)

PPTX
Applications of bioinformatics
PDF
Basics of bioinformatics
PPTX
PPTX
PPT
PCR Primer desining
PPT
Clustal
PDF
Phylogenetics an overview
PPTX
BLAST (Basic local alignment search Tool)
PDF
PPTX
Protein database
PPTX
Swiss prot database
PPT
PPTX
Bioinformatic in drug designing
PPTX
Introduction to Bioinformatics
PPTX
Databases in Bioinformatics
PPTX
Protein Databases
PPTX
EMBL-EBI
PPT
Biological Databases
Applications of bioinformatics
Basics of bioinformatics
PCR Primer desining
Clustal
Phylogenetics an overview
BLAST (Basic local alignment search Tool)
Protein database
Swiss prot database
Bioinformatic in drug designing
Introduction to Bioinformatics
Databases in Bioinformatics
Protein Databases
EMBL-EBI
Biological Databases
Ad

Similar to Bioinformatics (20)

PDF
dna structure agricultural sciences gr12
PPTX
Introduction to genetics : Gene, DNA, Chromosomes
PPT
Genetics and malocclusion /certified fixed orthodontic courses by Indian de...
PDF
Dna Cloning
PDF
❤️DNA as a heredity material❤️
PPTX
bio 111 lect 25-26 (2jgdrsrycmbbcrkhx).pptx
PPTX
Genetic material
DOC
Bioinformatics
PPTX
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
PPTX
Chapter 7 Dna fingerprinting
PPT
Chapter 22 Nucleic acids and Protein synthesis [Autosaved].ppt
PDF
Genome Sequencing - Ahmadrezarafati 1395-01-30
PDF
Sbc 275 genome organization
PPTX
Introduction of bioinformatics
PPT
chapter 9 - dna powerpoint for forensics
PPTX
This is good ...ANTHONY SEMINAR POWER POINT.pptx
PPT
central dogma of genetic information and DNA replication
PDF
Examining gene expression and methylation with next gen sequencing
PPT
Basic genetics /certified fixed orthodontic courses by Indian dental academy
PPT
(Ng2) genetics & amp; malocclusion 1
dna structure agricultural sciences gr12
Introduction to genetics : Gene, DNA, Chromosomes
Genetics and malocclusion /certified fixed orthodontic courses by Indian de...
Dna Cloning
❤️DNA as a heredity material❤️
bio 111 lect 25-26 (2jgdrsrycmbbcrkhx).pptx
Genetic material
Bioinformatics
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
Chapter 7 Dna fingerprinting
Chapter 22 Nucleic acids and Protein synthesis [Autosaved].ppt
Genome Sequencing - Ahmadrezarafati 1395-01-30
Sbc 275 genome organization
Introduction of bioinformatics
chapter 9 - dna powerpoint for forensics
This is good ...ANTHONY SEMINAR POWER POINT.pptx
central dogma of genetic information and DNA replication
Examining gene expression and methylation with next gen sequencing
Basic genetics /certified fixed orthodontic courses by Indian dental academy
(Ng2) genetics & amp; malocclusion 1
Ad

More from Bahauddin Zakariya University lahore (20)

PPTX
PPTX
Transplants , eugenics and their issues
PPTX
PPTX
Nucleic acid-and-cell-based-therapies
PPTX
Antibodies, vaccines, adjuvents
DOC
DNA extraction for_fungi
PPTX
Dna sequencing techniques
PPTX
Basics of DNA & RNA (Nucleic acid)
PDF
The composting process
PPTX
Evaporation & crystalization
PDF
Electrophoresis and electrodialysis_yansee_maria_jiaxuan
PDF
Coagulation flocculation and_precipitation
PPTX
PDF
Chap9 downstream processing
Transplants , eugenics and their issues
Nucleic acid-and-cell-based-therapies
Antibodies, vaccines, adjuvents
DNA extraction for_fungi
Dna sequencing techniques
Basics of DNA & RNA (Nucleic acid)
The composting process
Evaporation & crystalization
Electrophoresis and electrodialysis_yansee_maria_jiaxuan
Coagulation flocculation and_precipitation
Chap9 downstream processing

Recently uploaded (20)

PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PPTX
Modernising the Digital Integration Hub
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPT
What is a Computer? Input Devices /output devices
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
DOCX
search engine optimization ppt fir known well about this
PPTX
2018-HIPAA-Renewal-Training for executives
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PDF
Flame analysis and combustion estimation using large language and vision assi...
sustainability-14-14877-v2.pddhzftheheeeee
UiPath Agentic Automation session 1: RPA to Agents
Consumable AI The What, Why & How for Small Teams.pdf
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
sbt 2.0: go big (Scala Days 2025 edition)
The influence of sentiment analysis in enhancing early warning system model f...
Modernising the Digital Integration Hub
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Developing a website for English-speaking practice to English as a foreign la...
What is a Computer? Input Devices /output devices
NewMind AI Weekly Chronicles – August ’25 Week III
search engine optimization ppt fir known well about this
2018-HIPAA-Renewal-Training for executives
Chapter 5: Probability Theory and Statistics
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Microsoft Excel 365/2024 Beginner's training
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Abstractive summarization using multilingual text-to-text transfer transforme...
Flame analysis and combustion estimation using large language and vision assi...

Bioinformatics

  • 2. M.Alroy Mascrenghe 2 2000 n  A Major event happened that was to change the course of human history n  It was a joint British and American effort n  nothing to do with IRAQ! n  It was a race – who will complete first n  Race Test – not whether they have taken drugs but whether they can produce them! n  Human genome was sequenced
  • 3. M.Alroy Mascrenghe 3 A Situ…somewhere in the near future n  A virus –not ‘I love you’ virus- creates an epidemic n  Geneticists and bioinformaticians role on their sleeves n  Genetic material of the virus is compared with the existing base of known genetic material of other viruses n  As the characteristics of the other viruses are known n  From genetic material computer programs will derive the proteins necessary for the survival of the virus n  When the protein (sequence and structure) is known then medicines can be designed
  • 4. M.Alroy Mascrenghe 4 What is n  The marriage between computer science and molecular biology l  The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists n  ‘Information technology applied to the management and analysis of biological data’ l  Storage and Analysis are two of the important functions – bioinformaticians build tools for each
  • 5. M.Alroy Mascrenghe 5 Biology Chemistry Statistics Computer Science Bioinformatics
  • 6. M.Alroy Mascrenghe 6 What is.. n  This is the age of the Information Technology n  However storing info is nothing new n  Information to the volume of Britannica Encyclopedia is stored in each of our cells n  ‘Bioinformatics tries to determine what info is biologically important’
  • 8. M.Alroy Mascrenghe 8 DNA & Genes n  DNA is where the genetic information is stored n  Blonde hair and blue eyes are inherited by this n  Gene - The basic unit of heredity l  There are genes for characteristics i.e. a gene for blond hair etc n  Genes contain the information as a sequence of nucleotides n  Genes are abstract concepts – like longitude and latitudes in the sense that you cannot see them separately n  Genes are made up of nucleotides
  • 10. M.Alroy Mascrenghe 10 Nucleotide (nt) n  Each nt I made up of l  Sugar l  Phospate group l  Base n  The base it (nt) contains makes the only difference between one nt and the other n  There are 4 different bases n  G(uanine),A(denine),T(hymine),C(ytosine) n  The information is in the order of nucleotide and the order is the info n  Genes can be many thousands of nt long n  The complete set of genetic instructions is called genomes
  • 11. M.Alroy Mascrenghe 11 Chromosomes n  DNA strings make chromosomes n  Analogy l  Letters - nt l  Sentences – genes l  Individual volumes of Britannica encyclopedia – chromosomes l  All voles together - Genome
  • 12. M.Alroy Mascrenghe 12 Double Helix n  The DNA is a double helix n  Each strand has complementary information n  Each particular base in one strand is bonded with another particular base in the next strand l  G - C l  A - T n  For example - l  AATGC one strand l  TTACG other strand
  • 13. M.Alroy Mascrenghe 13 Proteins n  Proteins are very important biological feature n  Amino Acids make up the proteins n  20 different amino acids are there n  The function of a protein is dependant on the order of the amino acids
  • 14. M.Alroy Mascrenghe 14 Proteins… n  The information required to make aa is stored in DNA n  DNA sequence determines amino acid sequence n  Amino Acid sequence determines protein structure n  Protein structure determines protein function n  A Substance called RNA is used to carry the Info stored in the DNA that in turn is used to make proteins n  Storage - DNA n  Information Transfer – RNA n  RNA is the message boy!
  • 15. M.Alroy Mascrenghe 15 Central dogma DNA transcription RNA Translation Protein RNA Polymerase Ribosomes
  • 17. M.Alroy Mascrenghe 17 Proteins….. n  Since there are 20 amino acids to translate one nt cannot correspond to one aa, neither can it correspond as twos n  So in triplet codes – codon – protein information is carried n  The codons that do not correspond to a protein are stop codons – UAA, UAG, UGA (RNA has U instead of T) n  Some codons are used as start codons - AUG as well as to code methionine
  • 18. M.Alroy Mascrenghe 18 Protein Structure n  Shows a wide variety as opposed to the DNA whose structure is uniform n  X-ray crystallography or Nuclear Magnetic Resonance (NMR) is used to figure out the structure n  Structure is related to the function or rather structure determines the function n  Although proteins are created as a linear structure of aa chain they fold into 3 d structure. n  If you stretch them and leave them they will go back to this structure – this is the native structure of a protein n  Only in the native structure the proteins functions well n  Even after the translation is over protein goes through some changes to its structure
  • 19. M.Alroy Mascrenghe 19 Gene Expression n  Gene Expression – the process of Transcripting a DNA and translating a RNA to make protein n  Where do the genes begin in a chromosome? n  How does the RNA identify the beginning of a gene to make a protein n  A single nt cannot be taken to point out the beginning of a gene as they occur frequently n  But a particular combination of a nucleotide can be n  Promoter sequences – the order of nt which mark the beginning of a gene
  • 21. M.Alroy Mascrenghe 21 Prediction and Pattern Recognition n  The two main areas of bioinformatics are n  Pattern recognition l  ‘A particular sequence or structure has been seen before’ and that a particular characteristic can be associated with it n  Prediction l  From a sequence (what we know) we can predict the structure and function (what we don’t know)
  • 22. M.Alroy Mascrenghe 22 Dot plots…. n  Simple way of evaluating similarity between two sequences n  In a graph one sequence is on one side the next on the other side n  Where there are matches between the two sequences the graph is marked
  • 24. M.Alroy Mascrenghe 24 Alignments n  A match for similarity between the characters of two or more sequences n  Eg. l  TTACTATA l  TAGATA n  There are so many ways to align the above two sequences l  1. n  TTACTATA n  TAGATA l  2. n  TTACTATA n  TAGATA l  3. n  TTACTATA n  TAGATA n  So which one do we choose and on what basis? n  Solution is to Provide a match score and mismatch score
  • 25. M.Alroy Mascrenghe 25 Gaps n  Introduce gaps and a penalty score for gaps n  TTACTATA n  T_A_GATA n  In gap scores a single indel which is two characters long is preferred to two indels which are each one character long n  However not all gaps are bad l  TTGCAATCT l  CAA l  How do we align? l  ---CAA--- l  These gaps are not biologically significant l  Semi Global Alignments
  • 26. M.Alroy Mascrenghe 26 Scoring Matrix n  For DNA/protein sequence alignment we create a matrix n  If A and A score is 1 n  If A and T score is -5 n  If A and C score is -1
  • 27. M.Alroy Mascrenghe 27 Dynamic Programming n  As the length of the query sequences increase and the difference of length between the two sequence also increases –more gaps has to be inserted in various places n  We cannot perform an exhaustive search n  Combinatorial explosion occurs – too much combinations to search for n  Dynamic programming is a way of using heuristics to search in the most promising path
  • 28. M.Alroy Mascrenghe 28 Databases n  Sequence info is stored in databases n  So that they can be manipulated easily n  The db (next slide) are located at diff places n  They exchange info on a daily basis so that they are up-to-date and are in sync n  Primary db – sequence data
  • 29. Major Primary DB Nucleic Acid Protein EMBL (Europe) PIR - Protein Information Resource GenBank (USA) MIPS DDBJ (Japan) SWISS-PROT University of Geneva, now with EBI TrEMBL A supplement to SWISS- PROT NRL-3D
  • 30. M.Alroy Mascrenghe 30 Composite DB n  As there are many db which one to search? Some are good in some aspects and weak in others? n  Composite db is the answer – which has several db for its base data n  Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db
  • 31. M.Alroy Mascrenghe 31 Composite DB n  OWL has these as their primary db l  SWISS PROT (top priority) l  PIR l  GenBank l  NRL-3D
  • 32. M.Alroy Mascrenghe 32 Secondary db n  Store secondary structure info or results of searches of the primary db Compo DB Primary Source PROSITE SWISS-PROT PRINTS OWL
  • 33. M.Alroy Mascrenghe 33 Database Searches n  We have sequenced and identified genes. So we know what they do n  The sequences are stored in databases n  So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases. n  Since there are large number of databases we cannot do sequence alignment for each and every sequence n  So heuristics must be used again.
  • 34. M.Alroy Mascrenghe 34 Areas in Bioinformatics…
  • 35. M.Alroy Mascrenghe 35 Genomics n  Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic n  i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates
  • 36. M.Alroy Mascrenghe 36 Genomics - Finding Genes n  Gene in sequence data – needle in a haystack n  However as the needle is different from the haystack genes are not diff from the rest of the sequence data n  Is whole array of nt we try to find and border mark a set o nt as a gene n  This is one of the challenges of bioinformatics n  Neural networks and dynamic programming are being employed
  • 37. Organism Genome Size (Mb) bp * 1,000,000 Gene Number Web Site Yeast 13.5 6,241 http://genome- www.stanford.ed u/ Saccharomyces Fruit Flies 180 13,601 http:// flybase.bio.india na.edu Homo Sapiens 3,000 45,000 http:// www.ncbi.nlm.ni h.gov/genome/
  • 38. M.Alroy Mascrenghe 38 Proteomics n  Proteome is the sum total of an organisms proteins n  More difficult than genomics l  4 20 l  Simple chemical makeup complex l  Can duplicate can’t n  We are entering into the ‘post genome era’ n  Meaning much has been done with the Genes – not that it’s a over
  • 39. M.Alroy Mascrenghe 39 Proteomics….. n  The relationship between the RNA and the protein it codes are usually very different n  After translation proteins do change l  So aa sequence do not tell anything about the post translation changes n  Proteins are not active until they are combined into a larger complex or moved to a relevant location inside or outside the cell n  So aa only hint in these things n  Also proteins must be handled more carefully in labs as they tend to change when in touch with an inappropriate material
  • 40. M.Alroy Mascrenghe 40 Protein Structure Prediction n  Is one of the biggest challenges of bioinformatics and esp. biochemistry n  No algorithm is there now to consistently predict the structure of proteins
  • 41. M.Alroy Mascrenghe 41 Structure Prediction methods n  Comparative Modeling l  Target proteins structure is compared with related proteins l  Proteins with similar sequences are searched for structures
  • 42. M.Alroy Mascrenghe 42 Phylogenetics n  The taxonomical system reflects evolutionary relationships n  Phylogenetics trees are things which reflect the evolutionary relationship thru a picture/ graph n  Rooted trees where there is only one ancestor n  Un rooted trees just showing the relationship n  Phylogenetic tree reconstruction algorithms are also an area of research
  • 44. M.Alroy Mascrenghe 44 Medical Implications n  Pharmacogenomics l  Not all drugs work on all patients, some good drugs cause death in some patients l  So by doing a gene analysis before the treatment the offensive drugs can be avoided l  Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! l  Customized treatment n  Gene Therapy l  Replace or supply the defective or missing gene l  E.g: Insulin and Factor VIII or Haemophilia n  BioWeapons (??)
  • 45. M.Alroy Mascrenghe 45 Diagnosis of Disease n  Diagnosis of disease l  Identification of genes which cause the disease will help detect disease at early stage e.g. Huntington disease - n  Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment n  Death in 10-15 years n  The gene responsible for the disease has been identified n  Contains excessively repeated sections of CAG n  So once analyzed the couple can be counseled
  • 46. M.Alroy Mascrenghe 46 Drug Design n  Can go up to 15yrs and $700million n  One of the goals of bioinformatics is to reduce the time and cost involved with it. n  The process l  Discovery n  Computational methods can improves this l  Testing
  • 47. M.Alroy Mascrenghe 47 Discovery Target identification l  Identifying the molecule on which the germs relies for its survival l  Then we develop another molecule i.e. drug which will bind to the target l  So the germ will not be able to interact with the target. l  Proteins are the most common targets
  • 48. M.Alroy Mascrenghe 48 Discovery… n  For example HIV produces HIV protease which is a protein and which in turn eat other proteins n  This HIV protease has an active site where it binds to other molecules n  So HIV drug will go and bind with that active site l  Easily said than done!
  • 49. M.Alroy Mascrenghe 49 Discovery… n  Lead compounds are the molecules that go and bind to the target protein’s active site n  Traditionally this has been a trial and error method n  Now this is being moved into the realm of computers
  • 50. M.Alroy Mascrenghe 50 Related Computer Technology………….
  • 51. M.Alroy Mascrenghe 51 PERL n  Perl is commonly used for bioinformatics calculations as its ability to manipulate character symbols n  The default CGI language n  It started out as a scripting language but has become a fully fledged language n  IT has everything now, even web service support n  http://bio.perl.org
  • 52. M.Alroy Mascrenghe 52 The place of XML & Web Services n  Various markup languages are being created – Gene Markup language etc to represent sequence/ gene data n  Web Services – program to program interaction, making the web application centric as opposed to human centric n  So this has to platform language independent n  Protocols like SOAP help in this regard n  In bioinformatics various databases are being used, different platforms, languages etc n  So web services helps achieve platform independence and program interaction n  Since sequence data bases are in various formats, platforms SOAP also helps in this regards
  • 53. M.Alroy Mascrenghe 53 The place of GRID n  GRID - new kid on the block n  Using many computers to fulfill a single computational tasks n  Bioinformatics is the ideal platform as it has to deal with a large amount of data in alignment and searches n  E-science initiative in the UK n  ORACLE 10g – the worlds first GRID database
  • 54. M.Alroy Mascrenghe 54 Data bases and Mining n  Lot of the sequence databases are available publicly n  As there is a DB involved various data mining techniques are used to pull the data out n  As there is a lot of literature – articles etc – on this area a data mining on the literature – not on the sequence data has also become a PhD topic for many
  • 55. M.Alroy Mascrenghe 55 European Molecular Biology Network (EMBnet) n  A central system for sharing, training and centralizing up to date bio info n  Some of the EMBnet sites are: n  SQENET l  http://www.seqnet.dl.ac.uk n  UCL l  http://www.biochem.ucl.ac.uk/bsm/ dbbrowser/embnet/ n  EBI – European Bioinformatics Institute l  www.ebi.ac.uk
  • 56. M.Alroy Mascrenghe 56 References n  Dan E. Krane and Michael L. Raymer l  Basic Concepts of Bioinformatics n  Arthur M Lesk l  Intro to Bioinformatics n  T.K. Attwood & D. J. Parry-Smith l  Intro to Bioinformatics n  The genetic Revolution l  Dr Patrick Dixon n  Prof David Gilbert’s Site l  http://www.brc.dcs.gla.ac.uk/~drg/