Bioinformatics life sciences_v2015

Inleiding tot de bio-informatica en
computationele biologie

Lab for Bioinformatics and
computational genomics
10 “genome hackers”
mostly engineers (statistics)
42 scientists
technicians, geneticists, clinicians
>100 people
hardware engineers,
mathematicians, molecular biologists

What is Bioinformatics ?
• Application of information technology to
the storage, management and analysis of
biological information (Facilitated by the
use of computers)
– Sequence analysis?
– Molecular modeling (HTX) ?
– Phylogeny/evolution?
– Ecology and population studies?
– Medical informatics?
– Image Analysis ?
– Statistics ? AI ?
– Sterkstroom of zwakstroom ?

• Medicine (Pharma)
– Genome analysis allows the targeting of genetic
diseases
– The effect of a disease or of a therapeutic on
RNA and protein levels can be elucidated
– Knowledge of protein structure facilitates drug
design
– Understanding of genomic variation allows the
tailoring of medical treatment to the individual’s
genetic make-up
• The same techniques can be applied to crop (Agro)
and livestock improvement (Animal Health)
Promises of genomics and bioinformatics

Math
Informatics
Bioinformatics, a life science discipline … management of expectations
Theoretical Biology
Computational Biology
(Molecular)
Biology
Computer Science
Bioinformatics
Discovery Informatics – Computational Genomics
Interface Design
AI, Image Analysis
structure prediction (HTX)
Sequence Analysis
Expert Annotation
NP
Datamining

• Timelin: Magaret
Dayhoff …

nature
the
Human
genome
Setting the stage …

Biological Research
Adapted from John McPherson, OICR

And this is just the beginning ….
Next Generation Sequencing is
here

Read Length is Not As Important For Resequencing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
8 10 12 14 16 18 20
Length of K-mer Reads (bp)
%ofPairedK-merswithUniquely
AssignableLocation
E.COLI
HUMAN
Jay Shendure

Paired End Reads are Important!
Repetitive DNA
Unique DNA
Single read maps to
multiple positions
Paired read maps uniquely
Read 1 Read 2
Known Distance

Single Molecule Sequencing
Helicos Biosciences Corp.
Microscope slide
Single DNA
molecule
dNTP-Cy3
* * *
*
primer
Super-cooled
TIRF microscope
Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh

Next next generation sequencing
Third generation sequencing
Now sequencing

Pacific Biosciences: A Third Generation Sequencing Technology
Eid et al 2008

Ultra-low-cost SINGLE molecule sequencing

Genome Size
DOGS: Database Of Genome Sizes
E. coli = 4.2 x 106
Yeast = 18 x 106
Arabidopsis = 80 x 106
C.elegans = 100 x 106
Drosophila = 180 x 106
Human/Rat/Mouse = 3000 x 106
Lily = 300 000 x 106
With ... : 99.9 %
To primates: 99%

Identity
The extent to which two (nucleotide or amino acid)
sequences are invariant.
Homology
Similarity attributed to descent from a common ancestor.
Definitions
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K ++ + + GTW++MA+ L + A V T + +L+ W+
glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Orthologous
Homologous sequences in different species
that arose from a common ancestral gene
during speciation; may or may not be responsible
for a similar function.
Paralogous
Homologous sequences within a single species
that arose by gene duplication.
Definitions

• Simple identity, which scores only identical amino
acids as a match.
• Genetic code changes, which scores the
minimum number of nucieotide changes to change
a codon for one amino acid into a codon for the
other.
• Chemical similarity of amino acid side chains,
which scores as a match two amino acids which
have a similar side chain, such as hydrophobic,
charged and polar amino acid groups.
• The Dayhoff percent accepted mutation (PAM)
family of matrices, which scores amino acid pairs
on the basis of the expected frequency of
substitution of one amino acid for the other during
protein evolution.
• The blocks substitution matrix (BLOSUM) amino
acid substitution tables, which scores amino acid
pairs based on the frequency of amino acid
substitutions in aligned sequence motifs called
blocks which are found in protein families
Overview

BLOSUM (BLOck – SUM) scoring
DDNAAV
DNAVDD
NNVAVV
Block = ungapped alignent
Eg. Amino Acids D N V A
a b c d e f
1
2
3
S = 3 sequences
W = 6 aa
N= (W*S*(S-1))/2 = 18 pairs

A. Observed pairs
DDNAAV
DNAVDD
NNVAVV
a b c d e f
1
2
3
D N A V
D
N
A
V
1
4
1
3
1
1
1
1
4 1
f fij
D N A V
D
N
A
V
.056
.222
.056
.167
.056
.056
.056
.056
.222 .056
gij
/18
Relative frequency table
Probability of obtaining a pair
if randomly choosing pairs
from block

AB. Expected pairs
DDDDD
NNNN
AAAA
VVVVV
DDNAAV
DNAVDD
NNVAVV
Pi
5/18
4/18
4/18
5/18
P{Draw DN pair}= P{Draw D, then N or Draw M, then D}
P{Draw DN pair}= PDPN + PNPD = 2 * (5/18)*(4/18) = .123
D N A V
D
N
A
V
.077
.123
.154
.123
.049
.123
.099
.049
.123 .049
eijRandom rel. frequency table
Probability of obtaining a pair of
each amino acid drawn
independently from block

C. Summary (A/B)
sij = log2 gij/eij
(sij) is basic BLOSUM score matrix
Notes:
• Observed pairs in blocks contain information about
relationships at all levels of evolutionary distance
simultaneously (Cf: Dayhoffs’s close relationships)
• Actual algorithm generates observed + expected pair
distributions by accumalution over a set of approx. 2000
ungapped blocks of varrying with (w) + depth (s)

• blosum30,35,40,45,50,55,60,62,65,70,75,80,85,90
• transition frequencies observed directly by identifying
blocks that are at least
– 45% identical (BLOSUM45)
– 50% identical (BLOSUM50)
– 62% identical (BLOSUM62) etc.
• No extrapolation made
• High blosum - closely related sequences
• Low blosum - distant sequences
• blosum45  pam250
• blosum62  pam160
• blosum62 is the most popular matrix
The BLOSUM Series

• Church of the Flying Spaghetti Monster
• http://www.venganza.org/about/open-letter

– Henikoff and Henikoff have compared the
BLOSUM matrices to PAM by evaluating how
effectively the matrices can detect known members
of a protein family from a database when searching
with the ungapped local alignment program
BLAST. They conclude that overall the BLOSUM
62 matrix is the most effective.
• However, all the substitution matrices investigated
perform better than BLOSUM 62 for a proportion of
the families. This suggests that no single matrix is
the complete answer for all sequence comparisons.
• It is probably best to compliment the BLOSUM 62
matrix with comparisons using 250 PAMS, and
Overington structurally derived matrices.
– It seems likely that as more protein three
dimensional structures are determined, substitution
tables derived from structure comparison will give
the most reliable data.
Overview

Rat versus
mouse RBP
Rat versus
bacterial
lipocalin

• Exhaustive …
– All combinations:
• Algorithm
– Dynamic programming (much faster)
• Heuristics
– Needleman – Wunsh for global
alignments
(Journal of Molecular Biology, 1970)
– Later adapated by Smith-Waterman
for local alignment
Alignments

A metric …
GACGGATTAG, GATCGGAATAG
GA-CGGATTAG
GATCGGAATAG
+1 (a match), -1 (a mismatch),-2 (gap)
9*1 + 1*(-1)+1*(-2) = 6

Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1

The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
a
bc
A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)
B: up_score = matrix(i-1,j) + GAP
C: left_score = matrix(i,j-1) + GAP

Seq1:CKHVFCRVCI
Seq2:CKKCFC-KCV
++--++--+- score = 0

• Practicum: use similarity function in
initialization step -> scoring tables
• Time Complexity
• Use random proteins to generate
histogram of scores from aligned
random sequences

Time complexity with needleman-wunsch.pl
Sequence Length (aa) Execution Time (s)
10 0
25 0
50 0
100 1
500 5
1000 19
2500 559
5000 Memory could not be
written

Average around -64 !
-80
-78
-76
-74
-72 **
-70 *******
-68 ***************
-66 *************************
-64 ************************************************************
-60 ***********************
-58 ***************
-56 ********
-54 ****
-52 *
-50
-48
-46
-44
-42
-40
-38

If the sequences are similar, the path
of the best alignment should be very
close to the main diagonal.
Therefore, we may not need to fill the
entire matrix, rather, we fill a narrow
band of entries around the main
diagonal.
An algorithm that fills in a band of
width 2k+1 around the main
diagonal.

Phylogenetic methods may be used to
solve crimes, test purity of products, and
determine whether endangered species
have been smuggled or mislabeled:
– Vogel, G. 1998. HIV strain analysis debuts in
murder trial. Science 282(5390): 851-853.
– Lau, D. T.-W., et al. 2001. Authentication of
medicinal Dendrobium species by the internal
transcribed spacer of ribosomal DNA. Planta
Med 67:456-460.
Examples

– Epidemiologists use phylogenetic methods to
understand the development of pandemics,
patterns of disease transmission, and
development of antimicrobial resistance or
pathogenicity:
• Basler, C.F., et al. 2001. Sequence of the 1918
pandemic influenza virus nonstructural gene (NS)
segment and characterization of recombinant viruses
bearing the 1918 NS genes. PNAS, 98(5):2746-2751.
• Ou, C.-Y., et al. 1992. Molecular epidemiology of HIV
transmission in a dental practice. Science
256(5060):1165-1171.
• Bacillus Antracis:
Examples

• Finding a structural homologue
• Blast
–versus PDB database or PSI-
blast (E<0.005)
–Domain coverage at least 60%
• Avoid Gaps
–Choose for few gaps and
reasonable similarity scores
instead of lots of gaps and high
similarity scores
Modeling

Bootstrapping - an example
Ciliate SSUrDNA - parsimony bootstrap
Majority-rule consensus
Ochromonas (1)
Symbiodinium (2)
Prorocentrum (3)
Euplotes (8)
Tetrahymena (9)
Loxodes (4)
Tracheloraphis (5)
Spirostomum (6)
Gruberia (7)
100
96
84
100
100
100

Overview
Personalized Medicine,
Biomarkers …
… Molecular Profiling
First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling
Concluding Remarks

CONFIDENTIAL
Defining Epigenetics
 Reversible changes in gene
expression/function
 Without changes in DNA
sequence
 Can be inherited from
precursor cells
 Allows to integrate intrinsic
with environmental signals
(including diet)
Methylation I Epigenetics | Oncology | Biomarker
Genome
DNA
Gene Expression
Epigenome
Chromatin
Phenotype
I NEXT-GEN | PharmacoDX | CRC

CONFIDENTIAL

CONFIDENTIAL
Epigenetic Regulation:
Post Translational Modifications to Histones and Base Changes in DNA
 Epigenetic modifications of histones and DNA include:
– Histone acetylation and methylation, and DNA methylation
Histone
Acetylation
Histone
Methylation
DNA Methylation
MeMe
Ac
Me

CONFIDENTIAL
MGMT Biology
O6 Methyl-Guanine
Methyl Transferase
Essential DNA Repair Enzyme
Removes alkyl groups from damaged guanine
bases
Healthy individual:
- MGMT is an essential DNA repair enzyme
Loss of MGMT activity makes individuals susceptible
to DNA damage and prone to tumor development
Glioblastoma patient on alkylator chemotherapy:
- Patients with MGMT promoter methylation show
have longer PFS and OS with the use of alkylating
agents as chemotherapy

CONFIDENTIAL
MGMT Promoter
Methylation Predicts
Benefit form DNA-Alkylating Chemotherapy
Post-hoc subgroup analysis of Temozolomide Clinical trial with primary glioblastoma
patients show benefit for patients with MGMT promoter methylation
0
5
10
15
20
25
Median Overall Survival
21.7 months
12.7 months
radiotherapy
plus
temozolomide
Methylated
MGMT Gene
Non-Methylated
MGMT Gene
radiotherapy
Adapted from Hegi et al.
NEJM 2005
352(10):1036-8.
Study with 207 patients

CONFIDENTIAL
Genome-wide methylation
by methylation sensitive restriction enzymes

CONFIDENTIAL
by probes

CONFIDENTIAL# samples
# markers
…. by next generation sequencing
Discovery
Verification
Validation

CONFIDENTIAL
MBD_Seq
DNA Sheared
Immobilized
Methyl Binding Domain
Condensed Chromatin
DNA Sheared

CONFIDENTIAL
Immobilized
Methyl binding domain
MgCl2
Next Gen Sequencing
GA Illumina: 100 million reads
MBD_Seq

CONFIDENTIAL
MBD_Seq
MGMT = dual core

# markers
MBD_Seq
Discovery
1-2 million
methylation
cores

CONFIDENTIAL
Data integration
Correlation tracks
87
methylation methylation
expression expression
Corr =-1 Corr = 1

CONFIDENTIAL
Correlation track
in GBM @ MGMT
88
I NEXT-GEN | PharmacoDX |
+1
-1

# markers
MBD_Seq
454_BT_Seq
MSP
Discovery
Verification
Validation
I NEXT-GEN | PharmacoDX |

CONFIDENTIAL
GCATCGTGACTTACGACTGATCGATGGATGCTAGCAT
unmethylated alleles
less methylationmethylated alleles
more methylation
Deep Sequencing

CONFIDENTIAL
Deep MGMT
Heterogenic complexity

CONFIDENTIAL
92

93
biobix
wvcrieki
biobix.be
bioinformatics.be

Bioinformatics life sciences_v2015

More Related Content

What's hot (15)

Viewers also liked (20)

Similar to Bioinformatics life sciences_v2015 (20)

More from Prof. Wim Van Criekinge (20)

Recently uploaded (20)

Bioinformatics life sciences_v2015