Golden Rules of Bioinformatics

An Introduction to Bioinformatics
Tools
Part 1: Golden Rules of Bioinformatics
Leighton Pritchard and Peter Cock

On Conﬁdence
“Ignorance more frequently begets conﬁdence than does
knowledge: it is those who know little, not those who know much,
who so positively assert. . .”
- Charles Darwin

Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions

Zeroeth Golden Rule of Bioinformatics
• No-one knows everything about everything - talk to people!
• local bioinformaticians, mailing lists, forums, Twitter, etc.
• Keep learning - there are lots of resources
• There is no free lunch - no method works best on all data
• The worst errors are silent - share worries, problems, etc.
• Share expertise (see ﬁrst item)

Subgroups
• You are in group A, B, C or D - this decides your dataset:
expnA.tab, expnB.tab, expnC.tab, expnD.tab
• You will use R at the command-line to analyse your data

The biological question
• Your dataset expn?.tab describes (log) expression data for
two genes: gene1 and gene2
• Expression measured at eleven time points (including control)
• Q: Are gene1 and gene2 genes coregulated?
• How do we answer this question?

Reformulating the biological question
• A: We cannot determine this from expression data alone

Reformulating the biological question
• A: We cannot determine this from expression data alone
• Reformulate the question:
• NewQ: Is there evidence that gene1 and gene2 expression
proﬁles are correlated?
(is expression gene1 ∝ gene2)
• How do we answer this new question?

Starting the analysis
• Change directory to where Exercise 1 data is located, and
start R.
1 $ cd ../../ data/ ex1_expression /
2 $ R

Load and inspect data in R
1 > data = read.table("expnA.tab", sep="t", header=TRUE)
2 > head(data)
3 gene1 gene2
4 1 10 8.04
5 2 8 6.95
6 3 13 7.58
7 4 9 8.81
8 5 11 8.33
9 6 14 9.96

Load and inspect data in R
1 > mean(data$gene1)
2 [1] 9
3 > mean(data$gene2)
4 [1] 7.500909
5 > sd(data$gene1)
6 [1] 3.316625
7 > sd(data$gene2)
8 [1] 2.031568
9 > cor(data)
10 gene1 gene2
11 gene1 1.0000000 0.8164205
12 gene2 0.8164205 1.0000000

Results
measure expnA expnB expnC expnD
mean(gene1) 9
mean(gene2) 7.5
sd(gene1) 3.3
sd(gene2) 2.0
cor(data) 0.816

Results
mean(gene1) 9 9 9 9
mean(gene2) 7.5 7.5 7.5 7.5
sd(gene1) 3.3 3.3 3.3 3.3
sd(gene2) 2.0 2.0 2.0 2.0
cor(data) 0.816 0.816 0.816 0.816

Results
mean(gene1) 9 9 9 9
mean(gene2) 7.5 7.5 7.5 7.5
sd(gene1) 3.3 3.3 3.3 3.3
sd(gene2) 2.0 2.0 2.0 2.0
cor(data) 0.816 0.816 0.816 0.816
• r = 0.816(P < 0.005) in every experiment
• Can we conclude that gene1 and gene2 are coexpressed in
each experiment?

Plot the data in R
1 > plot(data)

Always plot the data
Which gene pairs are coexpressed?

Always plot the data
Is the matrix of (Pearson) correlation values potentially misleading?
1 > data = anscombe
2 > cor(data)[1:4 ,5:8]
3 y1 y2 y3 y4
4 x1 0.8164205 0.8162365 0.8162867 -0.3140467
5 x2 0.8164205 0.8162365 0.8162867 -0.3140467
6 x3 0.8164205 0.8162365 0.8162867 -0.3140467
7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214

Sometimes real correlation doesn’t
mean anything

First Golden Rule of Bioinformatics
• Always inspect the raw data (trends, outliers, clustering)
• What is the question? Can the data answer it?
• Communicate with data collectors! (don’t be afraid of
pedantry)
• Who? When? How?
• You need to understand the experiment to analyse it (easier if
you helped design it).
• Be wary of block eﬀects (experimenter, time, batch, etc.)

Exercise 2
• You are in group A, B, C or D - this decides your database
dbA, dbB, dbC, dbD
• You will use BLAST at the command-line to analyse your data
• You will use script at the command-line to record your work

Exercise 2
• Start recording your actions by entering script at the
command line
1 $ script
2 Script started , output file is typescript

Exercise 2
• Change directory to the ex2 blast directory
• Run BLAST with the appropriate database
• Exit script
1 $ cd ../ ex2_blast
2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA
3 $ exit
4 exit
5 Script done , output file is typescript

Exercise 2
• You can view the typescript ﬁle with cat
1 $ cat typescript
2 Script started on Fri May 9 10:45:12 2014
3 lpritc@lpmacpro :$ cd ../ ex2_blast
4 [...]

Exercise 2
Query= query protein sequence
Length=400
Score
Sequences producing significant alignments: (Bits)
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104

Exercise 2
• What is a reasonable E-value threshold to call a ’match’?
• 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value

Exercise 2
• What is a reasonable E-value threshold to call a ’match’?
• 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
• Five orders of magnitude diﬀerence in E-value, depending on
database choice - Why?

Exercise 2
• E-values depend on database size
• Bit score and alignment do not depend on database size
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
Bit score 34.3 34.3 34.3 34.3
Sequences 100,001 501 1 5,001
Letters 48,650,486 210,866 486 2,066,510

Exercise 2
• E-values diﬀer, but the query matches a choline
transporter-like protein quite well. . .
• After all, a biological match is a biological match. . .

Exercise 2
• E-values diﬀer, but the query matches a choline
transporter-like protein quite well. . .
• Doesn’t it?
• After all, a biological match is a biological match. . .
• Isn’t it?

Exercise 2
Query= query protein sequence
Length=400
Score E
Sequences producing significant alignments: (Bits) Value
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104

Exercise 2
• Sequence accessions (PITG ?????T0) are correct in the
databases

Exercise 2
databases
• Sequence functional descriptions are randomly shuﬄed:
lengths do not match in BLAST output

Exercise 2
databases
• dbA contains only three diﬀerent sequences: two are repeated
50,000 times

Exercise 2
databases
• dbA contains only three diﬀerent sequences: two are repeated
50,000 times
• query.fasta is random sequence, not a real protein
• Shuﬄed from all P. infestans proteins
• No nr or PFam matches

Second Golden Rule of Bioinformatics
• Do not trust the software: it is not an authority
• Software does not distinguish meaningful from meaningless
data
• Software has bugs
• Algorithms have assumptions, conditions, and applicable
domains
• Some problems are inherently hard, or even insoluble
• You must understand the analysis/algorithm
• Always sanity test
• Test output for robustness to parameter (including data)
choice

Exercise 3
• Rule: If there is a vowel on one side of the card, there must
be an even number on the other side.
• Which cards must be turned over to determine if this rule (if
a card shows a vowel on one face, the opposite face is even)
holds true?

Exercise 3
This is the Wason Selection Task
• If you chose E and 4

Exercise 3
• You are in the typical majority group
• You are not correct
• You have been a victim of conﬁrmation bias (System 1
thinking)

Exercise 3
thinking)

Exercise 3
thinking)
• Congratulations!
• Your choice was capable of falsifying the rule.

Exercise 3
Rule: If there is a vowel on one side of the card, there must be an
even number on the other side.
Card Outcome Rule
E
Even Can be true even if rule false
Odd violated
K
Even na
Odd na
4
Vowel Can be true even if rule false
Consonant na
7
Vowel violated
Consonant na

Exercise 3
• This is equivalent to functional classiﬁcation, e.g:
• Rule: If there is a CRN/RxLR/T3SS domain, the protein must
be an eﬀector.

Exercise 3
• Conﬁrmation Bias (Wason Selection Task)
• An uninformative experiment is performed
• http://en.wikipedia.org/wiki/Wason_selection_task
• Aﬃrming the Consequent (a related formal fallacy)
1. If P, then Q
2. Q
3. Therefore, P
• Experimental results are misinterpreted
• http:
//en.wikipedia.org/wiki/Affirming_the_consequent

Third Golden Rule of Bioinformatics
• Everyone has expectations of their data/experiment
• Beware cognitive errors, such as conﬁrmation bias!
• System 1 vs. System 2 ≈ intuition vs. reason
• Think statistically!
• Large datasets can be counterintuitive and appear to conﬁrm a
large number of contradictory hypotheses
• Always account for multiple tests.
• Avoid “data dredging”: intensive computation is not an
adequate substitute for expertise
• Use test-driven development of analyses and code
• Use examples that pass and fail

In Conclusion
• Always communicate!
• worst errors are silent
• Don’t trust the data
• formatting/validation/category errors - check!
• suitability for scientiﬁc question
• Don’t trust the software
• software is not an authority
• always benchmark, always validate
• Don’t trust yourself
• beware cognitive errors
• think statistically
• biological “stories” can be constructed from nonsense

Golden Rules of Bioinformatics

More Related Content

Similar to Golden Rules of Bioinformatics (20)

More from Leighton Pritchard (20)

Recently uploaded (20)

Golden Rules of Bioinformatics