SlideShare a Scribd company logo
An Introduction to Bioinformatics
Tools
Part 1: Golden Rules of Bioinformatics
Leighton Pritchard and Peter Cock
On Confidence
“Ignorance more frequently begets confidence than does
knowledge: it is those who know little, not those who know much,
who so positively assert. . .”
- Charles Darwin
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Zeroeth Golden Rule of Bioinformatics
• No-one knows everything about everything - talk to people!
• local bioinformaticians, mailing lists, forums, Twitter, etc.
• Keep learning - there are lots of resources
• There is no free lunch - no method works best on all data
• The worst errors are silent - share worries, problems, etc.
• Share expertise (see first item)
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Subgroups
• You are in group A, B, C or D - this decides your dataset:
expnA.tab, expnB.tab, expnC.tab, expnD.tab
• You will use R at the command-line to analyse your data
The biological question
• Your dataset expn?.tab describes (log) expression data for
two genes: gene1 and gene2
• Expression measured at eleven time points (including control)
• Q: Are gene1 and gene2 genes coregulated?
• How do we answer this question?
Reformulating the biological question
• Q: Are gene1 and gene2 genes coregulated?
• A: We cannot determine this from expression data alone
Reformulating the biological question
• Q: Are gene1 and gene2 genes coregulated?
• A: We cannot determine this from expression data alone
• Reformulate the question:
• NewQ: Is there evidence that gene1 and gene2 expression
profiles are correlated?
(is expression gene1 ∝ gene2)
• How do we answer this new question?
Starting the analysis
• Change directory to where Exercise 1 data is located, and
start R.
1 $ cd ../../ data/ ex1_expression /
2 $ R
Load and inspect data in R
1 > data = read.table("expnA.tab", sep="t", header=TRUE)
2 > head(data)
3 gene1 gene2
4 1 10 8.04
5 2 8 6.95
6 3 13 7.58
7 4 9 8.81
8 5 11 8.33
9 6 14 9.96
Load and inspect data in R
1 > mean(data$gene1)
2 [1] 9
3 > mean(data$gene2)
4 [1] 7.500909
5 > sd(data$gene1)
6 [1] 3.316625
7 > sd(data$gene2)
8 [1] 2.031568
9 > cor(data)
10 gene1 gene2
11 gene1 1.0000000 0.8164205
12 gene2 0.8164205 1.0000000
Results
measure expnA expnB expnC expnD
mean(gene1) 9
mean(gene2) 7.5
sd(gene1) 3.3
sd(gene2) 2.0
cor(data) 0.816
Results
measure expnA expnB expnC expnD
mean(gene1) 9 9 9 9
mean(gene2) 7.5 7.5 7.5 7.5
sd(gene1) 3.3 3.3 3.3 3.3
sd(gene2) 2.0 2.0 2.0 2.0
cor(data) 0.816 0.816 0.816 0.816
Results
measure expnA expnB expnC expnD
mean(gene1) 9 9 9 9
mean(gene2) 7.5 7.5 7.5 7.5
sd(gene1) 3.3 3.3 3.3 3.3
sd(gene2) 2.0 2.0 2.0 2.0
cor(data) 0.816 0.816 0.816 0.816
• r = 0.816(P < 0.005) in every experiment
• Can we conclude that gene1 and gene2 are coexpressed in
each experiment?
Plot the data in R
1 > plot(data)
Always plot the data
Which gene pairs are coexpressed?
Always plot the data
Is the matrix of (Pearson) correlation values potentially misleading?
1 > data = anscombe
2 > cor(data)[1:4 ,5:8]
3 y1 y2 y3 y4
4 x1 0.8164205 0.8162365 0.8162867 -0.3140467
5 x2 0.8164205 0.8162365 0.8162867 -0.3140467
6 x3 0.8164205 0.8162365 0.8162867 -0.3140467
7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
Sometimes real correlation doesn’t
mean anything
First Golden Rule of Bioinformatics
• Always inspect the raw data (trends, outliers, clustering)
• What is the question? Can the data answer it?
• Communicate with data collectors! (don’t be afraid of
pedantry)
• Who? When? How?
• You need to understand the experiment to analyse it (easier if
you helped design it).
• Be wary of block effects (experimenter, time, batch, etc.)
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Exercise 2
• You are in group A, B, C or D - this decides your database
dbA, dbB, dbC, dbD
• You will use BLAST at the command-line to analyse your data
• You will use script at the command-line to record your work
Exercise 2
• Start recording your actions by entering script at the
command line
1 $ script
2 Script started , output file is typescript
Exercise 2
• Change directory to the ex2 blast directory
• Run BLAST with the appropriate database
• Exit script
1 $ cd ../ ex2_blast
2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA
3 $ exit
4 exit
5 Script done , output file is typescript
Exercise 2
• You can view the typescript file with cat
1 $ cat typescript
2 Script started on Fri May 9 10:45:12 2014
3 lpritc@lpmacpro :$ cd ../ ex2_blast
4 [...]
Exercise 2
Query= query protein sequence
Length=400
Score
Sequences producing significant alignments: (Bits)
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104
Exercise 2
• What is a reasonable E-value threshold to call a ’match’?
• 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value
Exercise 2
• What is a reasonable E-value threshold to call a ’match’?
• 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
• Five orders of magnitude difference in E-value, depending on
database choice - Why?
Exercise 2
• E-values depend on database size
• Bit score and alignment do not depend on database size
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
Bit score 34.3 34.3 34.3 34.3
Sequences 100,001 501 1 5,001
Letters 48,650,486 210,866 486 2,066,510
Exercise 2
• E-values differ, but the query matches a choline
transporter-like protein quite well. . .
• After all, a biological match is a biological match. . .
Exercise 2
• E-values differ, but the query matches a choline
transporter-like protein quite well. . .
• Doesn’t it?
• After all, a biological match is a biological match. . .
• Isn’t it?
Exercise 2
Query= query protein sequence
Length=400
Score E
Sequences producing significant alignments: (Bits) Value
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104
Exercise 2
• Sequence accessions (PITG ?????T0) are correct in the
databases
Exercise 2
• Sequence accessions (PITG ?????T0) are correct in the
databases
• Sequence functional descriptions are randomly shuffled:
lengths do not match in BLAST output
Exercise 2
• Sequence accessions (PITG ?????T0) are correct in the
databases
• Sequence functional descriptions are randomly shuffled:
lengths do not match in BLAST output
• dbA contains only three different sequences: two are repeated
50,000 times
Exercise 2
• Sequence accessions (PITG ?????T0) are correct in the
databases
• Sequence functional descriptions are randomly shuffled:
lengths do not match in BLAST output
• dbA contains only three different sequences: two are repeated
50,000 times
• query.fasta is random sequence, not a real protein
• Shuffled from all P. infestans proteins
• No nr or PFam matches
Second Golden Rule of Bioinformatics
• Do not trust the software: it is not an authority
• Software does not distinguish meaningful from meaningless
data
• Software has bugs
• Algorithms have assumptions, conditions, and applicable
domains
• Some problems are inherently hard, or even insoluble
• You must understand the analysis/algorithm
• Always sanity test
• Test output for robustness to parameter (including data)
choice
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Exercise 3
• Rule: If there is a vowel on one side of the card, there must
be an even number on the other side.
• Which cards must be turned over to determine if this rule (if
a card shows a vowel on one face, the opposite face is even)
holds true?
Exercise 3
This is the Wason Selection Task
• If you chose E and 4
Exercise 3
This is the Wason Selection Task
• If you chose E and 4
• You are in the typical majority group
• You are not correct
• You have been a victim of confirmation bias (System 1
thinking)
Exercise 3
This is the Wason Selection Task
• If you chose E and 4
• You are in the typical majority group
• You are not correct
• You have been a victim of confirmation bias (System 1
thinking)
• If you chose E and 7
Exercise 3
This is the Wason Selection Task
• If you chose E and 4
• You are in the typical majority group
• You are not correct
• You have been a victim of confirmation bias (System 1
thinking)
• If you chose E and 7
• Congratulations!
• Your choice was capable of falsifying the rule.
Exercise 3
Rule: If there is a vowel on one side of the card, there must be an
even number on the other side.
Card Outcome Rule
E
Even Can be true even if rule false
Odd violated
K
Even na
Odd na
4
Vowel Can be true even if rule false
Consonant na
7
Vowel violated
Consonant na
Exercise 3
• This is equivalent to functional classification, e.g:
• Rule: If there is a CRN/RxLR/T3SS domain, the protein must
be an effector.
Exercise 3
• Confirmation Bias (Wason Selection Task)
• An uninformative experiment is performed
• http://en.wikipedia.org/wiki/Wason_selection_task
• Affirming the Consequent (a related formal fallacy)
1. If P, then Q
2. Q
3. Therefore, P
• Experimental results are misinterpreted
• http:
//en.wikipedia.org/wiki/Affirming_the_consequent
Third Golden Rule of Bioinformatics
• Everyone has expectations of their data/experiment
• Beware cognitive errors, such as confirmation bias!
• System 1 vs. System 2 ≈ intuition vs. reason
• Think statistically!
• Large datasets can be counterintuitive and appear to confirm a
large number of contradictory hypotheses
• Always account for multiple tests.
• Avoid “data dredging”: intensive computation is not an
adequate substitute for expertise
• Use test-driven development of analyses and code
• Use examples that pass and fail
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
In Conclusion
• Always communicate!
• worst errors are silent
• Don’t trust the data
• formatting/validation/category errors - check!
• suitability for scientific question
• Don’t trust the software
• software is not an authority
• always benchmark, always validate
• Don’t trust yourself
• beware cognitive errors
• think statistically
• biological “stories” can be constructed from nonsense

More Related Content

PPTX
Data Mining Zoo classification
PDF
Weka presentation cmt111
DOC
cs348-06-lab3.doc
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PDF
R Cheat Sheet for Data Analysts and Statisticians.pdf
PDF
Math, Stats and CS in Public Health and Medical Research
PDF
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
PDF
Day 2b i/o.pptx
Data Mining Zoo classification
Weka presentation cmt111
cs348-06-lab3.doc
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
R Cheat Sheet for Data Analysts and Statisticians.pdf
Math, Stats and CS in Public Health and Medical Research
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
Day 2b i/o.pptx

Similar to Golden Rules of Bioinformatics (20)

PPTX
Session ii g1 lab genomics and gene expression mmc-corr
PDF
Basics of Data Analysis in Bioinformatics
PDF
Introduction to Bioinformatics
PPTX
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...
PPT
Bioinformatics MiRON
PPTX
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
PPTX
EiB Seminar from Antoni Miñarro, Ph.D
PPTX
How to analyse bulk transcriptomic data using Deseq2
PDF
Making Sense of Data Big and Small
PDF
R introduction v2
PDF
Personalized medicine via molecular interrogation, data mining and systems bi...
DOCX
R Activity in Biostatistics
PPTX
Data analysis patterns, tools and data types in genomics
PPTX
Some statistical concepts relevant to proteomics data analysis
PDF
Essential Numerical Computer Methods Johnson M Ed
PDF
PPTX
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
PPTX
RNASeq DE methods review Applied Bioinformatics Journal Club
DOC
Lab 10.doc
Session ii g1 lab genomics and gene expression mmc-corr
Basics of Data Analysis in Bioinformatics
Introduction to Bioinformatics
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...
Bioinformatics MiRON
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
EiB Seminar from Antoni Miñarro, Ph.D
How to analyse bulk transcriptomic data using Deseq2
Making Sense of Data Big and Small
R introduction v2
Personalized medicine via molecular interrogation, data mining and systems bi...
R Activity in Biostatistics
Data analysis patterns, tools and data types in genomics
Some statistical concepts relevant to proteomics data analysis
Essential Numerical Computer Methods Johnson M Ed
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
RNASeq DE methods review Applied Bioinformatics Journal Club
Lab 10.doc
Ad

More from Leighton Pritchard (20)

PDF
In a Different Class?
PDF
RDVW Hands-on session: Python
PDF
Little Rotters: Adventures With Plant-Pathogenic Bacteria
PDF
Pathogen Genome Data
PDF
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
PDF
Comparative Genomics and Visualisation BS32010
PDF
Whole genome taxonomic classi cation for prokaryotic plant pathogens
PDF
Microbial Genomics and Bioinformatics: BM405 (2015)
PDF
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
PDF
BM405 Lecture Slides 21/11/2014 University of Strathclyde
PDF
Sequencing and Beyond?
PDF
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
PDF
ICSB 2013 - Visits Abroad Report
PDF
Adventures in Bioinformatics (2012)
PPTX
Plant Pathogen Genome Data: My Life In Sequences
PPTX
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
PPT
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
PPTX
Rapid generation of E.coli O104:H4 PCR diagnostics
PDF
Mining Plant Pathogen Genomes for Effectors
PDF
Comparative Genomics and Visualisation - Part 2
In a Different Class?
RDVW Hands-on session: Python
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Pathogen Genome Data
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Comparative Genomics and Visualisation BS32010
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
BM405 Lecture Slides 21/11/2014 University of Strathclyde
Sequencing and Beyond?
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
ICSB 2013 - Visits Abroad Report
Adventures in Bioinformatics (2012)
Plant Pathogen Genome Data: My Life In Sequences
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
Rapid generation of E.coli O104:H4 PCR diagnostics
Mining Plant Pathogen Genomes for Effectors
Comparative Genomics and Visualisation - Part 2
Ad

Recently uploaded (20)

PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
2Systematics of Living Organisms t-.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Introduction to Cardiovascular system_structure and functions-1
INTRODUCTION TO EVS | Concept of sustainability
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
2Systematics of Living Organisms t-.pptx
An interstellar mission to test astrophysical black holes
ECG_Course_Presentation د.محمد صقران ppt
microscope-Lecturecjchchchchcuvuvhc.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Comparative Structure of Integument in Vertebrates.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
7. General Toxicologyfor clinical phrmacy.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Derivatives of integument scales, beaks, horns,.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Biophysics 2.pdffffffffffffffffffffffffff
POSITIONING IN OPERATION THEATRE ROOM.ppt
The KM-GBF monitoring framework – status & key messages.pptx
Introduction to Cardiovascular system_structure and functions-1

Golden Rules of Bioinformatics

  • 1. An Introduction to Bioinformatics Tools Part 1: Golden Rules of Bioinformatics Leighton Pritchard and Peter Cock
  • 2. On Confidence “Ignorance more frequently begets confidence than does knowledge: it is those who know little, not those who know much, who so positively assert. . .” - Charles Darwin
  • 3. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 4. Zeroeth Golden Rule of Bioinformatics • No-one knows everything about everything - talk to people! • local bioinformaticians, mailing lists, forums, Twitter, etc. • Keep learning - there are lots of resources • There is no free lunch - no method works best on all data • The worst errors are silent - share worries, problems, etc. • Share expertise (see first item)
  • 5. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 6. Subgroups • You are in group A, B, C or D - this decides your dataset: expnA.tab, expnB.tab, expnC.tab, expnD.tab • You will use R at the command-line to analyse your data
  • 7. The biological question • Your dataset expn?.tab describes (log) expression data for two genes: gene1 and gene2 • Expression measured at eleven time points (including control) • Q: Are gene1 and gene2 genes coregulated? • How do we answer this question?
  • 8. Reformulating the biological question • Q: Are gene1 and gene2 genes coregulated? • A: We cannot determine this from expression data alone
  • 9. Reformulating the biological question • Q: Are gene1 and gene2 genes coregulated? • A: We cannot determine this from expression data alone • Reformulate the question: • NewQ: Is there evidence that gene1 and gene2 expression profiles are correlated? (is expression gene1 ∝ gene2) • How do we answer this new question?
  • 10. Starting the analysis • Change directory to where Exercise 1 data is located, and start R. 1 $ cd ../../ data/ ex1_expression / 2 $ R
  • 11. Load and inspect data in R 1 > data = read.table("expnA.tab", sep="t", header=TRUE) 2 > head(data) 3 gene1 gene2 4 1 10 8.04 5 2 8 6.95 6 3 13 7.58 7 4 9 8.81 8 5 11 8.33 9 6 14 9.96
  • 12. Load and inspect data in R 1 > mean(data$gene1) 2 [1] 9 3 > mean(data$gene2) 4 [1] 7.500909 5 > sd(data$gene1) 6 [1] 3.316625 7 > sd(data$gene2) 8 [1] 2.031568 9 > cor(data) 10 gene1 gene2 11 gene1 1.0000000 0.8164205 12 gene2 0.8164205 1.0000000
  • 13. Results measure expnA expnB expnC expnD mean(gene1) 9 mean(gene2) 7.5 sd(gene1) 3.3 sd(gene2) 2.0 cor(data) 0.816
  • 14. Results measure expnA expnB expnC expnD mean(gene1) 9 9 9 9 mean(gene2) 7.5 7.5 7.5 7.5 sd(gene1) 3.3 3.3 3.3 3.3 sd(gene2) 2.0 2.0 2.0 2.0 cor(data) 0.816 0.816 0.816 0.816
  • 15. Results measure expnA expnB expnC expnD mean(gene1) 9 9 9 9 mean(gene2) 7.5 7.5 7.5 7.5 sd(gene1) 3.3 3.3 3.3 3.3 sd(gene2) 2.0 2.0 2.0 2.0 cor(data) 0.816 0.816 0.816 0.816 • r = 0.816(P < 0.005) in every experiment • Can we conclude that gene1 and gene2 are coexpressed in each experiment?
  • 16. Plot the data in R 1 > plot(data)
  • 17. Always plot the data Which gene pairs are coexpressed?
  • 18. Always plot the data Is the matrix of (Pearson) correlation values potentially misleading? 1 > data = anscombe 2 > cor(data)[1:4 ,5:8] 3 y1 y2 y3 y4 4 x1 0.8164205 0.8162365 0.8162867 -0.3140467 5 x2 0.8164205 0.8162365 0.8162867 -0.3140467 6 x3 0.8164205 0.8162365 0.8162867 -0.3140467 7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
  • 19. Sometimes real correlation doesn’t mean anything
  • 20. First Golden Rule of Bioinformatics • Always inspect the raw data (trends, outliers, clustering) • What is the question? Can the data answer it? • Communicate with data collectors! (don’t be afraid of pedantry) • Who? When? How? • You need to understand the experiment to analyse it (easier if you helped design it). • Be wary of block effects (experimenter, time, batch, etc.)
  • 21. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 22. Exercise 2 • You are in group A, B, C or D - this decides your database dbA, dbB, dbC, dbD • You will use BLAST at the command-line to analyse your data • You will use script at the command-line to record your work
  • 23. Exercise 2 • Start recording your actions by entering script at the command line 1 $ script 2 Script started , output file is typescript
  • 24. Exercise 2 • Change directory to the ex2 blast directory • Run BLAST with the appropriate database • Exit script 1 $ cd ../ ex2_blast 2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA 3 $ exit 4 exit 5 Script done , output file is typescript
  • 25. Exercise 2 • You can view the typescript file with cat 1 $ cat typescript 2 Script started on Fri May 9 10:45:12 2014 3 lpritc@lpmacpro :$ cd ../ ex2_blast 4 [...]
  • 26. Exercise 2 Query= query protein sequence Length=400 Score Sequences producing significant alignments: (Bits) PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 > PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like protein (441 aa) Length=486 Score = 34.3 bits (77), Method: Compositional matrix adjust. Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%) Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165 E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++ Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95 Query 166 IKTKSNSSE 174 T SN S+ Sbjct 96 CHTSSNISQ 104
  • 27. Exercise 2 • What is a reasonable E-value threshold to call a ’match’? • 1e-05, 0.001, 0.1, 10? dbA dbB dbC dbD E-value
  • 28. Exercise 2 • What is a reasonable E-value threshold to call a ’match’? • 1e-05, 0.001, 0.1, 10? dbA dbB dbC dbD E-value 0.45 0.002 4e-06 0.019 • Five orders of magnitude difference in E-value, depending on database choice - Why?
  • 29. Exercise 2 • E-values depend on database size • Bit score and alignment do not depend on database size dbA dbB dbC dbD E-value 0.45 0.002 4e-06 0.019 Bit score 34.3 34.3 34.3 34.3 Sequences 100,001 501 1 5,001 Letters 48,650,486 210,866 486 2,066,510
  • 30. Exercise 2 • E-values differ, but the query matches a choline transporter-like protein quite well. . . • After all, a biological match is a biological match. . .
  • 31. Exercise 2 • E-values differ, but the query matches a choline transporter-like protein quite well. . . • Doesn’t it? • After all, a biological match is a biological match. . . • Isn’t it?
  • 32. Exercise 2 Query= query protein sequence Length=400 Score E Sequences producing significant alignments: (Bits) Value PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06 > PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like protein (441 aa) Length=486 Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust. Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%) Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165 E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++ Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95 Query 166 IKTKSNSSE 174 T SN S+ Sbjct 96 CHTSSNISQ 104
  • 33. Exercise 2 • Sequence accessions (PITG ?????T0) are correct in the databases
  • 34. Exercise 2 • Sequence accessions (PITG ?????T0) are correct in the databases • Sequence functional descriptions are randomly shuffled: lengths do not match in BLAST output
  • 35. Exercise 2 • Sequence accessions (PITG ?????T0) are correct in the databases • Sequence functional descriptions are randomly shuffled: lengths do not match in BLAST output • dbA contains only three different sequences: two are repeated 50,000 times
  • 36. Exercise 2 • Sequence accessions (PITG ?????T0) are correct in the databases • Sequence functional descriptions are randomly shuffled: lengths do not match in BLAST output • dbA contains only three different sequences: two are repeated 50,000 times • query.fasta is random sequence, not a real protein • Shuffled from all P. infestans proteins • No nr or PFam matches
  • 37. Second Golden Rule of Bioinformatics • Do not trust the software: it is not an authority • Software does not distinguish meaningful from meaningless data • Software has bugs • Algorithms have assumptions, conditions, and applicable domains • Some problems are inherently hard, or even insoluble • You must understand the analysis/algorithm • Always sanity test • Test output for robustness to parameter (including data) choice
  • 38. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 39. Exercise 3 • Rule: If there is a vowel on one side of the card, there must be an even number on the other side. • Which cards must be turned over to determine if this rule (if a card shows a vowel on one face, the opposite face is even) holds true?
  • 40. Exercise 3 This is the Wason Selection Task • If you chose E and 4
  • 41. Exercise 3 This is the Wason Selection Task • If you chose E and 4 • You are in the typical majority group • You are not correct • You have been a victim of confirmation bias (System 1 thinking)
  • 42. Exercise 3 This is the Wason Selection Task • If you chose E and 4 • You are in the typical majority group • You are not correct • You have been a victim of confirmation bias (System 1 thinking) • If you chose E and 7
  • 43. Exercise 3 This is the Wason Selection Task • If you chose E and 4 • You are in the typical majority group • You are not correct • You have been a victim of confirmation bias (System 1 thinking) • If you chose E and 7 • Congratulations! • Your choice was capable of falsifying the rule.
  • 44. Exercise 3 Rule: If there is a vowel on one side of the card, there must be an even number on the other side. Card Outcome Rule E Even Can be true even if rule false Odd violated K Even na Odd na 4 Vowel Can be true even if rule false Consonant na 7 Vowel violated Consonant na
  • 45. Exercise 3 • This is equivalent to functional classification, e.g: • Rule: If there is a CRN/RxLR/T3SS domain, the protein must be an effector.
  • 46. Exercise 3 • Confirmation Bias (Wason Selection Task) • An uninformative experiment is performed • http://en.wikipedia.org/wiki/Wason_selection_task • Affirming the Consequent (a related formal fallacy) 1. If P, then Q 2. Q 3. Therefore, P • Experimental results are misinterpreted • http: //en.wikipedia.org/wiki/Affirming_the_consequent
  • 47. Third Golden Rule of Bioinformatics • Everyone has expectations of their data/experiment • Beware cognitive errors, such as confirmation bias! • System 1 vs. System 2 ≈ intuition vs. reason • Think statistically! • Large datasets can be counterintuitive and appear to confirm a large number of contradictory hypotheses • Always account for multiple tests. • Avoid “data dredging”: intensive computation is not an adequate substitute for expertise • Use test-driven development of analyses and code • Use examples that pass and fail
  • 48. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 49. In Conclusion • Always communicate! • worst errors are silent • Don’t trust the data • formatting/validation/category errors - check! • suitability for scientific question • Don’t trust the software • software is not an authority • always benchmark, always validate • Don’t trust yourself • beware cognitive errors • think statistically • biological “stories” can be constructed from nonsense