Understanding Biological Function in Times of High Throughput and Low Output

Understanding Biological Function in Times of High
Throughput and Low Output
Iddo Friedberg
Iowa State University
http://iddo-friedberg.net
@iddux

Big Data in my lab
Gene block evolution
Images and Genomes
Host/Microbiome Database error and bias
Critical
Assessment
of Protein
Function
Annotations

Big Data in my lab
Database error and bias
Critical
Assessment
of Protein
Function
Annotations

Big Data in my lab
Database error and bias
Critical
Assessment
of Protein
Function
Annotations
Understanding methods
Understanding the data

Understanding Methods: The Critical Assessment
of protein Function Annotations
Pedja
Wyatt
Sean
Tal
Alex

Large Data Biology has a Bad Rap?
"So we now have a culture which is based on
everything must be high-throughput.I like to
call it low-input, high-throughput, no-output
biology"
– Sydney Brenner

Motivation: The Knowledge Gap
● The gap between data and Information
Information
Data
Temperton & Giovannoni Curr. Opin. Microbiology (2012)

Errors Accumulate in Databases
Schnoes A et al (2009) PLoS Computational Biology, 5 (12)

Assigning Function to Proteins
Low-ish throughput
High throughput
Machine learning

Most Proteins are Annotated Electronically
Experimental
Computational
Electronic
Other
0
10
20
30
40
50
60
70
80
90
100
Arabidopsis
mouse
Cow
Zebrafish
Chicken
Human
Compiled from the GOA project, EBI, 6/2011

Problems
● Most genes are annotated electronically
● Databases have a high error rate which is growing
● Homology transfer is less effective
Solutions?
Assess accuracy of
annotation software
Write better software

Challenges in Picking Targets
● Can't use databases: circularity problem
● Experimental groups have a small “sharing timeframe”
● Function description too vague for precise GO
annotation
● There are “unknown unknowns”
Choose an
annotated protein
Prediction method uses said
annotation to predict function
Circular logic...
… is circular

Choosing Assessment Benchmarks
Function unknown Function still unknown Function still unknown
Challenge
opens
Submission
deadline
Assessment
time
Function unknown Function still unknown Function known
Benchmark?
Time

BLAST
Naive
Molecular Function precision/ recall

BLAST
Naive
Biological process precision/ recall

Case Study: hPNPase
Gadi Schuster (Technion)

Successful Methods?log(obs/exp)
Biological
Process
Molecular
Function
profile-profilealignments
literature
ortholog
sequenceproperties
proteininteractions
geneexpression
phylogenysequencealignments
otherfunctionalinformation
machinelearningbasedmethod
sequence-profilealignments
-0.4
-0.2
0
0.2
0.4
0.6
0.8
proteininteractions
geneexpression
literatureprofile-profilealignments
ortholog
sequenceproperties
phylogeny
sequencealignmentstherfunctionalinformation
inelearningbasedmethodquence-profilealignments
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8

CAFA2 vs. CAFA1
CAFA2 was held in 2014-2015
More targets (100,00 vs. 50,000)
More groups (56 vs 29)
Creator:MetaPost 0.993
CreationDate:2015.09.30:1653

CAFA2 vs. CAFA1
CAFA2 was held in 2014-2015
100,000 targets
147 participants
Methods have improved

CAFA Conclusions & What's Next
● Homology transfer still rules.
● Combined methods work best
● Molecular Function is easier to predict than Biological
Process
● Generally, the field can use improvement
● Comparison of metrics is very much needed
– Why do methods perform differently under different metrics?
– Is there a “best” metric? What is “best”?
● Databases are biased

protein binding
protein homodimerization
activity
zinc ion binding
transcription activator
activity
chromatin binding
transcription repressor
activity
transcription factor
activity
two-component sensor
activity
specific transcriptional
repressor activity
DNA binding
calcium ion binding
identical protein binding
manganese ion binding
ATP binding
beta-galactoside alpha-
2,3-sialyltransferase
activity
magnesium ion binding
enzyme binding
electron carrier activity
structural constituent of
ribosome
metal ion binding
Leaf terms Molecular Function
David Ream(MU)
Alexander Thorman (MU)
Alexandra Schnoes (UCSF)
Protein Binding
Activity

Annotations per article
Schnoes et al PloS Comp Biol (2013)

Information is in an inverse relationship to the
number of proteins annotated
1 <10 <100 ≥100
Molecular Function
1 <10 <100 ≥100
Biological Process
1 <10 <100 ≥100
Cellular Component
1 <10 <100 ≥100
Informationcontent
Schnoes et al (2013)
Single throughput
(1 protein/study)
High information
(12 bits)
Low information
(3 bits)
High throughput
(≥ proteins/study)

High Throughput Experiments
● Bias our knowledge
● Bias priors for function
prediction programs
● Are less informative
than low-throughput
experiments
● Exclusively annotate
genes otherwise
unknown
● Fewer $$$
● Fast results
● Consistency
The GoodThe Bad

Data that Will Drive Computation
● Whole chromosome
sequencing
● Epigenomics
● Integrating images:
phenomics→ genomics
relationships
● Fragment-based
sequencing
● Proteomics
● Metabolomics
● Documents
● Network data
● Images
Current Future

Thank you
http://iddo-friedberg.net

Understanding Biological Function in Times of High Throughput and Low Output

More Related Content

What's hot (16)

Viewers also liked (20)

Similar to Understanding Biological Function in Times of High Throughput and Low Output (20)

More from Iddo (20)

Recently uploaded (20)

Understanding Biological Function in Times of High Throughput and Low Output