SlideShare a Scribd company logo
Automation of Biological Data Analysis and Report Generation
Dmitry Grapov, PhD
Bots write the darndest things
http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-
california-rdivor,0,3229825.story#axzz2wQwc82EK
•fill in the template (easy)
•human-guided automation
(e.g. Metaboanalyst,
intermediate)
•intelligent/reactive writing
(e.g. ~AI, advanced)
http://narrativescience.com/
Humans + Bots
Interaction:
•Bots and humans combine
in guided analyses
•Humans: make choices
(based on bot guides)
•Bots: automate!
Facilitate:
• workflow logging and
template creation
•reproducible results
Bot: Initial data and meta data
parsing and quality validation
(need: template input)
Human: data cleaning and
experimental design identification
(use: multiple choice, dynamic GUI)
Bot: instantiation of complex
workflows
Human: overview of bot
assumptions and results
Bot: Numerical and text output
generation
Humans + Bots write
darndender things?
Choose Your Own Life Adventure!
?
https://github.com/
dgrapov/AdventureR
Data Analysis Tasks
Visualization (how does it look?)
• histograms, density plots, box plots, line plots, scatter plots, networks, etc.
Statistical Analysis (what is statistically significant?)
• summary tables, ANOVA, FDR adjustment, power analysis, etc.
Exploration (what are the major patterns/trends?)
• clustering, PCA, ICA, etc.
Predictive Modeling (what explains my hypothesis?)
• mixed effects, partial least squares (O-/PLS/-DA), etc.
Network Analysis and Mapping (how are things related?)
• Functional analysis: pathway enrichment or overrepresentation
• Networks: biochemical, structural, mass spectral and empirical networks
• Mapping: projection of analysis results onto network
WCMC Data Analysis Reports ™
Statistical analysis
Clustering
PCA
O-PLS-DA
Biochemical enrichment
Network mapping
Input template: BinBase
•inference of experimental
goals from sample meta data
•mapping variables to external
databases
Tasks:
Report:
Tools:
Automation Challenges
Data cleaning and quality validation
•use: quality control samples; identify: precision/accuracy,
normalization, batch corrections; mitigate: outliers, missing
values, batch effects, etc.
Identification of experimental goals
•use: meta data, identify: main and accessory effects;
choose: statistics, multivariate tests and visualizations
Integration of multiple tasks to evolve robust analyses
•tasks: statistics, multivariate, functional, networks, database
mapping, etc
Data analysis report generation
•use: R, Latex, markdown
?
Challenges to automated
metabolite ID mapping
Stereochemistry?
Search: catechin
Best Match:
Catechin
Biologically relevant:
D-catechin
Synonyms?
Search: UDP GlcNAc
FAIL: UDP GlcNac
PASS: UDP-GlcNac
Strategies for automated
metabolite ID mapping (from synonym)
#1: CTS+ #2: Web query #3: Curated DB
•Use CTS to translate
from synonyms to KEGG
(KID) and PubChem (CID)
•Use KEGGREST and
PUG to filter and choose
most appropriate IDs
•Use fuzzy matching and
word similarity metrics
(e.g. Damerau–
Levenshtein distance)
•Use KEGGREST +
PubChem PUG to
translate synonyms to
IDs
•For KEGG ID:
synonym  SID  KID
•Generate a curated DB
for KEGG and CID
translations +
•Include InChI Keys
•Map to other DBs
•Allow fuzzy matching
on synonyms
•e.g. IDEOM
http://bioinformatics.oxfordjournals.org/content
/early/2012/02/04/bioinformatics.bts069
Interactive Analysis and
Report Generation
knitr (http://yihui.name/knitr/)
Analysis Report Generation
•Analysis on rails or open sandbox
•Humans facilitate robust results generation + Bots ensure reproduction
•Generation of Methods and Results should be automateable
Devium 2.0
Human-guided automated data
analysis and report generator
Human-guided automation could help
ensure robust results by making choices
which are otherwise difficult to automate.
https://github.com/dgrapov/DeviumWeb
MetaMapR
Linking data analysis and biology
https://github.com/dgrapov/MetaMapR
Integration of complex work flows is key to automation.
+ Workflows for complex experiments (e.g. time-course)
+ Biochemical functional analysis (pathway enrichment)
+ GUI for report generation (Devium 2.0)
+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)
+ Scientific literature mining (RapportR)
+ Interactive plots and networks (JavaScript)
Future Goals
dgrapov@ucdavis.edu
metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154

More Related Content

PPT
Advanced strategies for Metabolomics Data Analysis
PPTX
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
PPTX
Metabolomic Data Analysis Workshop and Tutorials (2014)
PPT
Multivariate data analysis and visualization tools for biological data
PPT
Metabolomic Data Analysis Case Studies
PPT
Strategies for Metabolomics Data Analysis
PPTX
7 network mapping i
PPT
Multivarite and network tools for biological data analysis
Advanced strategies for Metabolomics Data Analysis
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomic Data Analysis Workshop and Tutorials (2014)
Multivariate data analysis and visualization tools for biological data
Metabolomic Data Analysis Case Studies
Strategies for Metabolomics Data Analysis
7 network mapping i
Multivarite and network tools for biological data analysis

What's hot (20)

PPTX
High Dimensional Biological Data Analysis and Visualization
PPTX
Mapping to the Metabolomic Manifold
PPTX
Metabolomic data analysis and visualization tools
PPTX
3 principal components analysis
PPTX
Data Normalization Approaches for Large-scale Biological Studies
PPTX
Normalization of Large-Scale Metabolomic Studies 2014
PPTX
Data analysis workflows part 2 2015
PPT
Gene Ontology Enrichment Network Analysis -Tutorial
PPTX
0 introduction
PDF
Case Study: Overview of Metabolomic Data Normalization Strategies
PPTX
4 partial least squares modeling
PPTX
Omic Data Integration Strategies
PPTX
3 data normalization (2014 lab tutorial)
PPT
Prote-OMIC Data Analysis and Visualization
PPTX
1 statistical analysis
PPTX
Some statistical concepts relevant to proteomics data analysis
PPTX
Data analysis workflows part 1 2015
PPTX
Quality Metrics for Linked Open Data
PPT
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
PDF
The International Journal of Engineering and Science (The IJES)
High Dimensional Biological Data Analysis and Visualization
Mapping to the Metabolomic Manifold
Metabolomic data analysis and visualization tools
3 principal components analysis
Data Normalization Approaches for Large-scale Biological Studies
Normalization of Large-Scale Metabolomic Studies 2014
Data analysis workflows part 2 2015
Gene Ontology Enrichment Network Analysis -Tutorial
0 introduction
Case Study: Overview of Metabolomic Data Normalization Strategies
4 partial least squares modeling
Omic Data Integration Strategies
3 data normalization (2014 lab tutorial)
Prote-OMIC Data Analysis and Visualization
1 statistical analysis
Some statistical concepts relevant to proteomics data analysis
Data analysis workflows part 1 2015
Quality Metrics for Linked Open Data
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
The International Journal of Engineering and Science (The IJES)
Ad

Viewers also liked (10)

PPTX
6 metabolite enrichment analysis
PPTX
5 data analysis case study
PPTX
2 cluster analysis
PDF
A Primer for Your Next Data Science Proof of Concept on the Cloud
PPTX
Pragmatic steps to implement big data analytics
PPTX
Connecting Metabolomic Data with Context
PPTX
Complex Systems Biology Informed Data Analysis and Machine Learning
PPTX
Big Data Analytics
PPTX
Big Data and Advanced Analytics
PPTX
What is Big Data?
6 metabolite enrichment analysis
5 data analysis case study
2 cluster analysis
A Primer for Your Next Data Science Proof of Concept on the Cloud
Pragmatic steps to implement big data analytics
Connecting Metabolomic Data with Context
Complex Systems Biology Informed Data Analysis and Machine Learning
Big Data Analytics
Big Data and Advanced Analytics
What is Big Data?
Ad

Similar to Automation of (Biological) Data Analysis and Report Generation (20)

PDF
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Using MongoDB + Hadoop Together
PDF
Data Discovery and Metadata
PDF
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
PDF
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
PPTX
Big Data: Guidelines and Examples for the Enterprise Decision Maker
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PDF
Automatic Detection of Web Trackers by Vasia Kalavri
PPTX
Bots & spiders
PDF
Anaconda and PyData Solutions
PDF
Artificial Intelligence for Data Quality
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
PDF
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
PDF
Knowledge Discovery in Production
PPTX
Data council sf amundsen presentation
PDF
OSCON 2014: Data Workflows for Machine Learning
PDF
OpenML Tutorial ECMLPKDD 2015
PDF
Venkata brundavanam 2020
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Using MongoDB + Hadoop Together
Data Discovery and Metadata
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
Transversal Delivery Pipeline by Mike Nescot and Nick Grace
Big Data: Guidelines and Examples for the Enterprise Decision Maker
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
Automatic Detection of Web Trackers by Vasia Kalavri
Bots & spiders
Anaconda and PyData Solutions
Artificial Intelligence for Data Quality
Crowdsourced query augmentation through the semantic discovery of domain spec...
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
Knowledge Discovery in Production
Data council sf amundsen presentation
OSCON 2014: Data Workflows for Machine Learning
OpenML Tutorial ECMLPKDD 2015
Venkata brundavanam 2020

More from Dmitry Grapov (7)

PDF
R programming for Data Science - A Beginner’s Guide
PDF
Network mapping 101 course
PDF
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
PDF
Dmitry Grapov Resume and CV
PPTX
Machine Learning Powered Metabolomic Network Analysis
PPTX
Modeling poster
PPTX
American Society of Mass Spectrommetry Conference 2014
R programming for Data Science - A Beginner’s Guide
Network mapping 101 course
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Dmitry Grapov Resume and CV
Machine Learning Powered Metabolomic Network Analysis
Modeling poster
American Society of Mass Spectrommetry Conference 2014

Recently uploaded (20)

PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
master seminar digital applications in india
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Microbial diseases, their pathogenesis and prophylaxis
Pharmacology of Heart Failure /Pharmacotherapy of CHF
master seminar digital applications in india
O7-L3 Supply Chain Operations - ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Microbial disease of the cardiovascular and lymphatic systems
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPH.pptx obstetrics and gynecology in nursing
Insiders guide to clinical Medicine.pdf
Cell Structure & Organelles in detailed.
Microbial diseases, their pathogenesis and prophylaxis

Automation of (Biological) Data Analysis and Report Generation

  • 1. Automation of Biological Data Analysis and Report Generation Dmitry Grapov, PhD
  • 2. Bots write the darndest things http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood- california-rdivor,0,3229825.story#axzz2wQwc82EK •fill in the template (easy) •human-guided automation (e.g. Metaboanalyst, intermediate) •intelligent/reactive writing (e.g. ~AI, advanced) http://narrativescience.com/
  • 3. Humans + Bots Interaction: •Bots and humans combine in guided analyses •Humans: make choices (based on bot guides) •Bots: automate! Facilitate: • workflow logging and template creation •reproducible results Bot: Initial data and meta data parsing and quality validation (need: template input) Human: data cleaning and experimental design identification (use: multiple choice, dynamic GUI) Bot: instantiation of complex workflows Human: overview of bot assumptions and results Bot: Numerical and text output generation
  • 4. Humans + Bots write darndender things? Choose Your Own Life Adventure! ? https://github.com/ dgrapov/AdventureR
  • 5. Data Analysis Tasks Visualization (how does it look?) • histograms, density plots, box plots, line plots, scatter plots, networks, etc. Statistical Analysis (what is statistically significant?) • summary tables, ANOVA, FDR adjustment, power analysis, etc. Exploration (what are the major patterns/trends?) • clustering, PCA, ICA, etc. Predictive Modeling (what explains my hypothesis?) • mixed effects, partial least squares (O-/PLS/-DA), etc. Network Analysis and Mapping (how are things related?) • Functional analysis: pathway enrichment or overrepresentation • Networks: biochemical, structural, mass spectral and empirical networks • Mapping: projection of analysis results onto network
  • 6. WCMC Data Analysis Reports ™ Statistical analysis Clustering PCA O-PLS-DA Biochemical enrichment Network mapping Input template: BinBase •inference of experimental goals from sample meta data •mapping variables to external databases Tasks: Report: Tools:
  • 7. Automation Challenges Data cleaning and quality validation •use: quality control samples; identify: precision/accuracy, normalization, batch corrections; mitigate: outliers, missing values, batch effects, etc. Identification of experimental goals •use: meta data, identify: main and accessory effects; choose: statistics, multivariate tests and visualizations Integration of multiple tasks to evolve robust analyses •tasks: statistics, multivariate, functional, networks, database mapping, etc Data analysis report generation •use: R, Latex, markdown ?
  • 8. Challenges to automated metabolite ID mapping Stereochemistry? Search: catechin Best Match: Catechin Biologically relevant: D-catechin Synonyms? Search: UDP GlcNAc FAIL: UDP GlcNac PASS: UDP-GlcNac
  • 9. Strategies for automated metabolite ID mapping (from synonym) #1: CTS+ #2: Web query #3: Curated DB •Use CTS to translate from synonyms to KEGG (KID) and PubChem (CID) •Use KEGGREST and PUG to filter and choose most appropriate IDs •Use fuzzy matching and word similarity metrics (e.g. Damerau– Levenshtein distance) •Use KEGGREST + PubChem PUG to translate synonyms to IDs •For KEGG ID: synonym  SID  KID •Generate a curated DB for KEGG and CID translations + •Include InChI Keys •Map to other DBs •Allow fuzzy matching on synonyms •e.g. IDEOM http://bioinformatics.oxfordjournals.org/content /early/2012/02/04/bioinformatics.bts069
  • 10. Interactive Analysis and Report Generation knitr (http://yihui.name/knitr/) Analysis Report Generation •Analysis on rails or open sandbox •Humans facilitate robust results generation + Bots ensure reproduction •Generation of Methods and Results should be automateable
  • 11. Devium 2.0 Human-guided automated data analysis and report generator Human-guided automation could help ensure robust results by making choices which are otherwise difficult to automate. https://github.com/dgrapov/DeviumWeb
  • 12. MetaMapR Linking data analysis and biology https://github.com/dgrapov/MetaMapR Integration of complex work flows is key to automation.
  • 13. + Workflows for complex experiments (e.g. time-course) + Biochemical functional analysis (pathway enrichment) + GUI for report generation (Devium 2.0) + Integrate multi-’Omic’ data sets (MetaMapR 2.0) + Scientific literature mining (RapportR) + Interactive plots and networks (JavaScript) Future Goals
  • 14. dgrapov@ucdavis.edu metabolomics.ucdavis.edu This research was supported in part by NIH 1 U24 DK097154