SlideShare a Scribd company logo
Easier, Better, Faster, Stronger:
Building ML Platforms to
Accelerate Model Development for
Biological Data
© 2018 – Freenome Inc. | Proprietary & confidential.2
What’s in this talk?
● Challenges in applying machine learning (ML) to biological data
● What is a ML platform and why should I build one?
● Speedups for common research workflows
© 2018 – Freenome Inc. | Proprietary & confidential.3
Challenges with
Applying ML to
Biological Data
© 2018 – Freenome Inc. | Proprietary & confidential.4
Time spent doing ML on biological data
● Biological data is noisy and has tons of biases
● Interpretation: Are our results real? Will the model generalize?
Diagram inspired by Karpathy’s slide in ScaledML 2018
© 2018 – Freenome Inc. | Proprietary & confidential.5
Confounding
● What is confounding?
○ Your model learns to predict based on
signal that does not generalize to your
use case
● This is especially prevalent in biological
datasets!
○ Data may not represent the real world
○ Batch effects, lab process changes,
sample sourcing, demographics
● Identifying confounding is difficult!
Data
Model
Good cross
validation (CV)
performanceData
Model
From institution A, older than
75, and male → cancer,
From institution B, younger
than 60, and female →
healthy, ...
Good CV
performance
Bad real world
performance
Data
Model
Upregulation of genes A, B, C
and downregulation of
genes X, Y, Z → cancer
Otherwise → healthy
Good CV
performance
Good real world
performance
© 2018 – Freenome Inc. | Proprietary & confidential.6
Sharing experiments leads to more efficiency
In all cases
● Proposed techniques can be tested
in a variety of use cases by many
people when added to the platform
● Well-established techniques can be
used as building blocks in future
analysis
● A new technique is used only when
we see reproducible increases in
performance and/or decreases in
confounding.
When using bio data
● By default, suspect confounders in
any good result
○ A result only stands if run
against a suite of shared
confounder analysis
● Focus on specific biological
hypotheses or measurements
which give insight into mechanisms
● Build visualization tools specifically
for dataset statistics & low
dimensional data representations
vs confounders
© 2018 – Freenome Inc. | Proprietary & confidential.7
What is an
ML platform and why
should I build one?
© 2018 – Freenome Inc. | Proprietary & confidential.8
From the user’s perspective
Experiment
specification ML Platform
Trained Model
Classification
Metrics
Defines:
● Data
● Preprocessing/pretraining
models
● Classification model
● Validation scheme
Reports:
● Aggregate performance
● Performance on different
subgroups of the data
● Classification confounding
indicators
Persists a model that can be
reloaded and used on incoming data
Visualization
Creates plots of:
● Dataset statistics
● Low dimensional data
embeddings by suspected
confounders
Scientists
interpret results
© 2018 – Freenome Inc. | Proprietary & confidential.9
Let’s take a look under the hood
Experiment
specification
ML Platform
Visualizations
Metrics
Experiment runner
Data
fetcher
Experiment
saver
Preprocessing
library
Visualizer
Pretraining
library
Model
Model
library
Fold
creation
Visualization
library
Metrics
library
Exploratory
analysis/
visualization
Prediction
Key
© 2018 – Freenome Inc. | Proprietary & confidential.10
ML Platform makes experimental iteration faster
● Scientists can focus on their speciality instead of infrastructure
● Scientists can easily build and reproduce the findings of fellow
scientists
● The most robust interpretation techniques uniformly applied to every
finding
© 2018 – Freenome Inc. | Proprietary & confidential.11
Common workflows
© 2018 – Freenome Inc. | Proprietary & confidential.12
An ML platform makes idea exploration a 1-step process
In the next series of slides, we’ll show that building upon the existing
corpus of shared work makes new exploration faster.
© 2018 – Freenome Inc. | Proprietary & confidential.13
Perform exploratory analysis on a new preprocessing method
Experiment
specification
ML Platform
Visualizations
Metrics
Experiment runner
Data
fetcher
Experiment
saver
Preprocessing
library
Visualizer
Pretraining
library
Model
Model
library
Fold
creation
Visualization
library
Metrics
library
Write one new
function in the
preprocessing
library for testing
data assumptions
Reused
Infrastructure
© 2018 – Freenome Inc. | Proprietary & confidential.14
Perform exploratory analysis on a new data type
Experiment
specification
ML Platform
Visualizations
Metrics
Experiment runner
Data
fetcher
Experiment
saver
Preprocessing
library
Visualizer
Pretraining
library
Model
Model
library
Fold
creation
Visualization
library
Metrics
library
Add new data to
data the cloud
for testing
different features
© 2018 – Freenome Inc. | Proprietary & confidential.15
Create a new validation scheme
Experiment
specification
ML Platform
Visualizations
Metrics
Experiment runner
Data
fetcher
Experiment
saver
Preprocessing
library
Visualizer
Pretraining
library
Model
Model
library
Fold
creation
Visualization
library
Metrics
library
Write one new
function for
testing a new
kind of validation
© 2018 – Freenome Inc. | Proprietary & confidential.16
Create a new model and test how it compares to other models
Experiment
specification
ML Platform
Visualizations
Metrics
Experiment runner
Data
fetcher
Experiment
saver
Preprocessing
library
Visualizer
Pretraining
library
Model
Model
library
Fold
creation
Visualization
library
Metrics
library
Write a new model
and add it to the
model library for
testing inference
methods
© 2018 – Freenome Inc. | Proprietary & confidential.17
Takeaways
● ML in biotech needs robust infrastructure for model interpretation
● Building on top of the findings from fellow scientists decreases
iteration time and improves results
● This benefit requires a commitment to a scientific platform

More Related Content

DOC
Tài liệu hướng dẫn sử dụng phần mềm kế toán Simply Accounting
PPTX
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
PPTX
vodQA Pune (2019) - Testing AI,ML applications
PDF
DutchMLSchool. ML for Energy Trading and Automotive Sector
PDF
artificggggggggggggggialintelligence.pdf
PDF
Ml ops intro session
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PDF
Experimentation to Industrialization: Implementing MLOps
Tài liệu hướng dẫn sử dụng phần mềm kế toán Simply Accounting
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
vodQA Pune (2019) - Testing AI,ML applications
DutchMLSchool. ML for Energy Trading and Automotive Sector
artificggggggggggggggialintelligence.pdf
Ml ops intro session
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Experimentation to Industrialization: Implementing MLOps

Similar to Freenome's Biological Machine Learning Platform (20)

PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PPTX
PPTX
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
PDF
Load models ppt
DOCX
High level model for phishing detection system.docx
PPTX
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
PPTX
2024-02-24_Session 1 - PMLE_UPDATED.pptx
PPTX
CNCF-Istanbul-MLOps for Devops Engineers.pptx
PPTX
Data Science as a Service: Intersection of Cloud Computing and Data Science
PPTX
Data Science as a Service: Intersection of Cloud Computing and Data Science
PDF
FlorenceAI: Reinventing Data Science at Humana
PDF
Interactive and reproducible data analysis with the open-source KNIME Analyti...
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PPTX
Splunk for Machine Learning and Analytics
PDF
How to generate Synthetic Data for an effective App Testing strategy.pdf
PDF
Let’s talk about reproducible data analysis
PDF
MLOps – Applying DevOps to Competitive Advantage
PDF
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
PPTX
Metrics Monitoring Is So Critical - What's Your Best Approach?
PPTX
AI hype or reality
Python for Machine Learning_ A Comprehensive Overview.pptx
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Load models ppt
High level model for phishing detection system.docx
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
2024-02-24_Session 1 - PMLE_UPDATED.pptx
CNCF-Istanbul-MLOps for Devops Engineers.pptx
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
FlorenceAI: Reinventing Data Science at Humana
Interactive and reproducible data analysis with the open-source KNIME Analyti...
MLOps Bridging the gap between Data Scientists and Ops.
Splunk for Machine Learning and Analytics
How to generate Synthetic Data for an effective App Testing strategy.pdf
Let’s talk about reproducible data analysis
MLOps – Applying DevOps to Competitive Advantage
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Metrics Monitoring Is So Critical - What's Your Best Approach?
AI hype or reality
Ad

Recently uploaded (20)

PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ai tools demonstartion for schools and inter college
PDF
top salesforce developer skills in 2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Essential Infomation Tech presentation.pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
history of c programming in notes for students .pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
VVF-Customer-Presentation2025-Ver1.9.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
How Creative Agencies Leverage Project Management Software.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms I-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Upgrade and Innovation Strategies for SAP ERP Customers
ai tools demonstartion for schools and inter college
top salesforce developer skills in 2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Design an Analysis of Algorithms II-SECS-1021-03
wealthsignaloriginal-com-DS-text-... (1).pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PTS Company Brochure 2025 (1).pdf.......
Essential Infomation Tech presentation.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
history of c programming in notes for students .pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Ad

Freenome's Biological Machine Learning Platform

  • 1. Easier, Better, Faster, Stronger: Building ML Platforms to Accelerate Model Development for Biological Data
  • 2. © 2018 – Freenome Inc. | Proprietary & confidential.2 What’s in this talk? ● Challenges in applying machine learning (ML) to biological data ● What is a ML platform and why should I build one? ● Speedups for common research workflows
  • 3. © 2018 – Freenome Inc. | Proprietary & confidential.3 Challenges with Applying ML to Biological Data
  • 4. © 2018 – Freenome Inc. | Proprietary & confidential.4 Time spent doing ML on biological data ● Biological data is noisy and has tons of biases ● Interpretation: Are our results real? Will the model generalize? Diagram inspired by Karpathy’s slide in ScaledML 2018
  • 5. © 2018 – Freenome Inc. | Proprietary & confidential.5 Confounding ● What is confounding? ○ Your model learns to predict based on signal that does not generalize to your use case ● This is especially prevalent in biological datasets! ○ Data may not represent the real world ○ Batch effects, lab process changes, sample sourcing, demographics ● Identifying confounding is difficult! Data Model Good cross validation (CV) performanceData Model From institution A, older than 75, and male → cancer, From institution B, younger than 60, and female → healthy, ... Good CV performance Bad real world performance Data Model Upregulation of genes A, B, C and downregulation of genes X, Y, Z → cancer Otherwise → healthy Good CV performance Good real world performance
  • 6. © 2018 – Freenome Inc. | Proprietary & confidential.6 Sharing experiments leads to more efficiency In all cases ● Proposed techniques can be tested in a variety of use cases by many people when added to the platform ● Well-established techniques can be used as building blocks in future analysis ● A new technique is used only when we see reproducible increases in performance and/or decreases in confounding. When using bio data ● By default, suspect confounders in any good result ○ A result only stands if run against a suite of shared confounder analysis ● Focus on specific biological hypotheses or measurements which give insight into mechanisms ● Build visualization tools specifically for dataset statistics & low dimensional data representations vs confounders
  • 7. © 2018 – Freenome Inc. | Proprietary & confidential.7 What is an ML platform and why should I build one?
  • 8. © 2018 – Freenome Inc. | Proprietary & confidential.8 From the user’s perspective Experiment specification ML Platform Trained Model Classification Metrics Defines: ● Data ● Preprocessing/pretraining models ● Classification model ● Validation scheme Reports: ● Aggregate performance ● Performance on different subgroups of the data ● Classification confounding indicators Persists a model that can be reloaded and used on incoming data Visualization Creates plots of: ● Dataset statistics ● Low dimensional data embeddings by suspected confounders Scientists interpret results
  • 9. © 2018 – Freenome Inc. | Proprietary & confidential.9 Let’s take a look under the hood Experiment specification ML Platform Visualizations Metrics Experiment runner Data fetcher Experiment saver Preprocessing library Visualizer Pretraining library Model Model library Fold creation Visualization library Metrics library Exploratory analysis/ visualization Prediction Key
  • 10. © 2018 – Freenome Inc. | Proprietary & confidential.10 ML Platform makes experimental iteration faster ● Scientists can focus on their speciality instead of infrastructure ● Scientists can easily build and reproduce the findings of fellow scientists ● The most robust interpretation techniques uniformly applied to every finding
  • 11. © 2018 – Freenome Inc. | Proprietary & confidential.11 Common workflows
  • 12. © 2018 – Freenome Inc. | Proprietary & confidential.12 An ML platform makes idea exploration a 1-step process In the next series of slides, we’ll show that building upon the existing corpus of shared work makes new exploration faster.
  • 13. © 2018 – Freenome Inc. | Proprietary & confidential.13 Perform exploratory analysis on a new preprocessing method Experiment specification ML Platform Visualizations Metrics Experiment runner Data fetcher Experiment saver Preprocessing library Visualizer Pretraining library Model Model library Fold creation Visualization library Metrics library Write one new function in the preprocessing library for testing data assumptions Reused Infrastructure
  • 14. © 2018 – Freenome Inc. | Proprietary & confidential.14 Perform exploratory analysis on a new data type Experiment specification ML Platform Visualizations Metrics Experiment runner Data fetcher Experiment saver Preprocessing library Visualizer Pretraining library Model Model library Fold creation Visualization library Metrics library Add new data to data the cloud for testing different features
  • 15. © 2018 – Freenome Inc. | Proprietary & confidential.15 Create a new validation scheme Experiment specification ML Platform Visualizations Metrics Experiment runner Data fetcher Experiment saver Preprocessing library Visualizer Pretraining library Model Model library Fold creation Visualization library Metrics library Write one new function for testing a new kind of validation
  • 16. © 2018 – Freenome Inc. | Proprietary & confidential.16 Create a new model and test how it compares to other models Experiment specification ML Platform Visualizations Metrics Experiment runner Data fetcher Experiment saver Preprocessing library Visualizer Pretraining library Model Model library Fold creation Visualization library Metrics library Write a new model and add it to the model library for testing inference methods
  • 17. © 2018 – Freenome Inc. | Proprietary & confidential.17 Takeaways ● ML in biotech needs robust infrastructure for model interpretation ● Building on top of the findings from fellow scientists decreases iteration time and improves results ● This benefit requires a commitment to a scientific platform