Agile large-scale machine-learning pipelines in drug discovery

Agile large-scale machine-learning
pipelines in drug discovery
Ola Spjuth
Department of Pharmaceutical Biosciences and Science for Life Laboratory
Uppsala University, Sweden
ola.spjuth@farmbio.uu.se

Outline
• My research in perspective
• Our approach to machine learning in ligand-based
modeling
• Challenges when data grows
• Automation workflows/pipelines
• HPC, Cloud Computing and Big Data Analytics

From data to insights
• We have access to a wealth
of information
• Data mining and predictive
modeling can be useful

History: Bioclipse – an open source
workbench for the life sciences
O. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Mäsak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg.
Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397
Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source
workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.

How is the compound
metabolized?
Are any of its metabolites
reactive/toxic?
Here?
Here?
Is it toxic?
Chemical liabilities (drug safety, alerts)
Adverse effects?
Can we, based on existing experimental studies, IT,
and statistical models, predict the outcome for new
compounds?

Starting out in 2008 with a challenge:
• Build a system with predictive models which runs on
the client
– Initial problem: Site-of-metabolism prediction

Site-of-metabolism (SOM) predictions – MetaPrint2D
L. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means
of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362
Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J
Chem Inf Model. 2007 Mar-Apr;47(2):583-90.
Reaction
Database
MetaPrint2D
database
Circular
Fingerprints
Highest probability
of metabolism
Low probability of
metabolism
Medium probability
of metabolism
Mapping

Next challenge: Extend to general predictive models
• Fast predictive models, allow for instant updates
upon structural changes
• Span from virtual screening to lead optimization

Bioclipse Decision Support
• Integrate various predictive methods
– Similarity searches (InChI, signatures, fingerprints)
– Structural alerts (toxicophores)
– QSAR models (classification, regression)
• Visual interpretation
– Highlight important substructures
O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support
for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.

Ligand-based predictive modeling
Quantitative Structure-Activity
Relationship (QSAR)
– Start with a dataset of
chemical structures with
measured property to model
(inhibition, toxicity, etc)
– Describe chemicals using
descriptors
– Make use of statistical
modeling to relate chemical
structures to a response

Machine learning pipelines
Preprocessing
Model building
Validation
Reporting

QSAR modeling
• Signatures1 descriptor in CDK2
– Canonical representation of atom
environments
• Support Vector Machine (SVM)
– Robust modeling
1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and
Computer Sciences, 2003, 43, 707-720
2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E.
Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.

Local interpretation of nonlinear QSAR models
• Method
– Compute gradient of
decision function for
prediction
– Extract descriptor(s) with
largest component in the
gradient
• Demonstrated on RF, SVM,
and PLS
Carlsson, L., Helgee, E. A., and Boyer, S.
Interpretation of nonlinear qsar models applied to ames mutagenicity data.
J Chem Inf Model 49, 11 (Nov 2009), 2551–2558.
Lars Carlsson,
AstraZeneca R&D

Next challenge: Simple model building
• Build a solution where:
– Scientists can build accurate models without modeling
expertise, in order to aid their decision making
– Combine these models with other models

Simple model building with graphical wizards

Next challenge: Predict using distributed services
• OpenTox - European project for creating a
interoperable framework for toxicity predictions
• Academia and industry
• Parts
– Ontology and API
– Query and invocation of predictive services
– Methods and algorithms
– Authentication and authorization

Bioclipse Decision Support
Model
discovery
predictions

Bioclipse and OpenTox
Collaboration with

Summary of Bioclipse Decision Support
• Flexible, general method
– Apply to any collection of molecules
• State-of-the-art machine-learning methods
• Handles large data sets
• Fast predictions

Advantages with the DS method
• Fast: Can run on local computer
– “Instant predictions”, “calculate as you draw”
• Interpretable results: Can be used for
hypothesis generation
• General: Apply any modeling technique to any
data set
• Extensible: Very easy to add new components
• Open: Free, open source

Observations
• Predictive drug discovery is becoming
data-intensive
– High throughput technologies
• Drug/chemical screening
• Molecular biology (omics)
– More and bigger publicly available data
sources
• Data is continuously updated
 We need scalable and automated
methods for predictive modeling

Challenges with bigger data sets for machine learning
• Modeling time increases
– Reduce/avoid parameter tuning
– Run on high-performance e-infrastructures
– Use approximate methods
• Not all implementations can handle dataset sizes
– Use sparse implementations

Determine parameter intervals
for modeling (sweetspot)
J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg.
Benchmarking study of parameter variation when using signature fingerprints together
with support vector machines. J Chem Inf Model. 2014, 54(11), pp 3211–3217.
SVM: Cost and Gamma parameters
Signatures: Heights

Example 1: Modeling large number of observations
Jonathan Alvarsson

Example 2: Target predictions
Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L,
Wikberg JE, Noeske T. Ligand-based target prediction
with signature fingerprints.
J Chem Inf Model. 2014 Oct 27;54(10):2647-53

Challenge with running on HPC
• Reduce manual work
– Automate data preprocessing and modeling
– Support modeling life cycle (build, validate, document,
version, publish, re-train …)
• Automating model building is not trivial
– Aim: Agile, component-based architecture

Example application: Training large
number of datasets
Aim: Build models for hundreds
of targets
– Challenge to extract
– Challenge to automate model
building
Data sources
Samuel Lampa

Automating analysis on HPC clusters
• Workflow systems can aid
development and deployment
• We used Luigi system
• Integrate with queuing system
(SLURM)
Train and
assess model
Samuel Lampa
https://github.com/spotify/luigi

Example ML pipeline
(unpublished data)

Publishing models
• Publish models for easy access
and consumption
• We used P2 (OSGi) provisioning
system
v. 1.3
v. 1.2
v. 1.1
Use models

Reactive/continuous modeling
Data sources
Coordinate
Integrate
Version
Monitor
Publish
models
Archive
models
Train and
assess model
User
Bioclipse

Model building WFs on HPC is not trivial
• Many workflow systems exist
– DSLs vs APIs
– Dynamic input/output in e.g. cross-validation not
supported out of the box
• Time-consuming to create WFs
• Workflows can be useful but is not (yet) the silver
bullet we sought
O. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen,
M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schönherr, A. Siretskiy, and D.
Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.

Could cloud computing improve things?

QSAR Modeling on Amazon Elastic Cloud
Number of cores
Time(hours)
1 2 4 8 16
5
50
100
150
200
220
20k
75k
150k
300k
B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson,
and O. Spjuth. “Scaling predictive modeling in drug
development with cloud computing”.
J. Chem. Inf. Model., 2015, 55 (1), pp 19-25

Private clouds
• We set up an OpenStack system at UPPMAX (our HPC
center)
• Primarily Infrastructure as a Service (IaaS) – users can
run virtual machines
• Platform-as-a-Service (PaaS): Hadoop and Spark
– Our question: Can this be useful for model building?

• Open catalogue of VMIs
• Hosted at Uppsala University
M. Dahlö, F. Haziza, A. Kallio,
E. Korpelainen, E. Bongcam-
Rudloff, and O. Spjuth.
BioImg.org: A catalogue of
virtual machine images for
the life sciences. Accepted in
Bioinformatics and Biology
Insights.
www.bioimg.org
Managing Virtual Machine Images

Cloud computing enables Big Data Analytics
• Hadoop
– Open Source Map-Reduce, suited for massively parallel
tasks
– Distributed execution, high availability, fault tolerant, can
be run on commodity hardware
– E.g. Google, Facebook and Twitter use it
• Hadoop File System (HDFS) distributes data on
nodes, computing done in parallel
– “bring computations to data”

Hadoop (MapReduce) for massively parallel analysis

Evaluating Hadoop for next-generation sequencing
• Compare Hadoop and HPC
– Create as identical pipelines as possible
– Calculate efficiency as function of data size
– Conclusion: Hadoop pipeline scales better
than HPC and is economical for current data
sizes
Alexey Siretskiy, former
postdoc at UPPMAX
A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth.
A quantitative assessment of the Hadoop framework for analyzing
massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26.
A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively
Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE
10th International Conference on (2014), vol. 1, pp. 317–323.

SPARK
• Add caching to Hadoop
(MapReduce) – in memory
computing
• Good for iterative
algorithms
• We applied it for ligand-
based virtual screening
With Åke Edlund,
HPCViz, KTH
L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative
MapReduce for Parallel Virtual Screening. Cloud Computing
Technology and Science (Cloud- Com), 2013 IEEE 5th
International Conference on , vol.2, no., pp.27,32, 2-5, 2013

Large-scale machine learning on Spark
• Ongoing project: Create a large-scale machine
learning pipeline for QSAR using Spark ML as
alternative to Luigi workflow system
– Apply to large data sets
– Apply to many data sets
– Compare Spark with workflows on Batch system
– Aim: Use for Reactive Modeling

Some conclusions so far on cloud computing and
Hadoop/Spark for bioinformatics
• Cloud computing
– Easy provisioning of infrastructures, services and platforms
• Hadoop
– Scalable and efficient – but to the price of software incompatibility
• Spark
– improves over Hadoop with in-memory computing and more intuitive
interface
• Current working hypothesis: Spark more advantageous
compared to workflows on batch systems for machine
learning pipelines

Conformal prediction
Seek answer to: “How good is your prediction?”
• Traditional machine learning algorithms:
– Simple predictions (e.g. “Class A”, 8.45)
• Conformal predictions
– Prediction intervals for a given confidence level
– based on a consistent and well-defined mathematical
framework1
1 Vovk, V.; Gammerman, A.; Shafer, G. “Algorithmic learning in a random world”; Springer: New York, 2005.

Conformal predictions
Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a
transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014),
1596–603.

Some projects on Conformal Predictions
• CP Feature Highlighting
• CP in Spark
• Large-scale model building in
cheminformatics and virtual
screening
– Ongoing projects
Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal
Prediction Classification Models. Statistical Learning and Data Sciences. Springer
International Publishing; 2015. pp. 323–334.
Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in
Spark: Large-Scale Machine Learning with Confidence. Submitted.

Two pilots for clinical data management

e-Science (cyberinfrastructure, “big data”)
“Systematic and advanced use of computers in
research”
– High-performance computing
– Distributed data, “Big data”
– Enabling science!
www.e-science.se www.essenceofescience.se

Acknowledgements
Workflows
Samuel Lampa
David Kreil
Maciej Kańduła
BioImg.org
Martin Dahlö
Frédéric Haziza
Mentell Design
Hadoop & Spark
Alexey Siretskiy
Åke Edlund
Izhar ul Hassan
Marco Cappucini
Staffan Arvidsson
Cloud computing
Frédéric Haziza
Tore Sundqvist
Behrooz Torabi
Salman Toor
Andreas Hellander
Predictive modeling
Lars Carlsson
Ernst Ahlberg-Helgee
Martin Eklund
Ulf Norinder
Wesley Schaal
Jonathan Alvarsson
Bioclipse
Arvid Berg
Egon Willighagen
All Bioclipse and CDK
contributors

Thank you
ola.spjuth@farmbio.uu.se

Agile large-scale machine-learning pipelines in drug discovery

More Related Content

What's hot (13)

Viewers also liked (20)

Similar to Agile large-scale machine-learning pipelines in drug discovery (20)

More from Ola Spjuth (9)

Recently uploaded (20)

Agile large-scale machine-learning pipelines in drug discovery

Editor's Notes