ReComp: challenges in selective recomputation of (expensive) data analytics tasks

CenterforDoctoralTraining–Newcastle
SeminarSeries–Nov.2015P.Missier
ReComp: preserving the value of big data insights over time
Panta Rhei (Heraclitus, through Plato)
Paolo Missier
Paolo.Missier@ncl.ac.uk
November, 2015
Cloud CDT seminar series
Newcastle
(*) Painting by Johannes Moreelse
(*)

Generating Analytical knowledge
Specification Deployment Execution KA
Dependencies Algorithms,
Libs,
Packages
System
External state
(DBs)
Input
Data config
KA = Knowledge Asset
Ex.:
machine learning
Using Python
and scikit-learn
Learn model
to recognise
activity
pattern
Python 3
Ubuntu x.y.z
Azure VM
Model
training
Model
Scikit-learn
Numpy
Pandas
Ubuntu
on Azure
Dependencies
Training +
Testing
dataset
config

Generating Analytical knowledge
Specification Deployment Execution KA
Dependencies Algorithms,
Libs,
Packages
System
External state
(DBs)
Input
Data config
KA = Knowledge Asset
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs

Rate of change
Long-lived
Slow-changing
Short-lived
Fast-changing
Input
data
External
DBs
Data
Streams
Historical
Time series
data
Current
Twitter
graph
Reference
DBs
What changes, and how frequently?

How fast does knowledge advance?
• Life Sciences knowledge:
• Genes (GenBank, Ensembl), Proteins, SNPs, Human Variants DBs (ClinVar)
• Life Sciences ontologies (GO, HPO,…)
• The human genome assembly
• The collection of all PubMed articles
• DBPedia, Wikipedia, etc.
• All current {Twitter, FB, G+, …} users and their connections
• A map of all buildings in a city, with their location and footprint
• The Hubble Atlas of Ancient Galaxies
• The catalogue of all known Exoplanets (about 2000)

How do we know which changes are relevant?

What analytics?
Genomics
• Diagnosis of rare genetic diseases
• Analyse soil, water composition (metagenomics)
Social media analytics, eg Twitter content analysis
• Sentiment analysis
• Topic discovery
• Emergency response
• Fostering new communities
Climate modelling
• Predicting local climate changes
• Ecology: understanding change by monitoring local species
Environment risk assessment
• Flood modelling and simulation

Case study: NGS data processing pipeline (Genomics)
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants

Case study: metagenomics
From environment to DNA sequence
Sample
Size
Fractioning
DNA
extraction
Sequencing
Analysis?
mRNA
extraction
PCR
AmpliconMetatranscriptome Metagenome
metagenomics

Case study: flood modelling in Newcastle
CityCAT (City Catchment Analysis Tool)
A unique software tool for modelling, analysis and
visualisation of surface water flooding
• High resolution flood model
• Integrates hydraulic modelling algorithms
• Subsurface flow modelling
• Topography (DEMs from LIDAR)
• Physical structures (buildings etc.)
• Landuse data
• Outputs high resolution grid of flood depths
• Extensively tested
• Multi-platform
• Integrated into CONDOR and Microsoft Azure

What kind of changes affect these analytics tasks?
Application Knowledge Algorithms and tools
LS Diagnosis of rare
genetic diseases
PubMed
Human Variants DBs
The human genome assembly
SNP DBs
Numerous algorithms and tools used for
sequence alignment, cleaning, variant
calling…
LS Metagenomics Collections of known DNA
sequences for multiple species
Same as for genomics
SM Sentiment analysis Past Predictive models Content analysis NLP tools
Statistical model learning (classification)
SM Topic discovery Clustering algorithms
SM Emergency response Content analysis NLP tools
Predictive models, topical trend analysis
SM Fostering new
communities
Hubs & authorities algorithms, clustering
CS Predicting local
climate changes
Historical and current time series at
multiple resolution
Past and current models
Statistical model learning
CS Ecology: understand
change by monitoring
local species
Local species count & behaviour
observations
Statistical model learning
CE Flood modelling and
simulation
Local topography, location of
buildings
Simulation packages (eg CityCat)

Volume: how many data products are affected?
Application Volume
LS Diagnosis of rare genetic diseases 100K genome project in the UK alone
Thousands of samples in Newcastle alone
LS Metagenomics A few K (EBI Metagenomics portal)
SM Sentiment analysis # of users whose sentiment is being analysed
SM Topic discovery A few clusters, containing a large number of Tweets
SM Emergency response A few key decisions
SM Fostering new communities A few key users
CS Predicting local climate changes Local effect
CS Ecology: understand change by
monitoring local species
Local effects
CE Flood modelling and simulation Local effects

LS Diagnosis of rare genetic
diseases
LS Metagenomics
SM Sentiment analysis
SM Topic discovery
SM Emergency response
SM Fostering new communities
CS Predicting local climate changes
CS Ecology: monitoring local species
CE Flood modelling and simulation
How fast do these products become obsolete?
minutes hours months yearsdays

How sensitive are data products to change?
diseases
LS Metagenomics
SM Topic discovery

How much do they cost?
Note:
Cost per product vs
Cost over all products
Cost components:
- Design
- Development
- System
- Runtime
diseases
LS Metagenomics
SM Topic discovery

Workflow Deployment on the Azure Cloud
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400
GB SSD, Ubuntu 14.04.

Cost
0
2
4
6
8
10
12
14
16
18
0 6 12 18 24
CostinGBP
Number of samples
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)

Changes in reference knowledge (ClinVar DB)

Case study: the Metagenomics portal at the EBI

From environment to DNA sequence
Sample
Size
Fractioning
DNA
extraction
Sequencing
Analysis?
mRNA
extraction
PCR
AmpliconMetatranscriptome Metagenome

EBI metagenomics portal
Open resource for the archiving and analysis of metagenomics
and metatranscriptomics
Generic, yet standardised analysis platform for all metagenomics
studies
Offer a service that small groups would struggle to achieve
Submission of
sequence data for
archiving and analysis
Data analysis using
selected EBI and
external software
tools
Data presentation
and visualisation
through web
interface
Visualisation

Marine Datasets
- Portal contains over 30 marine metagenomes
MillionsofSequences

ReComp: challenges in selective recomputation of (expensive) data analytics tasks

Fusing UO data and modelling
CityCAT Flood model
Traffic data
Weather data

The ReComp project
Aims:
To create a decision support system for
1. detecting changes that affect time-sensitive analytical
knowledge,
2. assessing its reprocessing options, and
3. estimating their cost
Change
Events
Utility
function
s
Priority
rules
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
Previously
Computed KAs
And their metadata
Funded by the EPSRC - Making sense from data
Feb. 2016- Feb. 2019
2 Research Associates
In collaboration with
- Newcastle Civil Engineering (Phil James)
- Department of Clinical Neurosciences
Cambridge University (Prof. Patrick Chinnery)

ReComp: Target operating region
Rate of change
Volume
slowfast
low
high
Cost
Volume
highlow
low
high
Volume
Rate of change
ReComp target region

Recomputation analysis: abstraction
t1 t2 t3
KA5
KA4
KA3
KA2
KA1a b c
a b
d
a b c d
a c
Change
Events
a a’
a
a
a
a
b b’
c c’
b,c
b
b,c
c

Recomputation analysis: conceptual steps
Assume we have a growing universe of KA of Knowledge Assets.
Each ka ∈ KA has dependencies dep(ka) on other assets in a set DA (input data,
algorithms, libs…)
ReComp analysis steps:
Monitor and detect relevant change events {dai  dai’} with dai ∈ DA
For each change event {dai  dai’}:
• Identify candidate recomp population karec ⊆ KA:
• ka ∈ KA such that dai ∈ dep(ka)
• For each ka ∈ karec:
• Estimate the effect of recomputing ka using da’i instead of dai
• Quantitative estimation of impact due to change dai  dai’
• Determine time, cost associated to recomputing ka
• Use these estimates along with utility functions to rank karec
• Carry out top-k recomputations given a budget: ka  ka’
• Perform post-hoc analysis to improve estimation models:
• Compare actual effect with estimates
• Differential data analysis: Δ(ka, ka’)
• Change cause analysis: has any other element contributed to Δ(ka, ka’)?

Recomputation analysis through sampling
Change
Events
Monitor
identify
recomp
candidates
prioritisation budgetutility
assess
effects of
change
estimate
recomp
cost
assess
reproducibility
cost
sampling
recomp
recompsmall scale
recomp
Meta-K
large-scale
recomp
estimate
recomp
cost

Recomputation analysis through modelling
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
utility budget
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!
Can we do better??

Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a
collection of metadata items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies

High level architecture
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)

Project objectives
Obj 1.
To investigate analytics techniques aimed at supporting re-computation decisions
Obj 2.
To research techniques for assessing under what conditions it is practically feasible
to re-compute an analytical process.
• Specific target system environments:
• Python / Jupyter
• The eScience Central, workflow manager (developed at Newcastle)
Obj 3.
To create a decision support system for the selective recomputation of complex
data-centric analytical processes and demonstrate its viability on two target case
studies
• Genomics (human variant analysis)
• Urban Observatory (flood modelling)

Expected outcomes
Research Outcomes:
Algorithms that operate on metadata to perform:
• impact analysis
• cost estimation
• differential data and change cause analysis of past and new knowledge
outcomes
• estimation of reproducibility effort
System Outcomes:
• A software framework consisting of domain-independent, reusable components,
which implement the metadata infrastructure and the research outcomes
• A user-facing decision support dashboard.
It must be possible to integrate the framework with domain-specific components, to
support specific scenarios, exemplified by our case studies.

Challenge 1: estimating impact and cost
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
prioritisation
target
population
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Change impact model: Δ(x,x’)  Δ(y,y’)
-- challenging!!

Challenge 2: managing the metadata
How do we generate / capture / store / index / query across multiple metadata
types and formats?
Relevant Metadata:
• Logs of past executions, automatically collected;
• Provenance traces:
• Runtime (“retrospective”) provenance
• Automatically collected data dependency graph captured from the
computation
• Process structure (“prospective provenance”)
• obtained by manually annotating a script
• External data and system dependencies, process and data versions, and system
requirements

Challenge 3: Reproducibility
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variant
s
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
What happens when any of the dependencies change?

Challenge 4: reusability of the solution across cases
• How do we make case-specific solutions generic?
• How do we make the DSS reusable?
• Refactor: Generic framework + case-specific components
• This is hard: most elements are case-specific!
• Metadata formats
• Metadata capture
• Change impact
• Cost models
• Utility functions
• …

Available technology components
• W3C PROV model for describing data dependencies (provenance)
• DataONE “metacat” for data and metadata management
• The eScience Central Workflow Management System
• Natively provenance-aware
• NoWorkflow: a (experimental) Python provenance recorder
• Cloud resources:
• Azure, our own private cloud (CIC)
ReComp decision dashboard
Execute
Curate
Select/
prioritise
prospective
provenance
curation
(Yworkflow)
Meta-Knowledge
Repository
Research
Objects
Change
Impact
Analysis
Cost
Estimation
Differential
Analysis
Reproducibility
Assessment
- Utility functions
- Priorities policies
- Data similarity functions
domain knowledge
runtime
monitor
Logging
Runtime
Provenance recorder
runtime
monitor
Logging
Runtime
Provenance recorder
Python
WP1
- provenance
- logs
- data and process versions
- process dependencies
(other analytics environments)

Specific areas for PhD research
Modelling and analytics:
• Impact and cost estimation
• […]
Software engineering
• Generic framework + plugins architecture
• Metadata management
• Capture, storage, index, query
• Reproducibility for recomputation
• […]
Case studies
• Genomics
• Flood modelling / smart cities
• […]

Summary
• Value from Big Data analytics may decay as the resources it is
built on change
• Resources = {data, external state, algorithms, libs, …}
• Value = “Knowledge Assets” (KA)
• When should such value be restored?
• How do you estimate the cost of re-computation?
• How do you prioritise over a large pool of KA for given budget?
ReComp:
• A decision support tool aimed at answering these questions
• Through a metadata management infrastructure with metadata
analytics on top

References
[1] V.Stodden,F.Leisch,andR.D.Peng,Implementingreproducibleresearch.CRCPress,2014.
[2] R.Peng,“ReproducibleResearchinComputationalScience,”Science,vol.334,no.6060,pp.1226–1127,Dec-2011.
[3] R.Qasha,J.Cala,andP.Watson,“TowardsAutomatedWorkflowDeploymentintheCloudusingTOSCA,”inProcs.IEEE8th
International Conference on Cloud Computing (IEEE CLOUD 2015), 2015.
[4] D.C.Koboldt,L.Ding,E.Mardis,andR.Wilson,“Challengesofsequencinghumangenomes.,”Brief.Bioinform.,Jun.2010.
[5] A.Nekrutenko,“Galaxy:acomprehensiveapproachforsupportingaccessible,reproducible,andtransparentcomputationalresearchin
the life sciences,” Genome Biol., vol. 11, no. 8, p. R86, 2010.
[6] J.Cala,Y.X.Xu,E.A.Wijaya,andP.Missier,“FromscriptedHPC-basedNGSpipelinestoworkflowsonthecloud,”inProcs.C4Bio
workshop, co-located with the 2014 CCGrid conference, 2013.
[7] P.Missier,E.Wijaya,R.Kirby,andM.Keogh,“SVI:asimplesingle-nucleotideHumanVariantInterpretationtoolforClinicalUse,”in
Procs. 11th International conference on Data Integration in the Life Sciences, 2015.
[8] D.G.MacArthur,T.A.Manolio,D.P.Dimmock,H.L.Rehm,etal.,“Guidelinesforinvestigatingcausalityofsequencevariantsinhuman
disease.,” Nature, vol. 508, no. 7497, pp. 469–76, Apr. 2014.
[9] H.Johnson,R.S.Kovats,G.McGregor,J.Stedman,M.Gibbs,andH.Walton,“Theimpactofthe2003heatwaveondailymortalityin
England and Wales and the use of rapid weekly mortality estimates.,” Euro Surveill. Bull. Eur. sur les Mal. Transm. = Eur.
Commun. Dis.Bull., vol. 10, no. 7, pp. 168–171, 2005.
[10]T. Holderness, S. Barr, R. Dawson, and J. Hall, “An evaluation of thermal Earth observation for characterizing urban heatwave
event dynamics using the urban heat island intensity metric,” International Journal of Remote Sensing, vol. 34, no. 3. pp. 864–
884, 2013.
[11]L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, et al, “The Open
Provenance Model --- Core Specification (v1.1),” Futur. Gener. Comput. Syst., vol. 7, no. 21, pp. 743–756, 2011.
[12]H. Hiden, P. Watson, S. Woodman, and D. Leahy, “e-Science Central: Cloud-based e-Science and its application to chemical
property modelling,” Newcastle University Technical Report series, http://www.ncl.ac.uk/computing/research/techreports/, 2011.
[13]T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, et al.., “YesWorkflow: A User-Oriented, Language-Independent
Tool for Recovering Workflow Information from Scripts,” in Procs. 10th Intl. Digital Curation Conference (IDCC), 2015.
[14]S. Bechhofer, D. De Roure, M. Gamble, C. Goble, and I. Buchan, “Research Objects: Towards Exchange and Reuse of Digital
Knowledge,” in Procs. Int’l Workshop on Future of the Web for Collaborative Science (FWCS) -- WWW'10, 2010.
[15]S. Woodman, H. Hiden, P. Watson, and P. Missier, “Achieving Reproducibility by Combining Provenance with Service and
Workflow Versioning,” in Procs. WORKS 2011, 2011.
[16]L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire, “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆,” in
Procs.IPAW’14, 2014.
[17]L. Moreau, P. Missier, K. Belhajjame, R. B’Far, J. Cheney, S. Coppens, S. Cresswell, Y. Gil, P. Groth, G. Klyne, T. Lebo, J.
McCusker,S. Miles, J. Myers, S. Sahoo, and C. Tilmes, “PROV-DM: The PROV Data Model,” 2012.
[18]T. Miu and P. Missier, “Predicting the Execution Time of Workflow Activities Based on Their Input Features,” in Procs. WORKS,
2012. [19]P. Missier, S. Woodman, H. Hiden, and P. Watson, “Provenance and data differencing for workflow reproducibility
analysis,” Concurr.Comput. Pract. Exp., p. n/a–n/a, 2013.

ReComp: challenges in selective recomputation of (expensive) data analytics tasks

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to ReComp: challenges in selective recomputation of (expensive) data analytics tasks (20)

More from Paolo Missier (20)

Recently uploaded (20)

ReComp: challenges in selective recomputation of (expensive) data analytics tasks

Editor's Notes