SlideShare a Scribd company logo
Bertram	Ludäscher		
ludaesch@illinois.edu	
	
Center	for	Informa*cs	Research	in	Science	&	Scholarship	(CIRSS)	
School	of	Informa/on	Sciences	(formerly:	GSLIS)	
&	Na/onal	Center	for	Supercompu/ng	Applica/ons	(NCSA)	
&	Department	of	Computer	Science		
From Data to Knowledge with
Workflows & Provenance
•  Scien2fic	Workflows	
–  Examples,	Features	
•  Data	Cleaning	and	Cura5on	
•  Provenance	&	Reproducible	Science			
–  “Prospec5ve	Provenance”	(a.k.a.	workflows)	
–  Retrospec5ve	Provenance	
•  YesWorkflow	
–  Yes,	Scripts	can	be	Workflows,	too!	
•  Other	stuff	
–  Time	allowing	..		
Outline			
2	
SPIN'16	@	NCSA
Introduc2ons	should	come	first!	
•  MS	Computer	Science,	U	Karlsruhe	(K.I.T.)	
•  PhD	Computer	Science,	U	Freiburg,	Germany	
•  Research	Scien5st,	UC	San	Diego,	SDSC	
•  Dept.	of	Computer	Science,	UC	Davis	
•  School	of	Informa5on	Sciences,	U	of	Illinois	
•  Natl.	Center	for	Supercompu5ng	Applica5ons		
3	
SPIN'16	@	NCSA
Scientific Workflows: ASAP
•  Automation
–  wfs to automate computational aspects of science
•  Scaling (exploit and optimize machine cycles)
–  wfs should make use of parallel compute resources
–  wfs should be able handle large data
•  Abstraction, Evolution, Reuse (human cycles)
–  wfs should be easy to (re-)use, evolve, share
•  Provenance
–  wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è  Reproducible Science
Trident	
Workbench	
VisTrails	
4	
Es	war	einmal	…			
SPIN'16	@	NCSA
10	Essen2al	func2ons	of	a	scien2fic	workflow	system	
1.  Automate	programs	and	services	scien5sts	already	use.		
2.  Schedule	invoca5ons	of	programs	and	services	correctly	and	efficiently	–	in	
parallel	where	possible.	
3.  Manage	dataflow	to,	from,	and	between	programs	and	services.	
4.  Enable	scien2sts	(not	just	developers)	to	author	or	modify	workflows	easily.	
5.  Predict	what	a	workflow	will	do	when	executed:	prospec/ve	provenance.	
6.  Record	what	happened	during	workflow	execu5on:	retrospec/ve	provenance.	
7.  Reveal	retrospec2ve	provenance	–	how	workflow	products	were	derived	from	
inputs	via	programs	and	services.	
8.  Organize	intermediate	and	final	data	products	as	desired	by	users.	
9.  Enable	scien5sts	to	version,	share	and	publish	their	workflows.		
10.  Empower	scien2sts	who	wish	to	automate	addi2onal	programs	and	services	
themselves.	
These	func2ons	(not	just	dataflow	&	actors)	dis2nguish	scien/fic	workflow	
automa/on	from	general	scien2fic	so[ware	development.	
SPIN'16	@	NCSA	
5	
Src:	Timothy	McPhillips
Find	OTUs	
(OTUHunter)	
Assign	Taxonomy			
(STAP)	
Profile	alignment	
(STAP	or	Infernal)	
Build	phylogene5c	
tree	(RaxML	or	
Quicktree)	
View	tree:	
Dendroscope	
UniFrac:		tree	&	
environment	file	
Assembled	
con5gs	
Chimera	check	
	(Mallard)	
Diversity	sta5s5cs:	
Text:	OUT	list,	Chao1,	Shannon	
Graphs:	rarefac5on	curves,	rank-
abundance	curves	
Visualiza5on	tools:	
Cytoscape	networks	&	
Heat	map	
WATERS:
Workflow	for	Alignment,	Taxonomy,	
Ecology	of	Ribosomal	Sequences	
(Amber	Hartman;	Eisen	Lab;	UC	Davis)	
+/-	cipres	
+/-	cluster	
+/-	cluster	
+/-	cluster	
SPIN'16	@	NCSA	
6
Executable WATERS Workflow in Kepler
SPIN'16	@	NCSA	
7
Example
Bioinformatics
Workflow:
Motif-Catcher
Marc	Faccion	et	al.	
UC	Davis	Genome	Center	
SPIN'16	@	NCSA	
8
Motif-Catcher workflow, implemented in Kepler
S	Köhler	et	al.	Improved	Mo5f	Detec5on	in	Large	Sequence	Sets	with	
Random	Sampling	in	a	Kepler	workflow,	ICCS-WS,	2012	
SPIN'16	@	NCSA	
9
A Data-Streaming Workflow over Sensor Data
SPIN'16	@	NCSA	
10
•  Monitor	and	control	supercomputer	
simula5ons				
–  50+	composite	actors	(subworkflows)	
–  4	levels	of	hierarchy		
–  1000+	atomic	(Java)	actors	
43	actors,	3	levels	
196	actors,	4	levels	
30	actors	
206	actors,	4	levels	
137	actors	
33	actors	
150	
123	actors	
66	actors	
12	actors	
243	actors,	4	levels	
	Norbert	Podhorszki	
	ORNL	(then:	UC	Davis)					
“Plumbing”	workflow		
SPIN'16	@	NCSA	
11
Scien2fic	Workflow	Design:	Some	Challenges		
And the graphical UI makes our scientific workflows
so much easier to develop, understand and maintain!
SPIN'16	@	NCSA	
12
More “Plumbing” (beware the Boolean Select)
Cabellos	et	al.	Computer	Physics	Communica*ons	182,	2011	
SPIN'16	@	NCSA	
13
Modeling & Design: Die Grenzen meiner
Sprache bedeuten die Grenzen meiner Welt
Vanilla	Process	Network	
	
	
	
Func2onal	Programming	
Dataflow	Network	
	
	
XML	Transforma2on	
Network		
	
	
Collec2on-oriented	
Modeling	&	Design	
framework	(COMAD)	
“Look	Ma:	No	Shims!”	
SPIN'16	@	NCSA	
14
Problems	with	[too	many]	Shims	and	Wires	
•  Shims	need	to	be	placed	and	connected	
–  Tedious,	error-prone	
•  Distract	from	scien5fic	meaningful	actors	
–  Non-descrip5ve	workflows	–	worth	sharing?	
•  Data	Organiza5on	is	encoded	in	workflow	structure	
–  Not	robust	to	data	changes	
•  Shims	ouen	lead	to	complex	designs	
–  Imagine	all	previous	`design-pawerns’	intertwined	
–  GOTO-programming	
COMAD/VDAL:		Raising	the	level	of	abstrac/on	
  Localized	control-flow	
  Data	management	not	done	via	wires	
  Actors	are	coupled	not	by	wire	but	by	data!	
SPIN'16	@	NCSA	
15
Collec5on-Oriented	Modeling	&	Design	(COMAD)	
–  fully	embrace	the	assembly	line	metaphor	
–  data	=	tagged	nested	collec2ons	
–  e.g.	represented	as	flawened,		
pipelined	(XML)	token	streams:	
Pipelined	Collec2on-Oriented	Workflows	
Actors	(like	assembly	line	workers),	pass	on	what	they	don’t	
work	on		
	T	McPhillips,	S	Bowers,	D	
Zinn,	B	Ludäscher	
SPIN'16	@	NCSA	
16
Two different workflow designs
• 	Hardwiring	vs.	configurable	data/collec5on	management	
• 	briwle	vs.	change	resilient	designs	
• 	scien5st	can	recognize	napkin	drawing/conceptual	model	
• 	Human	cycles	are	expensive	
SPIN'16	@	NCSA	
17
ADIOS in Kepler
SPIN'16	@	NCSA	
18
ADIOS in COMAD
SPIN'16	@	NCSA	
19
From Data Life-Cycle to Curation Life-Cycle
Uncanny Resemblance: Eye of Jupiter
(If you have “visions”… )
DCC Curation Lifecycle
SPIN'16	@	NCSA	
20
Data	Cleaning	(Scien5fic	&	Business	Apps)	
SPIN'16	@	NCSA	
21
How	do	you	clean	data?	(Syntax)	
22	
SPIN'16	@	NCSA
How	do	you	clean	data?	(Syntax)	
•  Regular	
Expressions	
(regex)	
•  Write	your	own	
scripts	
– …	with	regex	
– …	in	Python!	
23	
SPIN'16	@	NCSA
Kurator Project (Data Curation Workflows)
SPIN'16	@	NCSA	
24
From “Climate Gate” to Reproducible Science
Capturing provenance is crucial for
transparency, interpretation, debugging, …
=> repeatable experiments,
=> reproducible science
=> need workflow-system agnostic model
SPIN'16	@	NCSA	
25
Provenance:	The	Fine	Arts	
•  One	of	these	is	has	been	sold	for	nearly	$180m.	
•  The	other	could	be	worth	as	much	or	more.	
•  Which	is	which?	
•  What	is	the	difference?		
26	
SPIN'16	@	NCSA
Provenance	in	Science	
•  What’s	so	“provenance”	about	this?	
•  Grand	Canyon’s	rock	layers	are	a	record	of	the	early	geologic	history	of	North	America.	
The	ancestral	puebloan	granaries	at	Nankoweap	Creek	tell	archaeologists	about	more	
recent	human	history.	(By	Drenaline,	licensed	under	CC	BY-SA	3.0)	
27	
SPIN'16	@	NCSA
28	
Natural	History:		
Understanding	what	happened…	
Zrzavý,	Jan,	David	Storch,	and	Stanislav	Mihulka.	
Evolu*on:	Ein	Lese-Lehrbuch.	Springer-Verlag,	2009.	
Author:	Jkwchui	(Based	on	
drawing	by	Truth-seeker2004)	
SPIN'16	@	NCSA
Computa2onal	Provenance	
•  Origin	and	processing	history	of	an	ar2fact	
– usually:	data	(products),	figures,	...	
– some5mes:	workflow	(and	script)	evolu5on	…	
•  Different	sub-communi5es:	
– Provenance	in	databases	
– Provenance	in	(scien2fic)	workflows	
– ...	programming	languages,	systems/security,	…		
29	
SPIN'16	@	NCSA
30	
Run/me	Provenance		
(a.k.a.	traces,	logs,			
retrospec/ve	
provenance,	
“Trace-land”)	
Different	Kinds	of	Data	Provenance	in	Workflows
Workflow	Modeling	&	Design	
(a.k.a.	prospec/ve	provenance	
“Workflow-land”)	
SPIN'16	@	NCSA
SKOPE:	Synthesized	Knowledge	Of	Past	Environments	
31	
Bocinsky,	Kohler	et	al.	study	rain-fed	maize	of	Anasazi		
–  Four	Corners;	AD	600–1500.	Climate	change	influenced	Mesa	Verde	Migra2ons;	late	
13th	century	AD.	Uses	network	of	tree-ring	chronologies	to	reconstruct	a	spa2o-
temporal	climate	field	at	a	fairly	high	resolu5on	(~800	m)	from	AD	1–2000.	Algorithm	
es5mates	joint	informa5on	in	tree-rings	and	a	climate	signal	to	iden5fy	“best”		tree-ring	
chronologies	for	climate	reconstruc5ng.	
K.	Bocinsky,	T.	Kohler,	A	2000-year	reconstruc5on	of	the	rain-fed	
maize	agricultural	niche	in	the	US	Southwest.	Nature	
Communica/ons.	doi:10.1038/ncomms6618		
… implemented as an R Script …
SPIN'16	@	NCSA
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?	
YesWorkflow:		
Yes,	scripts	are	workflows,	too!	
•  Script	vs	Workflows/ASAP:	
– Automation:		*****	
– Scaling:					**	
– Abstraction:	*		
– Provenance:		**	
32	
SPIN'16	@	NCSA
YW	annota2ons:	Model	your	Workflow!	
33	
SPIN'16	@	NCSA
YesWorkflow:	Prospec2ve	&	Retrospec5ve	
Provenance	…	(almost)	for	free!		
•  YW	annota5ons	in	
the	script	(R,	
Python,	Matlab)	
are	used	to	
recreate	the	
workflow	view	
from	the	script	…		
34	
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!	
SPIN'16	@	NCSA
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate	Reconstruc2on	(EnviRecon.org)		
35	
•  …	explained	using	YesWorkflow!	
Kyle	B.,	(computa5onal)	archaeologist:		
"It	took	me	about	20	minutes	to	comment.	Less	
than	an	hour	to	learn	and	YW-annotate,	all-told."	
SPIN'16	@	NCSA
João	F.	Pimentel,	Saumen	Dey,	Timothy	McPhillips,		
Khalid	Belhajjame,	David	Koop,	Leonardo	Murta,		
Vanessa	Braganholo,	Bertram	Ludäscher	
Yin	&	Yang:	Demonstra2ng	
complementary	provenance	from	
noWorkflow	&	YesWorkflow
Using	Provenance	from	Script	Runs		
37	
Example	from	the	log-file:		
2016-06-07	20:32:36	Wrote	run/data/DRT240/DRT240_11000eV_002.img	
	
But	how	was	that	image	derived??	(“Provenance	for	Self!”)	SPIN'16	@	NCSA
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
noWorkflow:	
not	only	
Workflow!	
38	
•  Scripts	have	provenance,	too!	
•  Transparently	capture	some/all	
provenance	from	Python	script	
runs.	
•  Use	filter	queries	to	“zoom”	into	
relevant	parts	..			
SPIN'16	@	NCSA
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
$	now	dataflow	-f	"run/data/DRT240/DRT240_11000eV_002.img"	 39	
$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)
now helper df_style.py
now dataflow -v 55 -f $
(RETROSPECTIVE_LINEAGE_VALUE) -m simulation
| python df_style.py -d BT -e > $
(NW_FILTERED_LINEAGE_GRAPH).gv
..	auto-“make”	this!	
noWorkflow	lineage	
of	an	image	file	
Provenance	informa*on	
about	Python	func/on	calls,	
variable	assignments,	etc.	
SPIN'16	@	NCSA
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
YesWorkflow:	Yes,	scripts	are	Workflows,	too!	
•  Use	YW	annota5ons	
@begin...@end,	@in,	
@out	to	reveal	hidden	
conceptual	workflow		
(prospec2ve	provenance)		
•  Script	isn't	changed:	
–  annota5ons	via	comments	
(=>	language	independent)	
•  For	understanding	and	
sharing	the	“big	picture”	
•  Query	and	visualize!	
40	
SPIN'16	@	NCSA
Alternate	YW	Views		
41	
simulate_data_collection
initialize_run
load_screening_results calculate_strategy
log_rejected_sample
collect_data_set transform_images log_average_image_intensity
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
Process	view	
Data	view	
Workflow	view	
SPIN'16	@	NCSA
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
What	is	the	lineage	of	“corrected_image”?	
42	
From	here	on	“upwards”:	
What	led	(leads)	to	this?		
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
..	and	what	is	irrelevant	
and	should	be	pruned??	
SPIN'16	@	NCSA
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
Subgraph	
resul5ng	from	
lineage	query		
on	YW	workflow	
model		
43	
What	is	the	lineage	of	
corrected_image?	
SPIN'16	@	NCSA
44	
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
lineage	query	
lineage	query	
YesWorkflow:	
Conceptual	workflow	model	
noWorkflow:		
Python	trace	model	
But	how	do	we	
bridge	this	gap???	
Would	like	to	use	YW	
model	to	query	NW	
data!	
SPIN'16	@	NCSA
Some	bridges	are	
precarious…		
45	
SPIN'16	@	NCSA
…	and	new	bridge-building	can	be	stressful	
46	
…	even	if	just	pain*ng	over.		
SPIN'16	@	NCSA
Habemus	Pons!		
We’ve	got	the	Bridge!		
The	bridge	is	the	journey..			
(The	journey	is	the	des5na5on)	
47	
Lineage	of	image	file	
in	terms	of	YW	
model,	with	details	
from	NW	provenance	
SPIN'16	@	NCSA
Secret	Reproducible	Sauce	
•  Combining	provenance	informa5on	from	
noWorkflow	and	YesWorkflow	
•  Using	all	the	good	stuff:		
– make,	docker,	Prolog,	SQL,	Graphviz			
•  Open	source	
– github.com/yesworkflow-org/yw-noworkflow		
– github.com/gems-uff/yin-yang-demo		
•  Have	a	closer	look	at	the	demo!	
48	
SPIN'16	@	NCSA
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
YW-RECON:	Prospec5ve	&	Retrospec2ve	
Provenance	…	(almost)	for	free!		
49	
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
•  URI-templates	link	conceptual	en55es	
to	run2me	provenance	“leu	behind”	by	
the	script	author	…		
•  …	facilita5ng	provenance	reconstruc2on	SPIN'16	@	NCSA
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q1:	What	samples	did	the	script	run	collect	images	
from?	
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
50	
SPIN'16	@	NCSA
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q2:	What	energies	were	used	for	image	collec5on	from	
sample	DRT322?	
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
51	
SPIN'16	@	NCSA
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q3:	Where	is	the	raw	image	of	the	corrected	image	
DRT322_11000ev_030.img?		run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
52	
SPIN'16	@	NCSA
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
Q5:	What	cassese-id	had	the	sample	leading	to	
DRT240_10000ev_001.img?	
53	
SPIN'16	@	NCSA
New	Project!	(internships	next	summer!)	
SPIN'16	@	NCSA	
54	
hwp://wholetale.org/

More Related Content

PDF
From Data to Knowledge with Workflows & Provenance
PDF
Deep learning and the systemic challenges of data science initiatives
PPTX
Earning and Burning in Tartu
PDF
assuncao pdf ok
PDF
Gamification 08 2011
PDF
Lawrence I Lerner Executive Bio 11 2016
PDF
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
PDF
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...
From Data to Knowledge with Workflows & Provenance
Deep learning and the systemic challenges of data science initiatives
Earning and Burning in Tartu
assuncao pdf ok
Gamification 08 2011
Lawrence I Lerner Executive Bio 11 2016
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
Zhang et al ecn 2016 building an accessible weevil tissue collection for geno...

Viewers also liked (15)

PDF
Sas graphene patent_us20130043342
PPTX
20161006cognitivelinkvidsmall
PDF
Install Quagga - CheatSheet -
PPTX
Preheating effects in Electron Beam Additive Manufacturing
PDF
Key ecommerce trends and forecasts from now to 2020
PDF
Evaluación de software educativo 1
PDF
Provenance x Bitcoin meetup
PDF
Presentation cisco mobile internet
PPTX
My Cloud Hospitality - Hotel Property Management System
PPT
Evolutionary psychology
PPTX
Block chain Vs Analytics
PPTX
DPDK KNI interface
PDF
UNDOCUMENTED Vyatta vRouter: Unbreakable VPN Tunneling (MEMO)
PDF
PBR-LB - Direct Server Return Load Balancing using Policy Based Routing (MEMO)
PPTX
Coordination presentation
Sas graphene patent_us20130043342
20161006cognitivelinkvidsmall
Install Quagga - CheatSheet -
Preheating effects in Electron Beam Additive Manufacturing
Key ecommerce trends and forecasts from now to 2020
Evaluación de software educativo 1
Provenance x Bitcoin meetup
Presentation cisco mobile internet
My Cloud Hospitality - Hotel Property Management System
Evolutionary psychology
Block chain Vs Analytics
DPDK KNI interface
UNDOCUMENTED Vyatta vRouter: Unbreakable VPN Tunneling (MEMO)
PBR-LB - Direct Server Return Load Balancing using Policy Based Routing (MEMO)
Coordination presentation
Ad

Similar to From Data to Knowledge with Workflows & Provenance (20)

PDF
ICSSP-Panel Austin, May 15, 2016.
PDF
From Workflows to Transparent Research Objects and Reproducible Science Tales
PDF
Open Science for sustainability and inclusiveness: the SKA role model
PDF
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
PPTX
Lecture 3 Computer Science Research SEM1 22_23 (1).pptx
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
PDF
Hector Guerrero- Road to Business Analytics
PDF
Final data presentation_clir_july2014
PPT
Saving private data, sharing Open Data? Role of libraries and institutional r...
PDF
Is the current measure of excellence perverting Science? A Data deluge is com...
PPTX
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
PPTX
Summary of 3DPAS
PDF
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
PDF
Bridging Big Data and Data Science Using Scalable Workflows
PDF
Taming the Big Data Beast - Together
PDF
Linked Data: Een extra ontstluitingslaag op archieven
PDF
Data legend dh_benelux_2017.key
PDF
Works 2015-provenance-mileage
PPT
Ngsp
PDF
Provenance in Databases and Scientific Workflows: Part I
ICSSP-Panel Austin, May 15, 2016.
From Workflows to Transparent Research Objects and Reproducible Science Tales
Open Science for sustainability and inclusiveness: the SKA role model
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Lecture 3 Computer Science Research SEM1 22_23 (1).pptx
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Hector Guerrero- Road to Business Analytics
Final data presentation_clir_july2014
Saving private data, sharing Open Data? Role of libraries and institutional r...
Is the current measure of excellence perverting Science? A Data deluge is com...
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
Summary of 3DPAS
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Bridging Big Data and Data Science Using Scalable Workflows
Taming the Big Data Beast - Together
Linked Data: Een extra ontstluitingslaag op archieven
Data legend dh_benelux_2017.key
Works 2015-provenance-mileage
Ngsp
Provenance in Databases and Scientific Workflows: Part I
Ad

More from Bertram Ludäscher (20)

PDF
The Skeptic’s Argumentation Game or: Well-Founded Explanations for Mere Mortals
PDF
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
PDF
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
PDF
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
PDF
[Flashback] Integration of Active and Deductive Database Rules
PDF
[Flashback] Statelog: Integration of Active & Deductive Database Rules
PDF
Answering More Questions with Provenance and Query Patterns
PDF
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
PDF
Which Model Does Not Belong: A Dialogue
PDF
From Research Objects to Reproducible Science Tales
PDF
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
PDF
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
PDF
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
PDF
Dissecting Reproducibility: A case study with ecological niche models in th...
PDF
Incremental Recomputation: Those who cannot remember the past are condemned ...
PDF
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
PDF
An ontology-driven framework for data transformation in scientific workflows
PDF
Whole-Tale: The Experience of Research
PDF
ETC & Authors in the Driver's Seat
PDF
From Provenance Standards and Tools to Queries and Actionable Provenance
The Skeptic’s Argumentation Game or: Well-Founded Explanations for Mere Mortals
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
Answering More Questions with Provenance and Query Patterns
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Which Model Does Not Belong: A Dialogue
From Research Objects to Reproducible Science Tales
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
Dissecting Reproducibility: A case study with ecological niche models in th...
Incremental Recomputation: Those who cannot remember the past are condemned ...
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
An ontology-driven framework for data transformation in scientific workflows
Whole-Tale: The Experience of Research
ETC & Authors in the Driver's Seat
From Provenance Standards and Tools to Queries and Actionable Provenance

Recently uploaded (20)

PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Transcultural that can help you someday.
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Global Data and Analytics Market Outlook Report
PPTX
Business_Capability_Map_Collection__pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
DOCX
Factor Analysis Word Document Presentation
PDF
annual-report-2024-2025 original latest.
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Managing Community Partner Relationships
Navigating the Thai Supplements Landscape.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
IMPACT OF LANDSLIDE.....................
Transcultural that can help you someday.
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Global Data and Analytics Market Outlook Report
Business_Capability_Map_Collection__pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Factor Analysis Word Document Presentation
annual-report-2024-2025 original latest.
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
New ISO 27001_2022 standard and the changes
CYBER SECURITY the Next Warefare Tactics
A Complete Guide to Streamlining Business Processes
Optimise Shopper Experiences with a Strong Data Estate.pdf
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Managing Community Partner Relationships

From Data to Knowledge with Workflows & Provenance

  • 2. •  Scien2fic Workflows –  Examples, Features •  Data Cleaning and Cura5on •  Provenance & Reproducible Science –  “Prospec5ve Provenance” (a.k.a. workflows) –  Retrospec5ve Provenance •  YesWorkflow –  Yes, Scripts can be Workflows, too! •  Other stuff –  Time allowing .. Outline 2 SPIN'16 @ NCSA
  • 3. Introduc2ons should come first! •  MS Computer Science, U Karlsruhe (K.I.T.) •  PhD Computer Science, U Freiburg, Germany •  Research Scien5st, UC San Diego, SDSC •  Dept. of Computer Science, UC Davis •  School of Informa5on Sciences, U of Illinois •  Natl. Center for Supercompu5ng Applica5ons 3 SPIN'16 @ NCSA
  • 4. Scientific Workflows: ASAP •  Automation –  wfs to automate computational aspects of science •  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data •  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share •  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science Trident Workbench VisTrails 4 Es war einmal … SPIN'16 @ NCSA
  • 5. 10 Essen2al func2ons of a scien2fic workflow system 1.  Automate programs and services scien5sts already use. 2.  Schedule invoca5ons of programs and services correctly and efficiently – in parallel where possible. 3.  Manage dataflow to, from, and between programs and services. 4.  Enable scien2sts (not just developers) to author or modify workflows easily. 5.  Predict what a workflow will do when executed: prospec/ve provenance. 6.  Record what happened during workflow execu5on: retrospec/ve provenance. 7.  Reveal retrospec2ve provenance – how workflow products were derived from inputs via programs and services. 8.  Organize intermediate and final data products as desired by users. 9.  Enable scien5sts to version, share and publish their workflows. 10.  Empower scien2sts who wish to automate addi2onal programs and services themselves. These func2ons (not just dataflow & actors) dis2nguish scien/fic workflow automa/on from general scien2fic so[ware development. SPIN'16 @ NCSA 5 Src: Timothy McPhillips
  • 7. Executable WATERS Workflow in Kepler SPIN'16 @ NCSA 7
  • 9. Motif-Catcher workflow, implemented in Kepler S Köhler et al. Improved Mo5f Detec5on in Large Sequence Sets with Random Sampling in a Kepler workflow, ICCS-WS, 2012 SPIN'16 @ NCSA 9
  • 10. A Data-Streaming Workflow over Sensor Data SPIN'16 @ NCSA 10
  • 11. •  Monitor and control supercomputer simula5ons –  50+ composite actors (subworkflows) –  4 levels of hierarchy –  1000+ atomic (Java) actors 43 actors, 3 levels 196 actors, 4 levels 30 actors 206 actors, 4 levels 137 actors 33 actors 150 123 actors 66 actors 12 actors 243 actors, 4 levels Norbert Podhorszki ORNL (then: UC Davis) “Plumbing” workflow SPIN'16 @ NCSA 11
  • 12. Scien2fic Workflow Design: Some Challenges And the graphical UI makes our scientific workflows so much easier to develop, understand and maintain! SPIN'16 @ NCSA 12
  • 13. More “Plumbing” (beware the Boolean Select) Cabellos et al. Computer Physics Communica*ons 182, 2011 SPIN'16 @ NCSA 13
  • 14. Modeling & Design: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt Vanilla Process Network Func2onal Programming Dataflow Network XML Transforma2on Network Collec2on-oriented Modeling & Design framework (COMAD) “Look Ma: No Shims!” SPIN'16 @ NCSA 14
  • 15. Problems with [too many] Shims and Wires •  Shims need to be placed and connected –  Tedious, error-prone •  Distract from scien5fic meaningful actors –  Non-descrip5ve workflows – worth sharing? •  Data Organiza5on is encoded in workflow structure –  Not robust to data changes •  Shims ouen lead to complex designs –  Imagine all previous `design-pawerns’ intertwined –  GOTO-programming COMAD/VDAL: Raising the level of abstrac/on   Localized control-flow   Data management not done via wires   Actors are coupled not by wire but by data! SPIN'16 @ NCSA 15
  • 16. Collec5on-Oriented Modeling & Design (COMAD) –  fully embrace the assembly line metaphor –  data = tagged nested collec2ons –  e.g. represented as flawened, pipelined (XML) token streams: Pipelined Collec2on-Oriented Workflows Actors (like assembly line workers), pass on what they don’t work on T McPhillips, S Bowers, D Zinn, B Ludäscher SPIN'16 @ NCSA 16
  • 17. Two different workflow designs •  Hardwiring vs. configurable data/collec5on management •  briwle vs. change resilient designs •  scien5st can recognize napkin drawing/conceptual model •  Human cycles are expensive SPIN'16 @ NCSA 17
  • 20. From Data Life-Cycle to Curation Life-Cycle Uncanny Resemblance: Eye of Jupiter (If you have “visions”… ) DCC Curation Lifecycle SPIN'16 @ NCSA 20
  • 24. Kurator Project (Data Curation Workflows) SPIN'16 @ NCSA 24
  • 25. From “Climate Gate” to Reproducible Science Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science => need workflow-system agnostic model SPIN'16 @ NCSA 25
  • 32. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years ? YesWorkflow: Yes, scripts are workflows, too! •  Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: ** 32 SPIN'16 @ NCSA
  • 34. YesWorkflow: Prospec2ve & Retrospec5ve Provenance … (almost) for free! •  YW annota5ons in the script (R, Python, Matlab) are used to recreate the workflow view from the script … 34 cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv YW! SPIN'16 @ NCSA
  • 35. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years Paleoclimate Reconstruc2on (EnviRecon.org) 35 •  … explained using YesWorkflow! Kyle B., (computa5onal) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-annotate, all-told." SPIN'16 @ NCSA
  • 38. module.__build_class__ module.__build_class__ simulate_data_collection 180 return 180 run_logger 201 return 201 new_image_file 230 parser 231 cassette_id 236 add_option 241 add_option 246 add_option 248 set_usage 251 parse_args 251 args 251 options 254 module.len 24 cassette_id 24 sample_score_cutoff 24 data_redundancy 24 calibration_image_file 30 exists 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 36 run_log 37 write 38 str(sample_score_cutoff) 38 write 38 str(sample_score_cutoff) 49 str.format 49 sample_spreadsheet_file 50 spreadsheet_rows cassette_q55_spreadsheet.csv 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 72 str.format 72 write 73 open 73 rejection_log 74 str.format 74 TextIOWrapper.write 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image calibration.img 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 128 return run/run_log.txt run/rejected_samples.txt run/raw/q55/DRT240/e10000/image_001.raw run/data/DRT240/DRT240_10000eV_001.img run/collected_images.csv run/raw/q55/DRT240/e10000/image_002.raw run/data/DRT240/DRT240_10000eV_002.img run/raw/q55/DRT240/e11000/image_001.raw run/data/DRT240/DRT240_11000eV_001.img run/raw/q55/DRT240/e11000/image_002.raw run/data/DRT240/DRT240_11000eV_002.img run/raw/q55/DRT240/e12000/image_001.raw run/data/DRT240/DRT240_12000eV_001.img run/raw/q55/DRT240/e12000/image_002.raw run/data/DRT240/DRT240_12000eV_002.img run/raw/q55/DRT322/e10000/image_001.raw run/data/DRT322/DRT322_10000eV_001.img run/raw/q55/DRT322/e10000/image_002.raw run/data/DRT322/DRT322_10000eV_002.img run/raw/q55/DRT322/e11000/image_001.raw run/data/DRT322/DRT322_11000eV_001.img run/raw/q55/DRT322/e11000/image_002.raw run/data/DRT322/DRT322_11000eV_002.img noWorkflow: not only Workflow! 38 •  Scripts have provenance, too! •  Transparently capture some/all provenance from Python script runs. •  Use filter queries to “zoom” into relevant parts .. SPIN'16 @ NCSA
  • 39. simulate_data_collection 230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8> 251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55']) 251 args = ['q55'] 251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}> 24 cassette_id = 'q55' 24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0 24 calibration_image_file = 'calibration.img' 49 str.format 49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv' 50 spreadsheet_rows(sample_spreadsheet_file) 50 sample_name = 'DRT240'50 sample_quality = 45 61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000]) 61 accepted_sample = 'DRT240'61 num_images = 2 61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240' 92 collect_next_image(casset ... _{frame_number:03d}.raw') 92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw' 106 str.format 106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img') calibration.img run/data/DRT240/DRT240_11000eV_002.img $ now dataflow -f "run/data/DRT240/DRT240_11000eV_002.img" 39 $(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS) now helper df_style.py now dataflow -v 55 -f $ (RETROSPECTIVE_LINEAGE_VALUE) -m simulation | python df_style.py -d BT -e > $ (NW_FILTERED_LINEAGE_GRAPH).gv .. auto-“make” this! noWorkflow lineage of an image file Provenance informa*on about Python func/on calls, variable assignments, etc. SPIN'16 @ NCSA
  • 40. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id YesWorkflow: Yes, scripts are Workflows, too! •  Use YW annota5ons @begin...@end, @in, @out to reveal hidden conceptual workflow (prospec2ve provenance) •  Script isn't changed: –  annota5ons via comments (=> language independent) •  For understanding and sharing the “big picture” •  Query and visualize! 40 SPIN'16 @ NCSA
  • 41. Alternate YW Views 41 simulate_data_collection initialize_run load_screening_results calculate_strategy log_rejected_sample collect_data_set transform_images log_average_image_intensity simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id Process view Data view Workflow view SPIN'16 @ NCSA
  • 42. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id What is the lineage of “corrected_image”? 42 From here on “upwards”: What led (leads) to this? .. and what is irrelevant and should be pruned?? SPIN'16 @ NCSA
  • 43. simulate_data_collection collect_data_set sample_id energy frame_number raw_image calculate_strategy accepted_sample num_imagesenergies load_screening_results sample_namesample_quality transform_images corrected_image sample_spreadsheet calibration_image sample_score_cutoff data_redundancy cassette_id Subgraph resul5ng from lineage query on YW workflow model 43 What is the lineage of corrected_image? SPIN'16 @ NCSA
  • 44. 44 simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id simulate_data_collection collect_data_set sample_id energy frame_number raw_image calculate_strategy accepted_sample num_imagesenergies load_screening_results sample_namesample_quality transform_images corrected_image sample_spreadsheet calibration_image sample_score_cutoff data_redundancy cassette_id module.__build_class__ module.__build_class__ simulate_data_collection 180 return 180 run_logger 201 return 201 new_image_file 230 parser 231 cassette_id 236 add_option 241 add_option 246 add_option 248 set_usage 251 parse_args 251 args 251 options 254 module.len 24 cassette_id 24 sample_score_cutoff 24 data_redundancy 24 calibration_image_file 30 exists 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 36 run_log 37 write 38 str(sample_score_cutoff) 38 write 38 str(sample_score_cutoff) 49 str.format 49 sample_spreadsheet_file 50 spreadsheet_rows cassette_q55_spreadsheet.csv 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 72 str.format 72 write 73 open 73 rejection_log 74 str.format 74 TextIOWrapper.write 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image calibration.img 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 128 return run/run_log.txt run/rejected_samples.txt run/raw/q55/DRT240/e10000/image_001.raw run/data/DRT240/DRT240_10000eV_001.img run/collected_images.csv run/raw/q55/DRT240/e10000/image_002.raw run/data/DRT240/DRT240_10000eV_002.img run/raw/q55/DRT240/e11000/image_001.raw run/data/DRT240/DRT240_11000eV_001.img run/raw/q55/DRT240/e11000/image_002.raw run/data/DRT240/DRT240_11000eV_002.img run/raw/q55/DRT240/e12000/image_001.raw run/data/DRT240/DRT240_12000eV_001.img run/raw/q55/DRT240/e12000/image_002.raw run/data/DRT240/DRT240_12000eV_002.img run/raw/q55/DRT322/e10000/image_001.raw run/data/DRT322/DRT322_10000eV_001.img run/raw/q55/DRT322/e10000/image_002.raw run/data/DRT322/DRT322_10000eV_002.img run/raw/q55/DRT322/e11000/image_001.raw run/data/DRT322/DRT322_11000eV_001.img run/raw/q55/DRT322/e11000/image_002.raw run/data/DRT322/DRT322_11000eV_002.img simulate_data_collection 230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8> 251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55']) 251 args = ['q55'] 251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}> 24 cassette_id = 'q55' 24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0 24 calibration_image_file = 'calibration.img' 49 str.format 49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv' 50 spreadsheet_rows(sample_spreadsheet_file) 50 sample_name = 'DRT240'50 sample_quality = 45 61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000]) 61 accepted_sample = 'DRT240'61 num_images = 2 61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240' 92 collect_next_image(casset ... _{frame_number:03d}.raw') 92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw' 106 str.format 106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img') calibration.img run/data/DRT240/DRT240_11000eV_002.img lineage query lineage query YesWorkflow: Conceptual workflow model noWorkflow: Python trace model But how do we bridge this gap??? Would like to use YW model to query NW data! SPIN'16 @ NCSA
  • 48. Secret Reproducible Sauce •  Combining provenance informa5on from noWorkflow and YesWorkflow •  Using all the good stuff: – make, docker, Prolog, SQL, Graphviz •  Open source – github.com/yesworkflow-org/yw-noworkflow – github.com/gems-uff/yin-yang-demo •  Have a closer look at the demo! 48 SPIN'16 @ NCSA
  • 49. run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     YW-RECON: Prospec5ve & Retrospec2ve Provenance … (almost) for free! 49 cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv •  URI-templates link conceptual en55es to run2me provenance “leu behind” by the script author … •  … facilita5ng provenance reconstruc2on SPIN'16 @ NCSA
  • 50. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q1: What samples did the script run collect images from? run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     50 SPIN'16 @ NCSA
  • 51. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q2: What energies were used for image collec5on from sample DRT322? run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     51 SPIN'16 @ NCSA
  • 52. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q3: Where is the raw image of the corrected image DRT322_11000ev_030.img? run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     52 SPIN'16 @ NCSA
  • 53. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Q5: What cassese-id had the sample leading to DRT240_10000ev_001.img? 53 SPIN'16 @ NCSA