SlideShare a Scribd company logo
Compu&ng	Workflows	for	
Biologists	
Based	on:	
Shade	&	Teal,	Compu&ng	Workflows	for	Biologists:	A	Roadmap,	
PLOS	Biology	
Data	Carpentry	data	organiza&on	lessons
•  How	many	people	here	plan	to	analyze	data	
with	a	computer	in	their	work?	
•  Are	you	working	with	other	people	on	this	
analysis?	
•  Do	other	people	need	to	understand	your	
analysis?	
•  Do	you	need	to	remember	and	understand	
your	analysis?
Elements	of	compu&ng	
•  How	data	was	generated	(metadata)	
•  Data	
•  Data	cleaning	steps	
•  Data	analysis	steps	
•  Final	plots	and	charts
Data!	
•  Keep	raw	data	raw	
•  Use	meaningful	names	
•  Organize	your	data	so	computers	can	read	it
Keep	raw	data	raw	
•  What	is	raw	data?	
•  Why	should	I	leave	it	alone?
Use	meaningful	names
Organize	your	data	so	computers	
can	read	it	
(let’s	talk	about	spreadsheets)	
hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/00-intro.html	
…	also	avoid	formaZng	errors
Organizing	data	in	spreadsheets	
The	cardinal	rules	of	using	spreadsheet	programs	for	data:	
•  Put	all	your	variables	in	columns	-	the	thing	you're	
measuring,	like	'weight'	or	'temperature'.	
•  Put	each	observa/on	in	its	own	row.	
•  Don't	combine	mul/ple	pieces	of	informa/on	in	one	cell.	
Some&mes	it	just	seems	like	one	thing,	but	think	if	that's	
the	only	way	you'll	want	to	be	able	to	use	or	sort	that	data.	
•  Leave	the	raw	data	raw	-	don't	mess	with	it!	
•  Export	the	cleaned	data	to	a	text	based	format	like	CSV.	
This	ensures	that	anyone	can	use	the	data,	and	is	the	
format	required	by	most	data	repositories.
Computing Workflows for Biologists: An Overview
FormaZng	problems	
hTp://www.datacarpentry.org/spreadsheet-
ecology-lesson/02-common-mistakes.html
A	Roadmap	for	the	Compu&ng	
Biologist	
•  Consider	the	overarching	goals	of	the	analysis	
•  Adopt	an	Itera&ve,	Branching	PaTern	to	
Systema&cally	Explore	Op&ons	
•  Reproducibility	Checkpoints	
•  Taking	Notes	for	Computa&onal	Analysis	
•  Shared	Responsibility:	The	Team	Approach	to	
Reproducibility	and	Data	Management	
Shade	and	Teal,	Compu&ng	Workflows	for	Biologists:	A	Roadmap	
hTp://journals.plos.org/plosbiology/ar&cle?id=10.1371/journal.pbio.1002303
Consider	the	Overarching	Goals		
of	the	Analysis	
•  Working	to	address	a	given	hypothesis	will	
mo&vate	different	analysis	strategies	than	
conduc&ng	data	explora&on
Reproducibility	Checkpoints	
	
Reproducibility	checkpoints	are	places	in	a	
workflow	devoted	to	scru&nizing	its	integrity	
-  the	workflow	(or	step	in	the	workflow)	can	be	
seamlessly	used	(it	doesn’t	crash	halfway	or	
return	error	messages)	
-  the	outcomes	are	consistent	and	validated	
across	mul&ple,	iden&cal	itera&ons	
-  results	should	make	biological	sense
Adopt	an	Itera/ve,	Branching	PaFern	
to	Systema/cally	Explore	Op/ons
Taking	Notes	for	Computa/onal	
Analysis	
	•  Take	notes	like	you	would	for	experimental	
work	
•  Comment	code	
•  Use	version	control	(Github/Gitlab)
What	needs	to	go	in	notes:	
	
-  Soiware	versions	used	
-  Descrip&on	of	what	the	soiware	is	doing/goal	of	that	
step	
-  Brief	notes	on	devia&ons	from	default	op&ons	
-  Workflows	can	include	different	soiware	(e.g.,	
PANDAseq	to	QIIME	to	R),	and	should	also	include	all	
“formaZng	steps”	needed	to	move	between	tools	
hopefully	you	don’t	need	to	manually	format	too	
much;	avoid	if	possible
Shared	Responsibility:	The	Team	
Approach	to	Reproducibility	and	Data	
Management	
We	posit	that	integrity	in	computa&onal	analysis	of	biological	data	is	enhanced	if	
there	is	a	sense	of	shared	responsibility	for	ensuring	reproducible	workflows.	
	
Research	teams	that	work	together	to	develop	and	debug	code,	perform	internal	
reproducibility	checkpoints	for	each	other,	and	generally	hold	one	another	
accountable	for	high-quality	results	likely	will	enjoy	a	low	manuscript	retrac&on	
rate,	high	level	of	condence	in	their	results,	and	strong	sense	of	collabora&on.	
You,	your	lab	mates	and	PI	need	to	value	the	&me	it	takes	to	
do	analyses	reproducibly	and	correctly
Shared	responsibility	
•  Shared	storage	and	workspace	can	facilitate	access	to	all	
group	data	
•  Using	version	control	repositories	can	provide	access	to	
code	and	documenta&on	(Github,	Dropbox)	
•  SeZng	expecta&ons	for	‘reproducibility	
checkpoints’	(team	“hackathons”:	open-computer	group	
mee&ngs	dedicated	to	analysis)	
•  Paper	reviews	
•  Looking	for	help/support	outside	the	lab	(bioinforma&cs	
or	user	groups,	office	hours,	StackOverflow)
Looking	for	help	
hTps://github.com/mblmicdiv/course2016/
blob/master/bioinfo-resources.md	
	
You	are	not	alone	
	
Survey	responses
Exercise	
hFp:///nyurl.com/mbl-workflows

More Related Content

PPTX
Responsible conduct of research: Data Management
PPTX
Data processing and analysis final
 
PPTX
Database Engine
PPTX
Introduction to Data Science
PPT
eScience: A Transformed Scientific Method
PDF
Technical Presentation
PDF
Josh Wills, MLconf 2013
 
PPTX
Data Structure Assignment Help
Responsible conduct of research: Data Management
Data processing and analysis final
 
Database Engine
Introduction to Data Science
eScience: A Transformed Scientific Method
Technical Presentation
Josh Wills, MLconf 2013
 
Data Structure Assignment Help

What's hot (7)

PPTX
Lecture 1 introduction
PPTX
The Data Analysis Workflow
PPTX
Presentation on data preparation with pandas
PPTX
Machine Learning using Big data
PPTX
Introduction to Big Data/Machine Learning
PPTX
Machine Learning in the age of Big Data
PDF
Top 10 Data Science Practitioner Pitfalls
Lecture 1 introduction
The Data Analysis Workflow
Presentation on data preparation with pandas
Machine Learning using Big data
Introduction to Big Data/Machine Learning
Machine Learning in the age of Big Data
Top 10 Data Science Practitioner Pitfalls
Ad

Similar to Computing Workflows for Biologists: An Overview (20)

PPTX
Data and Donuts: Data organization
PDF
Bren - UCSB - Spooky spreadsheets
PDF
Coping with Data for WHOI JP Students
PDF
2013 10-30-sbc361-reproducible designsandsustainablesoftware
PDF
Data Stewardship for SPATIAL/IsoCamp 2014
PPTX
2013 bio-sesync-intro
PPTX
EDI Training Module 5: Creating Clean Data foro Publishing
PPTX
2015 aem-grs-keynote
PDF
Data Matters for AGU Early Career Conference
PDF
Reproducible, Open Data Science in the Life Sciences
PPTX
Best practices data management
PPT
ManagingOrganizingData_ReusableSlides.ppt
PDF
Swat4 ls2012
PPTX
Data carpentry ndic-2015-05-05
PDF
Introduction to Bioinformatics
PPTX
Best practices data collection
PPTX
Introduction to Data Management
PPT
The beauty of workflows and models
PDF
UCLA: Data Management for Scientists
PPTX
Managing Your Research Data
Data and Donuts: Data organization
Bren - UCSB - Spooky spreadsheets
Coping with Data for WHOI JP Students
2013 10-30-sbc361-reproducible designsandsustainablesoftware
Data Stewardship for SPATIAL/IsoCamp 2014
2013 bio-sesync-intro
EDI Training Module 5: Creating Clean Data foro Publishing
2015 aem-grs-keynote
Data Matters for AGU Early Career Conference
Reproducible, Open Data Science in the Life Sciences
Best practices data management
ManagingOrganizingData_ReusableSlides.ppt
Swat4 ls2012
Data carpentry ndic-2015-05-05
Introduction to Bioinformatics
Best practices data collection
Introduction to Data Management
The beauty of workflows and models
UCLA: Data Management for Scientists
Managing Your Research Data
Ad

More from tracykteal (6)

PDF
Carpentries nasa meeting_2018-10-30
PPTX
Data and Software Carpentry Science Gateways webinar 2017-05-10
PPTX
Data Carpentry NSBE Informational Webinar
PPTX
Data carpentry replicathon_2017-03-24
PPTX
Data carpentry run-a-workshop
PPTX
Data carpentry instructor-onboarding
Carpentries nasa meeting_2018-10-30
Data and Software Carpentry Science Gateways webinar 2017-05-10
Data Carpentry NSBE Informational Webinar
Data carpentry replicathon_2017-03-24
Data carpentry run-a-workshop
Data carpentry instructor-onboarding

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Managing Community Partner Relationships
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
modul_python (1).pptx for professional and student
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Mega Projects Data Mega Projects Data
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Leprosy and NLEP programme community medicine
STUDY DESIGN details- Lt Col Maksud (21).pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Managing Community Partner Relationships
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
modul_python (1).pptx for professional and student
Galatica Smart Energy Infrastructure Startup Pitch Deck
Mega Projects Data Mega Projects Data

Computing Workflows for Biologists: An Overview