SlideShare a Scribd company logo
Scaling Analysis
Responsibly
Hilary Parker
@hspter
#rcatladies
Not So Standard Deviations
@keegsdur
“We just don’t have enough analysts!”
“Let’s scale by building the perfect BI tool!”
That sounds great!
We should
automate some
of the things
that are slowing
you down
PRODUCT
TEAM
DATA
http://xkcd.com/
That seems perfectly
reasonable!
Let’s just enlist
some folks from
engineering to
help you with it
DATAPRODUCT
TEAM
DATA ENG
Sure thing!
...and finally can
it add this last
graph?
several months pass…
ENG
Sure! File a ticket!
Can we add
these 132 extra
metrics to the
testing?
PRODUCT
TEAM
You can’t do that,
your family-wise
error rate will tend
to 1!!
ENG PRODUCT
TEAM
DATA
ENG
That’s a reasonable
expectation for an
internal product. I’m on
it!
I’d really like this
tool to be more
stable.
PRODUCT
TEAM
Our test violates a
subtle statistical
assumption for this
new application, and
we need to gut this
stable product!
ENG PRODUCT
TEAM
DATA
Almost impossible to avoid 2-against-1 dysfunction as
product teams become “self-service” with engineering
support
Invariably becomes a race to the bottom as internal
competition for the simplest tool emerges
Stability prioritized over flexibility
(In tech)
Building = Owning
Analysis Developer!
“Analysis Developer”
Someone on the analyst team who develops reproducible,
flexible analyses in R and helps all analysts scale their
work
I’ll work with the analysis
developer on my team!
We should
automate some
of the things
that are slowing
you down
PRODUCT
TEAM
DATA
Avoids common types of dysfunction
Allows for flexible, accurate analysis
Analysts acquire marketable skills!
Instead of creating dashboards or using static BI
tools...
http://dilbert.com/strip/2007-05-16
Series of R packages highly specified for business case,
“mix and match” elements to rapidly create common
reports.
library(“internal_package”)
Scaling Analysis Responsibly
Instead of “assembly line” data processing…
Close 2-way partnership with data engineers to optimize
the creation of datasets for certain common analyses.
The assembly line handoff from scientist to engineer creates [an
uncreative] environment. The trick is to create an environment
that allows for autonomy, ownership, and focus for everyone
involved. - Jeff Magnusson
http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/
Instead of PM anxiously watching dashboards…
https://www.youtube.com/watch?v=CCbWyYr82BM
Analysts can create shorter-lived, reproducible reports
Expectation manage the shorter lifespan of the report, but
include that report will require less work from teams once
created
Productionize in the short-term with CRON jobs
Can add in more stats this way! Y/Y turns into
semiparametric models, etc.
“The Problem with
Dashboards (And A
Solution)” by
Stephanie Evergreen
http://stephanieevergreen.
com/problem-with-dashboards/
http://dilbert.com/strip/2004-04-05
Instead of promotion based on deliverables…
Consider skill acquisition for analyst promotion
For analysis developers, promoted based on whether or
not they were able to help other analysts become more
efficient
Support for skill acquisition!
Education support for learning better analysis
development methods for all analysts
Internally created resources
Instead of PMs self-teaching analysis based on what’s
presented in dashboarding tools..
https://xkcd.com/605/
PMs can use tools for education analysts if they want to
“ramp up” on analytical skills like R
This way you can bake in statistical education as well.
“Isn’t this just package development?”
“Isn’t this just package development?”
No!
Ad-hoc spreadsheet work
Ad-hoc spreadsheet work
+ scripting
Ad-hoc spreadsheet work
R workflows
+ scripting
Ad-hoc spreadsheet work
R workflows
+ scripting
+ reproducibility, some functions, “analysis testing”
Ad-hoc spreadsheet work
R workflows
Reproducible R analyses
+ scripting
+ reproducibility, some functions, “analysis testing”
Ad-hoc spreadsheet work
R workflows
Reproducible R analyses
+ scripting
+ reproducibility, some functions, “analysis testing”
+ workplace-wide audience, documentation, testing
- problem-specific writeups and functions
Ad-hoc spreadsheet work
R workflows
Reproducible R analyses
Internal package development
+ scripting
+ reproducibility, some functions, “analysis testing”
+ workplace-wide audience, documentation, testing
- problem-specific writeups and functions
Ad-hoc spreadsheet work
R workflows
Reproducible R analyses
Internal package development
+ scripting
+ reproducibility, some functions, “analysis testing”
+ workplace-wide audience, documentation, testing
- problem-specific writeups and functions
+ industry-wide audience
- company-specific code and functions
Ad-hoc spreadsheet work
R workflows
Reproducible R analyses
Internal package development
External package development
+ scripting
+ reproducibility, some functions, “analysis testing”
+ workplace-wide audience, documentation, testing
- problem-specific writeups and functions
+ industry-wide audience
- company-specific code and functions
Ad-hoc spreadsheet work
R workflows
Reproducible R analyses
Internal package development
External package development
+ reproducibility, some functions, “analysis testing”
+ scripting
+ workplace-wide audience, documentation, testing
- problem-specific writeups and functions
+ industry-wide audience
- company-specific code and functions
Analysis Developer
Open-Source Developer
Analysis Developer
Stop trying to scale with static BI tools -- this will (almost)
always lead to dysfunction
Instead, scale by increasing analyst efficiency using R and
education!
Hire Analysis Developers to help with all this!
Thanks!
Hilary Parker
@hspter

More Related Content

PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
PDF
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
PDF
Building Scalable Prediction Services in R
PDF
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
PDF
Data Science Challenges in Personal Program Analysis
PDF
Improving data interoperability in Python and R
PDF
High-Performance Python
PDF
Julia + R for Data Science
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
Building Scalable Prediction Services in R
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Data Science Challenges in Personal Program Analysis
Improving data interoperability in Python and R
High-Performance Python
Julia + R for Data Science

What's hot (20)

PDF
Version Control in Machine Learning + AI (Stanford)
PDF
Provenance in Production-Grade Machine Learning
PDF
From NASA to Startups to Big Commerce
PPTX
Managing and Versioning Machine Learning Models in Python
PDF
Using dataset versioning in data science
PDF
Open Source Big Graph Analytics on Neo4j with Apache Spark
PPTX
Reproducible Data Science with R
PDF
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
PPTX
OSS Java Analysis - What You Might Be Missing
PPTX
Adopting Agile
PPTX
Static Analysis Primer
PPTX
Finding Defects in C#: Coverity vs. FxCop
DOC
jlettvin.resume.20160922.STAR
PDF
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
PPTX
Resource Leaks in Java
PPTX
Bug prediction based on your code history
PDF
Web Applications of the Future with TypeScript and GraphQL
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PDF
Stream Processing: Choosing the Right Tool for the Job
PDF
Computer Tools for Academic Research
Version Control in Machine Learning + AI (Stanford)
Provenance in Production-Grade Machine Learning
From NASA to Startups to Big Commerce
Managing and Versioning Machine Learning Models in Python
Using dataset versioning in data science
Open Source Big Graph Analytics on Neo4j with Apache Spark
Reproducible Data Science with R
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
OSS Java Analysis - What You Might Be Missing
Adopting Agile
Static Analysis Primer
Finding Defects in C#: Coverity vs. FxCop
jlettvin.resume.20160922.STAR
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Resource Leaks in Java
Bug prediction based on your code history
Web Applications of the Future with TypeScript and GraphQL
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Stream Processing: Choosing the Right Tool for the Job
Computer Tools for Academic Research
Ad

Viewers also liked (15)

PDF
R for Everything
PPTX
Inside the R Consortium
PDF
Scaling Data Science at Airbnb
PDF
One Algorithm to Rule Them All: How to Automate Statistical Computation
PDF
The Political Impact of Social Penumbras
PDF
Reflection on the Data Science Profession in NYC
PDF
Broom: Converting Statistical Models to Tidy Data Frames
PDF
Analyzing NYC Transit Data
PDF
The Feels
PDF
I Don't Want to Be a Dummy! Encoding Predictors for Trees
PDF
R Packages for Time-Varying Networks and Extremal Dependence
PDF
Improving Data Interoperability for Python and R
PDF
Iterating over statistical models: NCAA tournament edition
PDF
Using R at NYT Graphics
PDF
Thinking Small About Big Data
R for Everything
Inside the R Consortium
Scaling Data Science at Airbnb
One Algorithm to Rule Them All: How to Automate Statistical Computation
The Political Impact of Social Penumbras
Reflection on the Data Science Profession in NYC
Broom: Converting Statistical Models to Tidy Data Frames
Analyzing NYC Transit Data
The Feels
I Don't Want to Be a Dummy! Encoding Predictors for Trees
R Packages for Time-Varying Networks and Extremal Dependence
Improving Data Interoperability for Python and R
Iterating over statistical models: NCAA tournament edition
Using R at NYT Graphics
Thinking Small About Big Data
Ad

Similar to Scaling Analysis Responsibly (20)

PPTX
Neotys PAC - Stijn Schepers
PPTX
Meet a 100% R-based CRO - The summary of a 5-year journey
PPTX
Meet a 100% R-based CRO. The summary of a 5-year journey
PPTX
Maintainable Machine Learning Products
PPTX
Sync Workitems between multiple Team Projects #vssatpn
PDF
ChatGPT and Beyond - Elevating DevOps Productivity
PDF
Software Engineering for Data Scientists (MEAP V2) Andrew Treadway
PDF
OSCON 2014: Data Workflows for Machine Learning
PDF
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
PPT
Chen's second test slides again
PPT
A simple test paper from Chen
PPT
Chen's second test slides
PPT
A simple test paper from Chen
PPT
A simple test paper from Chen
PPTX
Azure DevOps Realtime Work Item Sync: the good, the bad, the ugly!
PDF
Scalable code Design with slimmer Django models .. and more
PDF
Enterprise Data Science
PDF
Telemetry doesn't have to be scary; Ben Ford
PDF
Ben ford intro
PPTX
Operationalizing analytics to scale
Neotys PAC - Stijn Schepers
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
Maintainable Machine Learning Products
Sync Workitems between multiple Team Projects #vssatpn
ChatGPT and Beyond - Elevating DevOps Productivity
Software Engineering for Data Scientists (MEAP V2) Andrew Treadway
OSCON 2014: Data Workflows for Machine Learning
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
Chen's second test slides again
A simple test paper from Chen
Chen's second test slides
A simple test paper from Chen
A simple test paper from Chen
Azure DevOps Realtime Work Item Sync: the good, the bad, the ugly!
Scalable code Design with slimmer Django models .. and more
Enterprise Data Science
Telemetry doesn't have to be scary; Ben Ford
Ben ford intro
Operationalizing analytics to scale

More from Work-Bench (8)

PDF
2017 Enterprise Almanac
PDF
AI to Enable Next Generation of People Managers
PDF
Startup Recruiting Workbook: Sourcing and Interview Process
PDF
Cloud Native Infrastructure Management Solutions Compared
PPTX
Building a Demand Generation Machine at MongoDB
PPTX
How to Market Your Startup to the Enterprise
PDF
Marketing & Design for the Enterprise
PDF
Playing the Marketing Long Game
2017 Enterprise Almanac
AI to Enable Next Generation of People Managers
Startup Recruiting Workbook: Sourcing and Interview Process
Cloud Native Infrastructure Management Solutions Compared
Building a Demand Generation Machine at MongoDB
How to Market Your Startup to the Enterprise
Marketing & Design for the Enterprise
Playing the Marketing Long Game

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Lecture1 pattern recognition............
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Computer network topology notes for revision
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
Lecture1 pattern recognition............
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx
Mega Projects Data Mega Projects Data
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Business Analytics and business intelligence.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Supervised vs unsupervised machine learning algorithms
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf
.pdf is not working space design for the following data for the following dat...
Computer network topology notes for revision
Galatica Smart Energy Infrastructure Startup Pitch Deck

Scaling Analysis Responsibly

  • 2. #rcatladies Not So Standard Deviations @keegsdur
  • 3. “We just don’t have enough analysts!”
  • 4. “Let’s scale by building the perfect BI tool!”
  • 5. That sounds great! We should automate some of the things that are slowing you down PRODUCT TEAM DATA http://xkcd.com/
  • 6. That seems perfectly reasonable! Let’s just enlist some folks from engineering to help you with it DATAPRODUCT TEAM
  • 7. DATA ENG Sure thing! ...and finally can it add this last graph?
  • 9. ENG Sure! File a ticket! Can we add these 132 extra metrics to the testing? PRODUCT TEAM
  • 10. You can’t do that, your family-wise error rate will tend to 1!! ENG PRODUCT TEAM DATA
  • 11. ENG That’s a reasonable expectation for an internal product. I’m on it! I’d really like this tool to be more stable. PRODUCT TEAM
  • 12. Our test violates a subtle statistical assumption for this new application, and we need to gut this stable product! ENG PRODUCT TEAM DATA
  • 13. Almost impossible to avoid 2-against-1 dysfunction as product teams become “self-service” with engineering support Invariably becomes a race to the bottom as internal competition for the simplest tool emerges Stability prioritized over flexibility
  • 16. “Analysis Developer” Someone on the analyst team who develops reproducible, flexible analyses in R and helps all analysts scale their work
  • 17. I’ll work with the analysis developer on my team! We should automate some of the things that are slowing you down PRODUCT TEAM DATA
  • 18. Avoids common types of dysfunction Allows for flexible, accurate analysis Analysts acquire marketable skills!
  • 19. Instead of creating dashboards or using static BI tools... http://dilbert.com/strip/2007-05-16
  • 20. Series of R packages highly specified for business case, “mix and match” elements to rapidly create common reports. library(“internal_package”)
  • 22. Instead of “assembly line” data processing…
  • 23. Close 2-way partnership with data engineers to optimize the creation of datasets for certain common analyses. The assembly line handoff from scientist to engineer creates [an uncreative] environment. The trick is to create an environment that allows for autonomy, ownership, and focus for everyone involved. - Jeff Magnusson http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/
  • 24. Instead of PM anxiously watching dashboards… https://www.youtube.com/watch?v=CCbWyYr82BM
  • 25. Analysts can create shorter-lived, reproducible reports
  • 26. Expectation manage the shorter lifespan of the report, but include that report will require less work from teams once created Productionize in the short-term with CRON jobs Can add in more stats this way! Y/Y turns into semiparametric models, etc.
  • 27. “The Problem with Dashboards (And A Solution)” by Stephanie Evergreen http://stephanieevergreen. com/problem-with-dashboards/
  • 29. Consider skill acquisition for analyst promotion For analysis developers, promoted based on whether or not they were able to help other analysts become more efficient Support for skill acquisition!
  • 30. Education support for learning better analysis development methods for all analysts Internally created resources
  • 31. Instead of PMs self-teaching analysis based on what’s presented in dashboarding tools.. https://xkcd.com/605/
  • 32. PMs can use tools for education analysts if they want to “ramp up” on analytical skills like R This way you can bake in statistical education as well.
  • 33. “Isn’t this just package development?”
  • 34. “Isn’t this just package development?” No!
  • 37. Ad-hoc spreadsheet work R workflows + scripting
  • 38. Ad-hoc spreadsheet work R workflows + scripting + reproducibility, some functions, “analysis testing”
  • 39. Ad-hoc spreadsheet work R workflows Reproducible R analyses + scripting + reproducibility, some functions, “analysis testing”
  • 40. Ad-hoc spreadsheet work R workflows Reproducible R analyses + scripting + reproducibility, some functions, “analysis testing” + workplace-wide audience, documentation, testing - problem-specific writeups and functions
  • 41. Ad-hoc spreadsheet work R workflows Reproducible R analyses Internal package development + scripting + reproducibility, some functions, “analysis testing” + workplace-wide audience, documentation, testing - problem-specific writeups and functions
  • 42. Ad-hoc spreadsheet work R workflows Reproducible R analyses Internal package development + scripting + reproducibility, some functions, “analysis testing” + workplace-wide audience, documentation, testing - problem-specific writeups and functions + industry-wide audience - company-specific code and functions
  • 43. Ad-hoc spreadsheet work R workflows Reproducible R analyses Internal package development External package development + scripting + reproducibility, some functions, “analysis testing” + workplace-wide audience, documentation, testing - problem-specific writeups and functions + industry-wide audience - company-specific code and functions
  • 44. Ad-hoc spreadsheet work R workflows Reproducible R analyses Internal package development External package development + reproducibility, some functions, “analysis testing” + scripting + workplace-wide audience, documentation, testing - problem-specific writeups and functions + industry-wide audience - company-specific code and functions Analysis Developer Open-Source Developer
  • 45. Analysis Developer Stop trying to scale with static BI tools -- this will (almost) always lead to dysfunction Instead, scale by increasing analyst efficiency using R and education! Hire Analysis Developers to help with all this!