SlideShare a Scribd company logo
R and Reproducibility
A Proposal
David Smith
useR! 2014
What is Reproducibility?
“The goal of reproducible research is to tie
specific instructions to data analysis and
experimental data so that scholarship can be
recreated, better understood and verified.”
CRAN Task View on Reproducible Research (Kuhn)
• Method + Environment
-> Results
• A process for:
– Sharing the method
– Describing the environment
– Recreating the results
2 xkcd.com/242/
Why care about reproducibility?
Academic / Research
• Verify results
• Advance Research
Business
• Production code
• Reliability
• Reusability
• Regulation
3
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
R and Reproducibility
4
Results
Interfaces
Platform
Packages
R Engine
• Hand-assembled
• Sweave/knitr/DeployR/Shiny
• R GUI / DevelopR / RStudio
• Batch / Web Services
• OS / Virtualization
• Hardware Architecture
• CRAN
• BioConductor / GitHub / …
• R Version
• Base + Recommended pkgs
Observations
• R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
• Good solutions for literate programming
– Interfaces help
• OS/Hardware not the major cause of
problems
• The big problem is with packages
– CRAN is in a state of continual flux
5
Package Problem #1 : The User
http://xkcd.com/234/6
I heard you need to create a
TPS Report. Here, I’ve got an
R script that does that
already.
Oh, you need to
download these 5
packages first.
I already
did, and it
still
doesn’t
work!
Well, it worked when I
wrote it 3 weeks ago.
YOUR
Grr.
Package
updates…
Package Problem #2: The Author
http://xkcd.com/970/7
Time to update
my package on
CRAN!
>> Dependent
packages that
now fail to build:
67
>> Resubmit
your package
and try again
Crap.
Package Problem #3 : The Update
http://xkcd.com/664/8
3 days later…
Woot! A new version of R
is out! I have 10 minutes
now, time to download
and install!
… package not found …
… can’t install package…
… error …
The Proposal
• Change the default way R handles packages
– Install packages local to projects
• “Snapshot” CRAN daily
– Make it easy to get & use package versions used in script
development
Not a new idea!
– Ooms, “Possible Directions for Improving Dependency
Versioning in R”, R Journal 5/1
– BioConductor Project
– Revolution R Enterprise
– Linux distros
9
Example
• R script file using 6 most popular packages
10
Sharing a script reproducibly
… and simply
# Run with R 3.1.0
require(RRT)
mran_set(snapshot="2014-06-27")
# find packages used in this project
# get package versions used by script author
# install locally to this project
require(ggplot2)
require(data.table)
require(knitr) …
11
RRT: The R Reproducibility Toolkit
• Open Source R Package (GPLv2)
• From an R project folder:
– Detect packages & dependencies used in project
– Download and install from MRAN
– Versions selected according to script date
– Find and use packages from local install
github.com/RevolutionAnalytics/RRT
12
MRAN - Implementation
A downstream CRAN mirror with daily snapshots
• Use rsync to mirror CRAN daily
– Only downloads changed packages
• Use zfs to store incremental snapshots
– Storage only required for new packages
• Organize snapshots into a labelled hierarchy
– Access package versions by date of use
• CRAN snapshot server hosted by cloud provider
– Provisioned for availability and latency
13
Future work
• Just getting started!
• Snapshot binaries and source packages
• Other repos (BioConductor, GitHub, user)
• Institution-level package duplication
– CRAN “behind the firewall”
• User-defined package versions
• Checks on R versions
• Suggestions welcome!
github.com/RevolutionAnalytics/RRT
14
Thank You!
David Smith
david@revolutionanalytics.com
blog.revolutionanalytics.com
Possible Solution
• Bundle all packages with scripts
• Packrat solves this very well
– Project + package dependencies stored in Github
• But:
– Contributes to package fragmentation
– Adds friction to the sharing process
– Doesn’t address the problem for R generally
16
CRAN vs Github
CRAN
• “Repository of Record”
– Default for R users
• Strict quality checking
• Handles dependencies
• Binaries built
– But only current versions
saved
• Manual update process
• Dependent on volunteer
support
Github
• Frictionless publishing /
updates
– RStudio integration
• Social development
– Pull requests FTW
• Ease of updates
• Fragmented – no unified
directory of packages
• Permanence – accounts
closed / repos deleted
17
A downstream CRAN solution?
“I don't see why CRAN needs to be involved in
this effort at all. A third party could take
snapshots of CRAN at R release dates, and make
those available to package users in a separate
repository. It is not hard to set a different
repository than CRAN as the default location
from which to obtain packages.”
-- R-core member, r-devel, March 2014
18
Snapshot CRAN repository :
requirements
• Availability
• Latency
• Bandwidth
• Storage
• Binary package archives
• Other enhancements?
19
Proposal
“Development Branch” “Stable Branch”
Defaults are important!!20
MRANCRAN Downstram
Reproducible

More Related Content

PPTX
Revolution Analytics: a 5-minute history
PPTX
R Then and Now
PPTX
Reproducible Data Science with R
PPTX
Simple Reproducibility with the checkpoint package
PPTX
Predicting Loan Delinquency at One Million Transactions per Second
PPTX
A Step Towards Reproducibility in R
PPTX
The R Ecosystem
PPTX
R at Microsoft
Revolution Analytics: a 5-minute history
R Then and Now
Reproducible Data Science with R
Simple Reproducibility with the checkpoint package
Predicting Loan Delinquency at One Million Transactions per Second
A Step Towards Reproducibility in R
The R Ecosystem
R at Microsoft

What's hot (20)

PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
PPTX
Big data analytics using R
PDF
Big Data Analytics with R
PDF
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
PPTX
R at Microsoft
PDF
Big Data - Analytics with R
PPTX
Big data business case
PDF
Alex Liu Harvard Forest Presentation
PDF
Improving data interoperability in Python and R
PPTX
Reproducibility with Revolution R Open
PDF
Data Science Challenges in Personal Program Analysis
PDF
In-Database Analytics Deep Dive with Teradata and Revolution
PPTX
Revolution R Enterprise - Portland R User Group, November 2013
PDF
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
PDF
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
PPTX
Reproducibility with Checkpoint & RRO - NYC R Conference
PDF
Spark Worshop
PPTX
Data Science at Scale by Sarah Guido
PPTX
Intro to Reproducible Research
PDF
Rdf saturator
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Big data analytics using R
Big Data Analytics with R
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
R at Microsoft
Big Data - Analytics with R
Big data business case
Alex Liu Harvard Forest Presentation
Improving data interoperability in Python and R
Reproducibility with Revolution R Open
Data Science Challenges in Personal Program Analysis
In-Database Analytics Deep Dive with Teradata and Revolution
Revolution R Enterprise - Portland R User Group, November 2013
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Reproducibility with Checkpoint & RRO - NYC R Conference
Spark Worshop
Data Science at Scale by Sarah Guido
Intro to Reproducible Research
Rdf saturator
Ad

Similar to R reproducibility (20)

PPTX
Reproducibility with Checkpoint & RRO
PPTX
Through the firewall with miniCRAN
PDF
Reproducibility with Revolution R Open and the Checkpoint Package
PDF
R - the language
PPTX
R introduction
PDF
When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Pr...
PDF
Unit1_Introduction to R.pdf
PDF
SimpleR: tips, tricks & tools
PDF
R development
PPTX
R_L1-Aug-2022.pptx
PDF
Extending lifespan with Hadoop and R
PPTX
Reproducible research concepts and tools
PDF
How to install & update R packages?
PDF
A Handbook Of Statistical Analyses Using R
ODP
R, Git, Github, and CI
PDF
Data analysis in R
PDF
1 R Tutorial Introduction
PDF
Introduction to R software, by Leire ibaibarriaga
PDF
R Intro
PPTX
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
Reproducibility with Checkpoint & RRO
Through the firewall with miniCRAN
Reproducibility with Revolution R Open and the Checkpoint Package
R - the language
R introduction
When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Pr...
Unit1_Introduction to R.pdf
SimpleR: tips, tricks & tools
R development
R_L1-Aug-2022.pptx
Extending lifespan with Hadoop and R
Reproducible research concepts and tools
How to install & update R packages?
A Handbook Of Statistical Analyses Using R
R, Git, Github, and CI
Data analysis in R
1 R Tutorial Introduction
Introduction to R software, by Leire ibaibarriaga
R Intro
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
Ad

More from Revolution Analytics (17)

PPTX
Speeding up R with Parallel Programming in the Cloud
PPTX
Migrating Existing Open Source Machine Learning to Azure
PPTX
R in Minecraft
PPTX
The case for R for AI developers
PPTX
Speed up R with parallel programming in the Cloud
PPTX
The R Ecosystem
PPTX
The Value of Open Source Communities
PPTX
R at Microsoft (useR! 2016)
PPTX
Building a scalable data science platform with R
PPTX
The Business Economics and Opportunity of Open Source Data Science
PPTX
Taking R Analytics to SQL and the Cloud
PPTX
The Network structure of R packages on CRAN & BioConductor
PPTX
The network structure of cran 2015 07-02 final
PDF
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
PDF
Warranty Predictive Analytics solution
PDF
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
PDF
Batter Up! Advanced Sports Analytics with R and Storm
Speeding up R with Parallel Programming in the Cloud
Migrating Existing Open Source Machine Learning to Azure
R in Minecraft
The case for R for AI developers
Speed up R with parallel programming in the Cloud
The R Ecosystem
The Value of Open Source Communities
R at Microsoft (useR! 2016)
Building a scalable data science platform with R
The Business Economics and Opportunity of Open Source Data Science
Taking R Analytics to SQL and the Cloud
The Network structure of R packages on CRAN & BioConductor
The network structure of cran 2015 07-02 final
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Warranty Predictive Analytics solution
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Batter Up! Advanced Sports Analytics with R and Storm

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Monthly Chronicles - July 2025
Unlocking AI with Model Context Protocol (MCP)

R reproducibility

  • 1. R and Reproducibility A Proposal David Smith useR! 2014
  • 2. What is Reproducibility? “The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn) • Method + Environment -> Results • A process for: – Sharing the method – Describing the environment – Recreating the results 2 xkcd.com/242/
  • 3. Why care about reproducibility? Academic / Research • Verify results • Advance Research Business • Production code • Reliability • Reusability • Regulation 3 www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf
  • 4. R and Reproducibility 4 Results Interfaces Platform Packages R Engine • Hand-assembled • Sweave/knitr/DeployR/Shiny • R GUI / DevelopR / RStudio • Batch / Web Services • OS / Virtualization • Hardware Architecture • CRAN • BioConductor / GitHub / … • R Version • Base + Recommended pkgs
  • 5. Observations • R versions are pretty manageable – Major versions just once a year – Patches rarely introduce incompatible changes • Good solutions for literate programming – Interfaces help • OS/Hardware not the major cause of problems • The big problem is with packages – CRAN is in a state of continual flux 5
  • 6. Package Problem #1 : The User http://xkcd.com/234/6 I heard you need to create a TPS Report. Here, I’ve got an R script that does that already. Oh, you need to download these 5 packages first. I already did, and it still doesn’t work! Well, it worked when I wrote it 3 weeks ago. YOUR Grr. Package updates…
  • 7. Package Problem #2: The Author http://xkcd.com/970/7 Time to update my package on CRAN! >> Dependent packages that now fail to build: 67 >> Resubmit your package and try again Crap.
  • 8. Package Problem #3 : The Update http://xkcd.com/664/8 3 days later… Woot! A new version of R is out! I have 10 minutes now, time to download and install! … package not found … … can’t install package… … error …
  • 9. The Proposal • Change the default way R handles packages – Install packages local to projects • “Snapshot” CRAN daily – Make it easy to get & use package versions used in script development Not a new idea! – Ooms, “Possible Directions for Improving Dependency Versioning in R”, R Journal 5/1 – BioConductor Project – Revolution R Enterprise – Linux distros 9
  • 10. Example • R script file using 6 most popular packages 10
  • 11. Sharing a script reproducibly … and simply # Run with R 3.1.0 require(RRT) mran_set(snapshot="2014-06-27") # find packages used in this project # get package versions used by script author # install locally to this project require(ggplot2) require(data.table) require(knitr) … 11
  • 12. RRT: The R Reproducibility Toolkit • Open Source R Package (GPLv2) • From an R project folder: – Detect packages & dependencies used in project – Download and install from MRAN – Versions selected according to script date – Find and use packages from local install github.com/RevolutionAnalytics/RRT 12
  • 13. MRAN - Implementation A downstream CRAN mirror with daily snapshots • Use rsync to mirror CRAN daily – Only downloads changed packages • Use zfs to store incremental snapshots – Storage only required for new packages • Organize snapshots into a labelled hierarchy – Access package versions by date of use • CRAN snapshot server hosted by cloud provider – Provisioned for availability and latency 13
  • 14. Future work • Just getting started! • Snapshot binaries and source packages • Other repos (BioConductor, GitHub, user) • Institution-level package duplication – CRAN “behind the firewall” • User-defined package versions • Checks on R versions • Suggestions welcome! github.com/RevolutionAnalytics/RRT 14
  • 16. Possible Solution • Bundle all packages with scripts • Packrat solves this very well – Project + package dependencies stored in Github • But: – Contributes to package fragmentation – Adds friction to the sharing process – Doesn’t address the problem for R generally 16
  • 17. CRAN vs Github CRAN • “Repository of Record” – Default for R users • Strict quality checking • Handles dependencies • Binaries built – But only current versions saved • Manual update process • Dependent on volunteer support Github • Frictionless publishing / updates – RStudio integration • Social development – Pull requests FTW • Ease of updates • Fragmented – no unified directory of packages • Permanence – accounts closed / repos deleted 17
  • 18. A downstream CRAN solution? “I don't see why CRAN needs to be involved in this effort at all. A third party could take snapshots of CRAN at R release dates, and make those available to package users in a separate repository. It is not hard to set a different repository than CRAN as the default location from which to obtain packages.” -- R-core member, r-devel, March 2014 18
  • 19. Snapshot CRAN repository : requirements • Availability • Latency • Bandwidth • Storage • Binary package archives • Other enhancements? 19
  • 20. Proposal “Development Branch” “Stable Branch” Defaults are important!!20 MRANCRAN Downstram Reproducible

Editor's Notes

  • #3: http://xkcd.com/242/
  • #19: https://stat.ethz.ch/pipermail/r-devel/2014-March/068552.html