Alex Liu Harvard Forest Presentation

Improving RDataTracker
Accessibility and
Functionality: R Markdown
and Caching
Siqing (Alex) Liu, Amherst College
Mentors: Barbara Lerner, Emery Boose
Harvard Forest, Summer 2016

Background: Provenance
• A record of ownership of a
work of art or an antique,
used as a guide to
authenticity or quality1
1Oxford English Dictionary
The Beautiful Princess,
allegedly a rediscovery
from Leonardo da Vinci

Background: Data Provenance
• Why is it Important for Science?
• Validity
• Reproducibility
• Reproducibility Crisis

Data Provenance in R: RDataTracker
Data R Script Data Derivation
Graph (DDG)

Problem: Organizing Data Derivation
Graphs (DDGs)

Solution: R Markdown
• What is R Markdown?
• File Format that allows simple
creation of formatted and
interactive output from R

Alex Liu Harvard Forest Presentation

Organizing Scripts with R Markdown

Problem: Repeated Execution
• Intensive Calculations
• Complex
• Over large data sets
• Takes a significant
amount of time each run
• Minor changes take a lot
of time

Solution: Caching
• What is Caching?
• Storing intermediate values so they don’t need to be reprocessed
• Problem: When to Re-execute?
• Dependencies

Caching with Knitr: Chunk-by-Chunk
Dependencies

DDG: Command-by-Command
Dependencies

Conclusion: Accessibility and
Functionality
• R Markdown
• Accessibility: Directly read in R Markdown files
• Functionality: Automatically organize nodes in DDG
• Caching
• Accessibility: Make repeated execution of compute-intensive
scripts faster
• Functionality: Accurate, command-by-command caching and
tracing

Further Work
• More complete caching
• Compression vs. Speed
• Side Effects
• Creating a plot
• Writing file
• Printing results to console

Acknowledgements
Emery Boose
Harvard Forest
Harvard University
Barbara Lerner
Dept. of Computer Science
Mount Holyoke College

Alex Liu Harvard Forest Presentation

More Related Content

What's hot (11)

Similar to Alex Liu Harvard Forest Presentation (20)

Alex Liu Harvard Forest Presentation