Improving RDataTracker
Accessibility and
Functionality: R Markdown
and Caching
Siqing (Alex) Liu, Amherst College
Mentors: Barbara Lerner, Emery Boose
Harvard Forest, Summer 2016
Background: Provenance
• A record of ownership of a
work of art or an antique,
used as a guide to
authenticity or quality1
1Oxford English Dictionary
The Beautiful Princess,
allegedly a rediscovery
from Leonardo da Vinci
Background: Data Provenance
• Why is it Important for Science?
• Validity
• Reproducibility
• Reproducibility Crisis
Data Provenance in R: RDataTracker
Data R Script Data Derivation
Graph (DDG)
Problem: Organizing Data Derivation
Graphs (DDGs)
Solution: R Markdown
• What is R Markdown?
• File Format that allows simple
creation of formatted and
interactive output from R
Alex Liu Harvard Forest Presentation
Organizing Scripts with R Markdown
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentation
Problem: Repeated Execution
• Intensive Calculations
• Complex
• Over large data sets
• Takes a significant
amount of time each run
• Minor changes take a lot
of time
ModifiedOriginal
Solution: Caching
• What is Caching?
• Storing intermediate values so they don’t need to be reprocessed
• Problem: When to Re-execute?
• Dependencies
Caching with Knitr: Chunk-by-Chunk
Dependencies
DDG: Command-by-Command
Dependencies
ddg.cache
Run Time
Modified
Conclusion: Accessibility and
Functionality
• R Markdown
• Accessibility: Directly read in R Markdown files
• Functionality: Automatically organize nodes in DDG
• Caching
• Accessibility: Make repeated execution of compute-intensive
scripts faster
• Functionality: Accurate, command-by-command caching and
tracing
Further Work
• More complete caching
• Compression vs. Speed
• Side Effects
• Creating a plot
• Writing file
• Printing results to console
Acknowledgements
Emery Boose
Harvard Forest
Harvard University
Barbara Lerner
Dept. of Computer Science
Mount Holyoke College

More Related Content

PPTX
R reproducibility
PPTX
R Then and Now
PPTX
Revolution Analytics: a 5-minute history
PPTX
The R Ecosystem
PDF
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
PDF
What is Hydra?
PDF
6.15.17 DSpace-Cris Webinar Presentation Slides
PDF
Big Data Analytics with R
R reproducibility
R Then and Now
Revolution Analytics: a 5-minute history
The R Ecosystem
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
What is Hydra?
6.15.17 DSpace-Cris Webinar Presentation Slides
Big Data Analytics with R

What's hot (11)

PPTX
Reproducible Data Science with R
PPTX
DSpace-CRIS: a CRIS enhanced repository platform
PDF
Congressional PageRank: Graph Analytics of US Congress With Neo4j
PDF
4Science presentes: ORCiD API Tutorial
PPTX
R training at Aimia
PPTX
Big data analytics using R
PPTX
Boosting Data Science in Geochemistry: We Need Global Geochemical Data Standa...
PPTX
DSpace standard Data model and DSpace-CRIS
PDF
The Property Graph Query Language Landscape: openCypher and Property Graph Ex...
PPTX
R at Microsoft
PPTX
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Reproducible Data Science with R
DSpace-CRIS: a CRIS enhanced repository platform
Congressional PageRank: Graph Analytics of US Congress With Neo4j
4Science presentes: ORCiD API Tutorial
R training at Aimia
Big data analytics using R
Boosting Data Science in Geochemistry: We Need Global Geochemical Data Standa...
DSpace standard Data model and DSpace-CRIS
The Property Graph Query Language Landscape: openCypher and Property Graph Ex...
R at Microsoft
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Ad

Similar to Alex Liu Harvard Forest Presentation (20)

PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
PDF
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
PDF
Provenance Management to Enable Data Sharing
PPTX
Research Data Management for the Humanities and Social Sciences
PDF
Prov-O-Viz: Interactive Provenance Visualization
PPTX
Research Data Management in GLAM: Managing Data for Cultural Heritage
PPTX
Martin Donnelly - Digital Data Curation at the Digital Curation Centre (DH2016)
PDF
Digital Data Sharing: Opportunities and Challenges of Opening Research
PDF
Aaron Ellison: Analytic Web
PDF
Provenance and DataONE: Facilitating Reproducible Science
PDF
Bridging data analysis and interactive visualization
PPTX
Provenance for Reproducible Data Science
PPT
Dublin Core Application Profile for Scholarly Works (ePrints)
PPT
Supporting Libraries in Leading the Way in Research Data Management
PPTX
Linked data for Libraries, Archives, Museums
PPTX
Research Data in the Arts and Humanities: A Few Difficulties
PDF
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
PPT
Disciplinary RDM
PPTX
Visualizing Provenance using Comics
PPTX
Research data management: a tale of two paradigms:
Thoughts on Knowledge Graphs & Deeper Provenance
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
Provenance Management to Enable Data Sharing
Research Data Management for the Humanities and Social Sciences
Prov-O-Viz: Interactive Provenance Visualization
Research Data Management in GLAM: Managing Data for Cultural Heritage
Martin Donnelly - Digital Data Curation at the Digital Curation Centre (DH2016)
Digital Data Sharing: Opportunities and Challenges of Opening Research
Aaron Ellison: Analytic Web
Provenance and DataONE: Facilitating Reproducible Science
Bridging data analysis and interactive visualization
Provenance for Reproducible Data Science
Dublin Core Application Profile for Scholarly Works (ePrints)
Supporting Libraries in Leading the Way in Research Data Management
Linked data for Libraries, Archives, Museums
Research Data in the Arts and Humanities: A Few Difficulties
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
Disciplinary RDM
Visualizing Provenance using Comics
Research data management: a tale of two paradigms:
Ad

Alex Liu Harvard Forest Presentation