SlideShare a Scribd company logo
1
Open Data Science Conference and iRODS User Group meeting
Raminder Singh
Research Data Services
Research Technologies, Indiana University
July 7th, 2016
2
ODSC East 2016
https://www.odsc.com/boston
3
Technologies Discussed
• Julia is a high-level, high-performance dynamic programming language for technical computing with familiar syntax. It
provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical
function library.
• Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences,
engineering, and business
• Scikit-learn is a python library with classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other
libraries like NumPy and SciPy.
• Apache Spark is an application programming interface centered on a data structure called the resilient distributed
dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-
tolerant way.
• Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into
many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and
analysis.
• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data
analysis programs, coupled with infrastructure for evaluating these programs.
4
Keynote Speakers
5
About Companies of Keynote Speakers
• Booz Allen Hamilton: Core business is the provision of management, technology and security services,
to civilian government agencies. http://www.boozallen.com/datascience
• Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics
and business analytics. https://rapidminer.com/
• CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
6
Other Interesting Speakers
7
Topics for Training Workshops
• Using R for Data Analytics
– https://github.com/zachmayer/forecast
• Building a Real-time Recommender Systems with Spark
ML, Kafka, and the PANCAKE STACK
– http://advancedspark.com/
• Analyzing Open Data in Healthcare
using Public APIs and Reproducible Workflows
– https://github.com/jhajagos/health-open-data-
workshop
8
List of Good Talks Available Online
• Kirk Borne – “2 Most Important Things in Data Science”
– https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data-
science/
• Experiment
• Data collection
• Tomorrow’s Map Room: Data Portals
– https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/
• Interactive Data Visualizations in R with Shiny and ggplot2
– https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data-
visualizations-in-r-with-shiny-and-ggplot2/
• Bokeh is a Python interactive visualization library that targets modern web browsers for
presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org
– https://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points-
with-bokeh-datashader/
• Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be
used to create Networks. http://www.exaptive.com/data-application-gallery
9
10
Objective to Attend
• iRODS features and architecture
• User Community
• Use Cases and Solutions built over iRODS
• Future development and directions
Questions
• Can I write rules in other languages?
• Is it possible to attach it to existing storage?
• What does it take to implement data policy rules for Research Data Alliance (RDA) practical
policy recommendations?
11
12
13
iRODS Implements Four Main Functions
Data Virtualization: iRODS provides a logical representation of files stored
in physical storage locations. We call this logical view a virtual file system
and the capabilities it provides.
Data Discovery: This information about data, called metadata, is
extremely useful for Data Discovery, locating relevant data within large
data sets.
Workflow Automation: Once data is stored and available in the catalog, it
often needs to be migrated, secured, or otherwise processed.
Secure Collaboration: Data is most useful when it’s in the hands of the
right people. There is a recognized need in the public research community
to publish data sets that accompany written articles.
14
15
16
18
EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx
19
20
Getting R to talk to iRODS
Bernhard Sonderegger, Nestlé Institute of Health Sciences
• The R language is an environment with a large and highly active user community in the field of data
science. At NIHS we have developed the R-irods package which allows user-friendly access to irods
data objects and metadata from the R language. Information is passed to the R functions as native R
objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using
standard R constructs.
• To maximize performance and maintain a simple architecture, the implementation heavily relies on the
icommands C++ code wrapped using Rcpp bindings.
• The R-irods package has been engineered to have semantics equivalent to the icommands and can
easily be used as a basis for further customization. At the NIHS we have created an ontology aware
package on top of R-irods to ensure consistent metadata annotations and to facilitate query
construction.
21
22
23
24
Review
Questions
• Can I write rules in other languages?
– YES
• Is it possible to attach it to existing storage?
– YES. There are tools to load the data
• What does it take to implement data policy rules for Research Data Alliance (RDA) practical
policy recommendations?
– Here https://github.com/DICE-UNC/policy-workbook is a reference implementation for
RDA recommendations. It needs some work to update and test these with the latest
version of iRODS.
25
iRODS User Group Meeting notes and slides
• http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides
• http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech report
• http://slides.com/irods/ : Workshop Slides
• https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation
• http://www.cyverse.org/ : iRODS as a service
• http://irods.org/documentation/articles/ : Other Articles
• http://www.odum.unc.edu/
• http://datafed.org/about/use-cases/
• http://renci.org/news/virtual-institute-for-social-research/

More Related Content

PDF
Data analytics using the cloud challenges and opportunities for india
PPT
Big Tools for Big Data
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
PDF
Open source stak of big data techs open suse asia
PDF
Big Data Tech Stack
PDF
Modern data warehouse
PDF
Big data Big Analytics
PPSX
Big Data
Data analytics using the cloud challenges and opportunities for india
Big Tools for Big Data
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Open source stak of big data techs open suse asia
Big Data Tech Stack
Modern data warehouse
Big data Big Analytics
Big Data

What's hot (20)

PPTX
Technical Presentation on Hadoop
PPTX
PDF
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
PPTX
Fair data principles for AOASG
PPTX
Exploring Big Data Analytics Tools
PPTX
View on big data technologies
PDF
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
PPTX
DSpace-CRIS ORCID Integration
PPT
Analytics and Access to the UK web archive
PDF
Analysis of big data in pandemic case
PDF
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
PPTX
Fighting COVID-19 with Artificial Intelligence
 
PDF
Bigdata and Hadoop Bootcamp
PDF
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
PDF
Yosemite part-4 webinar-final
PPTX
Sören Auer | Enterprise Knowledge Graphs
PDF
Introduction to Big Data
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Technical Presentation on Hadoop
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
Fair data principles for AOASG
Exploring Big Data Analytics Tools
View on big data technologies
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
DSpace-CRIS ORCID Integration
Analytics and Access to the UK web archive
Analysis of big data in pandemic case
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
Fighting COVID-19 with Artificial Intelligence
 
Bigdata and Hadoop Bootcamp
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Yosemite part-4 webinar-final
Sören Auer | Enterprise Knowledge Graphs
Introduction to Big Data
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Ad

Viewers also liked (18)

PDF
Invitacion celebracion regional dia ffmm aula universidad mayor
PDF
iRODS/Dataverse Project by Jonathan Crabtree
PDF
Data Management for Grown Ups
PPTX
iRODS: Interoperability in Data Management
PDF
PPT
NAGARA: SRB and iRODS
PPTX
Green Shoots: Research Data Management Pilot at Imperial College London
PDF
Research Data Management en bibliotheken
PDF
iRODS User Group Meeting 2016 - MUMC+
PPT
UDT
PDF
iRODS Rule Language Cheat Sheet
PPTX
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
ODP
Private Cloud Architecture
PPT
File management ppt
PDF
I rods분석(20170313,01,김선태)
PDF
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
PPTX
Operating Systems - File Management
Invitacion celebracion regional dia ffmm aula universidad mayor
iRODS/Dataverse Project by Jonathan Crabtree
Data Management for Grown Ups
iRODS: Interoperability in Data Management
NAGARA: SRB and iRODS
Green Shoots: Research Data Management Pilot at Imperial College London
Research Data Management en bibliotheken
iRODS User Group Meeting 2016 - MUMC+
UDT
iRODS Rule Language Cheat Sheet
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Private Cloud Architecture
File management ppt
I rods분석(20170313,01,김선태)
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
Operating Systems - File Management
Ad

Similar to ODSC and iRODS (20)

PDF
Big Data Technologies.pdf
PPTX
2019 DSA 105 Introduction to Data Science Week 4
PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
PDF
Memory Management in BigData: A Perpective View
PPTX
BDA UNIT 1big data – web analytics – big data applications– big data technolo...
PDF
Unstructured Datasets Analysis: Thesaurus Model
PDF
Modern data warehouse
PDF
Comparison among rdbms, hadoop and spark
PPTX
Big Data Practice_Planning_steps_RK
PDF
Tools and techniques for data science
PDF
Big Data Tools: A Deep Dive into Essential Tools
PPTX
Top 10 renowned big data companies
PDF
IJSRED-V2I3P43
PDF
Big Data
PDF
Tag.bio: Self Service Data Mesh Platform
PPTX
Intership(Hadoop cluster and DevOps.pptx
PPTX
BIG DATA and USE CASES
PDF
Top 10 Big Data Tools that you should know about.pdf
PDF
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
PPTX
Big Data Technologies - Introduction.pptx
Big Data Technologies.pdf
2019 DSA 105 Introduction to Data Science Week 4
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Memory Management in BigData: A Perpective View
BDA UNIT 1big data – web analytics – big data applications– big data technolo...
Unstructured Datasets Analysis: Thesaurus Model
Modern data warehouse
Comparison among rdbms, hadoop and spark
Big Data Practice_Planning_steps_RK
Tools and techniques for data science
Big Data Tools: A Deep Dive into Essential Tools
Top 10 renowned big data companies
IJSRED-V2I3P43
Big Data
Tag.bio: Self Service Data Mesh Platform
Intership(Hadoop cluster and DevOps.pptx
BIG DATA and USE CASES
Top 10 Big Data Tools that you should know about.pdf
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
Big Data Technologies - Introduction.pptx

ODSC and iRODS

  • 1. 1 Open Data Science Conference and iRODS User Group meeting Raminder Singh Research Data Services Research Technologies, Indiana University July 7th, 2016
  • 3. 3 Technologies Discussed • Julia is a high-level, high-performance dynamic programming language for technical computing with familiar syntax. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. • Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business • Scikit-learn is a python library with classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other libraries like NumPy and SciPy. • Apache Spark is an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault- tolerant way. • Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. • Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • 5. 5 About Companies of Keynote Speakers • Booz Allen Hamilton: Core business is the provision of management, technology and security services, to civilian government agencies. http://www.boozallen.com/datascience • Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. https://rapidminer.com/ • CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
  • 7. 7 Topics for Training Workshops • Using R for Data Analytics – https://github.com/zachmayer/forecast • Building a Real-time Recommender Systems with Spark ML, Kafka, and the PANCAKE STACK – http://advancedspark.com/ • Analyzing Open Data in Healthcare using Public APIs and Reproducible Workflows – https://github.com/jhajagos/health-open-data- workshop
  • 8. 8 List of Good Talks Available Online • Kirk Borne – “2 Most Important Things in Data Science” – https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data- science/ • Experiment • Data collection • Tomorrow’s Map Room: Data Portals – https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/ • Interactive Data Visualizations in R with Shiny and ggplot2 – https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data- visualizations-in-r-with-shiny-and-ggplot2/ • Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org – https://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points- with-bokeh-datashader/ • Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be used to create Networks. http://www.exaptive.com/data-application-gallery
  • 9. 9
  • 10. 10 Objective to Attend • iRODS features and architecture • User Community • Use Cases and Solutions built over iRODS • Future development and directions Questions • Can I write rules in other languages? • Is it possible to attach it to existing storage? • What does it take to implement data policy rules for Research Data Alliance (RDA) practical policy recommendations?
  • 11. 11
  • 12. 12
  • 13. 13 iRODS Implements Four Main Functions Data Virtualization: iRODS provides a logical representation of files stored in physical storage locations. We call this logical view a virtual file system and the capabilities it provides. Data Discovery: This information about data, called metadata, is extremely useful for Data Discovery, locating relevant data within large data sets. Workflow Automation: Once data is stored and available in the catalog, it often needs to be migrated, secured, or otherwise processed. Secure Collaboration: Data is most useful when it’s in the hands of the right people. There is a recognized need in the public research community to publish data sets that accompany written articles.
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 18 EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx
  • 18. 19
  • 19. 20 Getting R to talk to iRODS Bernhard Sonderegger, Nestlé Institute of Health Sciences • The R language is an environment with a large and highly active user community in the field of data science. At NIHS we have developed the R-irods package which allows user-friendly access to irods data objects and metadata from the R language. Information is passed to the R functions as native R objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using standard R constructs. • To maximize performance and maintain a simple architecture, the implementation heavily relies on the icommands C++ code wrapped using Rcpp bindings. • The R-irods package has been engineered to have semantics equivalent to the icommands and can easily be used as a basis for further customization. At the NIHS we have created an ontology aware package on top of R-irods to ensure consistent metadata annotations and to facilitate query construction.
  • 20. 21
  • 21. 22
  • 22. 23
  • 23. 24 Review Questions • Can I write rules in other languages? – YES • Is it possible to attach it to existing storage? – YES. There are tools to load the data • What does it take to implement data policy rules for Research Data Alliance (RDA) practical policy recommendations? – Here https://github.com/DICE-UNC/policy-workbook is a reference implementation for RDA recommendations. It needs some work to update and test these with the latest version of iRODS.
  • 24. 25 iRODS User Group Meeting notes and slides • http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides • http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech report • http://slides.com/irods/ : Workshop Slides • https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation • http://www.cyverse.org/ : iRODS as a service • http://irods.org/documentation/articles/ : Other Articles • http://www.odum.unc.edu/ • http://datafed.org/about/use-cases/ • http://renci.org/news/virtual-institute-for-social-research/

Editor's Notes

  • #2: Intro Slide
  • #3: Stan : http://mc-stan.org/
  • #5: CrowdFlower combines the best of human and machine intelligence to enrich data for the world's most innovative companies.
  • #12: User community