SlideShare a Scribd company logo
Building a Tiered Digital Storage
Environment Based on User-Defined
Metadata to Enable eResearch
David Fellinger
Data Management Technologist
iRODS Consortium
October 24, 2019
iRODS: Data Management at Scale 2
Modern RAM
(1965)
1K non-volatile
Magnetic Core
1µs access
Historical Reference
https://en.wikipedia.org/wiki/Magnetic-core_memory
iRODS: Data Management at Scale 3
Modern Hard Disk
Storage
(1956)
5MB
Random Access @3ms
Historical Reference
https://en.wikipedia.org/wiki/IBM_305_RAMAC#/media/File:IBM_350_RAMAC.jpg
iRODS: Data Management at Scale 4
Historical Reference
https://www.computerhistory.org/revolution/early-computer-companies/5/100/1491
Modern Digital Storage
Tape
(1951)
1.4M per 1500 feet
READ at 100 ips
Hierarchical Storage Management (HSM)
• Early HPC users saw an immediate need for HSM
– Technology dating to the mid 1960s enabled HPC users to apportion storage
based on access time and cost.
– CSIRO developed the Drum and Display (DAD) operating system as one of
the first HSMs in the 1960s.
• IBM and others began HSM development including Workstation Data
Save (WDS) for AIX in the early 1980s.
• All HSM software was initially based on moving data based on
attributes;
– File creation time and date
– File name
– Name extension
– Access controls
• All HSMs were focused on moving data between vertical tiers and not
on horizontal data distribution tiers.
iRODS: Data Management at Scale 5
iRODS is the Next Generation HSM and Data Manager
• The Integrated Rule-Oriented Data System (iRODS) has been designed
by the iRODS Consortium with 4 key functionalities;
iRODS: Data Management at Scale 6
iRODS is:
• Open Source
• Distributed
• Data Centric
• Metadata Driven
Metadata Driven
• iRODS moves data based User Defined Metadata.
– Latitude, longitude, altitude
– Anomalies in genomic sequences
– Data collection points
– Instrumentation details enabling the data collection
– Specific relevance in a research area
• The use of Rich Metadata enables:
– Discovery
– Data grouping based on content to enable analysis
– Data movement to analytic platforms
• iRODS is Data Centric and Metadata can be extracted from file headers
or actual file content.
– Metadata extraction is based on set rules to produce a collection
– Data can be apportioned instantly based on metadata
– Metadata can include citation instances and can change dynamically
iRODS: Data Management at Scale 7
The Rise of Sensor Data and “Big Data”
• It can be argued that sensor data has
changed the paradigm of HPC.
– Huge amounts of data must be
collected
– The data has the characteristics of,
Volume, Variety, Velocity and Value
– The data must be organized for
analysis
– In many instances the data must be
moved to a file system close to the
analytic element
– An analytic process must be started
only when the full data set is available
• All steps must guarantee provenance
of the collected data to assure
Veracity.
• Full automation includes moving the
results to a data distribution file
system.
iRODS: Data Management at Scale 8
Arcot Rajasekar DataNet Federation Consortium
iRODS Automating, Gathering, and Organizing Data
iRODS: Data Management at Scale 9
Migration to HPC and Policy Driven Analysis
iRODS: Data Management at Scale 10
• iRODS ties to the machine scheduler to
move data at the correct time for analysis.
• Data with similar metadata charactistics are
selected for the analysis process.
• Data are moved to a parallel file system for
processing.
• iRODS moves the data but is not in the
process path.
• When the process is concluded with
notification from the scheduler iRODS can
purge the scratch file system to a data
distribution file system.
• iRODS can provide notifications that the
reduced or analyzed is available.
Automated Management Through Synchronization to the Cloud
iRODS: Data Management at Scale 11
• iRODS can migrate data to any filesystem maintaining location data in the catalog.
• Data can be synchronized to the cloud or a federated partner.
• Notifications can be provided at each step of the process and an audit report can be
generated at any time.
Packaged Capabilities Allow Tracking and Managing Data Through
Publication
iRODS: Data Management at Scale 12
• iRODS provides eight packaged capabilities
which can be configured and deployed to
serve the needs of the data center.
• Organizations can seamlessly address their
immediate needs.
• Additional capabilities can be added or
reconfiguration can occur as the need arises.
• A plugin architecture allows customization to
address any data migration need.
Automation to Enable the Establishment of an Archive
iRODS: Data Management at Scale 13
• iRODS provides eight packaged capabilities
which can be configured and deployed to ser
Secure Federation Enables Geographically Protected Data Archives
iRODS: Data Management at Scale 14
Deployment: CyVerse
iRODS: Data Management at Scale 15
Diagram available from: https://www.cyverse.org/about accessed 25 September 2019
Deployment: EUDAT CDI
iRODS: Data Management at Scale 16
Diagram available from:
https://eudat.eu/eudat-cdi
Accessed 26 September 2019
Conclusion
• eResearch has evolved to accommodate sensor and
other types of “big data”.
• The use of user defined and extracted metadata
improves the disposition of data at every level.
• iRODS can enable complete workflow control, data
lifecycle management, and present discoverable data
sets with assured traceability and reproducibility.
iRODS: Data Management at Scale 17
The iRODS Consortium (iRODS.org)
The iRODS Consortium
• Leads software development and support of iRODS
• Hosts iRODS Events
• Tiered membership model
iRODS: Data Management at Scale 18
Questions?
iRODS: Data Management at Scale
Thank you!
David Fellinger
davef@renci.org
iRODS.org
19

More Related Content

PPT
Grid Computing
DOCX
PPTX
Introduction to Grid Computing
PPTX
Grid computing
PPTX
Open Source Grid Middleware Packages
PPTX
Grid computing
PDF
KVH Data Center Solutions
Grid Computing
Introduction to Grid Computing
Grid computing
Open Source Grid Middleware Packages
Grid computing
KVH Data Center Solutions

What's hot (20)

PDF
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
PDF
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
PDF
Blockchain at internet scale
PPTX
Grid Computing (An Up-Coming Technology)
PDF
Data warehousing
PPTX
A physical view
PPTX
Data center architure ppts
PDF
PDF
Efficient multicast delivery for data redundancy minimization over wireless d...
PPTX
News about DSpace-CRIS Anwendertreffen 2020
PPTX
Cloud vs grid
PDF
Tide data warehousesolutionfort24_nayamsoft_flyer
PPTX
Data Center
PPTX
DSpace-CRIS ORCID Integration
PDF
The Future of the OS
PPT
Ds1 int (1)
PPTX
The Extreme Data Cloud (XDC) Project
PPTX
Grid Computing
PPT
Grid computing
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Blockchain at internet scale
Grid Computing (An Up-Coming Technology)
Data warehousing
A physical view
Data center architure ppts
Efficient multicast delivery for data redundancy minimization over wireless d...
News about DSpace-CRIS Anwendertreffen 2020
Cloud vs grid
Tide data warehousesolutionfort24_nayamsoft_flyer
Data Center
DSpace-CRIS ORCID Integration
The Future of the OS
Ds1 int (1)
The Extreme Data Cloud (XDC) Project
Grid Computing
Grid computing
Ad

Similar to Building a Tiered Digital Storage Environment on User-Defined Metadata to Enable eResearch (20)

PDF
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
PDF
spectrum Storage Whitepaper
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
PDF
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
PDF
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
PDF
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
PDF
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
PDF
Crafting highly scalable and performant Modern Data Platforms
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
PDF
A Logical Architecture is Always a Flexible Architecture (ASEAN)
PPTX
Information Systems
PPTX
Navigating the World of User Data Management and Data Discovery
PPT
Intelligent Cloud Enablement
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPT
Building Cyber-infrastructure at UNC-CH
PDF
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
spectrum Storage Whitepaper
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Crafting highly scalable and performant Modern Data Platforms
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Data Virtualization: An Essential Component of a Cloud Data Lake
A Logical Architecture is Always a Flexible Architecture (ASEAN)
Information Systems
Navigating the World of User Data Management and Data Discovery
Intelligent Cloud Enablement
Data lake-itweekend-sharif university-vahid amiry
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Building Cyber-infrastructure at UNC-CH
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPT
What is a Computer? Input Devices /output devices
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Getting Started with Data Integration: FME Form 101
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
observCloud-Native Containerability and monitoring.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
What is a Computer? Input Devices /output devices
Univ-Connecticut-ChatGPT-Presentaion.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Getting started with AI Agents and Multi-Agent Systems
Module 1.ppt Iot fundamentals and Architecture
Getting Started with Data Integration: FME Form 101
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Developing a website for English-speaking practice to English as a foreign la...
DP Operators-handbook-extract for the Mautical Institute
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Tartificialntelligence_presentation.pptx
Hindi spoken digit analysis for native and non-native speakers
A comparative study of natural language inference in Swahili using monolingua...
NewMind AI Weekly Chronicles – August ’25 Week III
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Assigned Numbers - 2025 - Bluetooth® Document
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Chapter 5: Probability Theory and Statistics
observCloud-Native Containerability and monitoring.pptx

Building a Tiered Digital Storage Environment on User-Defined Metadata to Enable eResearch

  • 1. Building a Tiered Digital Storage Environment Based on User-Defined Metadata to Enable eResearch David Fellinger Data Management Technologist iRODS Consortium October 24, 2019
  • 2. iRODS: Data Management at Scale 2 Modern RAM (1965) 1K non-volatile Magnetic Core 1µs access Historical Reference https://en.wikipedia.org/wiki/Magnetic-core_memory
  • 3. iRODS: Data Management at Scale 3 Modern Hard Disk Storage (1956) 5MB Random Access @3ms Historical Reference https://en.wikipedia.org/wiki/IBM_305_RAMAC#/media/File:IBM_350_RAMAC.jpg
  • 4. iRODS: Data Management at Scale 4 Historical Reference https://www.computerhistory.org/revolution/early-computer-companies/5/100/1491 Modern Digital Storage Tape (1951) 1.4M per 1500 feet READ at 100 ips
  • 5. Hierarchical Storage Management (HSM) • Early HPC users saw an immediate need for HSM – Technology dating to the mid 1960s enabled HPC users to apportion storage based on access time and cost. – CSIRO developed the Drum and Display (DAD) operating system as one of the first HSMs in the 1960s. • IBM and others began HSM development including Workstation Data Save (WDS) for AIX in the early 1980s. • All HSM software was initially based on moving data based on attributes; – File creation time and date – File name – Name extension – Access controls • All HSMs were focused on moving data between vertical tiers and not on horizontal data distribution tiers. iRODS: Data Management at Scale 5
  • 6. iRODS is the Next Generation HSM and Data Manager • The Integrated Rule-Oriented Data System (iRODS) has been designed by the iRODS Consortium with 4 key functionalities; iRODS: Data Management at Scale 6 iRODS is: • Open Source • Distributed • Data Centric • Metadata Driven
  • 7. Metadata Driven • iRODS moves data based User Defined Metadata. – Latitude, longitude, altitude – Anomalies in genomic sequences – Data collection points – Instrumentation details enabling the data collection – Specific relevance in a research area • The use of Rich Metadata enables: – Discovery – Data grouping based on content to enable analysis – Data movement to analytic platforms • iRODS is Data Centric and Metadata can be extracted from file headers or actual file content. – Metadata extraction is based on set rules to produce a collection – Data can be apportioned instantly based on metadata – Metadata can include citation instances and can change dynamically iRODS: Data Management at Scale 7
  • 8. The Rise of Sensor Data and “Big Data” • It can be argued that sensor data has changed the paradigm of HPC. – Huge amounts of data must be collected – The data has the characteristics of, Volume, Variety, Velocity and Value – The data must be organized for analysis – In many instances the data must be moved to a file system close to the analytic element – An analytic process must be started only when the full data set is available • All steps must guarantee provenance of the collected data to assure Veracity. • Full automation includes moving the results to a data distribution file system. iRODS: Data Management at Scale 8 Arcot Rajasekar DataNet Federation Consortium
  • 9. iRODS Automating, Gathering, and Organizing Data iRODS: Data Management at Scale 9
  • 10. Migration to HPC and Policy Driven Analysis iRODS: Data Management at Scale 10 • iRODS ties to the machine scheduler to move data at the correct time for analysis. • Data with similar metadata charactistics are selected for the analysis process. • Data are moved to a parallel file system for processing. • iRODS moves the data but is not in the process path. • When the process is concluded with notification from the scheduler iRODS can purge the scratch file system to a data distribution file system. • iRODS can provide notifications that the reduced or analyzed is available.
  • 11. Automated Management Through Synchronization to the Cloud iRODS: Data Management at Scale 11 • iRODS can migrate data to any filesystem maintaining location data in the catalog. • Data can be synchronized to the cloud or a federated partner. • Notifications can be provided at each step of the process and an audit report can be generated at any time.
  • 12. Packaged Capabilities Allow Tracking and Managing Data Through Publication iRODS: Data Management at Scale 12 • iRODS provides eight packaged capabilities which can be configured and deployed to serve the needs of the data center. • Organizations can seamlessly address their immediate needs. • Additional capabilities can be added or reconfiguration can occur as the need arises. • A plugin architecture allows customization to address any data migration need.
  • 13. Automation to Enable the Establishment of an Archive iRODS: Data Management at Scale 13 • iRODS provides eight packaged capabilities which can be configured and deployed to ser
  • 14. Secure Federation Enables Geographically Protected Data Archives iRODS: Data Management at Scale 14
  • 15. Deployment: CyVerse iRODS: Data Management at Scale 15 Diagram available from: https://www.cyverse.org/about accessed 25 September 2019
  • 16. Deployment: EUDAT CDI iRODS: Data Management at Scale 16 Diagram available from: https://eudat.eu/eudat-cdi Accessed 26 September 2019
  • 17. Conclusion • eResearch has evolved to accommodate sensor and other types of “big data”. • The use of user defined and extracted metadata improves the disposition of data at every level. • iRODS can enable complete workflow control, data lifecycle management, and present discoverable data sets with assured traceability and reproducibility. iRODS: Data Management at Scale 17
  • 18. The iRODS Consortium (iRODS.org) The iRODS Consortium • Leads software development and support of iRODS • Hosts iRODS Events • Tiered membership model iRODS: Data Management at Scale 18
  • 19. Questions? iRODS: Data Management at Scale Thank you! David Fellinger davef@renci.org iRODS.org 19