SlideShare a Scribd company logo
Active Data: Data Life Cycle Management
Across Heterogeneous Systems and
Infrastructures
Anthony Simonet, Gilles Fedak (INRIA)
Matei Ripeanu, Samer Al-Kiswany (UCB)
Kyle Chard, Ian Foster (ANL/UC)
Hot Topics in High-Performance Distributed Computing Workshop
IBM Almaden Research Center
San Jose, California
March 12, 2015
1/13
G. Fedak() Active Data March 12, 2015
Big Data ...
Huge and growing volume of information originating from multiple
sources.
!"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($)
. . . or Big Bottlenecks ?
how to scale the infrastructure ?
end-to-end performance improvement, inter-system optimization.
how to improve productivity of data-intensive scientist ?
data-oriented programming language, data quality, improve
automation and errors recovery
2/13
G. Fedak() Active Data March 12, 2015
Data Life Cycle
Definition
Data Life Cycle (DLC) is the course of operational stages through which
data pass from the time when they enter a set of systems to the time
when they leave it.
!"#$%&%'()* +,-.,("-&&%)/* 01(,2/-*
!)234&%&*
!)234&%&*
Challenges :
Expose high level view DLC across distributed systems and
infrastructures
Expose interactions between the infrastructure and the DLC (e.g
failures)
3/13
G. Fedak() Active Data March 12, 2015
Active Data
Active Data:
Allow to reason about data sets handled by heterogeneous software
and infrastructures.
A formal model that captures the essential life cycle stages and
properties: creation, deletion, faults, replication, error checking . . .
programming model to develop easily data life cycle management
applications.
Allows legacy systems to expose their intrinsic data life cycle.
4/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
•
Created
t1
Written
t2
Read
t3
t4
Terminated
Each token has a unique identifier, corresponding to the actual data
item’s.
5/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
•
Written
t2
Read
t3
t4
Terminated
A transition is fired whenever a data state changes.
5/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
•
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is fired.
5/13
G. Fedak() Active Data March 12, 2015
Active Data Framework
6/13
G. Fedak() Active Data March 12, 2015
Life Cycle View
File transferFile Dataset Metadata
Guard
Code Execution
}Tagged Tokens
Notification
Framework features:
Captures data events in legacy systems
High-level life cycle-centered view of data
Single namespace for all the files,
datasets and metadata
Powerful filters based on Data Tags
Install Taggers on Transitions
Guarded Transitions : only executes on
token which have specific tags.
Publish/subscribe transitions
Custom user reaction to data progress
Custom code execution
Custom notifications (twitter, email,
gdoc, ifttt . . . )
Use Case: Advanced Photon Source
Globus Catalog Globus
Detector Local Storage Compute Cluster
1. Local
Transfer
2. Extract
Metadata
3. Globus
Transfer
4. Swift Parallel Analysis
3 to 5 TB of data per week on this detector
Raw data are pre-processed and registered in the Globus Catalog :
Data are curated by several applications
Data are shared amongst scientific user
7/13
G. Fedak() Active Data March 12, 2015
Data Surveillance Framework
4 goals (that would otherwise require a lot of scripting and hacking):
Monitoring Data Set Progress
Better Automation
Sharing & Notification
Error Discovery & Recovery
8/13
G. Fedak() Active Data March 12, 2015
APS Data Life Cycle Model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
9/13
G. Fedak() Active Data March 12, 2015
Error Detection & Recovery
10/13
G. Fedak() Active Data March 12, 2015
Example scenario
Recover from system-wide errors: faulty acquired files are detected only
after Swift fails to process them.
In this situation, the user manually:
Drops the whole dataset
Removes any associated file and metadata
Re-acquire the dataset using the same parameters
E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent
Use-case: APS data life cycle model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
Avalon June 3rd, 2014 22/30
Active Data Client
Failure
and
Recovery
Handler
remove metadata from the catalog
Filters
data
likely to
fail
Tagger
Active Data Client
Guard
Handler
run the Globus
catalog UI scripts
11/13
G. Fedak() Active Data March 12, 2015
Handler Code
TransitionHandler handler = new TransitionHandler () {
public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) {
// Get the dataset identifier
LifeCycle lc = ad. getLifeCycle (inTokens [0]);
datasetId = lc.getTokens("Shared storage.Created")[0]. getUid ();
// Remove the dataset annotations from the catalog
String url = "https :// catalog.globus.org/dataset/" + datasetId;
Runtime r = Runtime.getRuntime ();
Process p = r.exec(" catalog_client .py remove " + url);
p.waitFor ();
// Locally , remove the datasets
String path = "~/aps/" + datasetId;
FileUtils. deleteDirectory (new File(path));
// Publish the " Detector .End"
Token root = lc.getTokens("Detector.Created")[0];
ad. publishTransition ("Detector.End", lc);
// Notify the user
sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId);
}
};
HandlerGuard guard = new HandlerGuard () {
public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) {
return input [0]. hasTag(" f a i l u r e c o r r u p t e d ");
}}
ad.subscribeTo("Swift.Failure", handler , guard);
12/13
G. Fedak() Active Data March 12, 2015
Conclusion
Active Data
allows to expose Data Life Cycle across heterogeneous systems and
infrastructures
transition-based programming model for DLC management
application
Monitoring, automation, error detection & recovery
X-systems optimizations: incremental computing, data staging,
caching, throttling etc. . .
Perspectives :
Use AD to deploy data management software stack on IaaS (Asma
Ben Cheick, Heithem Abbes, Univ. Tunis)
Big Data Apache stack X-optimization (H. He, CAS, Beijing)
Volunteer & crowd computing (M. Moca, BBU, Romania)
13/13
G. Fedak() Active Data March 12, 2015
Thank you!
Questions?

More Related Content

PDF
ETL DW-RealTime
PDF
LDV: Light-weight Database Virtualization
PPTX
Data Automation at Light Sources
PPTX
Tool collection as linkeddata
PPT
Webtracks at JISC Managing Research Data Meeting
PDF
GlobusWorld 2015
PDF
MOCHA 2018 Challenge @ ESWC2018
PDF
GeoDataspace: Simplifying Data Management Tasks with Globus
ETL DW-RealTime
LDV: Light-weight Database Virtualization
Data Automation at Light Sources
Tool collection as linkeddata
Webtracks at JISC Managing Research Data Meeting
GlobusWorld 2015
MOCHA 2018 Challenge @ ESWC2018
GeoDataspace: Simplifying Data Management Tasks with Globus

What's hot (20)

PPTX
SnowCamp - Adding search to a legacy application
PDF
Making data typing efforts or automatically detecting data types for automat...
PPTX
Populate your Search index, NEST 2016-01
PPTX
“Open Data Web” – A Linked Open Data Repository Built with CKAN
PPTX
Revamp the tablespace reorg process with ibm db2 automation tool
PPTX
Redirected Recovery of Recovery Expert for DB2 on z/OS
PDF
balloon: LOD forecasting - cloudy with a chance of services
PPTX
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
PPTX
Introduction to R2DBC
PPTX
Benchmarking Cloud-based Tagging Services
PDF
Thomas Krichel (Long Island University) – AuthorClaim
PPTX
Learning Systems for Science
PPTX
JOnConf - A CDC use-case: designing an Evergreen Cache
PPT
Kettleetltool 090522005630-phpapp01
PPTX
Accelerating Discovery via Science Services
PPTX
Presentation of Gantt Chart (System Analysis and Design)
PPTX
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
PDF
Use of Open Data in Hong Kong
PPTX
Big Data HPC Convergence and a bunch of other things
PPTX
Big data at experimental facilities
SnowCamp - Adding search to a legacy application
Making data typing efforts or automatically detecting data types for automat...
Populate your Search index, NEST 2016-01
“Open Data Web” – A Linked Open Data Repository Built with CKAN
Revamp the tablespace reorg process with ibm db2 automation tool
Redirected Recovery of Recovery Expert for DB2 on z/OS
balloon: LOD forecasting - cloudy with a chance of services
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
Introduction to R2DBC
Benchmarking Cloud-based Tagging Services
Thomas Krichel (Long Island University) – AuthorClaim
Learning Systems for Science
JOnConf - A CDC use-case: designing an Evergreen Cache
Kettleetltool 090522005630-phpapp01
Accelerating Discovery via Science Services
Presentation of Gantt Chart (System Analysis and Design)
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
Use of Open Data in Hong Kong
Big Data HPC Convergence and a bunch of other things
Big data at experimental facilities
Ad

Viewers also liked (8)

PDF
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
PDF
Active Data PDSW'13
PDF
Big Data, Beyond the Data Center
PDF
Mapreduce Runtime Environments: Design, Performance, Optimizations
PDF
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
PDF
iExec: Blockchain-based Fully Distributed Cloud Computing
PPTX
How Blockchain and Smart Buildings can Reshape the Internet
PPS
Information Management Life Cycle
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
Active Data PDSW'13
Big Data, Beyond the Data Center
Mapreduce Runtime Environments: Design, Performance, Optimizations
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
iExec: Blockchain-based Fully Distributed Cloud Computing
How Blockchain and Smart Buildings can Reshape the Internet
Information Management Life Cycle
Ad

Similar to Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures (20)

PPT
60141457-Oracle-Golden-Gate-Presentation.ppt
PDF
A Gen3 Perspective of Disparate Data
PPTX
Presentation for use
PPTX
Preservation Metadata, CARLI Metadata Matters series, December 2010
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PDF
Monitoring in 2017 - TIAD Camp Docker
PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
PPTX
Presentation
PPTX
Microsoft Dryad
PPTX
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
PPT
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PPTX
SplunkLive! Munich 2018: Data Onboarding Overview
PDF
Webinar Data Mesh - Part 3
PPTX
Presentation for slideshare
PPTX
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
PDF
Reintroducing the Stream Processor: A universal tool for continuous data anal...
PPT
20090701 Climate Data Staging
PPTX
So Long Computer Overlords
PPT
GeoKettle: A powerful open source spatial ETL tool
PDF
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
60141457-Oracle-Golden-Gate-Presentation.ppt
A Gen3 Perspective of Disparate Data
Presentation for use
Preservation Metadata, CARLI Metadata Matters series, December 2010
Advanced Analytics and Machine Learning with Data Virtualization
Monitoring in 2017 - TIAD Camp Docker
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Presentation
Microsoft Dryad
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
SplunkLive! Munich 2018: Data Onboarding Overview
Webinar Data Mesh - Part 3
Presentation for slideshare
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
Reintroducing the Stream Processor: A universal tool for continuous data anal...
20090701 Climate Data Staging
So Long Computer Overlords
GeoKettle: A powerful open source spatial ETL tool
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

  • 1. Active Data: Data Life Cycle Management Across Heterogeneous Systems and Infrastructures Anthony Simonet, Gilles Fedak (INRIA) Matei Ripeanu, Samer Al-Kiswany (UCB) Kyle Chard, Ian Foster (ANL/UC) Hot Topics in High-Performance Distributed Computing Workshop IBM Almaden Research Center San Jose, California March 12, 2015 1/13 G. Fedak() Active Data March 12, 2015
  • 2. Big Data ... Huge and growing volume of information originating from multiple sources. !"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($) . . . or Big Bottlenecks ? how to scale the infrastructure ? end-to-end performance improvement, inter-system optimization. how to improve productivity of data-intensive scientist ? data-oriented programming language, data quality, improve automation and errors recovery 2/13 G. Fedak() Active Data March 12, 2015
  • 3. Data Life Cycle Definition Data Life Cycle (DLC) is the course of operational stages through which data pass from the time when they enter a set of systems to the time when they leave it. !"#$%&%'()* +,-.,("-&&%)/* 01(,2/-* !)234&%&* !)234&%&* Challenges : Expose high level view DLC across distributed systems and infrastructures Expose interactions between the infrastructure and the DLC (e.g failures) 3/13 G. Fedak() Active Data March 12, 2015
  • 4. Active Data Active Data: Allow to reason about data sets handled by heterogeneous software and infrastructures. A formal model that captures the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . programming model to develop easily data life cycle management applications. Allows legacy systems to expose their intrinsic data life cycle. 4/13 G. Fedak() Active Data March 12, 2015
  • 5. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations • Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identifier, corresponding to the actual data item’s. 5/13 G. Fedak() Active Data March 12, 2015
  • 6. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated A transition is fired whenever a data state changes. 5/13 G. Fedak() Active Data March 12, 2015
  • 7. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. 5/13 G. Fedak() Active Data March 12, 2015
  • 8. Active Data Framework 6/13 G. Fedak() Active Data March 12, 2015 Life Cycle View File transferFile Dataset Metadata Guard Code Execution }Tagged Tokens Notification Framework features: Captures data events in legacy systems High-level life cycle-centered view of data Single namespace for all the files, datasets and metadata Powerful filters based on Data Tags Install Taggers on Transitions Guarded Transitions : only executes on token which have specific tags. Publish/subscribe transitions Custom user reaction to data progress Custom code execution Custom notifications (twitter, email, gdoc, ifttt . . . )
  • 9. Use Case: Advanced Photon Source Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis 3 to 5 TB of data per week on this detector Raw data are pre-processed and registered in the Globus Catalog : Data are curated by several applications Data are shared amongst scientific user 7/13 G. Fedak() Active Data March 12, 2015
  • 10. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): Monitoring Data Set Progress Better Automation Sharing & Notification Error Discovery & Recovery 8/13 G. Fedak() Active Data March 12, 2015
  • 11. APS Data Life Cycle Model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. 9/13 G. Fedak() Active Data March 12, 2015
  • 12. Error Detection & Recovery 10/13 G. Fedak() Active Data March 12, 2015 Example scenario Recover from system-wide errors: faulty acquired files are detected only after Swift fails to process them. In this situation, the user manually: Drops the whole dataset Removes any associated file and metadata Re-acquire the dataset using the same parameters
  • 13. E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 Active Data Client Failure and Recovery Handler remove metadata from the catalog Filters data likely to fail Tagger Active Data Client Guard Handler run the Globus catalog UI scripts 11/13 G. Fedak() Active Data March 12, 2015
  • 14. Handler Code TransitionHandler handler = new TransitionHandler () { public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) { // Get the dataset identifier LifeCycle lc = ad. getLifeCycle (inTokens [0]); datasetId = lc.getTokens("Shared storage.Created")[0]. getUid (); // Remove the dataset annotations from the catalog String url = "https :// catalog.globus.org/dataset/" + datasetId; Runtime r = Runtime.getRuntime (); Process p = r.exec(" catalog_client .py remove " + url); p.waitFor (); // Locally , remove the datasets String path = "~/aps/" + datasetId; FileUtils. deleteDirectory (new File(path)); // Publish the " Detector .End" Token root = lc.getTokens("Detector.Created")[0]; ad. publishTransition ("Detector.End", lc); // Notify the user sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId); } }; HandlerGuard guard = new HandlerGuard () { public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) { return input [0]. hasTag(" f a i l u r e c o r r u p t e d "); }} ad.subscribeTo("Swift.Failure", handler , guard); 12/13 G. Fedak() Active Data March 12, 2015
  • 15. Conclusion Active Data allows to expose Data Life Cycle across heterogeneous systems and infrastructures transition-based programming model for DLC management application Monitoring, automation, error detection & recovery X-systems optimizations: incremental computing, data staging, caching, throttling etc. . . Perspectives : Use AD to deploy data management software stack on IaaS (Asma Ben Cheick, Heithem Abbes, Univ. Tunis) Big Data Apache stack X-optimization (H. He, CAS, Beijing) Volunteer & crowd computing (M. Moca, BBU, Romania) 13/13 G. Fedak() Active Data March 12, 2015