SlideShare a Scribd company logo
eXtreme DataCloud is co-funded by the Horizon2020
Framework Program – Grant Agreement 777367
Copyright © Members of the XDC Collaboration, 2017-2020
Data Management for extreme scale computing
The XDC project
Daniele Cesini
daniele.cesini<at>extreme-datacloud.eu
XDC Objectives
The eXtreme DataCloud is a software development and
integration project
Develops scalable technologies for federating storage
resources and managing data in highly distributed computing
environments
Focus efficient, policy driven and Quality of Service based DM
The targeted platforms are the current and next generation e-
Infrastructures deployed in Europe
European Open Science Cloud (EOSC)
The e-infrastructures used by the represented communities
24/01/2018 D.Cesini - The eXtreme DataCloud Project 2
XDC Foundations
XDC takes the move from
the INDIGO Data management activity
the experience of the project partners on data-management
Improve already existing, production quality, Federated Data
Management services
By adding missing functionalities requested by research communities
Must be coherently harmonized in the European e-Infrastructures
3
INDIGO PaaS
Orchestrator
INDIGO CDMI
Server24/01/2018 D.Cesini - The eXtreme DataCloud Project
Represented research communities
24/01/2018 D.Cesini - The eXtreme DataCloud Project 4
XDC Consortium
8 partners, 7 countries
7 research communities represented + EGI
XDC Total Budget: 3.07Meuros
XDC started on Nov 1st – will run for 27 months
24/01/2018 D.Cesini - The eXtreme DataCloud Project 5
ID Partner Country Represented Community Tools and system
1 INFN
(Lead)
IT HEP/WLCG INDIGO-Orchestrator, INDIGO-
CDMI(*)
2 DESY DE Research with Photons
(XFEL)
dCache
3 CERN CH HEP/WLCG EOS, DYNAFED, FTS
4 AGH PL ONEDATA
5 ECRIN [ERIC] Medical data
6 UC ES Lifewatch
7 CNRS FR Astro [CTA and LSST]
8 EGI.eu NL EGI communities
The New Functionalities
Intelligent & Automated Dataset Distribution
Orchestration to realize a policy-driven data management
Data distribution policies based on Quality of Service (i.e. disks vs tape vs SSD)
supporting geographical distributed resources (cross-sites)
Software lifecycle management
Data pre-processing during ingestion
Data management based on access patterns
Move to ‘glacier-like’ storage unused data, move to fast storage “hot” data
at infrastructure level
Smart caching
Transparent access to remote data without the need of a-priori copy
Metadata management
Sensitive data handling
secure storage and encryption
24/01/2018 D.Cesini - The eXtreme DataCloud Project 6
Policy driven Data Management
Intelligent & Automated Dataset
Distribution
A typical workflow
Initially the data will be stored on low latency
devices for fast access
To ensure data safety, the data will be replicated
to a second storage device and will be migrated
to custodial systems, which might be tape or S3
appliances
Eligible users will get permission to restore
archived data if necessary
After a grace period, Access Control will be
changed from “private” to “open access”
Data management based on access pattern
724/01/2018 D.Cesini - The eXtreme DataCloud Project
Data pre-processing
Data pre-processing during ingestion
Automatically run user defined applications
and workflows when data are uploaded
i.e. for Skimming, indexing, metadata extraction,
consistency checks
Implement a solution to discover new data at
specific locations
Create the functions to request the INDIGO
PaaS Orchestrator to execute specific
applications on the computing resources on
the Infrastructure
Implement a high-level workflow engine, that
will execute applications defined by the users
Implement the data mover to store the
elaborated data in the final destination
824/01/2018 D.Cesini - The eXtreme DataCloud Project
Smart caching
Develop a global caching infrastructure supporting the following building
blocks:
dynamic integration of satellite sites by existing data centres
creation of standalone caches modelled on existing web solutions
federation of the above to create a large scale caching infrastructure
924/01/2018
Smart caching
D.Cesini - The eXtreme DataCloud Project
Onedata developments
1024/01/2018 D.Cesini - The eXtreme DataCloud Project
Unified data access platform at a PaaS level at the Exascale
Multi-region support in Onedata
Advanced metadata management with no pre-defined schema
Encryption Services and Secure Storage
Sensitive data management and key storage within Onedata
Metadata handling use cases
D.Cesini - The eXtreme DataCloud Project 11
LIFEWATCH CTA ECRIN
Metadata management to handle
heterogeneous and large datasets
Different data types, formats, source
and ways to access
e.g. Copernicus data: ~16PB per
year
Used as input for water quality
forecasting systems
Use of standards like EML
(Ecological Metadata Language) and
adopting best practices like FAIR+R
principles
The CTA distributed archive lies on
the « Open Archival Information System
» (OAIS) ISO standard. Event data are
in files (FITS format) containing all
metadata.
Metadata are extracted from the
ingested files, with an automatic filling
of the metadata database.
Metadata will be used for the
further query of archive.
The system should be able to
manage replicas, tapes, disks, etc,
with data from low-level to high-level.
Clinical trial data objects
available for sharing with others
a variety of access
mechanisms
wide variety of different
locations
growing number of general and
specialised data repositories
trial registries
Publications
the original researchers’
institutions
‘discoverability’ will become
much worse in the future as more
and more materials is made
available for sharing
24/01/2018
XDC high level architecture
24/01/2018 D.Cesini - The eXtreme DataCloud Project 12
Project Structure
WP1 - Project Management (NA1)
WP2: New functionalities definition and
Research Communities support (NA2)
WP3: Software Lifecycle Management,
Pilot Infrastructures & Exploitation
(SA1)
WP4: Orchestration and policy driven
data management (JRA1)
WP5: Unified cross federations data
management (JRA2)
24/01/2018 D.Cesini - The eXtreme DataCloud Project 13
Management bodies
24/01/2018 D.Cesini - The eXtreme DataCloud Project 14
ELG is responsible for maintaining
active relationships with the
infrastructure and technology
providers, discussing synergies,
strategies, roadmaps and
requirements workflow for the
software released by the project.
The plan for the next couple of years
Main Milestones
Research communities requirements for new functionalities
collected
Research communities requirements analysis performed
Project architecture detailed
Development schedule defined
Event with User Communities
XDC reference releases – 1
XDC reference releases – 2
Functionalities and scalability demonstrated
24/01/2018 D.Cesini - The eXtreme DataCloud Project 15
PM3
PM6
PM12
PM24
PM27
PM9 - Joint with DEEP in Santander
Conclusion
XDC has an ambitious development plan for data management
services
We want to support very diverse use cases and requirements
We need really a modular and flexible approach in building our platform
We will support as much as possible standards protocols to make
the solutions as general as possible
Sustainability of the products:
Provide upstream to the original project all the changes developed by XDC
Involving the user communities in exploiting the XDC outputs in their
production environments
Pushing XDC developments in the EOSC Service Catalogue
24/01/2018 D.Cesini - The eXtreme DataCloud Project 16
XDC Contacts
Website: www.extreme-datacloud.eu
@XtremeDataCloud on Twitter
Mailing list: info<at>extreme-datacloud.eu
24/01/2018 D.Cesini - The eXtreme DataCloud Project 17

More Related Content

PPTX
Research Data Shared Service
PDF
A Glimpse into the Future of I/O
PDF
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
PPT
Ticer summer school_24_aug06
PPT
Grid Computing
PPTX
Running Dataverse repository in the European Open Science Cloud (EOSC)
 
PPTX
Information Systems
PDF
Denodo Global Cloud Survey 2020
Research Data Shared Service
A Glimpse into the Future of I/O
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
Ticer summer school_24_aug06
Grid Computing
Running Dataverse repository in the European Open Science Cloud (EOSC)
 
Information Systems
Denodo Global Cloud Survey 2020

What's hot (20)

PDF
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
PDF
Bridging the Last Mile: Getting Data to the People Who Need It
PPT
Globus toolkit in grid
PDF
A Literature Survey on Resource Management Techniques, Issues and Challenges ...
PPTX
Building an electronic repository and archives on Dataverse in the European O...
 
PPT
Grid computing
PPTX
Grid Computing
PPTX
Building COVID-19 Museum as Open Science Project
 
PPTX
Grid Computing Systems and Resource Management
PDF
Cloud DC Transforming
PPT
grid computing
PPTX
Grid computing
PPTX
Challenges and advantages of grid computing
PPTX
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
 
PDF
Introducing SURF
PDF
A distributed network of digital heritage information by Enno Meijers - Europ...
PDF
Towards Generating Policy-compliant Datasets (poster)
PPT
SKA NZ R&D BeSTGRID Infrastructure
PPTX
BD2K and the Commons : ELIXR All Hands
PPTX
The NIH Data Commons - BD2K All Hands Meeting 2015
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
Bridging the Last Mile: Getting Data to the People Who Need It
Globus toolkit in grid
A Literature Survey on Resource Management Techniques, Issues and Challenges ...
Building an electronic repository and archives on Dataverse in the European O...
 
Grid computing
Grid Computing
Building COVID-19 Museum as Open Science Project
 
Grid Computing Systems and Resource Management
Cloud DC Transforming
grid computing
Grid computing
Challenges and advantages of grid computing
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
 
Introducing SURF
A distributed network of digital heritage information by Enno Meijers - Europ...
Towards Generating Policy-compliant Datasets (poster)
SKA NZ R&D BeSTGRID Infrastructure
BD2K and the Commons : ELIXR All Hands
The NIH Data Commons - BD2K All Hands Meeting 2015
Ad

Similar to The Extreme Data Cloud (XDC) Project (20)

PDF
The XDC project
PDF
Developing Enterprise Consciousness: Building Modern Open Data Platforms
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
PDF
Data Platform in the Cloud
PDF
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
PDF
Whitepaper: Evolution of the Software Defined Data Center - Happiest Minds
PDF
Realizing the Event Driven Enterprise
PDF
Xanadu Big Data Management Platform Brochure
PPTX
DDN EXA 5 - Innovation at Scale
PPTX
Data publication at CSIRO
PDF
IDC Rethinking the datacenter
PPTX
CQRS innovations (English version)
PDF
AWS Public Sector Summit 2018, Data Supply Chain Pipeline
PPTX
Efficient and effective: can we combine both to realize high-value, open, sca...
PDF
What Data Do You Have and Where is It?
PDF
Data Infrastructure at Flipkart (VLDB 2016)
PDF
1. data infrastructure keynote october 2010 alain
PPTX
PDF
2009.10.22 S308460 Cloud Data Services
PDF
Hadoop at the Center: The Next Generation of Hadoop
The XDC project
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Platform in the Cloud
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Whitepaper: Evolution of the Software Defined Data Center - Happiest Minds
Realizing the Event Driven Enterprise
Xanadu Big Data Management Platform Brochure
DDN EXA 5 - Innovation at Scale
Data publication at CSIRO
IDC Rethinking the datacenter
CQRS innovations (English version)
AWS Public Sector Summit 2018, Data Supply Chain Pipeline
Efficient and effective: can we combine both to realize high-value, open, sca...
What Data Do You Have and Where is It?
Data Infrastructure at Flipkart (VLDB 2016)
1. data infrastructure keynote october 2010 alain
2009.10.22 S308460 Cloud Data Services
Hadoop at the Center: The Next Generation of Hadoop
Ad

More from EUDAT (20)

PDF
EUDAT_Brochure_Generica_Jan_UPDATED(5).pdf
PDF
EUDAT Booklet Mar22 (2).pdf
PDF
EUDAT_Brochure_Generica_Jan_UPDATED (1).pdf
PDF
EUDAT Brochure - B2HANDLE.pdf
PDF
EUDAT Brochure - B2DROP.pdf
PDF
EUDAT Brochure - B2SHARE.pdf
PDF
EUDAT Brochure - B2SAFE.pdf
PDF
EUDAT Brochure - B2FIND(1).pdf
PDF
EUDAT Brochure - B2ACCESS.pdf
PDF
Rob Carrillo - Writing effective service documentation for EUDAT services
PDF
Ariyo - EUDAT CDI B2 services documentation
PDF
Introduction to eudat and its services
PPTX
Using B2NOTE: The U.Porto Pilot
PPT
OpenAIRE Advance - Kick off last week
PPT
European Open Science Cloud - Skills workshop
PPT
Linking service capabilities to data stweardship competences for professional...
PPT
FAIRness of training materials
PPT
Training by EOSC-hub - Integrating and Managing services for the European Ope...
PDF
Draft Governance Framework for the EOSC
PDF
Building Interoperable AAI for Researchers
EUDAT_Brochure_Generica_Jan_UPDATED(5).pdf
EUDAT Booklet Mar22 (2).pdf
EUDAT_Brochure_Generica_Jan_UPDATED (1).pdf
EUDAT Brochure - B2HANDLE.pdf
EUDAT Brochure - B2DROP.pdf
EUDAT Brochure - B2SHARE.pdf
EUDAT Brochure - B2SAFE.pdf
EUDAT Brochure - B2FIND(1).pdf
EUDAT Brochure - B2ACCESS.pdf
Rob Carrillo - Writing effective service documentation for EUDAT services
Ariyo - EUDAT CDI B2 services documentation
Introduction to eudat and its services
Using B2NOTE: The U.Porto Pilot
OpenAIRE Advance - Kick off last week
European Open Science Cloud - Skills workshop
Linking service capabilities to data stweardship competences for professional...
FAIRness of training materials
Training by EOSC-hub - Integrating and Managing services for the European Ope...
Draft Governance Framework for the EOSC
Building Interoperable AAI for Researchers

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Encapsulation_ Review paper, used for researhc scholars
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

The Extreme Data Cloud (XDC) Project

  • 1. eXtreme DataCloud is co-funded by the Horizon2020 Framework Program – Grant Agreement 777367 Copyright © Members of the XDC Collaboration, 2017-2020 Data Management for extreme scale computing The XDC project Daniele Cesini daniele.cesini<at>extreme-datacloud.eu
  • 2. XDC Objectives The eXtreme DataCloud is a software development and integration project Develops scalable technologies for federating storage resources and managing data in highly distributed computing environments Focus efficient, policy driven and Quality of Service based DM The targeted platforms are the current and next generation e- Infrastructures deployed in Europe European Open Science Cloud (EOSC) The e-infrastructures used by the represented communities 24/01/2018 D.Cesini - The eXtreme DataCloud Project 2
  • 3. XDC Foundations XDC takes the move from the INDIGO Data management activity the experience of the project partners on data-management Improve already existing, production quality, Federated Data Management services By adding missing functionalities requested by research communities Must be coherently harmonized in the European e-Infrastructures 3 INDIGO PaaS Orchestrator INDIGO CDMI Server24/01/2018 D.Cesini - The eXtreme DataCloud Project
  • 4. Represented research communities 24/01/2018 D.Cesini - The eXtreme DataCloud Project 4
  • 5. XDC Consortium 8 partners, 7 countries 7 research communities represented + EGI XDC Total Budget: 3.07Meuros XDC started on Nov 1st – will run for 27 months 24/01/2018 D.Cesini - The eXtreme DataCloud Project 5 ID Partner Country Represented Community Tools and system 1 INFN (Lead) IT HEP/WLCG INDIGO-Orchestrator, INDIGO- CDMI(*) 2 DESY DE Research with Photons (XFEL) dCache 3 CERN CH HEP/WLCG EOS, DYNAFED, FTS 4 AGH PL ONEDATA 5 ECRIN [ERIC] Medical data 6 UC ES Lifewatch 7 CNRS FR Astro [CTA and LSST] 8 EGI.eu NL EGI communities
  • 6. The New Functionalities Intelligent & Automated Dataset Distribution Orchestration to realize a policy-driven data management Data distribution policies based on Quality of Service (i.e. disks vs tape vs SSD) supporting geographical distributed resources (cross-sites) Software lifecycle management Data pre-processing during ingestion Data management based on access patterns Move to ‘glacier-like’ storage unused data, move to fast storage “hot” data at infrastructure level Smart caching Transparent access to remote data without the need of a-priori copy Metadata management Sensitive data handling secure storage and encryption 24/01/2018 D.Cesini - The eXtreme DataCloud Project 6
  • 7. Policy driven Data Management Intelligent & Automated Dataset Distribution A typical workflow Initially the data will be stored on low latency devices for fast access To ensure data safety, the data will be replicated to a second storage device and will be migrated to custodial systems, which might be tape or S3 appliances Eligible users will get permission to restore archived data if necessary After a grace period, Access Control will be changed from “private” to “open access” Data management based on access pattern 724/01/2018 D.Cesini - The eXtreme DataCloud Project
  • 8. Data pre-processing Data pre-processing during ingestion Automatically run user defined applications and workflows when data are uploaded i.e. for Skimming, indexing, metadata extraction, consistency checks Implement a solution to discover new data at specific locations Create the functions to request the INDIGO PaaS Orchestrator to execute specific applications on the computing resources on the Infrastructure Implement a high-level workflow engine, that will execute applications defined by the users Implement the data mover to store the elaborated data in the final destination 824/01/2018 D.Cesini - The eXtreme DataCloud Project
  • 9. Smart caching Develop a global caching infrastructure supporting the following building blocks: dynamic integration of satellite sites by existing data centres creation of standalone caches modelled on existing web solutions federation of the above to create a large scale caching infrastructure 924/01/2018 Smart caching D.Cesini - The eXtreme DataCloud Project
  • 10. Onedata developments 1024/01/2018 D.Cesini - The eXtreme DataCloud Project Unified data access platform at a PaaS level at the Exascale Multi-region support in Onedata Advanced metadata management with no pre-defined schema Encryption Services and Secure Storage Sensitive data management and key storage within Onedata
  • 11. Metadata handling use cases D.Cesini - The eXtreme DataCloud Project 11 LIFEWATCH CTA ECRIN Metadata management to handle heterogeneous and large datasets Different data types, formats, source and ways to access e.g. Copernicus data: ~16PB per year Used as input for water quality forecasting systems Use of standards like EML (Ecological Metadata Language) and adopting best practices like FAIR+R principles The CTA distributed archive lies on the « Open Archival Information System » (OAIS) ISO standard. Event data are in files (FITS format) containing all metadata. Metadata are extracted from the ingested files, with an automatic filling of the metadata database. Metadata will be used for the further query of archive. The system should be able to manage replicas, tapes, disks, etc, with data from low-level to high-level. Clinical trial data objects available for sharing with others a variety of access mechanisms wide variety of different locations growing number of general and specialised data repositories trial registries Publications the original researchers’ institutions ‘discoverability’ will become much worse in the future as more and more materials is made available for sharing 24/01/2018
  • 12. XDC high level architecture 24/01/2018 D.Cesini - The eXtreme DataCloud Project 12
  • 13. Project Structure WP1 - Project Management (NA1) WP2: New functionalities definition and Research Communities support (NA2) WP3: Software Lifecycle Management, Pilot Infrastructures & Exploitation (SA1) WP4: Orchestration and policy driven data management (JRA1) WP5: Unified cross federations data management (JRA2) 24/01/2018 D.Cesini - The eXtreme DataCloud Project 13
  • 14. Management bodies 24/01/2018 D.Cesini - The eXtreme DataCloud Project 14 ELG is responsible for maintaining active relationships with the infrastructure and technology providers, discussing synergies, strategies, roadmaps and requirements workflow for the software released by the project.
  • 15. The plan for the next couple of years Main Milestones Research communities requirements for new functionalities collected Research communities requirements analysis performed Project architecture detailed Development schedule defined Event with User Communities XDC reference releases – 1 XDC reference releases – 2 Functionalities and scalability demonstrated 24/01/2018 D.Cesini - The eXtreme DataCloud Project 15 PM3 PM6 PM12 PM24 PM27 PM9 - Joint with DEEP in Santander
  • 16. Conclusion XDC has an ambitious development plan for data management services We want to support very diverse use cases and requirements We need really a modular and flexible approach in building our platform We will support as much as possible standards protocols to make the solutions as general as possible Sustainability of the products: Provide upstream to the original project all the changes developed by XDC Involving the user communities in exploiting the XDC outputs in their production environments Pushing XDC developments in the EOSC Service Catalogue 24/01/2018 D.Cesini - The eXtreme DataCloud Project 16
  • 17. XDC Contacts Website: www.extreme-datacloud.eu @XtremeDataCloud on Twitter Mailing list: info<at>extreme-datacloud.eu 24/01/2018 D.Cesini - The eXtreme DataCloud Project 17