SlideShare a Scribd company logo
www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Big Data Solutions for the Climate
Community
ESGF – ENES – GEF – EGI interoperability
Xavier Pivan, CERFACS
Christian Page, CERFACS
EUDAT Porto’s Conference
Porto 22 – 25 Janvier 2018
Summary
Context & motivation. Introduction to the Generic
Executive Framework (GEF)
GEF – EGI Federated Cloud interoperability
ENES usecase - Live demo or Video
Performance quantification
ESGF – ENES – (GEF – EGI) interoperability
Context & Motivation
GEF presentation
Current situation
Climate Research Community
u Data	available	for	scientific	analysis:	a	very	large	
trend
§ Limitations	in	data	access	means	limitations	in	data	analytics	and	
scientific	results
u Download	locally	then	Analyze:	a	workflow	that	
cannot	be	sustained
§ Climate	researchers
§ Impact	researchers
Current situation
Practical	Example:	Climate	Community
Federation	
Service
• Temperature at 850 hPa field
• 10 climate models
• 1960-1990 & 2040-2070 = 60 years = 21 915 days
• Daily fields = 1 field per day
• Global spatial scale 100 km resolution
TOTAL: 6 754 500 fields to download
~100 Kb per 2D field = 626 Gb
After the analysis post-processing
• Anomaly of the average of the two periods over a specific
country for each climate model
• Result: 10 times 2D fields over a small domain
• Estimated datasize after post-processing: 1 Mb
Data reduction...
Current situation
Status	CMIP5	data	archive:
1.8	PB	for	59000	data	sets	stored	in	4.3	Mio	Files	in	23	ESGF	data	nodes	CMIP5	
data	is	about	50	times	CMIP3	
Extrapolation	to	CMIP6:
CMIP6	has	a	more	complex	experiment	structure	than	CMIP5.
Expectations:	more	models,	finer	spatial	resolution	and	larger	ensembles	
Factor	of	20:	36	PB	in	86	Mio	Files
Factor	of	50:	90	PB	in	215	Mio	Files
GEF – a few years ago
About the EUDAT Generic Executive
Framework
Generic tool to encapsulate calculation using
docker technology
Generic -> convenient for all communities (Climate,
Earth Science or Litterature)
Community admin can create specific services
Generic Executive Framework (GEF)
**Figure	from	GEF	github
GEF	– EGI	
interoperability
GEF – EGI interoperability
European Grid Infrastructure (EGI)
Federation of Cloud e-infrastructure
World-wide but mainly in Europe
Generate large computing resources on EGI
Use GEF docker rule engine on EGI
=> How to generate resources ?
=> How to make EGI computing resources and GEF
interoperable?
Generate EGI resources & set-up GEF on EGI
docker image - GEF service repository
x509 credential user
user private key
Contextualisation
(.yml)
jOCCI API - java
Create proxy
Instantiate VM with EGI appDB
Describe VM attribute using
VM id
Return VM id (URL)
Public IP
VM computing state
Import JSON file
Input from EGI appDB
• endpoint
• resource template
• operating system
Some other inputs
• …
3)	Start	the	GEF	with	EGI	VM	
docker	endpoint	+	client	
certificate	path
1) Create	client/server	TLS	
certificate
2) Bind	EGI’s	docker	daemon	
to	the	VM’s	IP	&	port	2376
ENES usecase - Icclim
Icclim stands for: Index Calculation CLIMate
Open Source python library developed by
CERFACS for Climate Community
Perform calculation on a netCDF file (average e.g)
Return the result on a new netCDF file
Encapsulate the icclim calculation
ENES_UseCase_icclim.py
Average the netCDF downloaded inside the container and
return a new netCDF file
Data	input
Format	file:	netCDF
Variable:	tas
Calculation:	mean
Data	output
Format:	netCDF
Variable:	tas
ENES usecase Big Picture before demo
Time to relax
GEF – live demo – video
GEF – EGI perfomance quantification
Measurement Methodology
ENES usecase workflow is split between two main
phases:
Downloading the data
Computing performed on data
Quantification of the performance:
Measuring the time of each phase
GEF – EGI performance quantification
Comparaison between EGI – CERFACS – DOMESTIC network
on a 3.41GB dataset
EGI downloading speed faster from 7 to 10 times
Computing process time about 2 to 3 times faster
Weakness of this quantification: only 1 sample
Network	comparaison
Computing	performance	comparaison
GEF – EGI performance quantification
Experiment: Run ENES usecase on data ranging
from 1GB to 200GB (multiple files)
Fit the result with a linear curve
Realize projection on potential larger data up to
2TBs
Analyze of the downloading & computing time
GEF – EGI performance quantification
Downloading time
Figure:
Downloading
time vs Data
Volume
Downloading
perfectly linear
Linear fit curve
GEF – EGI performance quantification
Computing time
Figure:
Computing time
vs Data Volume
Linear
approximation
Linear fit curve
GEF – EGI performance quantification
Downloading + Computing time
Figure: Workflow
time vs Data
Volume
Linear fit curve
(of course)
GEF – EGI performance: Projection up to 2TBs
Download Time
about 42 hours
Computing
Time: 20 hours
Total time:
62hours
GEF – EGI performance conclusion
Projection on 2TB:
Perfomance is linear – acceptable
Parallelisation should reduced both downloading &
computing
It means create VM machine – Cloud orchestration
Could be perform with the jOCCI
Icclim has option to realise fragmentation
-> Find the best value related to a given data
volume
GEF – EGI performance conclusion
Area	where	using	multiple	VM	on	EGI	
could	reduce	the	Total	time
GEF – EGI performance conclusion
It would imply instauration of some ”convention”
Allowing user to perform calculation up to a certain
amount of data
Special authorization if calculation
Get closer to the data
Downloading file is the most costly phase (~66%)
Considerably reduce this phase by getting closer to
the data
Implies to have computing nodes and data nodes
working together
This is the ESGF - ENES – GEF – EGI interoperability
Enlarge the interoperability
Climate for Impact (C4I) description
European portal for climate research
Earth System Grid Federation (ESGF)
American server cwt allow to perform calculation
close to the data
Generic Executive Framework (GEF)
European service developed by EUDAT
European Grid Infrastructure (EGI)
European e-infrastructure
Climate for Impact Portal
ESGF	– IS-ENES	– EGI	– GEF	
“Big	Picture”
How C4I looks like:
Conclusion
GEF – EGI interoperability:
Perform data reduction
Offer faster downloading and computing performance
New solution to explore:
Cloud orchestration
Quantify how faster the downloading and computing
using multiple VM & docker swarm
Icclim optimisation
ESGF – ENES – GEF – EGI in development
Very likely to strengthen our big data approach
Should be close to be operational by this year

More Related Content

PPTX
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
PDF
Green material, encryption and gate in Ark Load
PDF
Virtual training Intro to Kapacitor
PDF
k-means algorithm implementation on Hadoop
PPTX
The next generation of the Montage image mosaic engine
PDF
Introducing the HACC Simulation Data Portal
PDF
Updates on the Fake Object Pipeline for HSC Survey
PDF
Federated HPC Clouds applied to Radiation Therapy
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Green material, encryption and gate in Ark Load
Virtual training Intro to Kapacitor
k-means algorithm implementation on Hadoop
The next generation of the Montage image mosaic engine
Introducing the HACC Simulation Data Portal
Updates on the Fake Object Pipeline for HSC Survey
Federated HPC Clouds applied to Radiation Therapy

What's hot (20)

PDF
Hadoop analytics provisioning based on a virtual infrastructure
PPTX
Bioclouds CAMDA (Robert Grossman) 09-v9p
PPTX
DATACUBES: Conquering Space & Time
PDF
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
PDF
ETW - Monitor Anything, Anytime, Anywhere (NDC Oslo 2017)
PDF
Novel approaches to optomechanical transduction
PDF
Abusing JavaScript to Measure Web Performance
PDF
Messing with JavaScript and the DOM to measure network characteristics
PPTX
ML6 talk at Nexxworks Bootcamp
PDF
Q4 2016 GeoTrellis Presentation
PDF
indoo.rs NetFLIPs presentation_Müllner
PPTX
Telemetry Updates - Juno Edition
PDF
Hpc Cloud project Overview
PDF
2014.09.04 federated ground segments - toulouse
PDF
Kubernetes as data platform
PPTX
Storm: a distributed ,fault tolerant ,real time computation
PDF
Stabilizing the Jenga tower: Scaling out Ceilometer
PDF
Indoo.rs calibre
PPTX
SDOBenchmark - a machine learning image dataset for the prediction of solar f...
PDF
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
Hadoop analytics provisioning based on a virtual infrastructure
Bioclouds CAMDA (Robert Grossman) 09-v9p
DATACUBES: Conquering Space & Time
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
ETW - Monitor Anything, Anytime, Anywhere (NDC Oslo 2017)
Novel approaches to optomechanical transduction
Abusing JavaScript to Measure Web Performance
Messing with JavaScript and the DOM to measure network characteristics
ML6 talk at Nexxworks Bootcamp
Q4 2016 GeoTrellis Presentation
indoo.rs NetFLIPs presentation_Müllner
Telemetry Updates - Juno Edition
Hpc Cloud project Overview
2014.09.04 federated ground segments - toulouse
Kubernetes as data platform
Storm: a distributed ,fault tolerant ,real time computation
Stabilizing the Jenga tower: Scaling out Ceilometer
Indoo.rs calibre
SDOBenchmark - a machine learning image dataset for the prediction of solar f...
JavaFest. Cedrick Lunven. Build APIS with SpringBoot - REST, GRPC, GRAPHQL wh...
Ad

Similar to Big Data Solutions for the Climate Community (20)

PPTX
EUDAT Generic Execution Framework
PPT
EUDAT
PDF
Linking EUDAT services to the EGI Fed-Cloud - EUDAT Summer School (Hans van P...
PPTX
Using a Widely Distributed Federated Cloud System to Support Multiple Dispara...
PDF
Data analytics and downscaling for climate research in a big data world
PDF
Using a Widely Distributed Federated Cloud System to Support Multiple Dispara...
PPTX
The ascent of scientific computing: the EGI role and contribution towards the...
PPTX
The EGI Federated Cloud, 7 months of production
PPTX
Volunteer Crowd Computing and Federated Cloud developments
PPTX
The EGI Federated Cloud
PPTX
Scientific Computing 2021-2030
PDF
EGI Cloud Container Compute Service
PDF
ENES & EUDAT Uptake Report
PPTX
Past, present and future of advanced computing for data-driven science
PPTX
EGI Federated Cloud
PDF
EGI Engage: Impact & Results
PPTX
OSFair2017 Workshop | EGI applications database
PDF
CLIM Program: Remote Sensing Workshop, The Earth System Grid Federation as a ...
PPTX
EGI: a spark to transform science, business and society
PPTX
EGI Federated Cloud - May 2019
EUDAT Generic Execution Framework
EUDAT
Linking EUDAT services to the EGI Fed-Cloud - EUDAT Summer School (Hans van P...
Using a Widely Distributed Federated Cloud System to Support Multiple Dispara...
Data analytics and downscaling for climate research in a big data world
Using a Widely Distributed Federated Cloud System to Support Multiple Dispara...
The ascent of scientific computing: the EGI role and contribution towards the...
The EGI Federated Cloud, 7 months of production
Volunteer Crowd Computing and Federated Cloud developments
The EGI Federated Cloud
Scientific Computing 2021-2030
EGI Cloud Container Compute Service
ENES & EUDAT Uptake Report
Past, present and future of advanced computing for data-driven science
EGI Federated Cloud
EGI Engage: Impact & Results
OSFair2017 Workshop | EGI applications database
CLIM Program: Remote Sensing Workshop, The Earth System Grid Federation as a ...
EGI: a spark to transform science, business and society
EGI Federated Cloud - May 2019
Ad

More from EUDAT (20)

PDF
EUDAT_Brochure_Generica_Jan_UPDATED(5).pdf
PDF
EUDAT Booklet Mar22 (2).pdf
PDF
EUDAT_Brochure_Generica_Jan_UPDATED (1).pdf
PDF
EUDAT Brochure - B2HANDLE.pdf
PDF
EUDAT Brochure - B2DROP.pdf
PDF
EUDAT Brochure - B2SHARE.pdf
PDF
EUDAT Brochure - B2SAFE.pdf
PDF
EUDAT Brochure - B2FIND(1).pdf
PDF
EUDAT Brochure - B2ACCESS.pdf
PDF
Rob Carrillo - Writing effective service documentation for EUDAT services
PDF
Ariyo - EUDAT CDI B2 services documentation
PDF
Introduction to eudat and its services
PPTX
Using B2NOTE: The U.Porto Pilot
PPT
OpenAIRE Advance - Kick off last week
PPT
European Open Science Cloud - Skills workshop
PPT
Linking service capabilities to data stweardship competences for professional...
PPT
FAIRness of training materials
PPT
Training by EOSC-hub - Integrating and Managing services for the European Ope...
PDF
Draft Governance Framework for the EOSC
PDF
Building Interoperable AAI for Researchers
EUDAT_Brochure_Generica_Jan_UPDATED(5).pdf
EUDAT Booklet Mar22 (2).pdf
EUDAT_Brochure_Generica_Jan_UPDATED (1).pdf
EUDAT Brochure - B2HANDLE.pdf
EUDAT Brochure - B2DROP.pdf
EUDAT Brochure - B2SHARE.pdf
EUDAT Brochure - B2SAFE.pdf
EUDAT Brochure - B2FIND(1).pdf
EUDAT Brochure - B2ACCESS.pdf
Rob Carrillo - Writing effective service documentation for EUDAT services
Ariyo - EUDAT CDI B2 services documentation
Introduction to eudat and its services
Using B2NOTE: The U.Porto Pilot
OpenAIRE Advance - Kick off last week
European Open Science Cloud - Skills workshop
Linking service capabilities to data stweardship competences for professional...
FAIRness of training materials
Training by EOSC-hub - Integrating and Managing services for the European Ope...
Draft Governance Framework for the EOSC
Building Interoperable AAI for Researchers

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Big Data Solutions for the Climate Community

  • 1. www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Big Data Solutions for the Climate Community ESGF – ENES – GEF – EGI interoperability Xavier Pivan, CERFACS Christian Page, CERFACS EUDAT Porto’s Conference Porto 22 – 25 Janvier 2018
  • 2. Summary Context & motivation. Introduction to the Generic Executive Framework (GEF) GEF – EGI Federated Cloud interoperability ENES usecase - Live demo or Video Performance quantification ESGF – ENES – (GEF – EGI) interoperability
  • 4. Current situation Climate Research Community u Data available for scientific analysis: a very large trend § Limitations in data access means limitations in data analytics and scientific results u Download locally then Analyze: a workflow that cannot be sustained § Climate researchers § Impact researchers
  • 5. Current situation Practical Example: Climate Community Federation Service • Temperature at 850 hPa field • 10 climate models • 1960-1990 & 2040-2070 = 60 years = 21 915 days • Daily fields = 1 field per day • Global spatial scale 100 km resolution TOTAL: 6 754 500 fields to download ~100 Kb per 2D field = 626 Gb After the analysis post-processing • Anomaly of the average of the two periods over a specific country for each climate model • Result: 10 times 2D fields over a small domain • Estimated datasize after post-processing: 1 Mb Data reduction...
  • 7. GEF – a few years ago
  • 8. About the EUDAT Generic Executive Framework Generic tool to encapsulate calculation using docker technology Generic -> convenient for all communities (Climate, Earth Science or Litterature) Community admin can create specific services
  • 9. Generic Executive Framework (GEF) **Figure from GEF github
  • 11. GEF – EGI interoperability European Grid Infrastructure (EGI) Federation of Cloud e-infrastructure World-wide but mainly in Europe Generate large computing resources on EGI Use GEF docker rule engine on EGI => How to generate resources ? => How to make EGI computing resources and GEF interoperable?
  • 12. Generate EGI resources & set-up GEF on EGI docker image - GEF service repository x509 credential user user private key Contextualisation (.yml) jOCCI API - java Create proxy Instantiate VM with EGI appDB Describe VM attribute using VM id Return VM id (URL) Public IP VM computing state Import JSON file Input from EGI appDB • endpoint • resource template • operating system Some other inputs • … 3) Start the GEF with EGI VM docker endpoint + client certificate path 1) Create client/server TLS certificate 2) Bind EGI’s docker daemon to the VM’s IP & port 2376
  • 13. ENES usecase - Icclim Icclim stands for: Index Calculation CLIMate Open Source python library developed by CERFACS for Climate Community Perform calculation on a netCDF file (average e.g) Return the result on a new netCDF file
  • 14. Encapsulate the icclim calculation ENES_UseCase_icclim.py Average the netCDF downloaded inside the container and return a new netCDF file Data input Format file: netCDF Variable: tas Calculation: mean Data output Format: netCDF Variable: tas
  • 15. ENES usecase Big Picture before demo
  • 16. Time to relax GEF – live demo – video
  • 17. GEF – EGI perfomance quantification
  • 18. Measurement Methodology ENES usecase workflow is split between two main phases: Downloading the data Computing performed on data Quantification of the performance: Measuring the time of each phase
  • 19. GEF – EGI performance quantification Comparaison between EGI – CERFACS – DOMESTIC network on a 3.41GB dataset EGI downloading speed faster from 7 to 10 times Computing process time about 2 to 3 times faster Weakness of this quantification: only 1 sample Network comparaison Computing performance comparaison
  • 20. GEF – EGI performance quantification Experiment: Run ENES usecase on data ranging from 1GB to 200GB (multiple files) Fit the result with a linear curve Realize projection on potential larger data up to 2TBs Analyze of the downloading & computing time
  • 21. GEF – EGI performance quantification Downloading time Figure: Downloading time vs Data Volume Downloading perfectly linear Linear fit curve
  • 22. GEF – EGI performance quantification Computing time Figure: Computing time vs Data Volume Linear approximation Linear fit curve
  • 23. GEF – EGI performance quantification Downloading + Computing time Figure: Workflow time vs Data Volume Linear fit curve (of course)
  • 24. GEF – EGI performance: Projection up to 2TBs Download Time about 42 hours Computing Time: 20 hours Total time: 62hours
  • 25. GEF – EGI performance conclusion Projection on 2TB: Perfomance is linear – acceptable Parallelisation should reduced both downloading & computing It means create VM machine – Cloud orchestration Could be perform with the jOCCI Icclim has option to realise fragmentation -> Find the best value related to a given data volume
  • 26. GEF – EGI performance conclusion Area where using multiple VM on EGI could reduce the Total time
  • 27. GEF – EGI performance conclusion It would imply instauration of some ”convention” Allowing user to perform calculation up to a certain amount of data Special authorization if calculation
  • 28. Get closer to the data Downloading file is the most costly phase (~66%) Considerably reduce this phase by getting closer to the data Implies to have computing nodes and data nodes working together This is the ESGF - ENES – GEF – EGI interoperability
  • 29. Enlarge the interoperability Climate for Impact (C4I) description European portal for climate research Earth System Grid Federation (ESGF) American server cwt allow to perform calculation close to the data Generic Executive Framework (GEF) European service developed by EUDAT European Grid Infrastructure (EGI) European e-infrastructure
  • 30. Climate for Impact Portal ESGF – IS-ENES – EGI – GEF “Big Picture”
  • 31. How C4I looks like:
  • 32. Conclusion GEF – EGI interoperability: Perform data reduction Offer faster downloading and computing performance New solution to explore: Cloud orchestration Quantify how faster the downloading and computing using multiple VM & docker swarm Icclim optimisation ESGF – ENES – GEF – EGI in development Very likely to strengthen our big data approach Should be close to be operational by this year