SlideShare a Scribd company logo
PyModESt : A Python Framework for Staging of Geo-referenced Data on the  Collaborative Climate Community Grid (C3-Grid) Henning Bergmeyer German Aerospace Center Simulation- and Software Technology Henning.Bergmeyer [at] dlr.de
Talk Overview C3-Grid : A  D-Grid 1  Infrastructure Project DLR Use Cases on C3-Grid Concept and Realization of Data Providers Implementing Staging Scripts with PyModESt ( Mod ular  E xtendable  St ager in  Py thon) Interoperability Issues and Goals
C3-Grid Layered Architecture  (simplified) WDC Climate  WDC Mare WDC RSAT    DWD DKRZ,   PIK, GKSS,   AWI,   MPI-M    IFM-Geomar    FU Berlin   Uni Cologne dCache OGSA-DAI Flat Files … heterogeneous! Grid-Service Portal Workflow Management Data Management Data Information System Archive Users View Grid Middleware Resources
C3-Grid  : A  D-Grid 1  Project D-Grid: German Grid Initiative Phase 1: Infrastructures and Services for Scientific Communities C3-Grid (2005-2009) Transparent Grid Infrastructure  for the German Climate Research Community (Globus Toolkit 4.0.8) Data : Simplified Uniform  Discovery  and  Retrieval  of Distributed Heterogeneous Data on the Web Portal Workflows : Computational Standard Tools for Remote Pre-/Post-Processing and Workflows Partners: Meteorological Data Providers and Users, Computer Scientists Some future plans Include more Data- and Compute-Providers Partners contribute to IPCC-AR5 and provide some replicas of the data on C3-Grid Enhance security mechanisms: Shibboleth, SLCs, SAML
DLR Use Cases on C3-Grid World Data Center for Remote Sensing of the Atmosphere Data Provider for Global Satellite Sensor Data e.g. “ Global Ozone Monitoring Experiment“ Data Located on Web Server (free) Institute for Physics of the Atmosphere Workflow Developer for Model- Driven Chemical Weather Forecasts for Flight Route Planning Data Provider for Data Selection Visualization Tool Data Located in File System (free)
Data Providers I Portal-Based Discovery Advanced Search Browse by Data Set Available data sets are annotated using I SO19115/19139 MD Profile C3-Grid project provides a documented online meta data editor form Meta data is published on an  OAI  Server  (e.g. DLESE jOAI) Meta data is  OAI-PMH  harvested by  Data Information System Portal Integration: “ panFMP ” (available on  Sourceforge  and  panfmp.org )
Data Providers II Uniform Data Access Data Download Assistant in Portal Grid-Service Receives a Standard Set of Selection Constraints Data Set ( Object ID ) Variables as  CF Names  (Climate and Forecast MD Convention) Regional Bounds (Longitude, Latitude) Vertical Bounds and Vertical Coordinate Reference System Time Period (not ISO8601 restricted) Data Set Specific Constraints Data Provider gets transparently Distinguished Name  of requesting user for authorization mechanism C3-Grid-wide Unique Workflow ID log_surface_pressure  mole_fraction_of_ozone_in_air …
Data Providers III Delivering Data Deliver only the data fulfilling the constraints Extract the corresponding parts of the base data Reduce necessary file size for remote transfers Deliver exactly  1 data file  and  1 meta data  file Always produce meta data for data files Provider may offer more than 1 file type (NetCDF, HDF, GRB) Compress files on request (.tar.gz) Place result in local  DMS work space GridFTP accessible directory Managed by central C3-Grid DMS Estimate time to complete request and needed storage space for result
Becoming a C3-Grid Data Provider Prepare Data and ISO Meta Data Grid certificate for own server and local grid users System Set-up through an Admin Middleware: Globus Toolkit 4.0.8 Configure Firewall, authorize Certificate of DMS GNDMS Software of Zuse Institute Berlin Basic installation using  ant Configuration for local setup using variables in a special shell script MDS-Entry in Resource Information System (RIS) Providing Data Sets: Implementation of scripts for Data Staging and Estimation Data Providers know their tools and their data!
General Implementation of File Stagers for C3-Grid Receive and interpret request constraints on STD-IN as XML or Property File Format, then fulfill either of Case 1:  Estimation Request Verify constraints, estimate result file size and staging time Offer a contract in Property or XML-Format on STD-OUT Do NOT process any base data files for this Case 2:  Stage Request Retrieve the data and produce one result data file Produce a corresponding meta data file Take care of concurrent service executions when using temporary files Case 3:  Cancel Request Clean-up temporary files from interrupted requests Implementation is open and can be done by extending the Grid-Service in Java or by calling any executable as “External Stager”
PyModESt Staging Process Skeleton initialize Environment read Stage Request select DataProcessor choose Request Mode handle Cancel Request handle Stage Request handle Estimation Request handle Exceptions tidy Work Space write Responses authorize retrieve Meta Data adjust Constraints estimate Stage Time File Size find abandoned   temporary files ( authorize ) prepare work space retrieve meta data adjust constraints retrieve data update meta data transfer files to target Necessary Implementation effort for DP when using PyModESt
What the DP does and What is done for her Automatically : Std-In/-Out communication Stage request validation Complete meta data file handling and operations Creation of python variables for request constraints (float, datetime, str) Temporary file management (preventing concurrency conflicts and storage leaks) Choice of processing method by OID Thesaurus: 2-way variable name translation Compression tar.gz Logging Error handling (log, service response, tidy-up) Gauss grid calculations Calling external tools catching its std-output Manually : associate OIDs with data processors associate CF names with variable indices retrieve and package data using your well-known tools enter precise result attributes for meta data update estimate file size and staging time authorize user or deny access
PyModESt Modular Extensibility Data Processor  module:  implement data set specific operations __init__(c3env, stage_request) retrieveAndFilterDataFiles() updateMetaData(c3_metadata) estimateFileSize()   returns long estimateStageTime(stage_moment)  returns timedelta define variable name associations between data set scopes for helper module  C3Thesaurus   SCOPE_MAP = { "g2.de.dlr.wdc.ERS2.GOME.L3.VCD.MONTHLYMEAN.O3" :    { "mean_ozone_VCD#nonCF"  : "mean",   "standard_deviation_ozone_VCD#nonCF" : "strd_dev“ } } The  starter script  associates data sets and data processors: PROCESSOR_TYPES = { "g2.de.dlr.wdc.CWF"  : netcdf_extraction.IPA_NCDF_Processor, “g2.de.dlr.wdc.ERS… “: wdc_hdf_processing.WDCHDFProcessor }
PyModESt Metadata Update md.removeQuicklook() md.filterContentInfo(self.thesaurus.translateToC3(   self.requested_vars, self.object_id)) md.setHorizontalBounds(self.lon_min, self.lon_max, self.lat_min,   self.lat_max) md.setVerticalExtent(self.alt_min, self.alt_max,    self.alt_verticalcrs) md.updateTimePeriod(self.timeperiod_begin_date,    self.timeperiod_stop_date) md.setObjectId(self.object_id + "." +    datetime.utcnow().isoformat().replace(":","-")) md.addLineageProcessStep( PROCESS_DESCRIPTION, datetime.datetime.utcnow(), self.stage_request.object_ids[0],  RESPONSIBLE_PERSON,  "http://wis.wmo.int/2006/catalogues/gmxCodelists.xml#CI_RoleCode_distributor", INSTITUTE_IDENTIFIER)
Estimation for Offer Contracts File Size Use “GaussianGridHelper” to calculate table index ranges on Gaussian Grids gauss_grid_hlp = RegularGaussianGridHelper(   src_lat_min, src_lat_max, lat_delta, lat_len,    src_lon_min, src_lon_max, lon_delta, lon_len ) lat_idx_min, lat_idx_max, lat_idx_len,    lon_idx_min, lon_idx_max, lon_idx_len    = gauss_grid_hlp.calculateRegionIndices(   lat_min, lat_max, lon_min, lon_max ) For Raster / Table Data simply multiply and sum-up Difficult for data on irregular coordinate system (e.g. time series) Staging Time: Return a  datetime.timedelta  value It is next to impossible to be precise Currently DLR implementations generously over-estimate with a constant:  timedelta(seconds=60)
Examples: DLR Use Cases WDC-RSAT: ERS2.GOME.L3.VCD.MONTHLYMEAN.O3 (95–05) Base Data: 1 file / month, file format HDF4, HTTP download Retrieval and Processing using  PyHDF   library create new HDF file iterate over months covering requested time period adjust data describing attributes IPA: Chemical Weather Forecast Demo Data Set (2005) 1 file with 8 time steps / day, file format NetCDF, local file system Retrieval and Processing using external command line tool Climate Data Operators iterate over files corresponding to time period ( cdo ) adjust data describing attributes by analyzing the result file ( cdo )
Benefits of Using PyModESt Easy to understand  and modify by non computer scientists. Intuitive configuration using dictionaries No compilation necessary Use Python types ( float ,  str ,  dict , ...) Use Python standard libraries ( datetime ,  timedelta ,  math , …) Add new data sets by  Copy and Customize  DataProcessors (documented  DataProcessor  template is provided) Easy integration of command line tools (CDO…) Rich set of useful libraries available (HDF, Numeric, iso8601, PyParsing…) Integration of Java and C/C++ Libraries with JPype and CTypes possible
Open Issues CF Names for Variables on C3-Grid level Data sets indices in most cases use institution or file format specific conventions =>  automatic   translation necessary Not all used variables have already CF standard names => new names have to be chosen and discussed on community level slows down development difficult to discover when unknown to user requires documentation on portal and reading by user Centralized naming and translation service would be helpful Support for better staging time estimation Authorization using Short Lived Certificates (Proxies) and SAML assertions (“ GapSLCs”  beginning soon)
Summary The C3-Grid is a  collaboration infrastructure  for the climate science community based on the Globus Toolkit 4.0.8 and WSRF grid services. Main aim is  transparency of the infrastructure  to users by abstraction of  heterogeneous data resources  for easy data discovery and access and to allow execution of basic manipulation and analysis tasks as well as complex distributed  workflows . PyModESt  is a Python framework for the comfortable  modularized implementation of staging scripts  for the C3-Grid that frees meteorological data providers from doing the work of system admins and vice versa. Open issues for interoperability and extensibility are a semantically defined, unified naming and translation service for variable names and (yet) the integration of authorization principles based on Shibboleth, SLC and SAML assertions.

More Related Content

PPTX
MATLAB, netCDF, and OPeNDAP
PPTX
Efficiently serving HDF5 via OPeNDAP
PDF
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
PPTX
Connecting HDF with ISO Metadata Standards
PPTX
Product Designer Hub - Taking HPD to the Web
PDF
Hpdw 2015-v10-paper
PPTX
Hadoop and big data training
PPTX
SPD and KEA: HDF5 based file formats for Earth Observation
MATLAB, netCDF, and OPeNDAP
Efficiently serving HDF5 via OPeNDAP
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
Connecting HDF with ISO Metadata Standards
Product Designer Hub - Taking HPD to the Web
Hpdw 2015-v10-paper
Hadoop and big data training
SPD and KEA: HDF5 based file formats for Earth Observation

What's hot (20)

PPTX
HDF Product Designer: Using Templates to Achieve Interoperability
PPT
Digital Object Identifiers for EOSDIS data
PPTX
Moving form HDF4 to HDF5/netCDF-4
PDF
Implementing HDF5 in MATLAB
PPTX
ICESat-2 Metadata and Status
PPT
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
PPTX
Open-source Scientific Computing and Data Analytics using HDF
PDF
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PPTX
Resilient Distributed DataSets - Apache SPARK
PDF
Finding URL pattern with MapReduce and Apache Hadoop
PPTX
Hierarchical Data Formats (HDF) Update
PDF
Resilient Distributed Datasets
PPT
Distributed Interactive Computing Environment (DICE)
PPTX
HDF Update for DAAC Managers (2017-02-27)
PPTX
Learn to use dplyr (Feb 2015 Philly R User Meetup)
PDF
Metadata syncronisation with GeoNetwork - a users perspective
PPTX
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
HDF Product Designer: Using Templates to Achieve Interoperability
Digital Object Identifiers for EOSDIS data
Moving form HDF4 to HDF5/netCDF-4
Implementing HDF5 in MATLAB
ICESat-2 Metadata and Status
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
Open-source Scientific Computing and Data Analytics using HDF
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
Resilient Distributed DataSets - Apache SPARK
Finding URL pattern with MapReduce and Apache Hadoop
Hierarchical Data Formats (HDF) Update
Resilient Distributed Datasets
Distributed Interactive Computing Environment (DICE)
HDF Update for DAAC Managers (2017-02-27)
Learn to use dplyr (Feb 2015 Philly R User Meetup)
Metadata syncronisation with GeoNetwork - a users perspective
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
2021 04-20 apache arrow and its impact on the database industry.pptx
Ad

Viewers also liked (6)

ODP
Debug Hacks出版記念トークイベント、新宿ジュンク堂
PPTX
Power Point Presentation
PPTX
Jtoow Seo Inteligente
PPT
Adventism and the Environment
PDF
Study: The Future of VR, AR and Self-Driving Cars
PDF
Hype vs. Reality: The AI Explainer
Debug Hacks出版記念トークイベント、新宿ジュンク堂
Power Point Presentation
Jtoow Seo Inteligente
Adventism and the Environment
Study: The Future of VR, AR and Self-Driving Cars
Hype vs. Reality: The AI Explainer
Ad

Similar to PyModESt: A Python Framework for Staging of Geo-referenced Data on the Collaborative Climate Community Grid (C3-Grid) (20)

PPT
20090701 Climate Data Staging
PDF
CCCA Data Centre - Dynamic Data Citation for NetCDF files
PPT
Computing Outside The Box September 2009
PPT
Computing Outside The Box June 2009
PPT
2006-03-14 WG on HTAP-Relevant IT Techniques, Tools and Philosophies: DataFed...
PPT
060314 Ispra Htap Presentations Husar 060314 Ispra
PDF
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
PDF
Data Science in the Cloud @StitchFix
PPTX
HDF and netCDF Data Support in ArcGIS
PPT
Nomads
PDF
Meteoio Introduction given by Mathias Bavey in Bozen
PPT
070726 Igarss07 Barcelona
PDF
Balman climate-c sc-ads-2011
PPTX
Overview
PPT
2004-11-13 Supersite Relational Database Project: (Data Portal?)
PPT
Srds Pres011120
PPT
060128 Galeon Rept
PDF
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
PPTX
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
20090701 Climate Data Staging
CCCA Data Centre - Dynamic Data Citation for NetCDF files
Computing Outside The Box September 2009
Computing Outside The Box June 2009
2006-03-14 WG on HTAP-Relevant IT Techniques, Tools and Philosophies: DataFed...
060314 Ispra Htap Presentations Husar 060314 Ispra
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
Data Science in the Cloud @StitchFix
HDF and netCDF Data Support in ArcGIS
Nomads
Meteoio Introduction given by Mathias Bavey in Bozen
070726 Igarss07 Barcelona
Balman climate-c sc-ads-2011
Overview
2004-11-13 Supersite Relational Database Project: (Data Portal?)
Srds Pres011120
060128 Galeon Rept
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...

More from Andreas Schreiber (20)

PPTX
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
PPTX
Visualization of Software Architectures in Virtual Reality and Augmented Reality
PPTX
Provenance as a building block for an open science infrastructure
PPTX
Raising Awareness about Open Source Licensing at the German Aerospace Center
PDF
Open Source Licensing for Rocket Scientists
PDF
Interactive Visualization of Software Components with Virtual Reality Headsets
PPTX
Provenance for Reproducible Data Science
PPTX
Visualizing Provenance using Comics
PPTX
Quantified Self Comics
PPTX
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
PPTX
Reproducible Science with Python
PPTX
Python at Warp Speed
PPTX
A Provenance Model for Quantified Self Data
PPTX
Open Source im DLR
PDF
Tracking after Stroke: Doctors, Dogs and All The Rest
PPTX
High Throughput Processing of Space Debris Data
PDF
Bericht von der QS15 Conference & Exposition
PPTX
Telemedizin: Gesundheit, messbar für jedermann
PDF
Big Python
PDF
Quantified Self mit Wearable Devices und Smartphone-Sensoren
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Visualization of Software Architectures in Virtual Reality and Augmented Reality
Provenance as a building block for an open science infrastructure
Raising Awareness about Open Source Licensing at the German Aerospace Center
Open Source Licensing for Rocket Scientists
Interactive Visualization of Software Components with Virtual Reality Headsets
Provenance for Reproducible Data Science
Visualizing Provenance using Comics
Quantified Self Comics
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Reproducible Science with Python
Python at Warp Speed
A Provenance Model for Quantified Self Data
Open Source im DLR
Tracking after Stroke: Doctors, Dogs and All The Rest
High Throughput Processing of Space Debris Data
Bericht von der QS15 Conference & Exposition
Telemedizin: Gesundheit, messbar für jedermann
Big Python
Quantified Self mit Wearable Devices und Smartphone-Sensoren

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation theory and applications.pdf

PyModESt: A Python Framework for Staging of Geo-referenced Data on the Collaborative Climate Community Grid (C3-Grid)

  • 1. PyModESt : A Python Framework for Staging of Geo-referenced Data on the Collaborative Climate Community Grid (C3-Grid) Henning Bergmeyer German Aerospace Center Simulation- and Software Technology Henning.Bergmeyer [at] dlr.de
  • 2. Talk Overview C3-Grid : A D-Grid 1 Infrastructure Project DLR Use Cases on C3-Grid Concept and Realization of Data Providers Implementing Staging Scripts with PyModESt ( Mod ular E xtendable St ager in Py thon) Interoperability Issues and Goals
  • 3. C3-Grid Layered Architecture (simplified) WDC Climate WDC Mare WDC RSAT DWD DKRZ, PIK, GKSS, AWI, MPI-M IFM-Geomar FU Berlin Uni Cologne dCache OGSA-DAI Flat Files … heterogeneous! Grid-Service Portal Workflow Management Data Management Data Information System Archive Users View Grid Middleware Resources
  • 4. C3-Grid : A D-Grid 1 Project D-Grid: German Grid Initiative Phase 1: Infrastructures and Services for Scientific Communities C3-Grid (2005-2009) Transparent Grid Infrastructure for the German Climate Research Community (Globus Toolkit 4.0.8) Data : Simplified Uniform Discovery and Retrieval of Distributed Heterogeneous Data on the Web Portal Workflows : Computational Standard Tools for Remote Pre-/Post-Processing and Workflows Partners: Meteorological Data Providers and Users, Computer Scientists Some future plans Include more Data- and Compute-Providers Partners contribute to IPCC-AR5 and provide some replicas of the data on C3-Grid Enhance security mechanisms: Shibboleth, SLCs, SAML
  • 5. DLR Use Cases on C3-Grid World Data Center for Remote Sensing of the Atmosphere Data Provider for Global Satellite Sensor Data e.g. “ Global Ozone Monitoring Experiment“ Data Located on Web Server (free) Institute for Physics of the Atmosphere Workflow Developer for Model- Driven Chemical Weather Forecasts for Flight Route Planning Data Provider for Data Selection Visualization Tool Data Located in File System (free)
  • 6. Data Providers I Portal-Based Discovery Advanced Search Browse by Data Set Available data sets are annotated using I SO19115/19139 MD Profile C3-Grid project provides a documented online meta data editor form Meta data is published on an OAI Server (e.g. DLESE jOAI) Meta data is OAI-PMH harvested by Data Information System Portal Integration: “ panFMP ” (available on Sourceforge and panfmp.org )
  • 7. Data Providers II Uniform Data Access Data Download Assistant in Portal Grid-Service Receives a Standard Set of Selection Constraints Data Set ( Object ID ) Variables as CF Names (Climate and Forecast MD Convention) Regional Bounds (Longitude, Latitude) Vertical Bounds and Vertical Coordinate Reference System Time Period (not ISO8601 restricted) Data Set Specific Constraints Data Provider gets transparently Distinguished Name of requesting user for authorization mechanism C3-Grid-wide Unique Workflow ID log_surface_pressure mole_fraction_of_ozone_in_air …
  • 8. Data Providers III Delivering Data Deliver only the data fulfilling the constraints Extract the corresponding parts of the base data Reduce necessary file size for remote transfers Deliver exactly 1 data file and 1 meta data file Always produce meta data for data files Provider may offer more than 1 file type (NetCDF, HDF, GRB) Compress files on request (.tar.gz) Place result in local DMS work space GridFTP accessible directory Managed by central C3-Grid DMS Estimate time to complete request and needed storage space for result
  • 9. Becoming a C3-Grid Data Provider Prepare Data and ISO Meta Data Grid certificate for own server and local grid users System Set-up through an Admin Middleware: Globus Toolkit 4.0.8 Configure Firewall, authorize Certificate of DMS GNDMS Software of Zuse Institute Berlin Basic installation using ant Configuration for local setup using variables in a special shell script MDS-Entry in Resource Information System (RIS) Providing Data Sets: Implementation of scripts for Data Staging and Estimation Data Providers know their tools and their data!
  • 10. General Implementation of File Stagers for C3-Grid Receive and interpret request constraints on STD-IN as XML or Property File Format, then fulfill either of Case 1: Estimation Request Verify constraints, estimate result file size and staging time Offer a contract in Property or XML-Format on STD-OUT Do NOT process any base data files for this Case 2: Stage Request Retrieve the data and produce one result data file Produce a corresponding meta data file Take care of concurrent service executions when using temporary files Case 3: Cancel Request Clean-up temporary files from interrupted requests Implementation is open and can be done by extending the Grid-Service in Java or by calling any executable as “External Stager”
  • 11. PyModESt Staging Process Skeleton initialize Environment read Stage Request select DataProcessor choose Request Mode handle Cancel Request handle Stage Request handle Estimation Request handle Exceptions tidy Work Space write Responses authorize retrieve Meta Data adjust Constraints estimate Stage Time File Size find abandoned temporary files ( authorize ) prepare work space retrieve meta data adjust constraints retrieve data update meta data transfer files to target Necessary Implementation effort for DP when using PyModESt
  • 12. What the DP does and What is done for her Automatically : Std-In/-Out communication Stage request validation Complete meta data file handling and operations Creation of python variables for request constraints (float, datetime, str) Temporary file management (preventing concurrency conflicts and storage leaks) Choice of processing method by OID Thesaurus: 2-way variable name translation Compression tar.gz Logging Error handling (log, service response, tidy-up) Gauss grid calculations Calling external tools catching its std-output Manually : associate OIDs with data processors associate CF names with variable indices retrieve and package data using your well-known tools enter precise result attributes for meta data update estimate file size and staging time authorize user or deny access
  • 13. PyModESt Modular Extensibility Data Processor module: implement data set specific operations __init__(c3env, stage_request) retrieveAndFilterDataFiles() updateMetaData(c3_metadata) estimateFileSize() returns long estimateStageTime(stage_moment) returns timedelta define variable name associations between data set scopes for helper module C3Thesaurus SCOPE_MAP = { "g2.de.dlr.wdc.ERS2.GOME.L3.VCD.MONTHLYMEAN.O3" : { "mean_ozone_VCD#nonCF" : "mean", "standard_deviation_ozone_VCD#nonCF" : "strd_dev“ } } The starter script associates data sets and data processors: PROCESSOR_TYPES = { "g2.de.dlr.wdc.CWF" : netcdf_extraction.IPA_NCDF_Processor, “g2.de.dlr.wdc.ERS… “: wdc_hdf_processing.WDCHDFProcessor }
  • 14. PyModESt Metadata Update md.removeQuicklook() md.filterContentInfo(self.thesaurus.translateToC3( self.requested_vars, self.object_id)) md.setHorizontalBounds(self.lon_min, self.lon_max, self.lat_min, self.lat_max) md.setVerticalExtent(self.alt_min, self.alt_max, self.alt_verticalcrs) md.updateTimePeriod(self.timeperiod_begin_date, self.timeperiod_stop_date) md.setObjectId(self.object_id + "." + datetime.utcnow().isoformat().replace(":","-")) md.addLineageProcessStep( PROCESS_DESCRIPTION, datetime.datetime.utcnow(), self.stage_request.object_ids[0], RESPONSIBLE_PERSON, "http://wis.wmo.int/2006/catalogues/gmxCodelists.xml#CI_RoleCode_distributor", INSTITUTE_IDENTIFIER)
  • 15. Estimation for Offer Contracts File Size Use “GaussianGridHelper” to calculate table index ranges on Gaussian Grids gauss_grid_hlp = RegularGaussianGridHelper( src_lat_min, src_lat_max, lat_delta, lat_len, src_lon_min, src_lon_max, lon_delta, lon_len ) lat_idx_min, lat_idx_max, lat_idx_len, lon_idx_min, lon_idx_max, lon_idx_len = gauss_grid_hlp.calculateRegionIndices( lat_min, lat_max, lon_min, lon_max ) For Raster / Table Data simply multiply and sum-up Difficult for data on irregular coordinate system (e.g. time series) Staging Time: Return a datetime.timedelta value It is next to impossible to be precise Currently DLR implementations generously over-estimate with a constant: timedelta(seconds=60)
  • 16. Examples: DLR Use Cases WDC-RSAT: ERS2.GOME.L3.VCD.MONTHLYMEAN.O3 (95–05) Base Data: 1 file / month, file format HDF4, HTTP download Retrieval and Processing using PyHDF library create new HDF file iterate over months covering requested time period adjust data describing attributes IPA: Chemical Weather Forecast Demo Data Set (2005) 1 file with 8 time steps / day, file format NetCDF, local file system Retrieval and Processing using external command line tool Climate Data Operators iterate over files corresponding to time period ( cdo ) adjust data describing attributes by analyzing the result file ( cdo )
  • 17. Benefits of Using PyModESt Easy to understand and modify by non computer scientists. Intuitive configuration using dictionaries No compilation necessary Use Python types ( float , str , dict , ...) Use Python standard libraries ( datetime , timedelta , math , …) Add new data sets by Copy and Customize DataProcessors (documented DataProcessor template is provided) Easy integration of command line tools (CDO…) Rich set of useful libraries available (HDF, Numeric, iso8601, PyParsing…) Integration of Java and C/C++ Libraries with JPype and CTypes possible
  • 18. Open Issues CF Names for Variables on C3-Grid level Data sets indices in most cases use institution or file format specific conventions => automatic translation necessary Not all used variables have already CF standard names => new names have to be chosen and discussed on community level slows down development difficult to discover when unknown to user requires documentation on portal and reading by user Centralized naming and translation service would be helpful Support for better staging time estimation Authorization using Short Lived Certificates (Proxies) and SAML assertions (“ GapSLCs” beginning soon)
  • 19. Summary The C3-Grid is a collaboration infrastructure for the climate science community based on the Globus Toolkit 4.0.8 and WSRF grid services. Main aim is transparency of the infrastructure to users by abstraction of heterogeneous data resources for easy data discovery and access and to allow execution of basic manipulation and analysis tasks as well as complex distributed workflows . PyModESt is a Python framework for the comfortable modularized implementation of staging scripts for the C3-Grid that frees meteorological data providers from doing the work of system admins and vice versa. Open issues for interoperability and extensibility are a semantically defined, unified naming and translation service for variable names and (yet) the integration of authorization principles based on Shibboleth, SLC and SAML assertions.