SlideShare a Scribd company logo
Empowering Transformational Science
Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann
Ryan Abernathey (Columbia / LDEO) twitter: @rabernat
Aimee Barciauskas (Development Seed) twitter: @_aimeeb
(there are lots of links in this presentation! click away!)
SWOT
NISAR
NASA Physical Oceanography Program
Empowering Transformational Science
Communities build open science.
Open science is more efficient.
Efficient science leads to
transformational results.
Data: time to find, access, clean, & format data for analysis
Software: what tools are easily available?
Compute: access to compute == speed of results
What impacts the velocity of science?
Data, Software, & Compute
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Traditional methods of data access
cannot leverage large volumes of data
6
https://earthdata.nasa.gov/eosdis/cloud-evolution
SWOT
NISAR
Data, Software, Compute
Analytics Optimized Data Store (AODS)
a few examples of
AODS formats
Current method -
NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or
day. Filename has information about date, sensor, version. Reading usually involved
calculating the filename, opening, reading, processing, closing.
Analytics Optimized Data Store (one example of many different formats)
Zarr - makes large datasets easily accessible to distributed computing. Original data is
stored in directories each having chunked data corresponding to dataset dimensions.
Metadata is read by zarr libraries to read only the chunks necessary to complete a
subsetting request.
Technology advances -
Lazy loading - also known as asynchronous loading - defer initialization of an object until
the point at which it is needed. Developed for webpages. Delays reading data until needed
for compute.
Advanced OSS libraries:
Xarray - library for analyzing multi-dimensional arrays, lazy loading.
Dask - able to break a large computational problems into a network of smaller problems for
distribution across multiple processors
Intake - lightweight set of tools for loading and sharing data in data science projects
NetCDF Zarr
What does a data store look like?
Organized so that each file can fit into RAM,
usually by day, orbit, or granules
organization and format invisible to user,
data accessed by metadata
Time to access data?
https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb
Modern software tools use lazy loading
to access large datasets
Accessing netCDF data: 11 minutes (depends on computer)
1 - user creates list of filenames
2 - access dataset by reading the metadata distributed through files
Accessing Zarr data: 0.1 seconds (metadata consolidated)
1 - access dataset by reading the consolidated metadata
Calculate mean over region
NetCDF - 12 minutes
Zarr - 4 seconds
My version of
lazy loading
before I knew
python - on
bedrest,
pregnant with
twins
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute
Data, Software, Compute
Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
SciPy
Data, Software, Compute
Analytics Optimized Data
Store (AODS)
Data Provider’s $ Data Consumer’s $
Scalable Parallel
Computing Frameworks
Agency driven solutions
Grass-Roots Solutions
13
14
Pangeo Architecture
Jupyter for interactive data
analysis on remote
systemsCloud / HPC
Xarray provides data structures
and intuitive interface for
interacting with datasetsParallel computing system allows users
deploy clusters of compute nodes for
data processing.
Dask tells the nodes what to do.
Distributedstorage
“Analytics Optimized
Data Stores”
stored on globally-
available distributed
storage.
@pangeo_data
How can data providers reduce barriers?
Reimagine how cloud data access and tools can enable
transformational science
Publish cloud-
optimized data Interactive
tutorials
Contribute to OSS tools
Increase user interactions/feedback
How does minimizing barriers to data
change science?
Levels the playing
field for all who
want to contribute
Traditional Project Timeline
Impacts: Reduce Time to Science
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Cloud-based Project Timeline
5%
Load
AODS
5%
Parallel
Processing
90%
Think about science
Traditional Project Code
Impacts: Reproducibility
Cloud-based Project Code
# step 1: open data (stored on local hard drive)
>>> data = open_data(“/path/to/private/files”)
Error: files not found
# step 1: open data (globally accessible)
>>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”)
# step 2: process data
>>> process(data)
Reproducibility in data-driven science requires more than just code!
Thank you!
Open source science
What impacts the velocity of progress?
Data, Software, & Compute
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute

More Related Content

PPTX
PPTX
Data-intensive applications on cloud computing resources: Applications in lif...
PDF
Advanced Research Computing at York
PPTX
So Long Computer Overlords
PPTX
Accelerating data-intensive science by outsourcing the mundane
PPTX
Data-intensive bioinformatics on HPC and Cloud
PPTX
Rpi talk foster september 2011
PPTX
Cloud com foster december 2010
Data-intensive applications on cloud computing resources: Applications in lif...
Advanced Research Computing at York
So Long Computer Overlords
Accelerating data-intensive science by outsourcing the mundane
Data-intensive bioinformatics on HPC and Cloud
Rpi talk foster september 2011
Cloud com foster december 2010

What's hot (20)

PDF
Is Hadoop a Necessity for Data Science
PPTX
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
PPTX
Significance Of Hadoop For Data Science
PDF
Big Data, The Community and The Commons (May 12, 2014)
PDF
Bigdata and Hadoop Bootcamp
PPTX
Big data
PDF
Keynote on 2015 Yale Day of Data
DOCX
Big data abstract
PPTX
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
PPTX
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
PDF
Big data and computing grid
PPTX
Hadoop
PPTX
Hadoop Tutorial
PPTX
2016 09 cxo forum
PDF
Using the Open Science Data Cloud for Data Science Research
PPTX
Presentation on Big Data Hadoop (Summer Training Demo)
PDF
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
PDF
An incremental and distributed inference methodfor large scale ontologies bas...
PPTX
Big data at experimental facilities
PPTX
Accelerating Discovery via Science Services
Is Hadoop a Necessity for Data Science
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Significance Of Hadoop For Data Science
Big Data, The Community and The Commons (May 12, 2014)
Bigdata and Hadoop Bootcamp
Big data
Keynote on 2015 Yale Day of Data
Big data abstract
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Big data and computing grid
Hadoop
Hadoop Tutorial
2016 09 cxo forum
Using the Open Science Data Cloud for Data Science Research
Presentation on Big Data Hadoop (Summer Training Demo)
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
An incremental and distributed inference methodfor large scale ontologies bas...
Big data at experimental facilities
Accelerating Discovery via Science Services
Ad

Similar to Empowering Transformational Science (20)

ODP
Hadoop @ Sara & BiG Grid
PPT
eScience: A Transformed Scientific Method
PPT
Bigdata processing with Spark
PPTX
Machine Learning and Hadoop
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
Real time analytics
PDF
Hadoop introduction
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
DOC
PDF
Minimizing the Complexities of Machine Learning with Data Virtualization
PPT
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
PPTX
Big Data and Hadoop
DOCX
Big data and Hadoop overview
PPTX
Inroduction to Big Data
PDF
Scaling the (evolving) web data –at low cost-
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PPTX
Big data ppt
DOC
Big Data Technologies - Hadoop, Spark, and Beyond.doc
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
PPT
hadoop
Hadoop @ Sara & BiG Grid
eScience: A Transformed Scientific Method
Bigdata processing with Spark
Machine Learning and Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Real time analytics
Hadoop introduction
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Minimizing the Complexities of Machine Learning with Data Virtualization
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
Big Data and Hadoop
Big data and Hadoop overview
Inroduction to Big Data
Scaling the (evolving) web data –at low cost-
Hadoop - Architectural road map for Hadoop Ecosystem
Big data ppt
Big Data Technologies - Hadoop, Spark, and Beyond.doc
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
hadoop
Ad

More from Chelle Gentemann (9)

PDF
Butterfly Satellite Mission Overview
PDF
Satellite passive microwave measurements of the climate crisis
PDF
FOSS4G 2021: Open source science
PDF
Open ecosystems help science storm the cloud
PPTX
Building a Community of Practice
PDF
Open Science
PPT
Multi-sensor Improved Sea Surface Temperatures Project
PPT
Saildrone Baja 2018 Cruise
PPTX
The changing landscape of science
Butterfly Satellite Mission Overview
Satellite passive microwave measurements of the climate crisis
FOSS4G 2021: Open source science
Open ecosystems help science storm the cloud
Building a Community of Practice
Open Science
Multi-sensor Improved Sea Surface Temperatures Project
Saildrone Baja 2018 Cruise
The changing landscape of science

Recently uploaded (20)

PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
An interstellar mission to test astrophysical black holes
PPTX
famous lake in india and its disturibution and importance
PPTX
Microbiology with diagram medical studies .pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ECG_Course_Presentation د.محمد صقران ppt
Derivatives of integument scales, beaks, horns,.pptx
neck nodes and dissection types and lymph nodes levels
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
An interstellar mission to test astrophysical black holes
famous lake in india and its disturibution and importance
Microbiology with diagram medical studies .pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Comparative Structure of Integument in Vertebrates.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
HPLC-PPT.docx high performance liquid chromatography
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Cell Membrane: Structure, Composition & Functions
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...

Empowering Transformational Science

  • 1. Empowering Transformational Science Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann Ryan Abernathey (Columbia / LDEO) twitter: @rabernat Aimee Barciauskas (Development Seed) twitter: @_aimeeb (there are lots of links in this presentation! click away!) SWOT NISAR NASA Physical Oceanography Program
  • 3. Communities build open science. Open science is more efficient. Efficient science leads to transformational results.
  • 4. Data: time to find, access, clean, & format data for analysis Software: what tools are easily available? Compute: access to compute == speed of results What impacts the velocity of science? Data, Software, & Compute 80% Data Preparation (download, clean, & organize files) 10% Batch Processing 10% Think about science
  • 5. Traditional methods of data access cannot leverage large volumes of data
  • 7. Analytics Optimized Data Store (AODS) a few examples of AODS formats Current method - NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or day. Filename has information about date, sensor, version. Reading usually involved calculating the filename, opening, reading, processing, closing. Analytics Optimized Data Store (one example of many different formats) Zarr - makes large datasets easily accessible to distributed computing. Original data is stored in directories each having chunked data corresponding to dataset dimensions. Metadata is read by zarr libraries to read only the chunks necessary to complete a subsetting request. Technology advances - Lazy loading - also known as asynchronous loading - defer initialization of an object until the point at which it is needed. Developed for webpages. Delays reading data until needed for compute. Advanced OSS libraries: Xarray - library for analyzing multi-dimensional arrays, lazy loading. Dask - able to break a large computational problems into a network of smaller problems for distribution across multiple processors Intake - lightweight set of tools for loading and sharing data in data science projects
  • 8. NetCDF Zarr What does a data store look like? Organized so that each file can fit into RAM, usually by day, orbit, or granules organization and format invisible to user, data accessed by metadata
  • 9. Time to access data? https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb Modern software tools use lazy loading to access large datasets Accessing netCDF data: 11 minutes (depends on computer) 1 - user creates list of filenames 2 - access dataset by reading the metadata distributed through files Accessing Zarr data: 0.1 seconds (metadata consolidated) 1 - access dataset by reading the consolidated metadata Calculate mean over region NetCDF - 12 minutes Zarr - 4 seconds My version of lazy loading before I knew python - on bedrest, pregnant with twins STOP ------------- THIS IS DIFFERENT ------------------ 1 line of code to access a 28-year, global, 25km dataset 1 line of code to select a region, calculate mean, & plot time series in LESS than 1 minute
  • 10. Data, Software, Compute Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy
  • 11. Data, Software, Compute Analytics Optimized Data Store (AODS) Data Provider’s $ Data Consumer’s $ Scalable Parallel Computing Frameworks
  • 14. 14 Pangeo Architecture Jupyter for interactive data analysis on remote systemsCloud / HPC Xarray provides data structures and intuitive interface for interacting with datasetsParallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do. Distributedstorage “Analytics Optimized Data Stores” stored on globally- available distributed storage. @pangeo_data
  • 15. How can data providers reduce barriers? Reimagine how cloud data access and tools can enable transformational science Publish cloud- optimized data Interactive tutorials Contribute to OSS tools Increase user interactions/feedback
  • 16. How does minimizing barriers to data change science? Levels the playing field for all who want to contribute
  • 17. Traditional Project Timeline Impacts: Reduce Time to Science 80% Data Preparation (download, clean, & organize files) 10% Batch Processing 10% Think about science Cloud-based Project Timeline 5% Load AODS 5% Parallel Processing 90% Think about science
  • 18. Traditional Project Code Impacts: Reproducibility Cloud-based Project Code # step 1: open data (stored on local hard drive) >>> data = open_data(“/path/to/private/files”) Error: files not found # step 1: open data (globally accessible) >>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”) # step 2: process data >>> process(data) Reproducibility in data-driven science requires more than just code!
  • 19. Thank you! Open source science What impacts the velocity of progress? Data, Software, & Compute STOP ------------- THIS IS DIFFERENT ------------------ 1 line of code to access a 28-year, global, 25km dataset 1 line of code to select a region, calculate mean, & plot time series in LESS than 1 minute