Empowering Transformational Science

Empowering Transformational Science
Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann
Ryan Abernathey (Columbia / LDEO) twitter: @rabernat
Aimee Barciauskas (Development Seed) twitter: @_aimeeb
(there are lots of links in this presentation! click away!)
SWOT
NISAR
NASA Physical Oceanography Program

Communities build open science.
Open science is more efficient.
Efficient science leads to
transformational results.

Data: time to find, access, clean, & format data for analysis
Software: what tools are easily available?
Compute: access to compute == speed of results
What impacts the velocity of science?
Data, Software, & Compute
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science

Traditional methods of data access
cannot leverage large volumes of data

6
https://earthdata.nasa.gov/eosdis/cloud-evolution
SWOT
NISAR
Data, Software, Compute

Analytics Optimized Data Store (AODS)
a few examples of
AODS formats
Current method -
NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or
day. Filename has information about date, sensor, version. Reading usually involved
calculating the filename, opening, reading, processing, closing.
Analytics Optimized Data Store (one example of many different formats)
Zarr - makes large datasets easily accessible to distributed computing. Original data is
stored in directories each having chunked data corresponding to dataset dimensions.
Metadata is read by zarr libraries to read only the chunks necessary to complete a
subsetting request.
Technology advances -
Lazy loading - also known as asynchronous loading - defer initialization of an object until
the point at which it is needed. Developed for webpages. Delays reading data until needed
for compute.
Advanced OSS libraries:
Xarray - library for analyzing multi-dimensional arrays, lazy loading.
Dask - able to break a large computational problems into a network of smaller problems for
distribution across multiple processors
Intake - lightweight set of tools for loading and sharing data in data science projects

NetCDF Zarr
What does a data store look like?
Organized so that each file can fit into RAM,
usually by day, orbit, or granules
organization and format invisible to user,
data accessed by metadata

Time to access data?
https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb
Modern software tools use lazy loading
to access large datasets
Accessing netCDF data: 11 minutes (depends on computer)
1 - user creates list of filenames
2 - access dataset by reading the metadata distributed through files
Accessing Zarr data: 0.1 seconds (metadata consolidated)
1 - access dataset by reading the consolidated metadata
Calculate mean over region
NetCDF - 12 minutes
Zarr - 4 seconds
My version of
lazy loading
before I knew
python - on
bedrest,
pregnant with
twins
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute

Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
SciPy

Analytics Optimized Data
Store (AODS)
Data Provider’s $ Data Consumer’s $
Scalable Parallel
Computing Frameworks

14
Pangeo Architecture
Jupyter for interactive data
analysis on remote
systemsCloud / HPC
Xarray provides data structures
and intuitive interface for
interacting with datasetsParallel computing system allows users
deploy clusters of compute nodes for
data processing.
Dask tells the nodes what to do.
Distributedstorage
“Analytics Optimized
Data Stores”
stored on globally-
available distributed
storage.
@pangeo_data

How can data providers reduce barriers?
Reimagine how cloud data access and tools can enable
transformational science
Publish cloud-
optimized data Interactive
tutorials
Contribute to OSS tools
Increase user interactions/feedback

How does minimizing barriers to data
change science?
Levels the playing
field for all who
want to contribute

Traditional Project Timeline
Impacts: Reduce Time to Science
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Cloud-based Project Timeline
5%
Load
AODS
5%
Parallel
Processing
90%
Think about science

Traditional Project Code
Impacts: Reproducibility
Cloud-based Project Code
# step 1: open data (stored on local hard drive)
>>> data = open_data(“/path/to/private/files”)
Error: files not found
# step 1: open data (globally accessible)
>>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”)
# step 2: process data
>>> process(data)
Reproducibility in data-driven science requires more than just code!

Thank you!
Open source science
What impacts the velocity of progress?
Data, Software, & Compute
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute

Empowering Transformational Science

More Related Content

What's hot (20)

Similar to Empowering Transformational Science (20)

More from Chelle Gentemann (9)

Recently uploaded (20)

Empowering Transformational Science