SlideShare a Scribd company logo
Transforming Science Through Data-driven Discovery
How Cyverse.org enables scalable
data discoverability and re-use
Matt Vaughn, co-PI
@mattdotvaughn
vaughn@tacc.utexas.edu
History and Context
~ $100m direct NSF
investment over 10
years
Currently working to
sustain its successes
beyond 2018
iPlant 2008
Empowering a
New Plant Biology
iPlant 2013
Cyberinfrastructure
for Life Science
CyVerse 2016
Transforming Science
Through Data-Driven
Discovery
Plant Science Cyberinfrastructure Collaborative
A "new type of organization" that is "community-
driven" uniting "biologists, computer and information
scientists and experts from other disciplines working
in an integrated team" to provide "computational and
cyberinfrastructure capabilities and expertise that are
capable of handling large and heterogeneous plant
biology data sets"
What is Cyberinfrastructure?
•Data storage and retrieval
•Software (system & user)
•Computing capability
•Human expertise and support
Organized into systems that solve problems of size
and scope that would not otherwise be solvable
Platform Overview
Ready to use
Platforms
Foundational
Capabilities
Established CI
Components
Extensible
Services
EaseofUse
Adoption and Outputs
• Over 40K registered users (15-20%
active)
• Millions of computing hours on
XSEDE, campus HPC, Cyverse
systems, and commercial cloud
• 2+ PB user data stored in CyVerse
Data Store
• Hundreds of publications, courses,
and discoveries
• Spin-off technologies
• Jetstream: NSF production
cloud
• Syndicate: Software-defined
storage system
• Agave API: Multitenant
science PaaS
• Communities such as iAnimal,
iMicrobe, iPlant.UK
• 3rd party software resources
using it as a platform
Federation
Metadata
Finding and re-using Data (1)
iRODS (2+PB)
ElasticSearchTucson
Resources
Austin
Resources
Catalog Servers
CSHL
Resource
iPlant.UK
Resources
Data Store APIs
Agave API
AWS S3
Public FTP
SFTP
At the heart of all Cyverse applications is a data-centric
architecture, designed to be scaled and extended
Finding and re-using Data (2)
• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
The Cyverse Discovery Environment Data Window
Finding and re-using Data (3)
• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
Google Drive, for big data
The Cyverse Discovery Environment Data Window
Finding and re-using Software (1)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
Info view for a Cyverse Discovery Environment application
Finding and re-using Software (2)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
Public or shared Atmosphere VM images tagged with “GWAS”
Finding and re-using Software (3)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
• Give credit to app author and
software authorApplication and Data catalogs available to 3rd parties
Cyverse Data Commons (1)
Data Commons Landing Page (1.0)
Persistent URL for each data set. No authentication
required. Fast browsing and retrieval.
NCBI SRA Submission Workflow in DE
Cyverse is the analysis home for a lot of genomics
data. To get it off our systems, we need to help get it
into the SRA!
Cyverse Data Commons (2)
Actively facilitating publication and discovery of data stored with CyVerse
Candidate
Research
Data @
Data Store
Identify,
organize,
rename
files and
folders
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Data
snapshot
made
public. DOI
issued.
Candidate
VM image
Document
contents &
capabilities
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Public
image
released.
DOI issued.
Summary
• Cyverse is a model for providing cyberinfrastructure to diverse
bioscience user communities
• State of the art has shifted at least twice since we started work
• Had to overcome initial reticence to “give data” to Cyverse
• Still hard to get developers and providers to maintain after
contributing
• Cost recovery model - We have started using the term ‘subsidized’
rather than free but it might be too late.
• Natural syngergy between our organization and ODEN objectives
Transforming Science Through Data-driven Discovery
Parker Antin
Nirav Merchant
Eric Lyons
Matt Vaughn
@mattdotvaughn
vaughn@tacc.utexas.edu
Doreen Ware
Dave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
CyVerse Executive Team

More Related Content

PPTX
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
PPTX
Jetstream: Accessible cloud computing for the national science and engineerin...
PPTX
A4 r overview deck_1.7
PPTX
Utilising Cloud Computing for Research through Infrastructure, Software and D...
PPTX
Cyverse: Extensible Cyberinfrastructure for Life Science
PDF
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
PDF
Accelerating your Research with Microsoft Azure (June 2015)
PDF
Doing Research in the Cloud - NIH Workshop Dennis Gannon
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
Jetstream: Accessible cloud computing for the national science and engineerin...
A4 r overview deck_1.7
Utilising Cloud Computing for Research through Infrastructure, Software and D...
Cyverse: Extensible Cyberinfrastructure for Life Science
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Accelerating your Research with Microsoft Azure (June 2015)
Doing Research in the Cloud - NIH Workshop Dennis Gannon

What's hot (20)

PPTX
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
PDF
Reproducible Research and the Cloud
PPTX
Sept 24 NISO Virtual Conference: Library Data in the Cloud
PPTX
Interoperability and scalability with microservices in science
PDF
Accelerating your research with Microsoft Azure
PDF
Cloud Dataverse
PPTX
Sept 24 NISO Virtual Conference: Library Data in the Cloud
PPT
Sept 24 NISO Virtual Conference: Library Data in the Cloud
PPTX
Sept 24 NISO Virtual Conference: Library Data in the Cloud
PDF
Virtualization for HPC at NCI
PPTX
Reaching a Billion Users with Hadoop
PPTX
SEAD Datanet and Sustainability Science
PPTX
D4Science Data Infrastructure - Facilitator for a FAIR Data Management
PDF
Multi-layer Authorization Framework for Hadoop Ecosystem
PPTX
Dataverse on the MOC
PPTX
Data-intensive bioinformatics on HPC and Cloud
PDF
The pulse of cloud computing with bioinformatics as an example
PPTX
PPTX
Data Publishing at Harvard's Research Data Access Symposium
PDF
Bridging Environmental Data Providers and SeaDataNet DIVA Service within a Co...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Reproducible Research and the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Interoperability and scalability with microservices in science
Accelerating your research with Microsoft Azure
Cloud Dataverse
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Virtualization for HPC at NCI
Reaching a Billion Users with Hadoop
SEAD Datanet and Sustainability Science
D4Science Data Infrastructure - Facilitator for a FAIR Data Management
Multi-layer Authorization Framework for Hadoop Ecosystem
Dataverse on the MOC
Data-intensive bioinformatics on HPC and Cloud
The pulse of cloud computing with bioinformatics as an example
Data Publishing at Harvard's Research Data Access Symposium
Bridging Environmental Data Providers and SeaDataNet DIVA Service within a Co...
Ad

Similar to How Cyverse.org enables scalable data discoverability and re-use (20)

PPTX
Research methods group accelarating impact by sharing data
PPTX
Science as a Service: How On-Demand Computing can Accelerate Discovery
PDF
Science cloud foster june 2013
PPTX
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
PDF
Dataverse, Cloud Dataverse, and DataTags
PDF
Sgci esip-7-20-18
PPTX
Or 2013-abrams-sharing-data-rich-research
PPTX
Ndsa 2013-abrams-integrating-repositories-for-data-sharing
PDF
The UC Curation Center (UC3): Developing Tools & Services for Managing Research
PPTX
Dataverse for Journals
PPTX
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
PDF
Cloud Standards in the Real World: Cloud Standards Testing for Developers
PPTX
A Year in Review - Building a Comprehensive Data Management Program
PPTX
Datashare cni spring2013
PPTX
#2 NCI data services - Fair data webinar 6 Sept 2017
PPTX
Globus status and publication plans
PPTX
Breed data scientists_ A Presentation.pptx
PDF
DataShare for UC Campuses
PDF
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
PDF
DCSF 19 Improving the Human Condition with Docker
Research methods group accelarating impact by sharing data
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science cloud foster june 2013
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Dataverse, Cloud Dataverse, and DataTags
Sgci esip-7-20-18
Or 2013-abrams-sharing-data-rich-research
Ndsa 2013-abrams-integrating-repositories-for-data-sharing
The UC Curation Center (UC3): Developing Tools & Services for Managing Research
Dataverse for Journals
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Cloud Standards in the Real World: Cloud Standards Testing for Developers
A Year in Review - Building a Comprehensive Data Management Program
Datashare cni spring2013
#2 NCI data services - Fair data webinar 6 Sept 2017
Globus status and publication plans
Breed data scientists_ A Presentation.pptx
DataShare for UC Campuses
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
DCSF 19 Improving the Human Condition with Docker
Ad

More from Matthew Vaughn (14)

PPTX
On-Demand Cloud Computing for Life Sciences Research and Education
PPTX
Towards a (united) federation of Bioinformatics resources
PDF
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
PPTX
Packaging computational biology tools for broad distribution and ease-of-reuse
PPTX
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
PPTX
Scaling People, Not Just Systems, to Take On Big Data Challenges
PPTX
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
PDF
Developing Apps: Exposing Your Data Through Araport
PPTX
Dinosaur bioinformatics
PPTX
aip-developer-intro_pag2015
PPTX
iplant-highlights-pag2015
PPTX
aip-workshop1-dev-tutorial
PPTX
aip_developer_overview_icar_2014
PPTX
Arabidopsis Information Portal overview from Plant Biology Europe 2014
On-Demand Cloud Computing for Life Sciences Research and Education
Towards a (united) federation of Bioinformatics resources
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Packaging computational biology tools for broad distribution and ease-of-reuse
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Scaling People, Not Just Systems, to Take On Big Data Challenges
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Developing Apps: Exposing Your Data Through Araport
Dinosaur bioinformatics
aip-developer-intro_pag2015
iplant-highlights-pag2015
aip-workshop1-dev-tutorial
aip_developer_overview_icar_2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014

Recently uploaded (20)

PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
An interstellar mission to test astrophysical black holes
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Sciences of Europe No 170 (2025)
PPTX
2. Earth - The Living Planet earth and life
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Comparative Structure of Integument in Vertebrates.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
INTRODUCTION TO EVS | Concept of sustainability
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
An interstellar mission to test astrophysical black holes
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
neck nodes and dissection types and lymph nodes levels
TOTAL hIP ARTHROPLASTY Presentation.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Cell Membrane: Structure, Composition & Functions
ECG_Course_Presentation د.محمد صقران ppt
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Sciences of Europe No 170 (2025)
2. Earth - The Living Planet earth and life
Classification Systems_TAXONOMY_SCIENCE8.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5

How Cyverse.org enables scalable data discoverability and re-use

  • 1. Transforming Science Through Data-driven Discovery How Cyverse.org enables scalable data discoverability and re-use Matt Vaughn, co-PI @mattdotvaughn vaughn@tacc.utexas.edu
  • 2. History and Context ~ $100m direct NSF investment over 10 years Currently working to sustain its successes beyond 2018 iPlant 2008 Empowering a New Plant Biology iPlant 2013 Cyberinfrastructure for Life Science CyVerse 2016 Transforming Science Through Data-Driven Discovery Plant Science Cyberinfrastructure Collaborative A "new type of organization" that is "community- driven" uniting "biologists, computer and information scientists and experts from other disciplines working in an integrated team" to provide "computational and cyberinfrastructure capabilities and expertise that are capable of handling large and heterogeneous plant biology data sets"
  • 3. What is Cyberinfrastructure? •Data storage and retrieval •Software (system & user) •Computing capability •Human expertise and support Organized into systems that solve problems of size and scope that would not otherwise be solvable
  • 4. Platform Overview Ready to use Platforms Foundational Capabilities Established CI Components Extensible Services EaseofUse
  • 5. Adoption and Outputs • Over 40K registered users (15-20% active) • Millions of computing hours on XSEDE, campus HPC, Cyverse systems, and commercial cloud • 2+ PB user data stored in CyVerse Data Store • Hundreds of publications, courses, and discoveries • Spin-off technologies • Jetstream: NSF production cloud • Syndicate: Software-defined storage system • Agave API: Multitenant science PaaS • Communities such as iAnimal, iMicrobe, iPlant.UK • 3rd party software resources using it as a platform
  • 6. Federation Metadata Finding and re-using Data (1) iRODS (2+PB) ElasticSearchTucson Resources Austin Resources Catalog Servers CSHL Resource iPlant.UK Resources Data Store APIs Agave API AWS S3 Public FTP SFTP At the heart of all Cyverse applications is a data-centric architecture, designed to be scaled and extended
  • 7. Finding and re-using Data (2) • Browser-based file manager • Upload from local or URI • Download • Add/Edit comments and tags • AVU metadata + structured templates • Share with collaborators or any Cyverse user The Cyverse Discovery Environment Data Window
  • 8. Finding and re-using Data (3) • Browser-based file manager • Upload from local or URI • Download • Add/Edit comments and tags • AVU metadata + structured templates • Share with collaborators or any Cyverse user Google Drive, for big data The Cyverse Discovery Environment Data Window
  • 9. Finding and re-using Software (1) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service Info view for a Cyverse Discovery Environment application
  • 10. Finding and re-using Software (2) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service • Require links to documentation, example files and usage, appropriate software and domain ontologies Public or shared Atmosphere VM images tagged with “GWAS”
  • 11. Finding and re-using Software (3) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service • Require links to documentation, example files and usage, appropriate software and domain ontologies • Give credit to app author and software authorApplication and Data catalogs available to 3rd parties
  • 12. Cyverse Data Commons (1) Data Commons Landing Page (1.0) Persistent URL for each data set. No authentication required. Fast browsing and retrieval. NCBI SRA Submission Workflow in DE Cyverse is the analysis home for a lot of genomics data. To get it off our systems, we need to help get it into the SRA!
  • 13. Cyverse Data Commons (2) Actively facilitating publication and discovery of data stored with CyVerse Candidate Research Data @ Data Store Identify, organize, rename files and folders Prepare a DataCite metadata document Submit to Cyverse Curation Team Data snapshot made public. DOI issued. Candidate VM image Document contents & capabilities Prepare a DataCite metadata document Submit to Cyverse Curation Team Public image released. DOI issued.
  • 14. Summary • Cyverse is a model for providing cyberinfrastructure to diverse bioscience user communities • State of the art has shifted at least twice since we started work • Had to overcome initial reticence to “give data” to Cyverse • Still hard to get developers and providers to maintain after contributing • Cost recovery model - We have started using the term ‘subsidized’ rather than free but it might be too late. • Natural syngergy between our organization and ODEN objectives
  • 15. Transforming Science Through Data-driven Discovery Parker Antin Nirav Merchant Eric Lyons Matt Vaughn @mattdotvaughn vaughn@tacc.utexas.edu Doreen Ware Dave Micklos CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383. CyVerse Executive Team

Editor's Notes

  • #3: (Brief) History and Context In the mid-2000s, realization inside the NSF that biology had some unique CI challenges not being met Plant Genome was already spending on full-genome characterization projects (Arabidopsis 2010, etc). Big data was on horizon - NGS just emergent BIO-specific CI. Chose plant sciences due to strong communities and sharing culture.  Funded iPlant in 2008 Project spend its first 18 months assessing the immediate and future needs for plant science, began developing CI Renewed in 2013, with broadened mandate to cover BIO in general excepting human disease Rebranded in 2016 as part of a strategy to operate sustainably after initial program is over.
  • #4: What is Cyberinfrastructure? Before diving in to specifics, define Cyberinfrastructure This is remarkably similar to the definition of a Commons So, our charge was: Blend data storage + computing capability, reproducible analysis, and human expertise
  • #5: Platform Overview Vertically integrated set of offerings that serve a variety of users (technical skill, science use case, geographic location, etc) Data Storage is centralized, sharing is easy. Tied to ability to analyze in situ.  Ease of use <-> Ease of Re-use Everything below the consumer-facing layer: LEGO building blocks At the bottom: Federation is baked in. We own almost no hardware! This is key. Hard to sustain!  
  • #6: Adoption and Outputs (END 6:00) So, what if you build and they don’t come? Luckily, they did. On average, we serve as many users as other major CI investments like leadership class clusters or the XSEDE project. But different users! Home to lots of training and consulting (~25% effort) Cyverse has spun out at least three successful open data ecosystem products
  • #7: Finding and re-using Data 1 EARLY DESIGN DECISION: Availability of a scalable “Data Store” OPTIONAL: You don’t have to keep all your data there, but we hope to add sufficient value that you do.  Tech Stack: There was nothing ready to go. Combines iRODS + ElasticSearch + Agave APIs Currently 2+ PB of user files. At UA this is purchased as needed. At TACC, sliced from our Corral storage offering. CSHL and Plant.UK federating in. Agave APIs give us access to other storage protocols like S3, SFTP, FTP, Azure, etc.
  • #8: Finding and re-using Data 2 Why don’t you just give us Google, Dropbox, Box?  Data Store APIs let us implement Data Window GUI. Here’s an example from Cyverse’s DE workbench Comprehensive, easy Data Management, but petascale Aside: Provenance under the hood, but we don’t expose via UI yet
  • #9: Finding and re-using Data 3 Google Drive for Scientific Big Data Can do local caching as well but hard to do native support well This has been our story to date on Data.. more in a  minute
  • #10: Finding and re-using Software (1) Reagents (Data) and Protocols (Apps) both must be sharable and reusable Software -> Application Catalog Each front-end GUI has its own concept and implementation but share common infrastructure or are interoperable Here’s the DE, our flagship workbench application Deploying apps to these catalogs involves Docker or VM image GUI specifications (written in some DSL or metadata form) About half of applications in Cyverse are community contributed
  • #11: Finding and re-using Software (2) Here’s ATMOSPHERE image catalog Mandate provision of help docs and examples data/usage
  • #12: Finding and re-using Software (3) Here’s an example of Cyverse App and Data available in a 3rd PARTY APPLICATION Give credit and attribution to App contributor as well as primary software author (if different)
  • #13: Turning back to dataCyverse Data Commons (1) (12:00) To date, Cyverse strategy around data has been "Bring it in, use and discover within the platform, we won’t lock you in" This was not selfish - adopters needed a clear path and we wanted to be sure our CI was externally reliable We’ve been working to broaden this approach as our technology has matured under a banner called “Cyverse Data Commons" We hold a lot of 1’ data. Some of it has a natural home, like NCBI. We have taken responsibility to help that happen. In other cases, it makes sense to publish in place No natural repository Data is too large to move There is an expectation that re-users will perform extensive re-analysis on it Accomplish this now with “Community Data” and deep-linking  Improving offerings over course of 2016 …
  • #14: Cyverse Data Commons (2) Here are two example workflow being implemented Both result in a persistent, resolvable identifier Note: The VM workflow is already implemented in our sister project Jetstream. Images can be exported and are being published at IU Scholarworks Uses DataCite schema. Indexed by public search engines Feeds into our ElasticSearch-based metadata service to allow easy search and retrieve Search API will be publicly accessible later this year
  • #15: Bullet points