SlideShare a Scribd company logo
Research Automation
for Data-Driven Discovery
Ian Foster
Argonne National Laboratory &
The University of Chicago
foster@anl.gov
A productivity crisis in research
Data volumes are growing
much faster than Moore’s law …
(10,000x more over 6 years for
genome data)
Kahn, Science, 331
(6018): 728-729
But most labs
have extremely
limited resources
Heidorn: NSF
grants in 2007
< $350,000
80% of awards
50% of grant $$
"Well, in our country," said Alice …
"you'd generally get to somewhere else
— if you run very fast for a long time,
as we've been doing.”
"A slow sort of country!" said the
Queen. "Now, here, you see, it
takes all the running you can do,
to keep in the same place. If you
want to get somewhere else, you
must run at least twice as fast as that!"
The challenge of staying competitive
4https://bit.ly/2l4gfgu
How industry handles complexity
cloud4scieng.org
Industry software builds on powerful platform services
Cloud platforms have transformed how software is
developed and delivered
6
Can we do the same for science?
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission: thus
reduce costs, increase quality, promote interoperability
What capabilities?
7
• Auth: Manage identities, authentication, and authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
globus.org
Science services
operated by UChicago for researchers worldwide
Monitor transfer
Monitor activitiesManage data
Automate and outsource with
REST APIs and Python SDK
Automate and
outsource with
REST APIs and
Python SDK
11
UK
NIST
NSF
NSF
NSF
DOE
NSF
Canada
Automate and outsource:
Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1212
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
Publication and discovery
1313
Programmatic access (Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Example: NCAR’s Research Data Archive
Globus used for
• Single sign on via
streamlined account
provisioning
• Data sharing
• Data downloads
15
Beyond transfer
(Experimental)
Cloud platforms have transformed how software is
developed and delivered
17
We can do the same for science
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission, to
reduce costs, increase quality, promote interoperability
We have identified some needed capabilities
18
• Auth: Manage identities, authentication, authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
Established
12,000 endpoints
100,000+ users
New
100s of users
Experimental
10s of users
globus.org — Ian Foster — foster@anl.gov

More Related Content

PPTX
Scaling collaborative data science with Globus and Jupyter
PPTX
Data Tribology: Overcoming Data Friction with Cloud Automation
PPTX
Data Automation at Light Sources
PDF
A Hadoop Primer
PDF
Reproducible Research and the Cloud
PPTX
Cloud com foster december 2010
PPTX
Accelerating data-intensive science by outsourcing the mundane
PDF
Accelerating your Research with Microsoft Azure (June 2015)
Scaling collaborative data science with Globus and Jupyter
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Automation at Light Sources
A Hadoop Primer
Reproducible Research and the Cloud
Cloud com foster december 2010
Accelerating data-intensive science by outsourcing the mundane
Accelerating your Research with Microsoft Azure (June 2015)

What's hot (20)

PDF
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
PDF
Doing Research in the Cloud - NIH Workshop Dennis Gannon
PDF
Accelerating your research with Microsoft Azure
PPTX
A4 r overview deck_1.7
PDF
A Data Ecosystem to Support Machine Learning in Materials Science
PPTX
Indexing big data in the cloud
PPTX
PPTX
Visualizing Data in Elasticsearch DevFest DC 2016
PPTX
NIH Data Commons Architecture Ideas
PPTX
RasterFrames + STAC
PDF
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
PDF
ieee cloud 2015 keynote talk
PPTX
Genomic Scale Big Data Pipelines
PDF
Au cœur de la roadmap de la Suite Elastic
PDF
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
PDF
Cloud Accelerated Genomics
PPTX
Use cases for cassandra in federal and state government
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
PDF
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
PDF
Globus Integrations (GlobusWorld Tour - UMich)
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Doing Research in the Cloud - NIH Workshop Dennis Gannon
Accelerating your research with Microsoft Azure
A4 r overview deck_1.7
A Data Ecosystem to Support Machine Learning in Materials Science
Indexing big data in the cloud
Visualizing Data in Elasticsearch DevFest DC 2016
NIH Data Commons Architecture Ideas
RasterFrames + STAC
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
ieee cloud 2015 keynote talk
Genomic Scale Big Data Pipelines
Au cœur de la roadmap de la Suite Elastic
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
Cloud Accelerated Genomics
Use cases for cassandra in federal and state government
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Globus Integrations (GlobusWorld Tour - UMich)
Ad

Similar to Research Automation for Data-Driven Discovery (20)

PPTX
Accelerating Discovery via Science Services
PPTX
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
PPTX
So Long Computer Overlords
PDF
Science cloud foster june 2013
PPTX
Accelerating Data-driven Discovery in Energy Science
PPTX
Science as a Service: How On-Demand Computing can Accelerate Discovery
PPTX
Rpi talk foster september 2011
PDF
Foundations for the Future of Science
PPTX
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
PPTX
Big Process for Big Data @ PNNL, May 2013
PDF
Automating Research Data Management at Scale with Globus
PPTX
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
PPTX
re:Invent 2013-foster-madduri
PPTX
Software Infrastructure for a National Research Platform
PPTX
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
PPTX
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
PPTX
Taming Big Data!
PDF
GlobusWorld 2019 Opening Keynote
PDF
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)
PDF
Enduring Impact in Data-Driven Science
Accelerating Discovery via Science Services
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
So Long Computer Overlords
Science cloud foster june 2013
Accelerating Data-driven Discovery in Energy Science
Science as a Service: How On-Demand Computing can Accelerate Discovery
Rpi talk foster september 2011
Foundations for the Future of Science
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Big Process for Big Data @ PNNL, May 2013
Automating Research Data Management at Scale with Globus
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
re:Invent 2013-foster-madduri
Software Infrastructure for a National Research Platform
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Taming Big Data!
GlobusWorld 2019 Opening Keynote
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)
Enduring Impact in Data-Driven Science
Ad

More from Globus (20)

PDF
Globus Compute wth IRI Workflows - GlobusWorld 2024
PDF
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
PDF
Globus Compute Introduction - GlobusWorld 2024
PDF
Globus Connect Server Deep Dive - GlobusWorld 2024
PDF
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
PDF
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
PDF
First Steps with Globus Compute Multi-User Endpoints
PDF
Enhancing Research Orchestration Capabilities at ORNL.pdf
PDF
Understanding Globus Data Transfers with NetSage
PDF
How to Position Your Globus Data Portal for Success Ten Good Practices
PDF
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
PDF
Developing Distributed High-performance Computing Capabilities of an Open Sci...
PDF
The Department of Energy's Integrated Research Infrastructure (IRI)
PDF
GlobusWorld 2024 Opening Keynote session
PDF
Enhancing Performance with Globus and the Science DMZ
PDF
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
PDF
Globus at the United States Geological Survey
PDF
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
PDF
Globus Compute with Integrated Research Infrastructure (IRI) workflows
PDF
Reactive Documents and Computational Pipelines - Bridging the Gap
Globus Compute wth IRI Workflows - GlobusWorld 2024
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus Compute Introduction - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
First Steps with Globus Compute Multi-User Endpoints
Enhancing Research Orchestration Capabilities at ORNL.pdf
Understanding Globus Data Transfers with NetSage
How to Position Your Globus Data Portal for Success Ten Good Practices
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
The Department of Energy's Integrated Research Infrastructure (IRI)
GlobusWorld 2024 Opening Keynote session
Enhancing Performance with Globus and the Science DMZ
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
Globus at the United States Geological Survey
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus Compute with Integrated Research Infrastructure (IRI) workflows
Reactive Documents and Computational Pipelines - Bridging the Gap

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Getting Started with Data Integration: FME Form 101
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
A comparative analysis of optical character recognition models for extracting...
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
OMC Textile Division Presentation 2021.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectroscopy.pptx food analysis technology

Research Automation for Data-Driven Discovery

  • 1. Research Automation for Data-Driven Discovery Ian Foster Argonne National Laboratory & The University of Chicago foster@anl.gov
  • 2. A productivity crisis in research Data volumes are growing much faster than Moore’s law … (10,000x more over 6 years for genome data) Kahn, Science, 331 (6018): 728-729 But most labs have extremely limited resources Heidorn: NSF grants in 2007 < $350,000 80% of awards 50% of grant $$
  • 3. "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.” "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!" The challenge of staying competitive
  • 5. cloud4scieng.org Industry software builds on powerful platform services
  • 6. Cloud platforms have transformed how software is developed and delivered 6 Can we do the same for science? • Identify cross-cutting capabilities required by many groups • Define simple REST APIs for accessing those capabilities • Operate high-quality, scalable, secure, performant cloud-hosted implementations • Ensure persistence and evolution over time In so doing, enable many scientists and tool developers to automate and outsource tasks that are not central to their core mission: thus reduce costs, increase quality, promote interoperability
  • 7. What capabilities? 7 • Auth: Manage identities, authentication, and authorization • Transfer: Manage movement of files from A to B • Sharing: Manage who can access data at a location • Publish: Preserve, identify, describe, curate • Search: Index and search data • Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities • Learn: Discover, train, run machine learning models • …
  • 8. globus.org Science services operated by UChicago for researchers worldwide
  • 10. Automate and outsource with REST APIs and Python SDK
  • 11. Automate and outsource with REST APIs and Python SDK 11 UK NIST NSF NSF NSF DOE NSF Canada
  • 12. Automate and outsource: Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1212 Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 13. Automate and outsource: Publication and discovery 1313 Programmatic access (Python, Jupyter) Web browse and search Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 14. Example: NCAR’s Research Data Archive Globus used for • Single sign on via streamlined account provisioning • Data sharing • Data downloads
  • 15. 15
  • 17. Cloud platforms have transformed how software is developed and delivered 17 We can do the same for science • Identify cross-cutting capabilities required by many groups • Define simple REST APIs for accessing those capabilities • Operate high-quality, scalable, secure, performant cloud-hosted implementations • Ensure persistence and evolution over time In so doing, enable many scientists and tool developers to automate and outsource tasks that are not central to their core mission, to reduce costs, increase quality, promote interoperability
  • 18. We have identified some needed capabilities 18 • Auth: Manage identities, authentication, authorization • Transfer: Manage movement of files from A to B • Sharing: Manage who can access data at a location • Publish: Preserve, identify, describe, curate • Search: Index and search data • Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities • Learn: Discover, train, run machine learning models • … Established 12,000 endpoints 100,000+ users New 100s of users Experimental 10s of users globus.org — Ian Foster — foster@anl.gov

Editor's Notes

  • #3: Genome data increase by 10,000 more than Moore’s law over last six years
  • #4: For many researchers, projects, and institutions, large data volumes are not an opportunity but a fundamental challenge to their competitiveness as researchers. How can they keep up?