SlideShare a Scribd company logo
Research Automation
for Data-Driven Discovery
Ian Foster
Argonne National Laboratory &
The University of Chicago
foster@anl.gov
A productivity crisis in research
Data volumes are growing
much faster than Moore’s law …
(10,000x more over 6 years for
genome data)
Kahn, Science, 331
(6018): 728-729
But most labs
have extremely
limited resources
Heidorn: NSF
grants in 2007
< $350,000
80% of awards
50% of grant $$
"Well, in our country," said Alice …
"you'd generally get to somewhere else
— if you run very fast for a long time,
as we've been doing.”
"A slow sort of country!" said the
Queen. "Now, here, you see, it
takes all the running you can do,
to keep in the same place. If you
want to get somewhere else, you
must run at least twice as fast as that!"
The challenge of staying competitive
4https://bit.ly/2l4gfgu
How industry handles complexity
cloud4scieng.org
Industry software builds on powerful platform services
Cloud platforms have transformed how software is
developed and delivered
6
Can we do the same for science?
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission: thus
reduce costs, increase quality, promote interoperability
What capabilities?
7
• Auth: Manage identities, authentication, and authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
globus.org
Science services
operated by UChicago for researchers worldwide
Monitor transfer
Monitor activitiesManage data
Automate and outsource with
REST APIs and Python SDK
Automate and
outsource with
REST APIs and
Python SDK
11
UK
NIST
NSF
NSF
NSF
DOE
NSF
Canada
Automate and outsource:
Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1212
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
Publication and discovery
1313
Programmatic access (Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Example: NCAR’s Research Data Archive
Globus used for
• Single sign on via
streamlined account
provisioning
• Data sharing
• Data downloads
15
Beyond transfer
(Experimental)
Cloud platforms have transformed how software is
developed and delivered
17
We can do the same for science
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission, to
reduce costs, increase quality, promote interoperability
We have identified some needed capabilities
18
• Auth: Manage identities, authentication, authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
Established
12,000 endpoints
100,000+ users
New
100s of users
Experimental
10s of users
globus.org — Ian Foster — foster@anl.gov

More Related Content

PPTX
Scaling collaborative data science with Globus and Jupyter
PPTX
Data Tribology: Overcoming Data Friction with Cloud Automation
PPTX
Data Automation at Light Sources
PPTX
Research Automation for Data-Driven Discovery
PPTX
Learning Systems for Science
PDF
Reproducible Research and the Cloud
PPTX
Thoughts on interoperability
PPTX
NIH Data Commons Architecture Ideas
Scaling collaborative data science with Globus and Jupyter
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Automation at Light Sources
Research Automation for Data-Driven Discovery
Learning Systems for Science
Reproducible Research and the Cloud
Thoughts on interoperability
NIH Data Commons Architecture Ideas

What's hot (20)

PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PDF
Accelerating your Research with Microsoft Azure (June 2015)
PPTX
Coding the Continuum
PDF
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
PDF
Doing Research in the Cloud - NIH Workshop Dennis Gannon
PPTX
A4 r overview deck_1.7
PDF
Accelerating your research with Microsoft Azure
PPTX
Accelerating data-intensive science by outsourcing the mundane
PPT
Analytics and Access to the UK web archive
PPTX
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
PDF
ieee cloud 2015 keynote talk
PPTX
Cloud com foster december 2010
PDF
WoSC19: Serverless Workflows for Indexing Large Scientific Data
PDF
Big Data Visualization
PDF
"A Toolkit for Digital Research" - CNI 2013
PPTX
Open Science Data Cloud (IEEE Cloud 2011)
PDF
Health Sciences Research Informatics, Powered by Globus
PDF
Big data visualization frameworks and applications at Kitware
PPTX
Benchmarking Cloud-based Tagging Services
PDF
GlobusWorld 2015
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Accelerating your Research with Microsoft Azure (June 2015)
Coding the Continuum
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Doing Research in the Cloud - NIH Workshop Dennis Gannon
A4 r overview deck_1.7
Accelerating your research with Microsoft Azure
Accelerating data-intensive science by outsourcing the mundane
Analytics and Access to the UK web archive
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
ieee cloud 2015 keynote talk
Cloud com foster december 2010
WoSC19: Serverless Workflows for Indexing Large Scientific Data
Big Data Visualization
"A Toolkit for Digital Research" - CNI 2013
Open Science Data Cloud (IEEE Cloud 2011)
Health Sciences Research Informatics, Powered by Globus
Big data visualization frameworks and applications at Kitware
Benchmarking Cloud-based Tagging Services
GlobusWorld 2015
Ad

Similar to Research Automation for Data-Driven Discovery (20)

PPTX
Accelerating Discovery via Science Services
PPTX
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
PPTX
So Long Computer Overlords
PDF
Science cloud foster june 2013
PPTX
Accelerating Data-driven Discovery in Energy Science
PPTX
Science as a Service: How On-Demand Computing can Accelerate Discovery
PPTX
Rpi talk foster september 2011
PDF
Foundations for the Future of Science
PPTX
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
PPTX
Big Process for Big Data @ PNNL, May 2013
PDF
Automating Research Data Management at Scale with Globus
PPTX
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
PPTX
re:Invent 2013-foster-madduri
PPTX
Software Infrastructure for a National Research Platform
PPTX
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
PPTX
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
PPTX
Taming Big Data!
PDF
GlobusWorld 2019 Opening Keynote
PDF
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)
PDF
Enduring Impact in Data-Driven Science
Accelerating Discovery via Science Services
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
So Long Computer Overlords
Science cloud foster june 2013
Accelerating Data-driven Discovery in Energy Science
Science as a Service: How On-Demand Computing can Accelerate Discovery
Rpi talk foster september 2011
Foundations for the Future of Science
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Big Process for Big Data @ PNNL, May 2013
Automating Research Data Management at Scale with Globus
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
re:Invent 2013-foster-madduri
Software Infrastructure for a National Research Platform
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Taming Big Data!
GlobusWorld 2019 Opening Keynote
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)
Enduring Impact in Data-Driven Science
Ad

More from Ian Foster (16)

PPTX
Global Services for Global Science March 2023.pptx
PPTX
The Earth System Grid Federation: Origins, Current State, Evolution
PPTX
Better Information Faster: Programming the Continuum
PPTX
ESnet6 and Smart Instruments
PPTX
Linking Scientific Instruments and Computation
PPTX
Foster CRA March 2022.pptx
PPTX
Big Data, Big Computing, AI, and Environmental Science
PPTX
AI at Scale for Materials and Chemistry
PPTX
Team Argon Summary
PPTX
Going Smart and Deep on Materials at ALCF
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PPTX
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
PPTX
Globus Auth: A Research Identity and Access Management Platform
PPTX
Streamlined data sharing and analysis to accelerate cancer research
PPTX
building global software/earthcube->sciencecloud
PPTX
Big data at experimental facilities
Global Services for Global Science March 2023.pptx
The Earth System Grid Federation: Origins, Current State, Evolution
Better Information Faster: Programming the Continuum
ESnet6 and Smart Instruments
Linking Scientific Instruments and Computation
Foster CRA March 2022.pptx
Big Data, Big Computing, AI, and Environmental Science
AI at Scale for Materials and Chemistry
Team Argon Summary
Going Smart and Deep on Materials at ALCF
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Globus Auth: A Research Identity and Access Management Platform
Streamlined data sharing and analysis to accelerate cancer research
building global software/earthcube->sciencecloud
Big data at experimental facilities

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
1. Introduction to Computer Programming.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Getting Started with Data Integration: FME Form 101
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
1. Introduction to Computer Programming.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Tartificialntelligence_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
TLE Review Electricity (Electricity).pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative analysis of optical character recognition models for extracting...
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles - August'25-Week II
Programs and apps: productivity, graphics, security and other tools
OMC Textile Division Presentation 2021.pptx
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
cloud_computing_Infrastucture_as_cloud_p
MIND Revenue Release Quarter 2 2025 Press Release

Research Automation for Data-Driven Discovery

  • 1. Research Automation for Data-Driven Discovery Ian Foster Argonne National Laboratory & The University of Chicago foster@anl.gov
  • 2. A productivity crisis in research Data volumes are growing much faster than Moore’s law … (10,000x more over 6 years for genome data) Kahn, Science, 331 (6018): 728-729 But most labs have extremely limited resources Heidorn: NSF grants in 2007 < $350,000 80% of awards 50% of grant $$
  • 3. "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.” "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!" The challenge of staying competitive
  • 5. cloud4scieng.org Industry software builds on powerful platform services
  • 6. Cloud platforms have transformed how software is developed and delivered 6 Can we do the same for science? • Identify cross-cutting capabilities required by many groups • Define simple REST APIs for accessing those capabilities • Operate high-quality, scalable, secure, performant cloud-hosted implementations • Ensure persistence and evolution over time In so doing, enable many scientists and tool developers to automate and outsource tasks that are not central to their core mission: thus reduce costs, increase quality, promote interoperability
  • 7. What capabilities? 7 • Auth: Manage identities, authentication, and authorization • Transfer: Manage movement of files from A to B • Sharing: Manage who can access data at a location • Publish: Preserve, identify, describe, curate • Search: Index and search data • Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities • Learn: Discover, train, run machine learning models • …
  • 8. globus.org Science services operated by UChicago for researchers worldwide
  • 10. Automate and outsource with REST APIs and Python SDK
  • 11. Automate and outsource with REST APIs and Python SDK 11 UK NIST NSF NSF NSF DOE NSF Canada
  • 12. Automate and outsource: Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1212 Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 13. Automate and outsource: Publication and discovery 1313 Programmatic access (Python, Jupyter) Web browse and search Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 14. Example: NCAR’s Research Data Archive Globus used for • Single sign on via streamlined account provisioning • Data sharing • Data downloads
  • 15. 15
  • 17. Cloud platforms have transformed how software is developed and delivered 17 We can do the same for science • Identify cross-cutting capabilities required by many groups • Define simple REST APIs for accessing those capabilities • Operate high-quality, scalable, secure, performant cloud-hosted implementations • Ensure persistence and evolution over time In so doing, enable many scientists and tool developers to automate and outsource tasks that are not central to their core mission, to reduce costs, increase quality, promote interoperability
  • 18. We have identified some needed capabilities 18 • Auth: Manage identities, authentication, authorization • Transfer: Manage movement of files from A to B • Sharing: Manage who can access data at a location • Publish: Preserve, identify, describe, curate • Search: Index and search data • Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities • Learn: Discover, train, run machine learning models • … Established 12,000 endpoints 100,000+ users New 100s of users Experimental 10s of users globus.org — Ian Foster — foster@anl.gov

Editor's Notes

  • #3: Genome data increase by 10,000 more than Moore’s law over last six years
  • #4: For many researchers, projects, and institutions, large data volumes are not an opportunity but a fundamental challenge to their competitiveness as researchers. How can they keep up?