SlideShare a Scribd company logo
Running a GPU burst for Multi-
Messenger Astrophysics with
IceCube across all available
GPUs in the Cloud
Frank Würthwein
OSG Executive Director
UCSD/SDSC
Jensen Huang keynote
yesterday
2
The Largest Cloud Simulation in History
50k NVIDIA GPUs in the Cloud
350 Petaflops for 2 hours
Distributed across US, Europe & Asia
On Saturday morning we bought all GPU capacity that was for sale in
Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
The Science Case
IceCube
4
A cubic kilometer of ice at the
south pole is instrumented
with 5160 optical sensors.
Astrophysics:
• Discovery of astrophysical neutrinos
• First evidence of neutrino point source (TXS)
• Cosmic rays with surface detector
Particle Physics:
• Atmospheric neutrino oscillation
• Neutrino cross sections at TeV scale
• New physics searches at highest energies
Earth Science:
• Glaciology
• Earth tomography
A facility with very
diverse science goals
Restrict this talk to
high energy Astrophysics
High Energy Astrophysics
Science case for IceCube
5
Universe is opaque to light
at highest energies and
distances.
Only gravitational waves
and neutrinos can pinpoint
most violent events in
universe.
Fortunately, highest energy
neutrinos are of cosmic origin.
Effectively “background free” as long
as energy is measured correctly.
High energy neutrinos from
outside the solar system
6
First 28 very high energy neutrinos from outside the solar system
Red curve is the photon flux
spectrum measured with the
Fermi satellite.
Black points show the
corresponding high energy
neutrino flux spectrum
measured by IceCube.
This demonstrates both the opaqueness of the universe to high energy
photons, and the ability of IceCube to detect neutrinos above the maximum
energy we can see light due to this opaqueness.
Science 342 (2013). DOI:
10.1126/science.1242856
Understanding the Origin
7
We now know high energy events happen in the universe. What are they?
p + g D + p + 0 p + g g
p + g D + n + + n + +
Co
Aya Ishihara
The hypothesis:
The same cosmic events produce
neutrinos and photons
We detect the electrons or muons from neutrino that interact in the ice.
Neutrino interact very weakly => need a very large array of ice instrumented
to maximize chances that a cosmic neutrino interacts inside the detector.
Need pointing accuracy to point back to origin of neutrino.
Telescopes the world over then try to identify the source in the direction
IceCube is pointing to for the neutrino.
Multi-messenger Astrophysics
The ν detection challenge
8
Optical Pro
Aya Ishiha
Combining all the possible info
These features are included in
We re al a s be de eloping h
Nature never tell us a perfec
satisfactory agreem
Ice properties change with
depth and wavelength
Observed pointing resolution at high
energies is systematics limited.
Central value moves
for different ice models
Improved e and τ reconstruction
Þ increased neutrino flux
detection
Þ more observations
Photon propagation through
ice runs efficiently on single
precision GPU.
Detailed simulation campaigns
to improve pointing resolution
by improving ice model.
Improvement in reconstruction with
better ice model near the detectors
First evidence of an origin
9
First location of a source of very high energy neutrinos.
Neutrino produced high energy muon
near IceCube. Muon produced light as it
traverses IceCube volume. Light is
detected by array of phototubes of
IceCube.
IceCube alerted the astronomy community of the
observation of a single high energy neutrino on
September 22 2017.
A blazar designated by astronomers as TXS
0506+056 was subsequently identified as most likely
source in the direction IceCube was pointing. Multiple
telescopes saw light from TXS at the same time
IceCube saw the neutrino.
Science 361, 147-151
(2018). DOI:10.1126/science.aat2890
IceCube’s Future Plans
10
| IceCube Upgrade and Gen2 | Summer Blot | TeVPA 2018
The IceCube-Gen2 Facility
Preliminary timeline
MeV- to EeV-scale physics
Surface array
High Energy
Array
Radio array
PINGU
IC86
2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 … 2032
Today
Surface air shower
ConstructionR&D Design & Approval
IceCube Upgrade
IceCube Upgrade
Deployment
Near term:
add more phototubes to deep core to increase granularity of measurements.
Longer term:
• Extend instrumented
volume at smaller
granularity.
• Extend even smaller
granularity deep core
volume.
• Add surface array.
Improve detector for low & high energy neutrinos
Details on the Cloud Burst
The Idea
• Integrate all GPUs available for sale
worldwide into a single HTCondor pool.
- use 28 regions across AWS, Azure, and Google
Cloud for a burst of a couple hours, or so.
• IceCube submits their photon propagation
workflow to this HTCondor pool.
- we handle the input, the jobs on the GPUs, and
the output as a single globally distributed system.
12
Run a GPU burst relevant in scale
for future Exascale HPC systems.
A global HTCondor pool
• IceCube, like all OSG user communities, relies on
HTCondor for resource orchestration
- This demo used the standard tools
• Dedicated HW setup
- Avoid disruption of OSG production system
- Optimize HTCondor setup for the spiky nature of the demo
§ multiple schedds for IceCube to submit to
§ collecting resources in each cloud region, then collecting from all
regions into global pool
13
HTCondor Distributed CI
14
Collector
Collector Collector
Collector
Collector
Negotiator
Scheduler SchedulerScheduler
IceCube
VM
VM
VM
10 schedd’s
One global resource pool
Using native Cloud storage
• Input data pre-staged into native Cloud storage
- Each file in one-to-few Cloud regions
§ some replication to deal with limited predictability of resources per region
- Local to Compute for large regions for maximum throughput
- Reading from “close” region for smaller ones to minimize ops
• Output staged back to region-local Cloud storage
• Deployed simple wrappers around Cloud native file
transfer tools
- IceCube jobs do not need to customize for different Clouds
- They just need to know where input data is available
(pretty standard OSG operation mode)
15
The Testing Ahead of Time
16
~250,000 single threaded jobs
run across 28 cloud regions
during 80 minutes.
Peak at 90,000
jobs running.
up to 60k jobs started in ~10min.
Regions across US, EU, and
Asia were used in this test.
Demonstrated burst capability
of our infrastructure on CPUs.
Want scale of GPU burst to be limited
only by # of GPUs available for sale.
Science with 51,000 GPUs
achieved as peak performance
17
Time in Minutes
Each color is a different
cloud region in US, EU, or Asia.
Total of 28 Regions in use.
Peaked at 51,500 GPUs
~350 Petaflops of fp32
8 generations of NVIDIA GPUs used.
A Heterogenous Resource Pool
18
28 cloud Regions across 4 world regions
providing us with 8 GPU generations.
No one region or GPU type dominates!
Science Produced
19
Distributed High-Throughput
Computing (dHTC) paradigm
implemented via HTCondor provides
global resource aggregation.
Largest cloud region provided 10.8% of the total
dHTC paradigm can aggregate
on-prem anywhere
HPC at any scale
and multiple clouds
IceCube is ready for Exascale
• Humanity has built extraordinary instruments by pooling
human and financial resources globally.
• The computing for these large collaborations fits perfectly to
the cloud or scheduling holes in Exascale HPC systems due
to its “ingeniously parallel” nature. => dHTC
• The dHTC computing paradigm applies to a wide range of
problems across all of open science.
- We are happy to repeat this with anybody willing to spend $50-200k in
the clouds.
20
Contact us at: help@opensciencegrid.org
Or me personally at: fkw@ucsd.edu
Demonstrated elastic burst at 51,500 GPUs
IceCube is ready for Exascale
Acknowledgements
• This work was partially sponsored by
NSF grants OAC-1941481,
MPS-1148698, OAC-1841530 and
OAC-1826967.
21

More Related Content

PPTX
"Building and running the cloud GPU vacuum cleaner"
PDF
SkyhookDM - Towards an Arrow-Native Storage System
PDF
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
PDF
Burst data retrieval after 50k GPU Cloud run
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
PDF
Data-intensive IceCube Cloud Burst
PPTX
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
PDF
Near Exascale Computing in the Cloud
"Building and running the cloud GPU vacuum cleaner"
SkyhookDM - Towards an Arrow-Native Storage System
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
Burst data retrieval after 50k GPU Cloud run
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Data-intensive IceCube Cloud Burst
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Near Exascale Computing in the Cloud

What's hot (20)

PDF
Using A100 MIG to Scale Astronomy Scientific Output
PDF
Using commercial Clouds to process IceCube jobs
PDF
Managing Cloud networking costs for data-intensive applications by provisioni...
PDF
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
PPTX
The OpenStack Cloud at CERN - OpenStack Nordic
PDF
Federated HPC Clouds applied to Radiation Therapy
PDF
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
PDF
OpenStack @ CERN, by Tim Bell
PPTX
OpenStack at CERN : A 5 year perspective
PDF
Cycle Computing Record-breaking Petascale HPC Run
PPTX
20170926 cern cloud v4
PPTX
20150924 rda federation_v1
PPTX
20181219 ucc open stack 5 years v3
PPTX
Differential data processing for energy efficiency of wireless sensor networks
PPTX
BioPig for scalable analysis of big sequencing data
PDF
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
PPTX
20161025 OpenStack at CERN Barcelona
PDF
Towards Exascale Simulations of Stellar Explosions with FLASH
PDF
How HPC and large-scale data analytics are transforming experimental science
PDF
CERN OpenStack Cloud Control Plane - From VMs to K8s
Using A100 MIG to Scale Astronomy Scientific Output
Using commercial Clouds to process IceCube jobs
Managing Cloud networking costs for data-intensive applications by provisioni...
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
The OpenStack Cloud at CERN - OpenStack Nordic
Federated HPC Clouds applied to Radiation Therapy
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
OpenStack @ CERN, by Tim Bell
OpenStack at CERN : A 5 year perspective
Cycle Computing Record-breaking Petascale HPC Run
20170926 cern cloud v4
20150924 rda federation_v1
20181219 ucc open stack 5 years v3
Differential data processing for energy efficiency of wireless sensor networks
BioPig for scalable analysis of big sequencing data
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
20161025 OpenStack at CERN Barcelona
Towards Exascale Simulations of Stellar Explosions with FLASH
How HPC and large-scale data analytics are transforming experimental science
CERN OpenStack Cloud Control Plane - From VMs to K8s
Ad

Similar to Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all available GPUs in the Cloud (20)

PDF
NASA Advanced Computing Environment for Science & Engineering
PPTX
Detecting solar farms with deep learning
PPTX
From pixels to point clouds - Using drones,game engines and virtual reality t...
PPT
Science and Cyberinfrastructure in the Data-Dominated Era
PPTX
Toward a National Research Platform
PPT
Toward a Global Interactive Earth Observing Cyberinfrastructure
PPT
TeraGrid and Physics Research
PDF
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
PDF
Efficient data reduction and analysis of DECam images using multicore archite...
PDF
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
PPTX
Scalable Deep Learning in ExtremeEarth-phiweek19
PPT
Seismic sensor
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PPT
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
PDF
The World Wide Distributed Computing Architecture of the LHC Datagrid
PDF
2020 ml swarm ascend presentation
PPTX
Report to the NAC
PPTX
Colloborative computing
PPTX
TEAMCD_SDR_Briefing
PPT
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
NASA Advanced Computing Environment for Science & Engineering
Detecting solar farms with deep learning
From pixels to point clouds - Using drones,game engines and virtual reality t...
Science and Cyberinfrastructure in the Data-Dominated Era
Toward a National Research Platform
Toward a Global Interactive Earth Observing Cyberinfrastructure
TeraGrid and Physics Research
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Efficient data reduction and analysis of DECam images using multicore archite...
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
Scalable Deep Learning in ExtremeEarth-phiweek19
Seismic sensor
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
The Emerging Cyberinfrastructure for Earth and Ocean Sciences
The World Wide Distributed Computing Architecture of the LHC Datagrid
2020 ml swarm ascend presentation
Report to the NAC
Colloborative computing
TEAMCD_SDR_Briefing
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
Ad

More from Igor Sfiligoi (20)

PDF
Preparing Fusion codes for Perlmutter - CGYRO
PDF
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
PDF
Comparing single-node and multi-node performance of an important fusion HPC c...
PDF
The anachronism of whole-GPU accounting
PDF
Auto-scaling HTCondor pools using Kubernetes compute resources
PDF
Speeding up bowtie2 by improving cache-hit rate
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PDF
Comparing GPU effectiveness for Unifrac distance compute
PDF
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
PDF
Modest scale HPC on Azure using CGYRO
PDF
Scheduling a Kubernetes Federation with Admiralty
PDF
Accelerating microbiome research with OpenACC
PDF
Porting and optimizing UniFrac for GPUs
PDF
Demonstrating 100 Gbps in and out of the public Clouds
PDF
TransAtlantic Networking using Cloud links
PDF
Bursting into the public Cloud - Sharing my experience doing it at large scal...
PDF
Demonstrating 100 Gbps in and out of the Clouds
PDF
Serving HTC Users in Kubernetes by Leveraging HTCondor
PPTX
Characterizing network paths in and out of the Clouds
PDF
GRP 19 - Nautilus, IceCube and LIGO
Preparing Fusion codes for Perlmutter - CGYRO
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Comparing single-node and multi-node performance of an important fusion HPC c...
The anachronism of whole-GPU accounting
Auto-scaling HTCondor pools using Kubernetes compute resources
Speeding up bowtie2 by improving cache-hit rate
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Comparing GPU effectiveness for Unifrac distance compute
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Modest scale HPC on Azure using CGYRO
Scheduling a Kubernetes Federation with Admiralty
Accelerating microbiome research with OpenACC
Porting and optimizing UniFrac for GPUs
Demonstrating 100 Gbps in and out of the public Clouds
TransAtlantic Networking using Cloud links
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Demonstrating 100 Gbps in and out of the Clouds
Serving HTC Users in Kubernetes by Leveraging HTCondor
Characterizing network paths in and out of the Clouds
GRP 19 - Nautilus, IceCube and LIGO

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MIND Revenue Release Quarter 2 2025 Press Release
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Review of recent advances in non-invasive hemoglobin estimation
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Building Integrated photovoltaic BIPV_UPV.pdf

Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all available GPUs in the Cloud

  • 1. Running a GPU burst for Multi- Messenger Astrophysics with IceCube across all available GPUs in the Cloud Frank Würthwein OSG Executive Director UCSD/SDSC
  • 2. Jensen Huang keynote yesterday 2 The Largest Cloud Simulation in History 50k NVIDIA GPUs in the Cloud 350 Petaflops for 2 hours Distributed across US, Europe & Asia On Saturday morning we bought all GPU capacity that was for sale in Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
  • 4. IceCube 4 A cubic kilometer of ice at the south pole is instrumented with 5160 optical sensors. Astrophysics: • Discovery of astrophysical neutrinos • First evidence of neutrino point source (TXS) • Cosmic rays with surface detector Particle Physics: • Atmospheric neutrino oscillation • Neutrino cross sections at TeV scale • New physics searches at highest energies Earth Science: • Glaciology • Earth tomography A facility with very diverse science goals Restrict this talk to high energy Astrophysics
  • 5. High Energy Astrophysics Science case for IceCube 5 Universe is opaque to light at highest energies and distances. Only gravitational waves and neutrinos can pinpoint most violent events in universe. Fortunately, highest energy neutrinos are of cosmic origin. Effectively “background free” as long as energy is measured correctly.
  • 6. High energy neutrinos from outside the solar system 6 First 28 very high energy neutrinos from outside the solar system Red curve is the photon flux spectrum measured with the Fermi satellite. Black points show the corresponding high energy neutrino flux spectrum measured by IceCube. This demonstrates both the opaqueness of the universe to high energy photons, and the ability of IceCube to detect neutrinos above the maximum energy we can see light due to this opaqueness. Science 342 (2013). DOI: 10.1126/science.1242856
  • 7. Understanding the Origin 7 We now know high energy events happen in the universe. What are they? p + g D + p + 0 p + g g p + g D + n + + n + + Co Aya Ishihara The hypothesis: The same cosmic events produce neutrinos and photons We detect the electrons or muons from neutrino that interact in the ice. Neutrino interact very weakly => need a very large array of ice instrumented to maximize chances that a cosmic neutrino interacts inside the detector. Need pointing accuracy to point back to origin of neutrino. Telescopes the world over then try to identify the source in the direction IceCube is pointing to for the neutrino. Multi-messenger Astrophysics
  • 8. The ν detection challenge 8 Optical Pro Aya Ishiha Combining all the possible info These features are included in We re al a s be de eloping h Nature never tell us a perfec satisfactory agreem Ice properties change with depth and wavelength Observed pointing resolution at high energies is systematics limited. Central value moves for different ice models Improved e and τ reconstruction Þ increased neutrino flux detection Þ more observations Photon propagation through ice runs efficiently on single precision GPU. Detailed simulation campaigns to improve pointing resolution by improving ice model. Improvement in reconstruction with better ice model near the detectors
  • 9. First evidence of an origin 9 First location of a source of very high energy neutrinos. Neutrino produced high energy muon near IceCube. Muon produced light as it traverses IceCube volume. Light is detected by array of phototubes of IceCube. IceCube alerted the astronomy community of the observation of a single high energy neutrino on September 22 2017. A blazar designated by astronomers as TXS 0506+056 was subsequently identified as most likely source in the direction IceCube was pointing. Multiple telescopes saw light from TXS at the same time IceCube saw the neutrino. Science 361, 147-151 (2018). DOI:10.1126/science.aat2890
  • 10. IceCube’s Future Plans 10 | IceCube Upgrade and Gen2 | Summer Blot | TeVPA 2018 The IceCube-Gen2 Facility Preliminary timeline MeV- to EeV-scale physics Surface array High Energy Array Radio array PINGU IC86 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 … 2032 Today Surface air shower ConstructionR&D Design & Approval IceCube Upgrade IceCube Upgrade Deployment Near term: add more phototubes to deep core to increase granularity of measurements. Longer term: • Extend instrumented volume at smaller granularity. • Extend even smaller granularity deep core volume. • Add surface array. Improve detector for low & high energy neutrinos
  • 11. Details on the Cloud Burst
  • 12. The Idea • Integrate all GPUs available for sale worldwide into a single HTCondor pool. - use 28 regions across AWS, Azure, and Google Cloud for a burst of a couple hours, or so. • IceCube submits their photon propagation workflow to this HTCondor pool. - we handle the input, the jobs on the GPUs, and the output as a single globally distributed system. 12 Run a GPU burst relevant in scale for future Exascale HPC systems.
  • 13. A global HTCondor pool • IceCube, like all OSG user communities, relies on HTCondor for resource orchestration - This demo used the standard tools • Dedicated HW setup - Avoid disruption of OSG production system - Optimize HTCondor setup for the spiky nature of the demo § multiple schedds for IceCube to submit to § collecting resources in each cloud region, then collecting from all regions into global pool 13
  • 14. HTCondor Distributed CI 14 Collector Collector Collector Collector Collector Negotiator Scheduler SchedulerScheduler IceCube VM VM VM 10 schedd’s One global resource pool
  • 15. Using native Cloud storage • Input data pre-staged into native Cloud storage - Each file in one-to-few Cloud regions § some replication to deal with limited predictability of resources per region - Local to Compute for large regions for maximum throughput - Reading from “close” region for smaller ones to minimize ops • Output staged back to region-local Cloud storage • Deployed simple wrappers around Cloud native file transfer tools - IceCube jobs do not need to customize for different Clouds - They just need to know where input data is available (pretty standard OSG operation mode) 15
  • 16. The Testing Ahead of Time 16 ~250,000 single threaded jobs run across 28 cloud regions during 80 minutes. Peak at 90,000 jobs running. up to 60k jobs started in ~10min. Regions across US, EU, and Asia were used in this test. Demonstrated burst capability of our infrastructure on CPUs. Want scale of GPU burst to be limited only by # of GPUs available for sale.
  • 17. Science with 51,000 GPUs achieved as peak performance 17 Time in Minutes Each color is a different cloud region in US, EU, or Asia. Total of 28 Regions in use. Peaked at 51,500 GPUs ~350 Petaflops of fp32 8 generations of NVIDIA GPUs used.
  • 18. A Heterogenous Resource Pool 18 28 cloud Regions across 4 world regions providing us with 8 GPU generations. No one region or GPU type dominates!
  • 19. Science Produced 19 Distributed High-Throughput Computing (dHTC) paradigm implemented via HTCondor provides global resource aggregation. Largest cloud region provided 10.8% of the total dHTC paradigm can aggregate on-prem anywhere HPC at any scale and multiple clouds
  • 20. IceCube is ready for Exascale • Humanity has built extraordinary instruments by pooling human and financial resources globally. • The computing for these large collaborations fits perfectly to the cloud or scheduling holes in Exascale HPC systems due to its “ingeniously parallel” nature. => dHTC • The dHTC computing paradigm applies to a wide range of problems across all of open science. - We are happy to repeat this with anybody willing to spend $50-200k in the clouds. 20 Contact us at: help@opensciencegrid.org Or me personally at: fkw@ucsd.edu Demonstrated elastic burst at 51,500 GPUs IceCube is ready for Exascale
  • 21. Acknowledgements • This work was partially sponsored by NSF grants OAC-1941481, MPS-1148698, OAC-1841530 and OAC-1826967. 21