SlideShare a Scribd company logo
1
Challenges of deploying your HPC application to the cloud
November 12, 2016
Mulyanto Poort, VP Engineering
mulyanto@rescale.com
2
Overview
•  Overview of Rescale
•  Challenges of deploying software on Rescale
•  How we install and deploy software
•  Examples
•  Future developments: ScaleX Developer
•  Conclusions
3
Global Growth
Technology
Customers
Investors
San Francisco HQ, Tokyo and Munich offices, further EMEA
expansion
300%+ annual growth
Global cloud-based HPC platform
57+ data centers, 180+ simulation software packages
100+ leading Global 2000 enterprises
Rescale - Company Profile Overview
Jeff Bezos Richard Branson Peter Thiel
4
Software on Rescale
180 Applications26 On demand - 80 Bring your own license - 74 Free
5
Rescale’s Cloud Infrastructure
9 Infrastructure Providers
58+ Datacenters - 5 regional platforms
6
Hardware on Rescale
33 Hardware Configurations
Intel XeonSandy Bridge Ivy Bridge
Haswell Broadwell
Phi
Up to 64 cores per node
Up to 2TB of RAM per node
Up to 100Gbps EDR Infiniband
7
The Challenge: Multiple providers
●  RDMA works only with Intel MPI
●  RDMA hardware not supported in all
regions
●  All hardware not supported in all regions
●  Amazon Linux OS of choice
●  Hard to distinguish hyperthreads from
physical cores
●  MVAPICH MPI Flavor of choice
●  No root access to compute nodes
●  Uses rsh instead of ssh
●  Bare metal EDR infiniband
8
The Challenge: Virtual vs Bare metal systems
Virtual Bare Metal
Pros:
●  Abstraction of resources
●  Configurable environment
●  Better user isolation
●  Faster hardware refresh cycles
Cons:
●  Harder to tune hardware/software
●  Provisioning time may be slow
Pros:
●  Performance
●  More familiar environment for
traditional HPC users
Cons:
●  Queued systems
●  No root access to compute nodes
9
The Challenge: Multiple regions
Rescale Platforms
•  platform.rescale.com
•  eu.rescale.com
•  kr.rescale.com
•  itar.rescale.com
•  platform.rescale.jp
Provider regions and clusters
•  azure: westus, westeurope, … 38 regions
•  aws: us-east, ap-northeast, … 18 regions
•  osc: Owens, Ruby, Oakley
10
The Challenge: Multiple OSes and software types
•  Linux vs Windows
•  Batch vs Virtual Desktop
•  Bash vs Powershell
•  Intel MPI, OpenMPI, MPICH, Platform MPI, MVAPICH, Microsoft MPI,
Microsoft HPC PACK, charm++
•  Support workflows with multiple applications / types of applications
(e.g. co-simulation, MDO, etc.)
11
Software installation principles on Rescale
Customer should not have to worry about where the software runs. In other words,
the execution should be the same whether it runs on AWS or Azure or another
provider
Abstraction of MPI
•  Provide common command interface - independent of hardware
•  Automatically select optimal MPI for hardware
•  Automatically set MPI options like affinity, binding, distribution and
interconnect options
Maximize performance
Maximize compatibility with hardware and optimally utilize hardware
•  Co-processors, GPUs, AVX2
12
Global install process
Initial Install
Initial Install
●  Install by hand
●  Create automated
installation script
●  Run automated install
●  Create JSON definition
for installation
●  Deploy to provider
●  Create regression test
Time ~ 16 hours
New Provider
Provider Install
●  Take base install put it
in provider repository
●  Create provider
specific environment
●  Pull down install from
repository and install
on new provider.
Time ~ 4 hours
New Version
Add version to install
script
●  versions=[10.0, 10.1]
●  Run automated install
●  Deploy to provider
Time ~ 1 hour
13
Defining the installation: <software>.json	
  
<software>.json	
  
Description of software, list of versions, environment, license information, etc.
{	
  
“software”:	
  “ansys”,	
  
“description:	
  “Ansys	
  Software”,	
  
“versions”:	
  [	
  
{	
  
“version”:	
  “17.0”,	
  
“environment_variables”:	
  [...],	
  
“installations”:	
  [...],	
  
...	
  
},	
  
{	
  
“version”:	
  “16.2”,	
  
“environment_variables”:	
  [...],	
  
“installations”:	
  [...],	
  
...	
  
}	
  
],	
  
...	
  
}	
  
14
Defining the installation: <software>.json	
  
<software>.json	
  >	
  environment	
  variables	
  
Defines the environment the software runs in
{	
  
“software”:	
  “ansys”,	
  
“versions”:	
  [	
  
{	
  
“version”:	
  “17.0”,	
  
“environment_variables”:	
  [	
  
{	
  
“name”:	
  “VERSION”,	
  
“value”:	
  “17.0”,	
  
“Sort_order”:	
  1	
  
},	
  
{	
  
“name”:	
  “PATH”,	
  
“value”:	
  “$INSTALL_ROOT/$VERSION/bin:$INSTALL_ROOT/$VERSION/mpi/bin”,	
  
“Sort_order”:	
  2	
  
}	
  
]	
  
}	
  
]	
  
}	
  
15
Defining the installation: <software>.json	
  
<software>.json	
  >	
  installations	
  
These are references to the locations of the installations
{	
  
“software”:	
  “ansys”,	
  
“versions”:	
  [	
  
{	
  
“version”:	
  “17.0”,	
  
“installations”:	
  [	
  
{	
  
“provider”:	
  “azure”,	
  
“install_root”:	
  “/rescale/ansys”	
  
},	
  
{	
  
“provider”:	
  “osc”,	
  
“install_root”:	
  “/shared/rescale/ansys”	
  
}	
  
],	
  
...	
  
}	
  
]	
  
}	
  
16
Defining the installation: rescale-­‐<software>.json	
  
rescale-­‐<software>.json	
  
Defines the resources related to an install root
{	
  
“install_root”:	
  “/rescale/ansys”,	
  
“providers”:	
  [	
  
{	
  
“provider”:	
  “aws”,	
  
“resources”:	
  [	
  
{	
  
“region”:	
  “us-­‐east-­‐1”,	
  
“resource”:	
  “snap-­‐0123456789abcdef”	
  
},	
  
{	
  
“region”:	
  “us-­‐gov-­‐west-­‐1”,	
  
“resource”:	
  “snap-­‐00aa11bb22cc33ee”	
  
}	
  
]	
  
},	
  
...	
  
]	
  
}	
  
17
The install process
Install
Create base install
●  Silently install using
shell or powershell
script
●  Install to snapshot, vhd
or shared storage
location
Stage + Test
Replicate install
●  Copy install to different
regions, storage
accounts and clusters
●  Use provider API or
shell commands
Update json definition files
to reflect new resources
Regression testing
Deploy
Sync json definition with
production databases
●  Make sure that running
jobs are not affected
by changes in
database
18
The install process: Install
Common interface for all providers:
rescale-­‐install	
  -­‐-­‐install-­‐root	
  /program/ansys_fluids	
  -­‐-­‐provider	
  azure	
  
Provision
Resources
For AWS, Azure, etc.
●  Provision VM/Instance
●  Attach clean volume
and mount it to
<install-root>
For bare metal providers
●  An ssh connection is
opened to a login
account on the cluster
Run Install
Execute install script
●  Powershell (Windows)
or Bash (Linux)
●  Pull down bits from
blob storage
●  Run pre-generated
script to install
software to <install-
root>
Capture Resource
For AWS, Azure, etc
●  Snapshot or vhd is
generated. The ID of
the resource and save
it to the JSON
definition
For bare metal providers
●  Install is archived and
stored in a repository
19
The install process: Install > code
Python Code
•  AWS Python SDK (Boto, https://github.com/boto/boto)
•  Azure Python SDK (https://github.com/Azure/azure-sdk-for-python)
•  Google Python SDK (https://github.com/GoogleCloudPlatform/google-cloud-python)
•  Fabric (for ssh)
Common interface for provisioning install resources and executing commands on
those resources:
provider	
  =	
  ‘azure’	
  
os	
  =	
  ‘linux’	
  
provision_resource	
  =	
  ProvisionResource(install_settings,	
  provider=provider,	
  os=os)	
  
install_action	
  =	
  InstallAction(install_settings,	
  os=os)	
  
provision_resource.with_action(install_action)	
  
20
The install process: Stage
Common Interface:
rescale-­‐copy	
  -­‐-­‐from	
  us-­‐east-­‐1	
  -­‐-­‐to	
  ap-­‐northeast-­‐1	
  -­‐-­‐provider	
  aws	
  
Features:
•  Use provider API when possible, otherwise rsync between regions
•  For bare metal, pull installation from repository
•  Save state. Make sure you don’t copy if it’s not necessary
21
The install process: Testing
Regression Testing in Testing Environment
•  Use Rescale’s API to re-run baseline jobs and compare against expected results
•  Integrated with Jenkins build server
•  Example definition
{	
  
“environment”:	
  “testing”,	
  
“baseline_job_id”:	
  “aBcDeF”	
  
“name”:	
  “Ansys	
  MPI	
  Test”	
  
“tags”:	
  [...]	
  
“tests”:	
  [	
  
{	
  
	
  “type”:	
  “file_content”,	
  
“file_name”:	
  “output.log”,	
  
“Parameters”:	
  [	
  
	
  {“type”:	
  “contains_regex”,	
  “value”:	
  “^.*[1-­‐2][0-­‐9]	
  seconds.*$”}	
  
	
  {“type”:	
  “line_count”,	
  “value”:	
  “>256”}	
  
}	
  
]	
  
}	
  
22
How software is installed on Rescale > Deploy
Definitions are synced to production databases
•  Integrated with jenkins build server
•  Definitions are pulled from source control and synced with the production database.
23
Example: LS-DYNA
Blob
Storage builds
mpi
bin
Source control
ls-dyna
/rescale/ls-dyna
ls-dyna
Environment:	
  
VERSION=9.0.0	
  
MPI=intelmpi	
  
CORES=32	
  
NODES=2	
  
INSTALL_ROOT=/rescale/ls-­‐dyna	
  
INTERCONNECT=RDMA	
  
PROVIDER=azure	
  
REMSH=ssh	
  
User	
  Command:	
  
ls-­‐dyna	
  -­‐i	
  input.k	
  -­‐p	
  double	
  
Execution	
  Command:	
  
/rescale/ls-­‐dyna/mpi/intelmpi/5.0/bin64/mpirun	
  -­‐np	
  32	
  -­‐
machinefile	
  /home/user/machinefile	
  /rescale/ls-­‐dyna/builds/
lsdyna-­‐9.0.0-­‐intelmpi-­‐mpp-­‐double	
  i=input.k	
  
ls-dyna builds
mpi builds
24
Example: Abaqus
#!/bin/bash	
  
#	
  Abaqus	
  wrapper	
  for	
  Azure	
  RDMA	
  
export	
  I_MPI_FABRICS=shm:dapl	
  
export	
  I_MPI_DAPL_PROVIDER=ofa-­‐v2-­‐ib0	
  
export	
  I_MPI_DYNAMIC_CONNECTION=0	
  
	
  
echo	
  "mp_mpi_implementation=IMPI"	
  >>	
  ~/abaqus_v6.env	
  
echo	
  "mp_environment_export+=('I_MPI_FABRICS',	
  
'I_MPI_DAPL_PROVIDER',	
  'I_MPI_DYNAMIC_CONNECTION')"	
  >>	
  ~/
abaqus_v6.env	
  
	
  
${ABAQUS_BIN}/${ABAQUS_EXE}	
  "$@"	
  
Environment:	
  
VERSION=2016	
  
INSTALL_ROOT=/rescale/abaqus	
  
ABAQUS_BIN=${INSTALL_ROOT}/${VERSION}/code/bin	
  
ABAQUS_EXE=abaqus	
  
MP_HOSTLIST=[‘node1’,16,’node2’,16]	
  
REMSH=ssh	
  
abaqus_v6.env:	
  
mp_hostlist=...	
  
mp_rsh_command=...	
  
	
  
User	
  Command:	
  
Abaqus	
  
job=job	
  
cpus=$cpus	
  
mp_mode=mpi	
  
interactive	
  
ABAQUS	
  
Blob
Storage
6.14 bin
Source control
abaqus
/rescale/abaqus
versions
2016
25
Future plans
ScaleX Developer
•  Provide a GUI to our tools to allow ISVs to deploy their software directly to Rescale
•  Integrate with Rescale’s ISV portal to manage installations and version access
•  Integrate with continuous integration systems for testing dev builds and QA testing
ScaleX Open Source
•  Integrate with version control systems (github, bitbucket, sourceforge) to allow users
to build and deploy their own open source builds at any time
•  Create a community for users to share their open source builds with each other
Use these products internally to build, install and deploy software
26
Conclusions
Lessons learned
•  Keep things as abstract as possible to ease integration with new cloud providers
•  Understand software limitations and use cases before integrating in the cloud
Unsolved challenges
•  A good process for customer provided software
•  Continuous integration
•  Automatically deploy software when vendor releases new version
27
Conclusions
Advice for HPC developers to successfully transition to the cloud
•  Make your software relocatable (export	
  SOFWARE_ROOT=/rescale/software)
•  Simple installation process (tar	
  -­‐xzf	
  install.tar.gz)
•  Consistent installation process
•  Simple batch execution of your software.
•  Minimize dependency on user provided libraries (bundle dependencies)
•  Have a clear cloud licensing strategy
•  Clear separation between Solver and GUI executables.
Rescale Confidential28
Become a Rescale Software Partner
Onboarding ISV Package for Intel HPC Dev Con Attendees
●  Hosted webinar at launch
●  Rescale test credits
●  Benchmarking on 3 core types
●  Logo on partner page
●  Guest blog post
●  Beta access to ScaleX Developer
●  Case study on Rescale.com
●  Dedicated ISV portal*
What’s Included?
Email partners@rescale.com
Subject: SW Partner - Intel HPC Dev Con
*For ISV partners with on-demand licensing
29
Thank You
Questions?
mulyanto@rescale.com - rescale.com

More Related Content

PDF
Overview of Intel® Omni-Path Architecture
PDF
6 open capi_meetup_in_japan_final
PDF
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
PDF
Preparing Codes for Intel Knights Landing (KNL)
PDF
5 p9 pnor and open bmc overview - final
PDF
Intel Knights Landing Slides
PDF
Develop, Deploy, and Innovate with Intel® Cluster Ready
PDF
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel
Overview of Intel® Omni-Path Architecture
6 open capi_meetup_in_japan_final
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
Preparing Codes for Intel Knights Landing (KNL)
5 p9 pnor and open bmc overview - final
Intel Knights Landing Slides
Develop, Deploy, and Innovate with Intel® Cluster Ready
XPDDS18: Xen Testing at Intel - Xudong Hao, Intel

What's hot (20)

PDF
Introduction to nfv movilforum
PDF
Developer's Guide to Knights Landing
PDF
HOW Series: Knights Landing
PDF
Enabling Continuous Availability and Reducing Downtime with IBM Multi-Site Wo...
PDF
Hyperscan - Mohammad Abdul Awal
PDF
Intel the-latest-on-ofi
PDF
Quieting noisy neighbor with Intel® Resource Director Technology
PDF
Linaro Connect 2016 (BKK16) - Introduction to LISA
PDF
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
ODP
Accelerated dataplanes integration and deployment
PDF
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
PPTX
Developing for Industrial IoT with Linux OS on DragonBoard™ 410c: Session 4
PPSX
FD.io Vector Packet Processing (VPP)
PPTX
VMware vSphere 4.1 deep dive - part 2
PDF
z/OS V2.4 Preview: z/OS Container Extensions - Running Linux on Z docker cont...
PDF
Methods and practices to analyze the performance of your application with Int...
PDF
IBM Configuration Assistant for z/OS Communications Server update
PDF
z/OS Communications Server: z/OS Resolver
PDF
NFV features in kubernetes
PDF
Container Support in IBM Spectrum LSF
Introduction to nfv movilforum
Developer's Guide to Knights Landing
HOW Series: Knights Landing
Enabling Continuous Availability and Reducing Downtime with IBM Multi-Site Wo...
Hyperscan - Mohammad Abdul Awal
Intel the-latest-on-ofi
Quieting noisy neighbor with Intel® Resource Director Technology
Linaro Connect 2016 (BKK16) - Introduction to LISA
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
Accelerated dataplanes integration and deployment
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Developing for Industrial IoT with Linux OS on DragonBoard™ 410c: Session 4
FD.io Vector Packet Processing (VPP)
VMware vSphere 4.1 deep dive - part 2
z/OS V2.4 Preview: z/OS Container Extensions - Running Linux on Z docker cont...
Methods and practices to analyze the performance of your application with Int...
IBM Configuration Assistant for z/OS Communications Server update
z/OS Communications Server: z/OS Resolver
NFV features in kubernetes
Container Support in IBM Spectrum LSF
Ad

Similar to Challenges for Deploying a High-Performance Computing Application to the Cloud (20)

PPTX
Moving Windows Applications to the Cloud
PPTX
Sanger, upcoming Openstack for Bio-informaticians
PPTX
Flexible compute
KEY
Real World Cloud Application Security
PDF
O futuro do cloud deployment
PDF
Microsoft System Center Configuration Manager Advanced Deployment Martyn Coup...
PPTX
ServerTemplate™ Deep Dive: Configuration for Multi-Cloud Environments
PPTX
HPC Controls Future
PPT
Clusters (Distributed computing)
PPTX
Architecting a Private Cloud - Cloud Expo
PDF
Ubuntucloud openstackinaction-110922045851-phpapp02
PDF
Ubuntu cloud infrastructures
PDF
Ubuntu Cloud Juju
PPTX
PMIx Updated Overview
PDF
OpenStack & Ubuntu (india openstack day)
PPTX
High Performance Cloud Computing
PDF
Running on Amazon EC2
PDF
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
PDF
High Performance Cloud Computing
PDF
What's New in RHEL 6 for Linux on System z?
Moving Windows Applications to the Cloud
Sanger, upcoming Openstack for Bio-informaticians
Flexible compute
Real World Cloud Application Security
O futuro do cloud deployment
Microsoft System Center Configuration Manager Advanced Deployment Martyn Coup...
ServerTemplate™ Deep Dive: Configuration for Multi-Cloud Environments
HPC Controls Future
Clusters (Distributed computing)
Architecting a Private Cloud - Cloud Expo
Ubuntucloud openstackinaction-110922045851-phpapp02
Ubuntu cloud infrastructures
Ubuntu Cloud Juju
PMIx Updated Overview
OpenStack & Ubuntu (india openstack day)
High Performance Cloud Computing
Running on Amazon EC2
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
High Performance Cloud Computing
What's New in RHEL 6 for Linux on System z?
Ad

More from Intel® Software (20)

PPTX
AI for All: Biology is eating the world & AI is eating Biology
PPTX
Python Data Science and Machine Learning at Scale with Intel and Anaconda
PDF
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
PDF
AI for good: Scaling AI in science, healthcare, and more.
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PPTX
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
PPTX
AWS & Intel Webinar Series - Accelerating AI Research
PPTX
Intel Developer Program
PDF
Intel AIDC Houston Summit - Overview Slides
PDF
AIDC NY: BODO AI Presentation - 09.19.2019
PDF
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
PDF
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
PDF
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
PDF
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
PDF
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
PDF
AIDC India - AI on IA
PDF
AIDC India - Intel Movidius / Open Vino Slides
PDF
AIDC India - AI Vision Slides
PDF
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
AI for All: Biology is eating the world & AI is eating Biology
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
AI for good: Scaling AI in science, healthcare, and more.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
AWS & Intel Webinar Series - Accelerating AI Research
Intel Developer Program
Intel AIDC Houston Summit - Overview Slides
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
AIDC India - AI on IA
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - AI Vision Slides
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
cuic standard and advanced reporting.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Challenges for Deploying a High-Performance Computing Application to the Cloud

  • 1. 1 Challenges of deploying your HPC application to the cloud November 12, 2016 Mulyanto Poort, VP Engineering mulyanto@rescale.com
  • 2. 2 Overview •  Overview of Rescale •  Challenges of deploying software on Rescale •  How we install and deploy software •  Examples •  Future developments: ScaleX Developer •  Conclusions
  • 3. 3 Global Growth Technology Customers Investors San Francisco HQ, Tokyo and Munich offices, further EMEA expansion 300%+ annual growth Global cloud-based HPC platform 57+ data centers, 180+ simulation software packages 100+ leading Global 2000 enterprises Rescale - Company Profile Overview Jeff Bezos Richard Branson Peter Thiel
  • 4. 4 Software on Rescale 180 Applications26 On demand - 80 Bring your own license - 74 Free
  • 5. 5 Rescale’s Cloud Infrastructure 9 Infrastructure Providers 58+ Datacenters - 5 regional platforms
  • 6. 6 Hardware on Rescale 33 Hardware Configurations Intel XeonSandy Bridge Ivy Bridge Haswell Broadwell Phi Up to 64 cores per node Up to 2TB of RAM per node Up to 100Gbps EDR Infiniband
  • 7. 7 The Challenge: Multiple providers ●  RDMA works only with Intel MPI ●  RDMA hardware not supported in all regions ●  All hardware not supported in all regions ●  Amazon Linux OS of choice ●  Hard to distinguish hyperthreads from physical cores ●  MVAPICH MPI Flavor of choice ●  No root access to compute nodes ●  Uses rsh instead of ssh ●  Bare metal EDR infiniband
  • 8. 8 The Challenge: Virtual vs Bare metal systems Virtual Bare Metal Pros: ●  Abstraction of resources ●  Configurable environment ●  Better user isolation ●  Faster hardware refresh cycles Cons: ●  Harder to tune hardware/software ●  Provisioning time may be slow Pros: ●  Performance ●  More familiar environment for traditional HPC users Cons: ●  Queued systems ●  No root access to compute nodes
  • 9. 9 The Challenge: Multiple regions Rescale Platforms •  platform.rescale.com •  eu.rescale.com •  kr.rescale.com •  itar.rescale.com •  platform.rescale.jp Provider regions and clusters •  azure: westus, westeurope, … 38 regions •  aws: us-east, ap-northeast, … 18 regions •  osc: Owens, Ruby, Oakley
  • 10. 10 The Challenge: Multiple OSes and software types •  Linux vs Windows •  Batch vs Virtual Desktop •  Bash vs Powershell •  Intel MPI, OpenMPI, MPICH, Platform MPI, MVAPICH, Microsoft MPI, Microsoft HPC PACK, charm++ •  Support workflows with multiple applications / types of applications (e.g. co-simulation, MDO, etc.)
  • 11. 11 Software installation principles on Rescale Customer should not have to worry about where the software runs. In other words, the execution should be the same whether it runs on AWS or Azure or another provider Abstraction of MPI •  Provide common command interface - independent of hardware •  Automatically select optimal MPI for hardware •  Automatically set MPI options like affinity, binding, distribution and interconnect options Maximize performance Maximize compatibility with hardware and optimally utilize hardware •  Co-processors, GPUs, AVX2
  • 12. 12 Global install process Initial Install Initial Install ●  Install by hand ●  Create automated installation script ●  Run automated install ●  Create JSON definition for installation ●  Deploy to provider ●  Create regression test Time ~ 16 hours New Provider Provider Install ●  Take base install put it in provider repository ●  Create provider specific environment ●  Pull down install from repository and install on new provider. Time ~ 4 hours New Version Add version to install script ●  versions=[10.0, 10.1] ●  Run automated install ●  Deploy to provider Time ~ 1 hour
  • 13. 13 Defining the installation: <software>.json   <software>.json   Description of software, list of versions, environment, license information, etc. {   “software”:  “ansys”,   “description:  “Ansys  Software”,   “versions”:  [   {   “version”:  “17.0”,   “environment_variables”:  [...],   “installations”:  [...],   ...   },   {   “version”:  “16.2”,   “environment_variables”:  [...],   “installations”:  [...],   ...   }   ],   ...   }  
  • 14. 14 Defining the installation: <software>.json   <software>.json  >  environment  variables   Defines the environment the software runs in {   “software”:  “ansys”,   “versions”:  [   {   “version”:  “17.0”,   “environment_variables”:  [   {   “name”:  “VERSION”,   “value”:  “17.0”,   “Sort_order”:  1   },   {   “name”:  “PATH”,   “value”:  “$INSTALL_ROOT/$VERSION/bin:$INSTALL_ROOT/$VERSION/mpi/bin”,   “Sort_order”:  2   }   ]   }   ]   }  
  • 15. 15 Defining the installation: <software>.json   <software>.json  >  installations   These are references to the locations of the installations {   “software”:  “ansys”,   “versions”:  [   {   “version”:  “17.0”,   “installations”:  [   {   “provider”:  “azure”,   “install_root”:  “/rescale/ansys”   },   {   “provider”:  “osc”,   “install_root”:  “/shared/rescale/ansys”   }   ],   ...   }   ]   }  
  • 16. 16 Defining the installation: rescale-­‐<software>.json   rescale-­‐<software>.json   Defines the resources related to an install root {   “install_root”:  “/rescale/ansys”,   “providers”:  [   {   “provider”:  “aws”,   “resources”:  [   {   “region”:  “us-­‐east-­‐1”,   “resource”:  “snap-­‐0123456789abcdef”   },   {   “region”:  “us-­‐gov-­‐west-­‐1”,   “resource”:  “snap-­‐00aa11bb22cc33ee”   }   ]   },   ...   ]   }  
  • 17. 17 The install process Install Create base install ●  Silently install using shell or powershell script ●  Install to snapshot, vhd or shared storage location Stage + Test Replicate install ●  Copy install to different regions, storage accounts and clusters ●  Use provider API or shell commands Update json definition files to reflect new resources Regression testing Deploy Sync json definition with production databases ●  Make sure that running jobs are not affected by changes in database
  • 18. 18 The install process: Install Common interface for all providers: rescale-­‐install  -­‐-­‐install-­‐root  /program/ansys_fluids  -­‐-­‐provider  azure   Provision Resources For AWS, Azure, etc. ●  Provision VM/Instance ●  Attach clean volume and mount it to <install-root> For bare metal providers ●  An ssh connection is opened to a login account on the cluster Run Install Execute install script ●  Powershell (Windows) or Bash (Linux) ●  Pull down bits from blob storage ●  Run pre-generated script to install software to <install- root> Capture Resource For AWS, Azure, etc ●  Snapshot or vhd is generated. The ID of the resource and save it to the JSON definition For bare metal providers ●  Install is archived and stored in a repository
  • 19. 19 The install process: Install > code Python Code •  AWS Python SDK (Boto, https://github.com/boto/boto) •  Azure Python SDK (https://github.com/Azure/azure-sdk-for-python) •  Google Python SDK (https://github.com/GoogleCloudPlatform/google-cloud-python) •  Fabric (for ssh) Common interface for provisioning install resources and executing commands on those resources: provider  =  ‘azure’   os  =  ‘linux’   provision_resource  =  ProvisionResource(install_settings,  provider=provider,  os=os)   install_action  =  InstallAction(install_settings,  os=os)   provision_resource.with_action(install_action)  
  • 20. 20 The install process: Stage Common Interface: rescale-­‐copy  -­‐-­‐from  us-­‐east-­‐1  -­‐-­‐to  ap-­‐northeast-­‐1  -­‐-­‐provider  aws   Features: •  Use provider API when possible, otherwise rsync between regions •  For bare metal, pull installation from repository •  Save state. Make sure you don’t copy if it’s not necessary
  • 21. 21 The install process: Testing Regression Testing in Testing Environment •  Use Rescale’s API to re-run baseline jobs and compare against expected results •  Integrated with Jenkins build server •  Example definition {   “environment”:  “testing”,   “baseline_job_id”:  “aBcDeF”   “name”:  “Ansys  MPI  Test”   “tags”:  [...]   “tests”:  [   {    “type”:  “file_content”,   “file_name”:  “output.log”,   “Parameters”:  [    {“type”:  “contains_regex”,  “value”:  “^.*[1-­‐2][0-­‐9]  seconds.*$”}    {“type”:  “line_count”,  “value”:  “>256”}   }   ]   }  
  • 22. 22 How software is installed on Rescale > Deploy Definitions are synced to production databases •  Integrated with jenkins build server •  Definitions are pulled from source control and synced with the production database.
  • 23. 23 Example: LS-DYNA Blob Storage builds mpi bin Source control ls-dyna /rescale/ls-dyna ls-dyna Environment:   VERSION=9.0.0   MPI=intelmpi   CORES=32   NODES=2   INSTALL_ROOT=/rescale/ls-­‐dyna   INTERCONNECT=RDMA   PROVIDER=azure   REMSH=ssh   User  Command:   ls-­‐dyna  -­‐i  input.k  -­‐p  double   Execution  Command:   /rescale/ls-­‐dyna/mpi/intelmpi/5.0/bin64/mpirun  -­‐np  32  -­‐ machinefile  /home/user/machinefile  /rescale/ls-­‐dyna/builds/ lsdyna-­‐9.0.0-­‐intelmpi-­‐mpp-­‐double  i=input.k   ls-dyna builds mpi builds
  • 24. 24 Example: Abaqus #!/bin/bash   #  Abaqus  wrapper  for  Azure  RDMA   export  I_MPI_FABRICS=shm:dapl   export  I_MPI_DAPL_PROVIDER=ofa-­‐v2-­‐ib0   export  I_MPI_DYNAMIC_CONNECTION=0     echo  "mp_mpi_implementation=IMPI"  >>  ~/abaqus_v6.env   echo  "mp_environment_export+=('I_MPI_FABRICS',   'I_MPI_DAPL_PROVIDER',  'I_MPI_DYNAMIC_CONNECTION')"  >>  ~/ abaqus_v6.env     ${ABAQUS_BIN}/${ABAQUS_EXE}  "$@"   Environment:   VERSION=2016   INSTALL_ROOT=/rescale/abaqus   ABAQUS_BIN=${INSTALL_ROOT}/${VERSION}/code/bin   ABAQUS_EXE=abaqus   MP_HOSTLIST=[‘node1’,16,’node2’,16]   REMSH=ssh   abaqus_v6.env:   mp_hostlist=...   mp_rsh_command=...     User  Command:   Abaqus   job=job   cpus=$cpus   mp_mode=mpi   interactive   ABAQUS   Blob Storage 6.14 bin Source control abaqus /rescale/abaqus versions 2016
  • 25. 25 Future plans ScaleX Developer •  Provide a GUI to our tools to allow ISVs to deploy their software directly to Rescale •  Integrate with Rescale’s ISV portal to manage installations and version access •  Integrate with continuous integration systems for testing dev builds and QA testing ScaleX Open Source •  Integrate with version control systems (github, bitbucket, sourceforge) to allow users to build and deploy their own open source builds at any time •  Create a community for users to share their open source builds with each other Use these products internally to build, install and deploy software
  • 26. 26 Conclusions Lessons learned •  Keep things as abstract as possible to ease integration with new cloud providers •  Understand software limitations and use cases before integrating in the cloud Unsolved challenges •  A good process for customer provided software •  Continuous integration •  Automatically deploy software when vendor releases new version
  • 27. 27 Conclusions Advice for HPC developers to successfully transition to the cloud •  Make your software relocatable (export  SOFWARE_ROOT=/rescale/software) •  Simple installation process (tar  -­‐xzf  install.tar.gz) •  Consistent installation process •  Simple batch execution of your software. •  Minimize dependency on user provided libraries (bundle dependencies) •  Have a clear cloud licensing strategy •  Clear separation between Solver and GUI executables.
  • 28. Rescale Confidential28 Become a Rescale Software Partner Onboarding ISV Package for Intel HPC Dev Con Attendees ●  Hosted webinar at launch ●  Rescale test credits ●  Benchmarking on 3 core types ●  Logo on partner page ●  Guest blog post ●  Beta access to ScaleX Developer ●  Case study on Rescale.com ●  Dedicated ISV portal* What’s Included? Email partners@rescale.com Subject: SW Partner - Intel HPC Dev Con *For ISV partners with on-demand licensing