SlideShare a Scribd company logo
High Performance Computing
          Cloud point of view


                     Alexey Ragozin
         alexey.ragozin@gmail.com
                          Sep 2012
Massive parallel computing

 I/O bound workload
  • Data mining / machine learning / indexing
  • Focus: Do not move data, process in place
 CPU bound
  • complex simulations / complex math models
  • Focus: Keep all cores busy
 Latency bound
  • Physical process simulations
    (e.g. weather forecast)

  • Focus: Minimize communication latencies
CPU bound task

 Stream of independent tasks
  • Independent tasks
  • Random continuous stream of tasks
  • E.g. video conversion, crawling
 Structured batch jobs
  •   Single batch is split into subtasks for parallel execution
  •   Task may have data dependency on each other
  •   Task may be generated during batch execution
  •   E.g. portfolio risk calculation
Handling task stream in Cloud

               Worker pool
                                                                       incoming
                                     in   g    Task queue
                             p   oll                                   tasks
                                                 queue metrics


                                                    Controler

                                                 adjusts pool size
                                              based on queue metrics




  Simple pattern. Exploiting “elasticy” of cloud. Cost effective.
Structured batch jobs in cloud

Batches are usually more sporadic
 e.g. end of day risk calculations
Task may have cross dependencies
 scheduler should be “cloud-aware”
Supplying tasks with data
 data delivery delay is critical
 worker pool is generally very large
 data sets also could be very large
Data delivery strategy
Push approach
 scheduler controls data delivery
 worker expects data to be available locally
 more opportunities for optimization
 complex
Pull approach
 worker pulls required data from central service
 scheduler is unaware about data sets
 requires scalable data service
 much simpler
What kind of data do we have?

 Working set
 • working set is divided between jobs
 • each portion of working set processed by single job
 • often jobs are producing working set for next
   computation stage
 Reference data
 • exactly same data shared by multiple/all jobs
 • usually static data set
Data distribution problem

Working set
• Spiky work load – especially at the start
• Hard to predict there piece of data will be required
• Caching is ineffective
Reference data set
• Naïve approach will produce huge volume of
  redundant transfers – smart caching required
• Spiky work load
Private grid practice

     HPC Grid
                                    RDBMS
                                      or
                                Data Warehouse




                    Data grid
Data grid, what is it?

• Key/Value storage
• Data distributed across cluster of servers
• RAM is usually used as storage
• Redundant copies provide level of fault tolerant /
  durability
• No single point of failure
• Automatic rebalancing of data when servers
  added/removed from grid
• Capacity and throughput are scaling linearly
Data service for cloud HPC

• Block storage service
  Azure drive / Amazon EBS
  – Lack of shared access to data
• Key / Value storage
  Azure Tables / Amazon Simple DB
  – Pricing: volume + usage
• Blob store
  Azure Tables (blobs) / Amazon S3
  – Pricing: volume + transactions
  – Good read scalability
Use case for caching

 Avoid storage of data in cloud
  • Upload data once per batch and cache in cloud
 Reduce storage cost by reducing number of
  operations
 Save IO bandwidth for shared data
  • Edge caching
  • Routing overlays
Distribution tree /
  Routing overlays




                                Storage
                                Proxy
                                Clients
Switch     Switch      Switch
Task stealing

Task steeling – alternative scheduling approach
Task steeling in widely used for in-process multi-core concurrency

Why use it for cluster task scheduling?
• Stochastic and adaptive
• Can use cost models accounting internal cloud
  topology
• Decently solves problem of data delivery,
  without additional caching
• Unproven for cluster computation, so far
Task stealing

       Worker 1

                     Work backlog is organized in a
                      form of stack
                     Tasks are generated recursively
                     Top of stack – fine grained tasks
fork                 Bottom of stack – coarse
                      grained tasks
fork                 Execution from top of stack
fork
                     Stealing – bottom of stack
       processing
Task stealing

       Worker 1             Worker 2

                    steal



                    fork

                    fork
                            processing
fork




fork
       processing
fork
         done
IO bound workload in cloud

Dawn of Map/Reduce
- high bandwidth interconnects are expensive
- network storage is expensive (due to network cost)
- cheap serves and local processing for keeping costs low
- price – very complex computation model
“Cloud” reality
- network bandwidth is cheap
- disks are already “networked”
- RAM is abundant
Hadoop is cloud unfriendly

Assume I have 50 nodes Hadoop cluster in cloud
What will I gain by adding another 50 nodes?
- Not much, until they are populated with data.
What if I will shut these 50 afterward?
- Effort to populate them with data will be wasted.

Hadoop is coupling execution and storage services
together – you have pay for both even if you use one.
How cloud M/R should look?

• Use cloud storage service and persistent storage
• Streaming M/R processing
• Aggressive use of memory for intermediate data

Peregrine – storeless M/R framework
  http://peregrine_mapreduce.bitbucket.org/
Spark – in-memory M/R framework
  http://www.spark-project.org/
Looking into future

Highly anticipated features
 Scheduler as a Service
  Azure HPC / Amazon SWF
 Simple middleware for organizing caches and
  routing overlays
  Existing solutions are far from simple
 Cloud friendly map/reduce frameworks
  Could provider work hard to offer effective Hadoop
Thank you
http://blog.ragozin.info
- my articles


                                 Alexey Ragozin
                     alexey.ragozin@gmail.com

More Related Content

KEY
Writing Scalable Software in Java
PPTX
Spark Overview and Performance Issues
PPT
Load Balancing In Cloud Computing newppt
PPTX
Distributed Processing Frameworks
PPTX
Load balancing In cloud - In a semi distributed system
PDF
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
PPT
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
PPTX
Apache Tez – Present and Future
Writing Scalable Software in Java
Spark Overview and Performance Issues
Load Balancing In Cloud Computing newppt
Distributed Processing Frameworks
Load balancing In cloud - In a semi distributed system
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Apache Tez – Present and Future

What's hot (20)

PPTX
load balancing in public cloud ppt
PDF
Tackling Scaling Challenges of Apache Spark at LinkedIn
PDF
Simulating Heterogeneous Resources in CloudLightning
PDF
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
PPTX
Pig on Tez: Low Latency Data Processing with Big Data
PPTX
Apache Tez – Present and Future
PPTX
Probabilistic consolidation of virtual machines in self organizing cloud data...
PPTX
LOAD BALANCING ALGORITHMS
PPTX
Hadoop Map Reduce OS
PPT
Scalable analytics for iaas cloud availability
PPTX
Load balancing
PPTX
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
PPTX
High Performance Computing (HPC) in cloud
PDF
LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
PPTX
Tez big datacamp-la-bikas_saha
PPTX
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
PPTX
Transform your DBMS to drive engagement innovation with Big Data
PDF
2016 may-countdown-to-postgres-v96-parallel-query
PDF
Hadoop Network Performance profile
PPTX
Tune up Yarn and Hive
 
load balancing in public cloud ppt
Tackling Scaling Challenges of Apache Spark at LinkedIn
Simulating Heterogeneous Resources in CloudLightning
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Pig on Tez: Low Latency Data Processing with Big Data
Apache Tez – Present and Future
Probabilistic consolidation of virtual machines in self organizing cloud data...
LOAD BALANCING ALGORITHMS
Hadoop Map Reduce OS
Scalable analytics for iaas cloud availability
Load balancing
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
High Performance Computing (HPC) in cloud
LOAD BALANCING ALGORITHM TO IMPROVE RESPONSE TIME ON CLOUD COMPUTING
Tez big datacamp-la-bikas_saha
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Transform your DBMS to drive engagement innovation with Big Data
2016 may-countdown-to-postgres-v96-parallel-query
Hadoop Network Performance profile
Tune up Yarn and Hive
 
Ad

Similar to Взгляд на облака с точки зрения HPC (20)

PPT
High Performance Computing - Cloud Point of View
PPT
Google Cloud Computing on Google Developer 2008 Day
PPT
Cloud Computing with .Net
PPTX
Introduction to Cloud Data Center and Network Issues
PPTX
Cloud computing
PPT
Computing Outside The Box September 2009
PPT
Computing Outside The Box June 2009
PDF
Hadoop.mapreduce
PDF
Notes on data-intensive processing with Hadoop Mapreduce
PPTX
Introducing Technologies for Handling Big Data by Jaseela
PDF
Google Storage concepts and computing concepts.pdf
PPT
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
PDF
International Journal of Engineering Inventions (IJEI)
PPT
云计算及其应用
ODP
Cloud accounting software uk
PPTX
TASK AND DATA PARALLELISM in Computer Science pptx
PDF
Exploiting dynamic resource allocation for
PDF
Cloudstate - Towards Stateful Serverless
PDF
Cloudstate—Towards Stateful Serverless
PPTX
IEEE CLOUD \'11
High Performance Computing - Cloud Point of View
Google Cloud Computing on Google Developer 2008 Day
Cloud Computing with .Net
Introduction to Cloud Data Center and Network Issues
Cloud computing
Computing Outside The Box September 2009
Computing Outside The Box June 2009
Hadoop.mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
Introducing Technologies for Handling Big Data by Jaseela
Google Storage concepts and computing concepts.pdf
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
International Journal of Engineering Inventions (IJEI)
云计算及其应用
Cloud accounting software uk
TASK AND DATA PARALLELISM in Computer Science pptx
Exploiting dynamic resource allocation for
Cloudstate - Towards Stateful Serverless
Cloudstate—Towards Stateful Serverless
IEEE CLOUD \'11
Ad

More from Olga Lavrentieva (20)

PPTX
15 10-22 altoros-fact_sheet_st_v4
PPTX
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
PPTX
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
PDF
Владимир Иванов (Oracle): Java: прошлое и будущее
PPTX
Brug - Web push notification
PDF
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
PPTX
Максим Жилинский: "Контейнеры: под капотом"
PPTX
Александр Протасеня: "PayPal. Различные способы интеграции"
PPTX
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
PPTX
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
PDF
Егор Воробьёв: «Ruby internals»
PDF
Андрей Колешко «Что не так с Rails»
PDF
Дмитрий Савицкий «Ruby Anti Magic Shield»
PPTX
Сергей Алексеев «Парное программирование. Удаленно»
PPTX
«Почему Spark отнюдь не так хорош»
PPTX
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
PPTX
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
PPTX
«Дизайн продвинутых нереляционных схем для Big Data»
PPTX
«Обзор возможностей Open cv»
PPTX
«Нужно больше шин! Eventbus based framework vertx.io»
15 10-22 altoros-fact_sheet_st_v4
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Владимир Иванов (Oracle): Java: прошлое и будущее
Brug - Web push notification
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Максим Жилинский: "Контейнеры: под капотом"
Александр Протасеня: "PayPal. Различные способы интеграции"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Егор Воробьёв: «Ruby internals»
Андрей Колешко «Что не так с Rails»
Дмитрий Савицкий «Ruby Anti Magic Shield»
Сергей Алексеев «Парное программирование. Удаленно»
«Почему Spark отнюдь не так хорош»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Дизайн продвинутых нереляционных схем для Big Data»
«Обзор возможностей Open cv»
«Нужно больше шин! Eventbus based framework vertx.io»

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Machine learning based COVID-19 study performance prediction
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Monthly Chronicles - July 2025
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Machine learning based COVID-19 study performance prediction

Взгляд на облака с точки зрения HPC

  • 1. High Performance Computing Cloud point of view Alexey Ragozin alexey.ragozin@gmail.com Sep 2012
  • 2. Massive parallel computing  I/O bound workload • Data mining / machine learning / indexing • Focus: Do not move data, process in place  CPU bound • complex simulations / complex math models • Focus: Keep all cores busy  Latency bound • Physical process simulations (e.g. weather forecast) • Focus: Minimize communication latencies
  • 3. CPU bound task  Stream of independent tasks • Independent tasks • Random continuous stream of tasks • E.g. video conversion, crawling  Structured batch jobs • Single batch is split into subtasks for parallel execution • Task may have data dependency on each other • Task may be generated during batch execution • E.g. portfolio risk calculation
  • 4. Handling task stream in Cloud Worker pool incoming in g Task queue p oll tasks queue metrics Controler adjusts pool size based on queue metrics Simple pattern. Exploiting “elasticy” of cloud. Cost effective.
  • 5. Structured batch jobs in cloud Batches are usually more sporadic  e.g. end of day risk calculations Task may have cross dependencies  scheduler should be “cloud-aware” Supplying tasks with data  data delivery delay is critical  worker pool is generally very large  data sets also could be very large
  • 6. Data delivery strategy Push approach  scheduler controls data delivery  worker expects data to be available locally  more opportunities for optimization  complex Pull approach  worker pulls required data from central service  scheduler is unaware about data sets  requires scalable data service  much simpler
  • 7. What kind of data do we have? Working set • working set is divided between jobs • each portion of working set processed by single job • often jobs are producing working set for next computation stage Reference data • exactly same data shared by multiple/all jobs • usually static data set
  • 8. Data distribution problem Working set • Spiky work load – especially at the start • Hard to predict there piece of data will be required • Caching is ineffective Reference data set • Naïve approach will produce huge volume of redundant transfers – smart caching required • Spiky work load
  • 9. Private grid practice HPC Grid RDBMS or Data Warehouse Data grid
  • 10. Data grid, what is it? • Key/Value storage • Data distributed across cluster of servers • RAM is usually used as storage • Redundant copies provide level of fault tolerant / durability • No single point of failure • Automatic rebalancing of data when servers added/removed from grid • Capacity and throughput are scaling linearly
  • 11. Data service for cloud HPC • Block storage service Azure drive / Amazon EBS – Lack of shared access to data • Key / Value storage Azure Tables / Amazon Simple DB – Pricing: volume + usage • Blob store Azure Tables (blobs) / Amazon S3 – Pricing: volume + transactions – Good read scalability
  • 12. Use case for caching  Avoid storage of data in cloud • Upload data once per batch and cache in cloud  Reduce storage cost by reducing number of operations  Save IO bandwidth for shared data • Edge caching • Routing overlays
  • 13. Distribution tree / Routing overlays Storage Proxy Clients Switch Switch Switch
  • 14. Task stealing Task steeling – alternative scheduling approach Task steeling in widely used for in-process multi-core concurrency Why use it for cluster task scheduling? • Stochastic and adaptive • Can use cost models accounting internal cloud topology • Decently solves problem of data delivery, without additional caching • Unproven for cluster computation, so far
  • 15. Task stealing Worker 1  Work backlog is organized in a form of stack  Tasks are generated recursively  Top of stack – fine grained tasks fork  Bottom of stack – coarse grained tasks fork  Execution from top of stack fork  Stealing – bottom of stack processing
  • 16. Task stealing Worker 1 Worker 2 steal fork fork processing fork fork processing fork done
  • 17. IO bound workload in cloud Dawn of Map/Reduce - high bandwidth interconnects are expensive - network storage is expensive (due to network cost) - cheap serves and local processing for keeping costs low - price – very complex computation model “Cloud” reality - network bandwidth is cheap - disks are already “networked” - RAM is abundant
  • 18. Hadoop is cloud unfriendly Assume I have 50 nodes Hadoop cluster in cloud What will I gain by adding another 50 nodes? - Not much, until they are populated with data. What if I will shut these 50 afterward? - Effort to populate them with data will be wasted. Hadoop is coupling execution and storage services together – you have pay for both even if you use one.
  • 19. How cloud M/R should look? • Use cloud storage service and persistent storage • Streaming M/R processing • Aggressive use of memory for intermediate data Peregrine – storeless M/R framework http://peregrine_mapreduce.bitbucket.org/ Spark – in-memory M/R framework http://www.spark-project.org/
  • 20. Looking into future Highly anticipated features  Scheduler as a Service Azure HPC / Amazon SWF  Simple middleware for organizing caches and routing overlays Existing solutions are far from simple  Cloud friendly map/reduce frameworks Could provider work hard to offer effective Hadoop
  • 21. Thank you http://blog.ragozin.info - my articles Alexey Ragozin alexey.ragozin@gmail.com