SlideShare a Scribd company logo
CloudClusteringToward an Iterative Data Processing Pattern on the CloudAnkur Dave*, Wei Lu†, Jared Jackson†, Roger Barga†*UC Berkeley†Microsoft Research
AnkurDaveWei LuJared JacksonRoger Barga
BackgroundMapReduce and its successors reimplement communication and storageBut cloud services like EC2 and Azure natively provide these utilitiesCan we build an efficient distributed application using only the primitives of the cloud?
Design GoalsEfficient Fault tolerantCloud-native
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
Windows AzureVM instancesBlob storageQueue storage
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
K-Means algorithmInitializeAssign points to centroidsRecalculate centroidsCentroidsPoints moved?Iteratively groups 𝑛 points into 𝑘 clusters 
K-Means algorithmInitializePartition workAssign points to centroidsCompute partial sumsRecalculate centroidsPoints moved?
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
ArchitectureInitializePartition workAssign points to centroids⋮ Centroids: {𝑐1, …, 𝑐𝑛} Compute partial sumsRecalculate centroidsPoints moved?
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
Data locality⋮ Single-queue pattern is naturally fault-tolerantProblem: Not suitable for iterative computationMultiple-queue pattern unlocks data localityTradeoff: Complicates fault tolerance
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
Handling failureInitializePartition workAssign points to centroidsCompute partial sumsStall?-> Buddy system Recalculate centroidsPoints moved?
Buddy systemBuddy system provides distributed fault detection and recovery
Buddy systemFault domain123Spreading buddies across Azure fault domains provides increased resilience to simultaneous failure
Buddy systemvs.…Cascaded failure detection reduces communication and improves resilience
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
EvaluationLinear speedup with instance count
EvaluationSublinear speedup with instance sizeReason: I/O bandwidth doesn’t scale
OutlineWindows AzureK-Means clustering algorithmCloudClustering architectureData localityBuddy systemEvaluationRelated work
Related workExisting frameworks (MapReduce, Dryad,Spark,MPI, …)all support K-Means, but reimplementreliable communicationAzureBlast built directly on cloud services, but algorithm is not iterative
ConclusionsCloudClustering shows that it's possible to build efficient, resilient applications using only the common cloud servicesMultiple-queue pattern unlocks data localityBuddy system provides fault tolerance

More Related Content

PDF
Census and spatial data in sql server 2008 designing tools for hazard mitigat...
PPTX
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
PDF
Making Elasticity Testing of Cloud-Based Systems Reproducible
PDF
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
PDF
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
PPTX
Deep Learning on Aerial Imagery: What does it look like on a map?
PDF
Artmosphere Demo
PDF
GeoMesa LocationTech DC
Census and spatial data in sql server 2008 designing tools for hazard mitigat...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
Making Elasticity Testing of Cloud-Based Systems Reproducible
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Deep Learning on Aerial Imagery: What does it look like on a map?
Artmosphere Demo
GeoMesa LocationTech DC

What's hot (20)

PDF
Big data reactive streams and OSGi - M Rulli
PDF
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
PPTX
Dato vs GraphX
PPTX
Google Cloud Spanner Preview
PPTX
GeoMesa – Spatio-Temporal Indexing in Accumulo
PDF
An introduction to Workload Modelling for Cloud Applications
PPTX
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
PDF
Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PPTX
Gartner Catalyst 2017: Image Recognition on Streaming Data
PPTX
NASA's Movement Towards Cloud Computing
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PPTX
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
PPTX
Microservice performance-b
PPTX
Elascale Poster
PPTX
Fed Geo Day - GeoTrellis Intro
PDF
You might be paying too much for BigQuery
PDF
Sample Application Architecture
Big data reactive streams and OSGi - M Rulli
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Dato vs GraphX
Google Cloud Spanner Preview
GeoMesa – Spatio-Temporal Indexing in Accumulo
An introduction to Workload Modelling for Cloud Applications
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Gartner Catalyst 2017: Image Recognition on Streaming Data
NASA's Movement Towards Cloud Computing
ML on Big Data: Real-Time Analysis on Time Series
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Microservice performance-b
Elascale Poster
Fed Geo Day - GeoTrellis Intro
You might be paying too much for BigQuery
Sample Application Architecture
Ad

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Machine Learning_overview_presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
sap open course for s4hana steps from ECC to s4
Chapter 3 Spatial Domain Image Processing.pdf
Machine Learning_overview_presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Ad

CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Editor's Notes

  • #2: Thanks for the introductionUndergrad at UC BerkeleyPresenting CloudClustering: data-intensive app on cloudQuestions inline
  • #3: Joint work with Wei Lu, Jared Jackson, and Roger Barga of MSRGlad to have him in the audience
  • #4: MapReduce, Dryad,Twister, HaLoop, Spark:useful abstraction for programming the cloudMapReduce,Dryad:acyclic data flowsSuccessors:iterativeCommunication and storage: sockets, replicationEC2,Azure provide thisCan use only “primitives of the cloud”? -> Simpler
  • #5: Fast:data locality and cachingResilient to failureUse only existing cloud services -- reliable building blocks
  • #6: What we will do:Used Azure to build CloudClusteringImplements K-MeansCloudClustering architecture: building blocks -> efficient, fault-tolerantBenchmarks of CloudClustering running on AzureRelated work
  • #7: Windows/.NET based cloud offeringRun user code on VM instances.Instance fails -> automatically recovered-> Central blob storage, 3x replicated. Limited by network-> Queue storage: small messages
  • #8: K-Means: widely used, well-understood
  • #9: Iterative clustering algorithmGroups n points into k clusters, n >> k-> Initial set of cluster centers -> centroids-> Points assigned to closest centroids-> Centroids move to average of their points-> Repeats until convergence
  • #10: Easy to parallelizeMaster: initial centroids-> Partition points, split across workers-> On workers, same as before for partition-> All returned: averages, move centroids-> Iterate until convergence
  • #11: How it’s implemented
  • #12: Master supervises pool of workers-> Coordinate using queues->First,master loads input into blob storage-> Generates initial centroids-> Sends command to workers: ptr to partition, list of centroids-> Workers do computation-> Generate partial sums-> All returned: master recalculates centroids-> Next iteration
  • #13: 2 modifications for efficiency and fault toleranceFirst: data locality
  • #14: Conventional: single queueFault tolerance easy. All state in messages. Failure -> throughput reducedNo good for iterative: no data locality-> Multiple queues: unlocks data localityBut complicates fault tolerance
  • #15: Efficiency with fault tolerance: buddy system
  • #16: Why failure is a problem?-> When worker fails-> Can’t report to master-> Stall-> Buddy system
  • #17: Conventional: heartbeat over socketsDesign goals: use cloud servicesInsight: working on task when fail. Queue reliable -> have a recordPeer-to-peer: master assigns into buddy groups-> Each polls queues of others-> When fail-> Tasks in queue for longer than timeout-> Buddies dequeue tasks, complete themSize: tradeoff between fault tolerance and network trafficLarge group -> resilient to simultaneous, but more polling traffic (charge per transaction)
  • #18: Fault domains: reduce likelihood of sim.failFault domains: prob. of sim.failAllocation algorithm spreads across fault domains
  • #19: Workers -> nodes in graphEdges -> polling queue of another instanceBuddy system: disconnected cliques-> Alternative: cascaded failure detectionTotal ordering of workersEach polls the next, circularLess traffic, better resilienceFuture work
  • #20: Performance
  • #21: Linear speedup as increase number of instances16 instances and 1 GB data
  • #22: Sublinear speedup with faster machinesBecause I/O bandwidth doesn’t always improveUse multiple threads, but I/O is bottleneck
  • #23: Related work
  • #24: All frameworks support K-MeansMapReduce, Dryad: no iteration, launch from driverImplement communication using socketsAzureBlast: NCBI-BLAST genetic tool on cloud servicesNot iterative -> different challenges
  • #25: Can build efficient, resilient with only cloud servicesHelpful patterns: multiple queues, buddy system