SlideShare a Scribd company logo
Platforms
Marat Zhanikeev
maratishe@gmail.com
maratishe.github.io
Hadoop versus Bigdata Replay
Tokyo Univ. of Science
O n P e r f o r m a n c e U n d e r H o t s p o t s i n
WebDB Forum 2017@お茶の水女子大
PDF → bit.do/170920
Background on Hadoop
• Hadoop performance measurement
◦ creators on performance limits 09
◦ superlinear effect 08
◦ various benchmarks on Hadoop vs Spark 07
◦ inconsistencies in measurements 11
• Hadoop/MapReduce optimization in 14 and a ton of other papers
• the ”Do We (actually) Need Hadoop?” argument in 10 and few recent
papers
09 K.Shvachko+0 ”HDFS scalability: the limits to growth” Usenix Login (2010)
08 N.Gunther+2 ”Hadoop Superlinear Scalability” ACM Queue (2015)
07 J.Shi+6 ”Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics” Very Large Data Bases (2015)
11 M.Xia+3 ”Performance Inconsistency in Large Scale Data Processing Clusters” 10th USENIX ICAC (2013)
14 A.Rasooli+1 ”COSHH: A Classiffication and Optimization based Scheduler for Heterogeneous Hadoop Systems” Future Gen.Comp.Sys. (2014)
10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 2/12
2/12
Modeling Hadoop Bottlenecks
Network
(NW)
Bulk
Storage
(BS)
Shared
Memory
(SM)
Core Output
Big Data Processing
HPC, Simulators, Modeling
Small
Data
Bulk
Storage
(BS)
On-Chip
Shared
Memory
(hSM)
Numberofparallelaccesses
Network
(NW)
Ability to isolate
Bottleneck
(pipe width)
RAM-based
Shared Memory
(sSM) Bulk
Storage
(BS)
Network
(NW)1
RAM-based
Shared Memory
(sSM)
Parallelaccesses
Ability to isolate
Core Output
Small
Data
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 3/12
3/12
Hadoop’s Answer: Rack Awareness
Rack
Switch
Datanode
Datanode
Datanode
…
Rack
Switch
Datanode
Datanode
……
Core
Switch
Client
Client
Logical
Client
Own Rack
Switch
Other Rack Switch
Other Rack Switch
Other Rack Switch
Datanodes
• official Hadoop feature
(not a bug) 12
• some dynamics, goes
off-rack when local
nodes have too many jobs
• sadly, manual
configuration of rack
affiliation (much potential here for
research on virtual network coordinates –
Meridian, Vivaldi...)
12 ”Hadoop: Rack Awareness” https://hadoop.apache.org (2017)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 4/12
4/12
Hadoop vs Bigdata Replay Method
• basic idea similar to 10 but uses circuits 02 to transfer shards and multicore
01 to parallel-process them
Name Node
Storage Node (shard)
file A
file B
file C
…
Hadoop Space
Manager
Hadoop Job
(your code)
Hadoop Job
(your code)
Hadoop Job
(your code)
MapReduce
job (your code)
manymany
Name
Server(s)
Client Machine
Hadoop Client
Your
Code
You
Start Use
Deploy
FindRead/parse
many
Internals (DC)
Users
Storage Node
(shard)
Time-Aware
Sub-Store(s)
Manager
Client Machine
Client
Your
Sketcher
You
Start Use
Schedule
Multicore
Replay
Replay Node
many
10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012)
02 myself+0 ”Circuit Emulation for Big Data Transfers in Clouds” Networking for Big Data, CRC (2015)
01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 5/12
5/12
Replay Environment is Highly Flexible!
• replay is time-aligned, so jobs can pick any spot on the timeline
• similar to Spark in going beyond key-value datatype but more – the full scope
of streaming algorithms 01
• massively multicore environments 04 with 100+ cores, dynamic re-packing of
job batches, etc.
Core 1
Core 1
Core X
Replay
Manager
Now(replay)
….
Time-Aligned Big Data
Cursor
Time
Direction
One Sketch One SketchOne Sketch
Start End End End
Read/prepare
Shared Memory
Start
….
Time
Now
(buffer head)
Manager
Job
Job
Buffer
tail
pos
pos
Controller
Kill
2 Report
Manage
in realtime
One Replay Batch
One
Buffer
One
Buffer
One
BufferJobs
Jobs
Jobs
Replay at
a scale
1
01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015)
04 myself+0 ”Volume and Irregularity Effects on Massively Multicore Packet Processors” APNOMS (2016)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 6/12
6/12
Performance under hotspots
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 7/12
7/12
The Hotspot Distribution
0 20 40 60 80 100
Decreasing order
0
0.35
0.7
1.05
1.4
1.75
2.1
2.45
2.8
log(value)
Class A Class B Class C Class D Class E
• models Flash/Hotspot/
Killerapp/Blackswan
events using extreme variance
in popularity
• generation method:
stick-breaking process,
Dirichlet distribution with
parallel beta sources 05
• final step: classify based on
the number of hot/flash items
05 myself+1 ”Popularity-Based Modeling of Flash Events in Synthetic Packet Traces” CQ研 (2012)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 8/12
8/12
The Binary ”Till Contention” Metric
• not a common, but very realistic
way to model performance under load
• note: even more applicable under
hotspot-y input
Rack
Rack
Border
(switch)
Client
Data
Shards
Data
Shards
…
Volume
Contention
Contention -free
to contention -ful
threshold
• example: function of server response
time to load can be expressed as:
T =
1
2
[
(L − n) +
√
(L − n)2 + k
1 − L
]
• ...where T is response time, L is load,
and k is the knee = contention point!
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 9/12
9/12
Performance Models
• shard size as S and in-job traffic to shard size ratio r
◦ so, Hadoop jobs generate rS versus always strictly S under Replay
• contention threshold as C (for both contention and/or capacity)
• list of shard hotness (popularity)
{
h1, h2, h3, ..., hn
}
and sizes{
S1, S2, S3, ...., Sn
}
• then we have (job/traffic) volume for Hadoop:
Vhadoop =
∑
i=1..n
rhiSi
• ... and for Replay method:
Vreplay =
∑
i=1..n
Si (1)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 10/12
10/12
Results
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.7
1.4
2.1
2.8
3.5
4.2
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 10
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.8
1.6
2.4
3.2
4
4.8
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 50
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.9
1.8
2.7
3.6
4.5
5.4
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 200
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 11/12
11/12
That’s all, thank you ...
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 12/12
12/12

More Related Content

PPTX
LendingClub RealTime BigData Platform with Oracle GoldenGate
ODP
BigData Hadoop
PDF
Rob peglar introduction_analytics _big data_hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
PDF
Big Data Real Time Applications
PDF
Hadoop,Big Data Analytics and More
PDF
Big Data: an introduction
LendingClub RealTime BigData Platform with Oracle GoldenGate
BigData Hadoop
Rob peglar introduction_analytics _big data_hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Big Data Real Time Applications
Hadoop,Big Data Analytics and More
Big Data: an introduction

What's hot (20)

PPTX
Big Data in the Real World
PDF
Bio bigdata
PPT
My other computer is a datacentre - 2012 edition
PDF
Big Data Analytics for Real Time Systems
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
PDF
Introduction to Big Data
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
PDF
Introduction to Bigdata and HADOOP
PPTX
Big data Analytics Hadoop
PDF
BigData HUB Workshop
PPTX
Hadoop and BigData - July 2016
PPTX
Data lake ppt
PPT
Big Data: An Overview
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
Introduction to Big Data
PPTX
Bigdata " new level"
PDF
Big Data simplified
PDF
Big Data Analytics
PPTX
Bigdata
Big Data in the Real World
Bio bigdata
My other computer is a datacentre - 2012 edition
Big Data Analytics for Real Time Systems
Counting Unique Users in Real-Time: Here's a Challenge for You!
Introduction to Big Data
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Introduction to Bigdata and HADOOP
Big data Analytics Hadoop
BigData HUB Workshop
Hadoop and BigData - July 2016
Data lake ppt
Big Data: An Overview
Big Data Analytics Projects - Real World with Pentaho
Big Data Analysis Patterns - TriHUG 6/27/2013
Introduction to Big Data
Bigdata " new level"
Big Data simplified
Big Data Analytics
Bigdata
Ad

Similar to On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms (20)

PDF
Big Data & Hadoop. Simone Leo (CRS4)
PPTX
EMC Isilon Database Converged deck
PDF
Big data hadooop analytic and data warehouse comparison guide
ODP
Hadoop demo ppt
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
PDF
Big Data: hype or necessity?
PDF
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
PPTX
SQL on Hadoop: Defining the New Generation of Analytics Databases
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
PDF
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
PDF
Strata Stinger Talk October 2013
PDF
Talend For Big Data : Secret Key to Hadoop
PDF
ETL using Big Data Talend
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPTX
Hadoop 2 @ Twitter, Elephant Scale
PPTX
Hadoop 2 @Twitter, Elephant Scale. Presented at
PPT
PDE2011 pythonOCC project status and plans
PDF
Python and trending_data_ops
Big Data & Hadoop. Simone Leo (CRS4)
EMC Isilon Database Converged deck
Big data hadooop analytic and data warehouse comparison guide
Hadoop demo ppt
Big data Hadoop Analytic and Data warehouse comparison guide
Big Data: hype or necessity?
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
SQL on Hadoop: Defining the New Generation of Analytics Databases
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
Strata Stinger Talk October 2013
Talend For Big Data : Secret Key to Hadoop
ETL using Big Data Talend
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @Twitter, Elephant Scale. Presented at
PDE2011 pythonOCC project status and plans
Python and trending_data_ops
Ad

More from Tokyo University of Science (20)

PDF
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
PDF
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
PDF
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
PDF
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
PDF
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
PDF
Taking the Step from Software to Product Development \\ when teaching PBL at ...
PDF
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
PDF
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
PDF
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
PDF
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
PDF
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
PDF
On a Hybrid Packets-and-Circuits Switching Logic
PDF
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
PDF
Complexity Resolution Control for Context Based on Metromaps
PDF
The Declarative-Coordinated Model for Self-Optimization of Service Networks
PDF
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
PDF
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
PDF
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
PDF
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
PDF
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Taking the Step from Software to Product Development \\ when teaching PBL at ...
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
On a Hybrid Packets-and-Circuits Switching Logic
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Complexity Resolution Control for Context Based on Metromaps
The Declarative-Coordinated Model for Self-Optimization of Service Networks
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction

On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms

  • 1. Platforms Marat Zhanikeev maratishe@gmail.com maratishe.github.io Hadoop versus Bigdata Replay Tokyo Univ. of Science O n P e r f o r m a n c e U n d e r H o t s p o t s i n WebDB Forum 2017@お茶の水女子大 PDF → bit.do/170920
  • 2. Background on Hadoop • Hadoop performance measurement ◦ creators on performance limits 09 ◦ superlinear effect 08 ◦ various benchmarks on Hadoop vs Spark 07 ◦ inconsistencies in measurements 11 • Hadoop/MapReduce optimization in 14 and a ton of other papers • the ”Do We (actually) Need Hadoop?” argument in 10 and few recent papers 09 K.Shvachko+0 ”HDFS scalability: the limits to growth” Usenix Login (2010) 08 N.Gunther+2 ”Hadoop Superlinear Scalability” ACM Queue (2015) 07 J.Shi+6 ”Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics” Very Large Data Bases (2015) 11 M.Xia+3 ”Performance Inconsistency in Large Scale Data Processing Clusters” 10th USENIX ICAC (2013) 14 A.Rasooli+1 ”COSHH: A Classiffication and Optimization based Scheduler for Heterogeneous Hadoop Systems” Future Gen.Comp.Sys. (2014) 10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 2/12 2/12
  • 3. Modeling Hadoop Bottlenecks Network (NW) Bulk Storage (BS) Shared Memory (SM) Core Output Big Data Processing HPC, Simulators, Modeling Small Data Bulk Storage (BS) On-Chip Shared Memory (hSM) Numberofparallelaccesses Network (NW) Ability to isolate Bottleneck (pipe width) RAM-based Shared Memory (sSM) Bulk Storage (BS) Network (NW)1 RAM-based Shared Memory (sSM) Parallelaccesses Ability to isolate Core Output Small Data M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 3/12 3/12
  • 4. Hadoop’s Answer: Rack Awareness Rack Switch Datanode Datanode Datanode … Rack Switch Datanode Datanode …… Core Switch Client Client Logical Client Own Rack Switch Other Rack Switch Other Rack Switch Other Rack Switch Datanodes • official Hadoop feature (not a bug) 12 • some dynamics, goes off-rack when local nodes have too many jobs • sadly, manual configuration of rack affiliation (much potential here for research on virtual network coordinates – Meridian, Vivaldi...) 12 ”Hadoop: Rack Awareness” https://hadoop.apache.org (2017) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 4/12 4/12
  • 5. Hadoop vs Bigdata Replay Method • basic idea similar to 10 but uses circuits 02 to transfer shards and multicore 01 to parallel-process them Name Node Storage Node (shard) file A file B file C … Hadoop Space Manager Hadoop Job (your code) Hadoop Job (your code) Hadoop Job (your code) MapReduce job (your code) manymany Name Server(s) Client Machine Hadoop Client Your Code You Start Use Deploy FindRead/parse many Internals (DC) Users Storage Node (shard) Time-Aware Sub-Store(s) Manager Client Machine Client Your Sketcher You Start Use Schedule Multicore Replay Replay Node many 10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012) 02 myself+0 ”Circuit Emulation for Big Data Transfers in Clouds” Networking for Big Data, CRC (2015) 01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 5/12 5/12
  • 6. Replay Environment is Highly Flexible! • replay is time-aligned, so jobs can pick any spot on the timeline • similar to Spark in going beyond key-value datatype but more – the full scope of streaming algorithms 01 • massively multicore environments 04 with 100+ cores, dynamic re-packing of job batches, etc. Core 1 Core 1 Core X Replay Manager Now(replay) …. Time-Aligned Big Data Cursor Time Direction One Sketch One SketchOne Sketch Start End End End Read/prepare Shared Memory Start …. Time Now (buffer head) Manager Job Job Buffer tail pos pos Controller Kill 2 Report Manage in realtime One Replay Batch One Buffer One Buffer One BufferJobs Jobs Jobs Replay at a scale 1 01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015) 04 myself+0 ”Volume and Irregularity Effects on Massively Multicore Packet Processors” APNOMS (2016) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 6/12 6/12
  • 7. Performance under hotspots M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 7/12 7/12
  • 8. The Hotspot Distribution 0 20 40 60 80 100 Decreasing order 0 0.35 0.7 1.05 1.4 1.75 2.1 2.45 2.8 log(value) Class A Class B Class C Class D Class E • models Flash/Hotspot/ Killerapp/Blackswan events using extreme variance in popularity • generation method: stick-breaking process, Dirichlet distribution with parallel beta sources 05 • final step: classify based on the number of hot/flash items 05 myself+1 ”Popularity-Based Modeling of Flash Events in Synthetic Packet Traces” CQ研 (2012) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 8/12 8/12
  • 9. The Binary ”Till Contention” Metric • not a common, but very realistic way to model performance under load • note: even more applicable under hotspot-y input Rack Rack Border (switch) Client Data Shards Data Shards … Volume Contention Contention -free to contention -ful threshold • example: function of server response time to load can be expressed as: T = 1 2 [ (L − n) + √ (L − n)2 + k 1 − L ] • ...where T is response time, L is load, and k is the knee = contention point! M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 9/12 9/12
  • 10. Performance Models • shard size as S and in-job traffic to shard size ratio r ◦ so, Hadoop jobs generate rS versus always strictly S under Replay • contention threshold as C (for both contention and/or capacity) • list of shard hotness (popularity) { h1, h2, h3, ..., hn } and sizes{ S1, S2, S3, ...., Sn } • then we have (job/traffic) volume for Hadoop: Vhadoop = ∑ i=1..n rhiSi • ... and for Replay method: Vreplay = ∑ i=1..n Si (1) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 10/12 10/12
  • 11. Results A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.7 1.4 2.1 2.8 3.5 4.2 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 10 A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.8 1.6 2.4 3.2 4 4.8 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 50 A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.9 1.8 2.7 3.6 4.5 5.4 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 200 M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 11/12 11/12
  • 12. That’s all, thank you ... M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 12/12 12/12