SlideShare a Scribd company logo
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
. 
Hadoop/MapReduce Problems 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 2/23 
. 
2/23
. 
Hadoop/MapReduce Problems 
 upper limit on throughput, both in HDFS and MapReduce jobs 06 09 
 one machine is not so bad anymore: good RAM, multicore 09 
 MapReduce is key-value hashes only, very restrictive 
 MapReduce performs badly under heterogeneous workloads 
 Small file problem is hard to solve in Hadoop 10 
 ... 
06 K.Shvachko HDFS Scalability: the Limits to Growth the Magazine of USENIX, vol.35, no.2 (2012) 
09 A.Rowstron+4 Nobody ever got fired for using Hadoop on a cluster 1st HTCDP (2012) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 3/23 
. 
3/23
. 
Hadoop/MapReduce Upgrades 
 solutions for geterogeneous jobs 17 
 R for statistical processing on top of HDFS shards 18 
 searchable shards -- HBase + Lucene 19 
 .... not enough 
17 A.Rasooli+1 COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop... McMaster (2013) 
18 S.Das+5 Ricardo: Integrating R and Hadoop SIGMOD (2010) 
19 X.Gao+2 Experimenting with Lucene Index on HBase in an HPC Environment 1st HPCDB (2012) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 4/23 
. 
4/23
. 
Example : Superspreaders 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 5/23 
. 
5/23
. 
Example Problem : Superspreaders 
. 
A Superspreader... 
. 
... is a one-to-many (o2m) traffic artifact where one host tries to contact 
many other hosts within short timespan 
. 
 reverse of Superspreader is a Flash Crowd -- same algorithm 
 many-to-many (m2m, groups) is even more complex 
 a known problem 16 
. 
QUIZ 
. 
.How to detect superspreaders using MapReduce? Any ideas? 
16 S.Venkataraman+3 New Streaming Algorithms for Fast Detection of Superspreaders NDSS (2005) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 6/23 
. 
6/23
. 
Superspreaders: (1) raw packets 
 time, sip, sport, dip, dport, psize, protocol -- the usual packet tuple 
 convert into text for MapReduce jobs 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 7/23 
. 
7/23
. 
Superspreaders: (2) 1st MapReduce 
. 
Step 1 
. 
....is to extract unique sip-dip pairs 
 3rd column is count (took word counting for 
basis) 
 ordered by sip 
 takes time because needs to process all the 
data 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 8/23 
. 
8/23
. 
Superspreaders: (3) 2nd MapReduce 
. 
Step 2 
. 
... is to count unique sips in sip-dip 
pairs 
. 
 based on the output of the 1st job 
 faster because data is relatively small 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 9/23 
. 
9/23
. 
Superspreaders: Problem NOT Solved 
 no time/sequence in MapReduce, no way to process data in a time 
window 
 do not know what to discard (tail, small counts) midway until all data is 
aggregated -- ineffective for large datasets 
 ... 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 10/23 
. 
10/23
. 
Proposal: TABID + Sketches 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 11/23 
. 
11/23
. 
TABID: Time-Aware BIg Data 
 TABID: acronym of Time Aware BIg Data 
 the grain is larger then key-value store, but better than HDFS shards 19 
KV 
Store 
Hadoop 
(HDFS) and 
MapReduce 
TABID Time-Aware 
Big Data 
(this demo) 
HDFS 
+ 
Lucene 
Index 
19 X.Gao+2 Experimenting with Lucene Index on HBase in an HPC Environment 1st HPCDB (2012) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 12/23 
. 
12/23
. 
Sketches (Data Streaming) 
. 
Data Streaming 
. 
... is a new concept where data is processed in realtime without 
using any storage 
. 
 a relatively recent but well defined and mathematically/statistically 
formulated 11 
 many known interesting algorithms 12 
 algorithms for specific problems 14 15 16 
 main features: space efficiency, statistical rigidity (information 
theory), speed 
11 S.Muthukrishnan Data Streams: Algorithms and Applications Foundations and Trends... (2005) 
12 M.Sung+3 Scalable and Efficient Data Streaming Algorithms for Detecting .... ICDEW (2006) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 13/23 
. 
13/23
. 
Hadoop/MapReduce 
. 
Hadoop is... 
. 
...a platform where your software 
meets with data shards 
. 
One Physical Machine (1 shard) 
file A 
file B 
file C 
… 
Hadoop Space 
Read/parse Find 
Hadoop Job 
(your code) 
Hadoop Job 
(your code) 
Hadoop Job 
(your code) 
Hadoop Job 
(your code) 
many many 
Manager 
Name 
Server(s) 
Client Machine 
Hadoop Client 
Start Use 
Your 
Code 
You 
Deploy 
 shards are distributed across 
the cluster 
 nameserver is the 
bottleneck 
 fairness problems are 
when heavy and light jobs run 
together at the same shard 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 14/23 
. 
14/23
. 
TABID/Sketches 
. 
TABID is... 
. 
...an API that helps your jobs get access to 
data in its natural time order 
. 
One Physical Machine (1 shard) 
STuimb-eSltinoere 
Store 
TABID 
Node 
TABID 
Manager 
Schedule 
Client Machine 
TABID Client 
Start Use 
Your 
Sketcher 
You 
Multicore 
Replay 
 data shards are downloaded 
and replayed 
 jobs run on multicore 
 jobs can use any data 
structure (well beyond 
key-value) 
 jobs are data streaming 
sketches -- can be selected 
from a library 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 15/23 
. 
15/23
. 
TABID: BigData Replay 
Core X 
Core 1 
Core 1 
Read/prepare 
One SketcOhne Sketch One Sketch 
Start End End End 
TABID 
Manager 
Now(replay) 
…. 
Time-Aligned Big Data 
Cursor 
Time 
Direction 
Shared Memory 
Start 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 16/23 
. 
16/23
. 
TABID: Lockfree Shared Memory 
 lockfree shared memory is a new branch of parallel 
processing 
 locks are bad for multicore (memory fences 20) 
 a generic lockfree design in 05 and software implementation in 23 
 some attempts to apply multicore to MapReduce 21 22 
20 M.Aldinucci+2 FastFlow: Efficient Parallel Streaming Applications on Multi-core Universita di Pisa (2009) 
23 current MCoreMemory project page https://github.com/maratishe/mcorememory (myself) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 17/23 
. 
17/23
. 
Wrapup (Problem→Solution) 
1. MapReduce is about counting, no rich datatypes 
 SOLVED → any datatype, stored as JSON 
2. MapReduce has no solution for heterogeneous jobs 
 SOLVED → TABID optimizes jobs-to-core mapping (bin packing) 
3. MapReduce has accountability problem because clients make their 
own jobs 
 SOLVED → a library of sketches based on known data 
streaming algorithms 
4. ... many smaller solutions 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 18/23 
. 
18/23
. 
Hadoop vs Tabid DEMO (today) 
 working Hadoop and TABID platforms 
 can play with various configurations on the spot 
. Replayable BigData for Multicore 
Processing and Stat... Rigid Sketching 
.Marat Zhanikeev – maratishe@gmail.com – maratishe.github.io – http://bit.do/marat141105 1/1 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 19/23 
. 
19/23
. 
That’s all, thank you ... 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 20/23 
. 
20/23
. 
Performance under Distributions 
 comparative efficiency (right) and job-to-core mapping efficiency 
(left) 
 all under heterogenious (variance) job distributions 
Tuples: min lifespan, max lifespan, exponent 
10,1000,0.01 
10,1000,0.05 
10,1000,0.1 
10,1000,0.3 
10,1000,0.7 
0 200 400 600 800 1000 
Number of Sketches 
5.4 
5.2 
5 
4.8 
4.6 
4.4 
4.2 
4 
3.8 
Log of Sketchbyte Ratio (HADOOP/TABID) 
10,1000,0 
More longer lifespans 
Tuples: min overhead, max overhead, exponent 
0 200 400 600 800 1000 
Number of Sketches 
6.65 
5.95 
5.25 
4.55 
3.85 
3.15 
Log of Max TABID Overhead (per core) 
100,10000,0 
100,10000,0.7 
100,10000,0.3 
100,10000,0.1 
100,10000,0.05 
100,10000,0.01 
Mostly 
large overhead 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 21/23 
. 
21/23
. 
Shared Memory : Logic 
Shared 
Memory 
IDLE 
All sketches 
caught up 
IDLE 
Cursor 
Time 
Sketch 
time 
... 
Data 
Stream 
… 
All 
sketches Ring buffer Process 
Data 
Set Sketch 
Time 
Add 
data 
Sketch 
started 
Monitor Sketch Manager 
cursor 
Cursor 
moved 
Each 
item 
Cursor = 
sketch 
End time 
of life 
Global 
start 
Wait for 
all sketches 
End of 
data 
 coursor written by 
manager and read by jobs on 
cores 
 only manager writes, cores only 
read -- lockfree 
design 
 zero collissions guaranteed by 
API 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 22/23 
. 
22/23
. 
Related Subjects 
 rigid statistics in traffic analysis 
 QoS of user communities 01, dynamic network management, etc. 
 isn't BigData replay a bottleneck? 
 not really, circuits for bulk transfer 02 or multisource aggregation 03 can support very 
high throughputs 
 MapReduce has several bottlenecks: namespace lookup, HDFS, etc. -- see 06 
for details 
 Why BigData Replay? -- it's a new ecosystem 
 bigdata hoarders announce replays, openly collect public jobs until a deadline, open 
outcome 
 one good solution for bigdata → opendata innitiative 
01 myself+0 A holistic community-based architecture for measuring end-to-end QoS at data centres IJCSE (2014) 
02 myself+0 Circuit Emulation for Big Data Transfers in Clouds Networking for Big Data, Wiley (in print) (2015) 
03 myself+0 Multi-Source Stream Aggregation in the Cloud Wiley (2014) 
06 K.Shvachko HDFS Scalability: the Limits to Growth the Magazine of USENIX, vol.35, no.2 (2012) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 
. 
23/23
. 
[01] myself+0 (2014) 
A holistic community-based architecture for measuring end-to-end QoS at data 
centres 
IJCSE 
[02] myself+0 (2015) 
Circuit Emulation for Big Data Transfers in Clouds 
Networking for Big Data, Wiley (in print) 
[03] myself+0 (2014) 
Multi-Source Stream Aggregation in the Cloud 
Wiley 
[04] myself+0 (2014) 
Optimizing Virtual Machine Migration for Energy-Efficient Clouds 
IEICEJ 
[05] myself+0 (2014) 
A lock-free shared memory design for high-throughput multicore packet traffic 
capture 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 
. 
23/23
. 
IJNM 
[06] K.Shvachko (2012) 
HDFS Scalability: the Limits to Growth 
the Magazine of USENIX, vol.35, no.2 
[07] Y.Chen+3 (2011) 
The Case for Evaluating MapReduce Performance Using Workload Suites 
MASCOTS 
[08] Z.Ren+4 (2012) 
Workload Characterization on a Production Hadoop Cluster... 
IEEE Workload... 
[09] A.Rowstron+4 (2012) 
Nobody ever got fired for using Hadoop on a cluster 
1st HTCDP 
[10] (current) 
Small File Problem in Hadoop (blog) 
http://amilaparanawithana.blogspot.jp/2012 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 
. 
23/23
. 
[11] S.Muthukrishnan (2005) 
Data Streams: Algorithms and Applications 
Foundations and Trends... 
[12] M.Sung+3 (2006) 
Scalable and Efficient Data Streaming Algorithms for Detecting .... 
ICDEW 
[13] Z.Bar-Yossef+2 (2002) 
...streaming algorithms, with an application to counting triangles in graphs 
ACM SODA 
[14] M.Charikar+2 (2002) 
Finding frequent items in data streams 
29th International Colloquium on Automata... 
[15] M.Datar+3 (2002) 
Maintaining stream statistics over sliding windows 
SIAM 
[16] S.Venkataraman+3 (2005) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 
. 
23/23
. 
New Streaming Algorithms for Fast Detection of Superspreaders 
NDSS 
[17] A.Rasooli+1 (2013) 
COSHH: A Classification and Optimization based Scheduler for Heterogeneous 
Hadoop... 
McMaster 
[18] S.Das+5 (2010) 
Ricardo: Integrating R and Hadoop 
SIGMOD 
[19] X.Gao+2 (2012) 
Experimenting with Lucene Index on HBase in an HPC Environment 
1st HPCDB 
[20] M.Aldinucci+2 (2009) 
FastFlow: Efficient Parallel Streaming Applications on Multi-core 
Universita di Pisa 
[21] R.Brightwell (2008) 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 
. 
23/23
. 
Workshop on Managed Many-Core Systems 
1st Workshop ...Many-Core Systems 
[22] R.Chen+2 (2010) 
Tiled-MapReduce: Optimizing Resource Usages of Data-parallel... with Tiling 
19th PACT 
[23] current (myself) 
MCoreMemory project page 
https://github.com/maratishe/mcorememory 
M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 
. 
23/23

More Related Content

PDF
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
MapReduce and Hadoop
PPTX
MapReduce: A useful parallel tool that still has room for improvement
PPTX
Parallel K means clustering using CUDA
PDF
Solving Endgames in Large Imperfect-Information Games such as Poker
PPT
Building your own NSQL store
ODP
Big data nyu
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
Parallel Implementation of K Means Clustering on CUDA
MapReduce and Hadoop
MapReduce: A useful parallel tool that still has room for improvement
Parallel K means clustering using CUDA
Solving Endgames in Large Imperfect-Information Games such as Poker
Building your own NSQL store
Big data nyu

Viewers also liked (6)

PPTX
20160309 AzureDay 2016 - Azure Stream Analytics & Azure Machine Learning
PPTX
Microsoft Machine Learning Smackdown
PPTX
The Microsoft BigData Story
PPTX
AWS for the SQL Server Pro
PPTX
Machine Learning on the Microsoft Stack
PPTX
Personalized and Adaptive Math Learning: Recent Research and What It Means fo...
20160309 AzureDay 2016 - Azure Stream Analytics & Azure Machine Learning
Microsoft Machine Learning Smackdown
The Microsoft BigData Story
AWS for the SQL Server Pro
Machine Learning on the Microsoft Stack
Personalized and Adaptive Math Learning: Recent Research and What It Means fo...
Ad

Similar to Replayable BigData for Multicore Processing and Statistically Rigid Sketching (20)

PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PDF
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
PDF
Hadoop trainting in hyderabad@kelly technologies
PDF
Map Reduce along with Amazon EMR
PDF
Report Hadoop Map Reduce
PPT
Hadoop trainting-in-hyderabad@kelly technologies
PDF
What is Distributed Computing, Why we use Apache Spark
PPTX
Introduction to map reduce
PPT
Hadoop institutes-in-bangalore
PDF
Big Data & Hadoop. Simone Leo (CRS4)
ODP
Hadoop demo ppt
PDF
How Apache Spark fits into the Big Data landscape
PDF
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
PPT
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
PDF
A Survey on Big Data Analysis Techniques
PDF
Hadoop interview questions
PDF
Hadoop Mapreduce
PPTX
Introduction to Map Reduce
PPTX
Hadoop Interview Questions and Answers
PDF
E031201032036
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Hadoop trainting in hyderabad@kelly technologies
Map Reduce along with Amazon EMR
Report Hadoop Map Reduce
Hadoop trainting-in-hyderabad@kelly technologies
What is Distributed Computing, Why we use Apache Spark
Introduction to map reduce
Hadoop institutes-in-bangalore
Big Data & Hadoop. Simone Leo (CRS4)
Hadoop demo ppt
How Apache Spark fits into the Big Data landscape
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
A Survey on Big Data Analysis Techniques
Hadoop interview questions
Hadoop Mapreduce
Introduction to Map Reduce
Hadoop Interview Questions and Answers
E031201032036
Ad

More from Tokyo University of Science (20)

PDF
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
PDF
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
PDF
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
PDF
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
PDF
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
PDF
Taking the Step from Software to Product Development \\ when teaching PBL at ...
PDF
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
PDF
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
PDF
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
PDF
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
PDF
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
PDF
On a Hybrid Packets-and-Circuits Switching Logic
PDF
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
PDF
Complexity Resolution Control for Context Based on Metromaps
PDF
The Declarative-Coordinated Model for Self-Optimization of Service Networks
PDF
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
PDF
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
PDF
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
PDF
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
PDF
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Taking the Step from Software to Product Development \\ when teaching PBL at ...
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
On a Hybrid Packets-and-Circuits Switching Logic
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Complexity Resolution Control for Context Based on Metromaps
The Declarative-Coordinated Model for Self-Optimization of Service Networks
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Programs and apps: productivity, graphics, security and other tools
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation

Replayable BigData for Multicore Processing and Statistically Rigid Sketching

  • 2. . Hadoop/MapReduce Problems M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 2/23 . 2/23
  • 3. . Hadoop/MapReduce Problems upper limit on throughput, both in HDFS and MapReduce jobs 06 09 one machine is not so bad anymore: good RAM, multicore 09 MapReduce is key-value hashes only, very restrictive MapReduce performs badly under heterogeneous workloads Small file problem is hard to solve in Hadoop 10 ... 06 K.Shvachko HDFS Scalability: the Limits to Growth the Magazine of USENIX, vol.35, no.2 (2012) 09 A.Rowstron+4 Nobody ever got fired for using Hadoop on a cluster 1st HTCDP (2012) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 3/23 . 3/23
  • 4. . Hadoop/MapReduce Upgrades solutions for geterogeneous jobs 17 R for statistical processing on top of HDFS shards 18 searchable shards -- HBase + Lucene 19 .... not enough 17 A.Rasooli+1 COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop... McMaster (2013) 18 S.Das+5 Ricardo: Integrating R and Hadoop SIGMOD (2010) 19 X.Gao+2 Experimenting with Lucene Index on HBase in an HPC Environment 1st HPCDB (2012) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 4/23 . 4/23
  • 5. . Example : Superspreaders M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 5/23 . 5/23
  • 6. . Example Problem : Superspreaders . A Superspreader... . ... is a one-to-many (o2m) traffic artifact where one host tries to contact many other hosts within short timespan . reverse of Superspreader is a Flash Crowd -- same algorithm many-to-many (m2m, groups) is even more complex a known problem 16 . QUIZ . .How to detect superspreaders using MapReduce? Any ideas? 16 S.Venkataraman+3 New Streaming Algorithms for Fast Detection of Superspreaders NDSS (2005) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 6/23 . 6/23
  • 7. . Superspreaders: (1) raw packets time, sip, sport, dip, dport, psize, protocol -- the usual packet tuple convert into text for MapReduce jobs M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 7/23 . 7/23
  • 8. . Superspreaders: (2) 1st MapReduce . Step 1 . ....is to extract unique sip-dip pairs 3rd column is count (took word counting for basis) ordered by sip takes time because needs to process all the data M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 8/23 . 8/23
  • 9. . Superspreaders: (3) 2nd MapReduce . Step 2 . ... is to count unique sips in sip-dip pairs . based on the output of the 1st job faster because data is relatively small M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 9/23 . 9/23
  • 10. . Superspreaders: Problem NOT Solved no time/sequence in MapReduce, no way to process data in a time window do not know what to discard (tail, small counts) midway until all data is aggregated -- ineffective for large datasets ... M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 10/23 . 10/23
  • 11. . Proposal: TABID + Sketches M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 11/23 . 11/23
  • 12. . TABID: Time-Aware BIg Data TABID: acronym of Time Aware BIg Data the grain is larger then key-value store, but better than HDFS shards 19 KV Store Hadoop (HDFS) and MapReduce TABID Time-Aware Big Data (this demo) HDFS + Lucene Index 19 X.Gao+2 Experimenting with Lucene Index on HBase in an HPC Environment 1st HPCDB (2012) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 12/23 . 12/23
  • 13. . Sketches (Data Streaming) . Data Streaming . ... is a new concept where data is processed in realtime without using any storage . a relatively recent but well defined and mathematically/statistically formulated 11 many known interesting algorithms 12 algorithms for specific problems 14 15 16 main features: space efficiency, statistical rigidity (information theory), speed 11 S.Muthukrishnan Data Streams: Algorithms and Applications Foundations and Trends... (2005) 12 M.Sung+3 Scalable and Efficient Data Streaming Algorithms for Detecting .... ICDEW (2006) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 13/23 . 13/23
  • 14. . Hadoop/MapReduce . Hadoop is... . ...a platform where your software meets with data shards . One Physical Machine (1 shard) file A file B file C … Hadoop Space Read/parse Find Hadoop Job (your code) Hadoop Job (your code) Hadoop Job (your code) Hadoop Job (your code) many many Manager Name Server(s) Client Machine Hadoop Client Start Use Your Code You Deploy shards are distributed across the cluster nameserver is the bottleneck fairness problems are when heavy and light jobs run together at the same shard M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 14/23 . 14/23
  • 15. . TABID/Sketches . TABID is... . ...an API that helps your jobs get access to data in its natural time order . One Physical Machine (1 shard) STuimb-eSltinoere Store TABID Node TABID Manager Schedule Client Machine TABID Client Start Use Your Sketcher You Multicore Replay data shards are downloaded and replayed jobs run on multicore jobs can use any data structure (well beyond key-value) jobs are data streaming sketches -- can be selected from a library M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 15/23 . 15/23
  • 16. . TABID: BigData Replay Core X Core 1 Core 1 Read/prepare One SketcOhne Sketch One Sketch Start End End End TABID Manager Now(replay) …. Time-Aligned Big Data Cursor Time Direction Shared Memory Start M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 16/23 . 16/23
  • 17. . TABID: Lockfree Shared Memory lockfree shared memory is a new branch of parallel processing locks are bad for multicore (memory fences 20) a generic lockfree design in 05 and software implementation in 23 some attempts to apply multicore to MapReduce 21 22 20 M.Aldinucci+2 FastFlow: Efficient Parallel Streaming Applications on Multi-core Universita di Pisa (2009) 23 current MCoreMemory project page https://github.com/maratishe/mcorememory (myself) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 17/23 . 17/23
  • 18. . Wrapup (Problem→Solution) 1. MapReduce is about counting, no rich datatypes SOLVED → any datatype, stored as JSON 2. MapReduce has no solution for heterogeneous jobs SOLVED → TABID optimizes jobs-to-core mapping (bin packing) 3. MapReduce has accountability problem because clients make their own jobs SOLVED → a library of sketches based on known data streaming algorithms 4. ... many smaller solutions M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 18/23 . 18/23
  • 19. . Hadoop vs Tabid DEMO (today) working Hadoop and TABID platforms can play with various configurations on the spot . Replayable BigData for Multicore Processing and Stat... Rigid Sketching .Marat Zhanikeev – maratishe@gmail.com – maratishe.github.io – http://bit.do/marat141105 1/1 M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 19/23 . 19/23
  • 20. . That’s all, thank you ... M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 20/23 . 20/23
  • 21. . Performance under Distributions comparative efficiency (right) and job-to-core mapping efficiency (left) all under heterogenious (variance) job distributions Tuples: min lifespan, max lifespan, exponent 10,1000,0.01 10,1000,0.05 10,1000,0.1 10,1000,0.3 10,1000,0.7 0 200 400 600 800 1000 Number of Sketches 5.4 5.2 5 4.8 4.6 4.4 4.2 4 3.8 Log of Sketchbyte Ratio (HADOOP/TABID) 10,1000,0 More longer lifespans Tuples: min overhead, max overhead, exponent 0 200 400 600 800 1000 Number of Sketches 6.65 5.95 5.25 4.55 3.85 3.15 Log of Max TABID Overhead (per core) 100,10000,0 100,10000,0.7 100,10000,0.3 100,10000,0.1 100,10000,0.05 100,10000,0.01 Mostly large overhead M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 21/23 . 21/23
  • 22. . Shared Memory : Logic Shared Memory IDLE All sketches caught up IDLE Cursor Time Sketch time ... Data Stream … All sketches Ring buffer Process Data Set Sketch Time Add data Sketch started Monitor Sketch Manager cursor Cursor moved Each item Cursor = sketch End time of life Global start Wait for all sketches End of data coursor written by manager and read by jobs on cores only manager writes, cores only read -- lockfree design zero collissions guaranteed by API M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 22/23 . 22/23
  • 23. . Related Subjects rigid statistics in traffic analysis QoS of user communities 01, dynamic network management, etc. isn't BigData replay a bottleneck? not really, circuits for bulk transfer 02 or multisource aggregation 03 can support very high throughputs MapReduce has several bottlenecks: namespace lookup, HDFS, etc. -- see 06 for details Why BigData Replay? -- it's a new ecosystem bigdata hoarders announce replays, openly collect public jobs until a deadline, open outcome one good solution for bigdata → opendata innitiative 01 myself+0 A holistic community-based architecture for measuring end-to-end QoS at data centres IJCSE (2014) 02 myself+0 Circuit Emulation for Big Data Transfers in Clouds Networking for Big Data, Wiley (in print) (2015) 03 myself+0 Multi-Source Stream Aggregation in the Cloud Wiley (2014) 06 K.Shvachko HDFS Scalability: the Limits to Growth the Magazine of USENIX, vol.35, no.2 (2012) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 . 23/23
  • 24. . [01] myself+0 (2014) A holistic community-based architecture for measuring end-to-end QoS at data centres IJCSE [02] myself+0 (2015) Circuit Emulation for Big Data Transfers in Clouds Networking for Big Data, Wiley (in print) [03] myself+0 (2014) Multi-Source Stream Aggregation in the Cloud Wiley [04] myself+0 (2014) Optimizing Virtual Machine Migration for Energy-Efficient Clouds IEICEJ [05] myself+0 (2014) A lock-free shared memory design for high-throughput multicore packet traffic capture M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 . 23/23
  • 25. . IJNM [06] K.Shvachko (2012) HDFS Scalability: the Limits to Growth the Magazine of USENIX, vol.35, no.2 [07] Y.Chen+3 (2011) The Case for Evaluating MapReduce Performance Using Workload Suites MASCOTS [08] Z.Ren+4 (2012) Workload Characterization on a Production Hadoop Cluster... IEEE Workload... [09] A.Rowstron+4 (2012) Nobody ever got fired for using Hadoop on a cluster 1st HTCDP [10] (current) Small File Problem in Hadoop (blog) http://amilaparanawithana.blogspot.jp/2012 M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 . 23/23
  • 26. . [11] S.Muthukrishnan (2005) Data Streams: Algorithms and Applications Foundations and Trends... [12] M.Sung+3 (2006) Scalable and Efficient Data Streaming Algorithms for Detecting .... ICDEW [13] Z.Bar-Yossef+2 (2002) ...streaming algorithms, with an application to counting triangles in graphs ACM SODA [14] M.Charikar+2 (2002) Finding frequent items in data streams 29th International Colloquium on Automata... [15] M.Datar+3 (2002) Maintaining stream statistics over sliding windows SIAM [16] S.Venkataraman+3 (2005) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 . 23/23
  • 27. . New Streaming Algorithms for Fast Detection of Superspreaders NDSS [17] A.Rasooli+1 (2013) COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop... McMaster [18] S.Das+5 (2010) Ricardo: Integrating R and Hadoop SIGMOD [19] X.Gao+2 (2012) Experimenting with Lucene Index on HBase in an HPC Environment 1st HPCDB [20] M.Aldinucci+2 (2009) FastFlow: Efficient Parallel Streaming Applications on Multi-core Universita di Pisa [21] R.Brightwell (2008) M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 . 23/23
  • 28. . Workshop on Managed Many-Core Systems 1st Workshop ...Many-Core Systems [22] R.Chen+2 (2010) Tiled-MapReduce: Optimizing Resource Usages of Data-parallel... with Tiling 19th PACT [23] current (myself) MCoreMemory project page https://github.com/maratishe/mcorememory M.Zhanikeev -- maratishe@gmail.com -- Replayable BigData for Multicore Processing and ... Sketching -- http://bit.do/marat141105 -- 23/23 . 23/23