© 2017 MapR Technologies 1
Update on t-digest
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 3
Who We Are
• MapR echnologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!
© 2017 MapR Technologies 4
Basic Outline
• Why we should measure distributions
• Basic Ideas
• How t-digest works
• Recent results
• Applications
© 2017 MapR Technologies 5
Why Is This Practically Important
• The novice came to the master and says “something is broken”
© 2017 MapR Technologies 6
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
© 2017 MapR Technologies 7
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
• And the student was enlightened
© 2017 MapR Technologies 8
Finding change is key
but what kind?
© 2017 MapR Technologies 9
Last Night’s Latencies
• These are ping latencies from my hotel
• Looks pretty good, right?
• But what about longer term?
208.302
198.571
185.099
191.258
201.392
214.738
197.389
187.749
201.693
186.762
185.296
186.390
183.960
188.060
190.763
> mean(y$t[i])
[1] 198.6047
> sd(y$t[i])
[1] 71.43965
© 2017 MapR Technologies 10
Not So Fast …
© 2017 MapR Technologies 11
This is long-tailed land
© 2017 MapR Technologies 12
This is long-tailed land
You have to know the distribution
of values
© 2017 MapR Technologies 13
© 2017 MapR Technologies 14
A single number
is simply not enough
© 2017 MapR Technologies 15
What We Really Need Here
• I want to be able to compute the distribution from any time
period
• From any subset of measurements
• With lots of keys and filters
• And not a lot of space
• Basically, any OLAP kind of query
select distribution(x) from … where … group by y,z
© 2017 MapR Technologies 16
Idea 0 – Pre-defined bins
• So let’s assume we have bins
– Upper, lower bound, constant width
• Get a measurement, pick a bin, increment count
• Works great if you know the data
– And you have limited dynamic range (too many bins)
– And the distribution is fixed
• Useful, but not general enough
© 2017 MapR Technologies 17
Idea 1 – Exponential Bins
• Suppose we want relative accuracy in measurement space
• Latencies are positive and only matter within a few percent
– 1.1 ms versus 1.0 ms
– 1100 ms versus 1000 ms
• We can cheat by using floating point representations
– Compute bin using magic
– Count
© 2017 MapR Technologies 18
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
© 2017 MapR Technologies 19
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
© 2017 MapR Technologies 20
Fixed Size Bins
© 2017 MapR Technologies 21
Approximate Exponential Bins
© 2017 MapR Technologies 22
Non-linear bins are better
(sometimes)
Still not general enough
© 2017 MapR Technologies 23
Idea 2 – Fully Adaptive Bins
• First intuition – in general, we want accuracy in terms of
percentile
• Second intuition – we want better accuracy at extreme
quantiles
– 50%-ile versus 50.1%-ile?
– What does 0.1% error even mean for 99.99th percentile
• We need bins with small counts near the edges
© 2017 MapR Technologies 24
First 1% of data shown.
Left graph has 100 x 100 sample bins.
Right graph has ~130bins, variable size
© 2017 MapR Technologies 25
The Basic t-digest
• Take a bunch of data
• Sort it
• Group into bins
– But make the bins be smaller at the beginning and end
• Remember the centroid and count of each bin
• That’s a t-digest
© 2017 MapR Technologies 26
But Wait, You Need a Bit More
• Take a bunch of new data, old t-digest
• Sort the data and the old bins together
• Group into bins
– Note that existing bins have bigger weights
– So they might survive … or might clump
• Remember the centroid and count of each new bin
• That’s an updated t-digest
© 2017 MapR Technologies 27
Oh … and Merging
• Take a bunch of old t-digests
• Sort the bins
• Group into mega-bins
– Respect the size constraint
• Remember the centroid and count of each new bin
• That’s a merged t-digest
© 2017 MapR Technologies 28
Adaptive non-linear bins are good
and general
And can be grouped
and regrouped
© 2017 MapR Technologies 29
Results
© 2017 MapR Technologies 30
© 2017 MapR Technologies 31
Status
• Current release
– Small accuracy bugs in corner cases
– Best overall is still AVLTreeDigest
© 2017 MapR Technologies 32
Status
• Current release (3.x)
– Small accuracy bugs in corner cases
– Best overall is still AVLTreeDigest
• Upcoming release (4.0)
– Better accuracy in pathological cases
– Strictly bounded size
– No dynamic allocation (with MergingDigest)
– Good speed (100ns for MergingDigest, 5ns for FloatHistogram)
– Real Soon Now
© 2017 MapR Technologies 33
Example Application
• The data:
– ~ 1 million machines
– Even more services
– Each producing thousands of measurements per second
• Store t-digest for each 5 minute period for each measurement
• Want to query any combination of keys, produce t-digest result
“what was the distribution of launch times yesterday?”
“what about last month?”
“in Europe versus in North America versus in Asia?”
© 2017 MapR Technologies 34
Collect Data
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 35
And Transport to Global Analytics
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
© 2017 MapR Technologies 36
With Many Sources
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
© 2017 MapR Technologies 37
With Many Sources
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
log consolidator
web server
Web-
server
Log
web server
Web-
server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 38
With Many Sources
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
log consolidator
web server
Web-
server
Log
web server
Web-
server
Log
log_events
log-stash
log-stash
data center
log consolidator
web server
Web-
server
Log
web server
Web-
server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 39
What about visualization?
© 2017 MapR Technologies 40
Can’t see small count bars
© 2017 MapR Technologies 41
Good Results
© 2017 MapR Technologies 42
Bad Results – 1% of measurements are 3x bigger
© 2017 MapR Technologies 43
Bad Results – 1% of measurements are 3x bigger
© 2017 MapR Technologies 44
With Better Vertical Scaling
© 2017 MapR Technologies 45
Uniform Bins
© 2017 MapR Technologies 46
FloatHistogram Bins
© 2017 MapR Technologies 47
With FloatHistogram
© 2017 MapR Technologies 48
Original Ping Latency Data
© 2017 MapR Technologies 49
Summary
• Single measurements insufficient, need distributions
• Uniform binned histograms not good
• FloatHistogram for some cases
• T-digest for general cases
• Upcoming release has super-
fast and accurate versions
• Good visualization also key
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 50
Q & A
© 2017 MapR Technologies 51
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 52
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k

More Related Content

PPTX
Finding Changes in Real Data
PPTX
Tensor Abuse - how to reuse machine learning frameworks
PPTX
Machine Learning logistics
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
Cheap learning-dunning-9-18-2015
PPTX
Where is Data Going? - RMDC Keynote
PPTX
Real time-hadoop
PPTX
Doing-the-impossible
Finding Changes in Real Data
Tensor Abuse - how to reuse machine learning frameworks
Machine Learning logistics
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Cheap learning-dunning-9-18-2015
Where is Data Going? - RMDC Keynote
Real time-hadoop
Doing-the-impossible

What's hot (20)

PPTX
Sharing Sensitive Data Securely
PPTX
Streaming Architecture including Rendezvous for Machine Learning
PPTX
Anomaly Detection - New York Machine Learning
PPTX
Dunning time-series-2015
PPTX
What is the past future tense of data?
PPTX
Cognitive computing with big data, high tech and low tech approaches
PPTX
Dunning ml-conf-2014
PPTX
Which Algorithms Really Matter
PDF
Strata 2014 Anomaly Detection
PPTX
My talk about recommendation and search to the Hive
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
What's new in Apache Mahout
PPTX
Building multi-modal recommendation engines using search engines
PPTX
Recommendation Techn
PPTX
Using Mahout and a Search Engine for Recommendation
PPTX
Polyvalent recommendations
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
PPTX
Buzz words-dunning-real-time-learning
PDF
Mathematical bridges From Old to New
PPTX
Hadoop and R Go to the Movies
Sharing Sensitive Data Securely
Streaming Architecture including Rendezvous for Machine Learning
Anomaly Detection - New York Machine Learning
Dunning time-series-2015
What is the past future tense of data?
Cognitive computing with big data, high tech and low tech approaches
Dunning ml-conf-2014
Which Algorithms Really Matter
Strata 2014 Anomaly Detection
My talk about recommendation and search to the Hive
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
What's new in Apache Mahout
Building multi-modal recommendation engines using search engines
Recommendation Techn
Using Mahout and a Search Engine for Recommendation
Polyvalent recommendations
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Buzz words-dunning-real-time-learning
Mathematical bridges From Old to New
Hadoop and R Go to the Movies
Ad

Similar to T digest-update (20)

PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
How the Internet of Things are Turning the Internet Upside Down
PPTX
Dealing with an Upside Down Internet
PPTX
Dealing with an Upside Down Internet With High Performance Time Series Database
PPTX
How the Internet of Things is Turning the Internet Upside Down
PPTX
Time Series Data in a Time Series World
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
How to tell which algorithms really matter
PDF
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
PPTX
CMU Lecture on Hadoop Performance
PPTX
Anomaly Detection: How to find what you didn’t know to look for
PPTX
How to Determine which Algorithms Really Matter
PPTX
Practical Computing With Chaos
PPTX
Practical Computing with Chaos
PPTX
How to find what you didn't know to look for, oractical anomaly detection
PPTX
Graphlab Ted Dunning Clustering
PDF
Streamlio and IoT analytics with Apache Pulsar
PPTX
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
PPSX
Big&open data challenges for smartcity-PIC2014 Shanghai
ML Workshop 2: Machine Learning Model Comparison & Evaluation
How the Internet of Things are Turning the Internet Upside Down
Dealing with an Upside Down Internet
Dealing with an Upside Down Internet With High Performance Time Series Database
How the Internet of Things is Turning the Internet Upside Down
Time Series Data in a Time Series World
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
How to tell which algorithms really matter
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
CMU Lecture on Hadoop Performance
Anomaly Detection: How to find what you didn’t know to look for
How to Determine which Algorithms Really Matter
Practical Computing With Chaos
Practical Computing with Chaos
How to find what you didn't know to look for, oractical anomaly detection
Graphlab Ted Dunning Clustering
Streamlio and IoT analytics with Apache Pulsar
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Big&open data challenges for smartcity-PIC2014 Shanghai
Ad

More from Ted Dunning (7)

PPTX
Dunning - SIGMOD - Data Economy.pptx
PPTX
How to Get Going with Kubernetes
PPTX
Progress for big data in Kubernetes
PPTX
Machine Learning Logistics
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
PPTX
Possible Visions for Mahout 1.0
PPTX
Inside MapR's M7
Dunning - SIGMOD - Data Economy.pptx
How to Get Going with Kubernetes
Progress for big data in Kubernetes
Machine Learning Logistics
Apache Kylin - OLAP Cubes for SQL on Hadoop
Possible Visions for Mahout 1.0
Inside MapR's M7

Recently uploaded (20)

PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
Business_Capability_Map_Collection__pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
MBA JAPAN: 2025 the University of Waseda
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
Machine Learning and working of machine Learning
PPTX
ai agent creaction with langgraph_presentation_
PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
chrmotography.pptx food anaylysis techni
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
recommendation Project PPT with details attached
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
DU, AIS, Big Data and Data Analytics.ppt
The Data Security Envisioning Workshop provides a summary of an organization...
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
Business_Capability_Map_Collection__pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
MBA JAPAN: 2025 the University of Waseda
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Machine Learning and working of machine Learning
ai agent creaction with langgraph_presentation_
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
chrmotography.pptx food anaylysis techni
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
1 hour to get there before the game is done so you don’t need a car seat for ...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
recommendation Project PPT with details attached
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
retention in jsjsksksksnbsndjddjdnFPD.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx

T digest-update

  • 1. © 2017 MapR Technologies 1 Update on t-digest
  • 2. © 2017 MapR Technologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning
  • 3. © 2017 MapR Technologies 3 Who We Are • MapR echnologies – We make a kick-ass platform for big data computing – Support many workloads including Hadoop / Spark / HPC / Other – Extended to allow streams and tables in basic platform – Free for academic research / training • Apache Software Foundation – Culture hub for building open source communities – Shared values around openness for contribution as well as use – Many major projects are part of Apache – Even more minor ones!
  • 4. © 2017 MapR Technologies 4 Basic Outline • Why we should measure distributions • Basic Ideas • How t-digest works • Recent results • Applications
  • 5. © 2017 MapR Technologies 5 Why Is This Practically Important • The novice came to the master and says “something is broken”
  • 6. © 2017 MapR Technologies 6 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?”
  • 7. © 2017 MapR Technologies 7 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?” • And the student was enlightened
  • 8. © 2017 MapR Technologies 8 Finding change is key but what kind?
  • 9. © 2017 MapR Technologies 9 Last Night’s Latencies • These are ping latencies from my hotel • Looks pretty good, right? • But what about longer term? 208.302 198.571 185.099 191.258 201.392 214.738 197.389 187.749 201.693 186.762 185.296 186.390 183.960 188.060 190.763 > mean(y$t[i]) [1] 198.6047 > sd(y$t[i]) [1] 71.43965
  • 10. © 2017 MapR Technologies 10 Not So Fast …
  • 11. © 2017 MapR Technologies 11 This is long-tailed land
  • 12. © 2017 MapR Technologies 12 This is long-tailed land You have to know the distribution of values
  • 13. © 2017 MapR Technologies 13
  • 14. © 2017 MapR Technologies 14 A single number is simply not enough
  • 15. © 2017 MapR Technologies 15 What We Really Need Here • I want to be able to compute the distribution from any time period • From any subset of measurements • With lots of keys and filters • And not a lot of space • Basically, any OLAP kind of query select distribution(x) from … where … group by y,z
  • 16. © 2017 MapR Technologies 16 Idea 0 – Pre-defined bins • So let’s assume we have bins – Upper, lower bound, constant width • Get a measurement, pick a bin, increment count • Works great if you know the data – And you have limited dynamic range (too many bins) – And the distribution is fixed • Useful, but not general enough
  • 17. © 2017 MapR Technologies 17 Idea 1 – Exponential Bins • Suppose we want relative accuracy in measurement space • Latencies are positive and only matter within a few percent – 1.1 ms versus 1.0 ms – 1100 ms versus 1000 ms • We can cheat by using floating point representations – Compute bin using magic – Count
  • 18. © 2017 MapR Technologies 18 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps – is typical • Relative error is bounded in measurement space
  • 19. © 2017 MapR Technologies 19 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps – is typical • Relative error is bounded in measurement space • Bin index can be computed using FP representation!
  • 20. © 2017 MapR Technologies 20 Fixed Size Bins
  • 21. © 2017 MapR Technologies 21 Approximate Exponential Bins
  • 22. © 2017 MapR Technologies 22 Non-linear bins are better (sometimes) Still not general enough
  • 23. © 2017 MapR Technologies 23 Idea 2 – Fully Adaptive Bins • First intuition – in general, we want accuracy in terms of percentile • Second intuition – we want better accuracy at extreme quantiles – 50%-ile versus 50.1%-ile? – What does 0.1% error even mean for 99.99th percentile • We need bins with small counts near the edges
  • 24. © 2017 MapR Technologies 24 First 1% of data shown. Left graph has 100 x 100 sample bins. Right graph has ~130bins, variable size
  • 25. © 2017 MapR Technologies 25 The Basic t-digest • Take a bunch of data • Sort it • Group into bins – But make the bins be smaller at the beginning and end • Remember the centroid and count of each bin • That’s a t-digest
  • 26. © 2017 MapR Technologies 26 But Wait, You Need a Bit More • Take a bunch of new data, old t-digest • Sort the data and the old bins together • Group into bins – Note that existing bins have bigger weights – So they might survive … or might clump • Remember the centroid and count of each new bin • That’s an updated t-digest
  • 27. © 2017 MapR Technologies 27 Oh … and Merging • Take a bunch of old t-digests • Sort the bins • Group into mega-bins – Respect the size constraint • Remember the centroid and count of each new bin • That’s a merged t-digest
  • 28. © 2017 MapR Technologies 28 Adaptive non-linear bins are good and general And can be grouped and regrouped
  • 29. © 2017 MapR Technologies 29 Results
  • 30. © 2017 MapR Technologies 30
  • 31. © 2017 MapR Technologies 31 Status • Current release – Small accuracy bugs in corner cases – Best overall is still AVLTreeDigest
  • 32. © 2017 MapR Technologies 32 Status • Current release (3.x) – Small accuracy bugs in corner cases – Best overall is still AVLTreeDigest • Upcoming release (4.0) – Better accuracy in pathological cases – Strictly bounded size – No dynamic allocation (with MergingDigest) – Good speed (100ns for MergingDigest, 5ns for FloatHistogram) – Real Soon Now
  • 33. © 2017 MapR Technologies 33 Example Application • The data: – ~ 1 million machines – Even more services – Each producing thousands of measurements per second • Store t-digest for each 5 minute period for each measurement • Want to query any combination of keys, produce t-digest result “what was the distribution of launch times yesterday?” “what about last month?” “in Europe versus in North America versus in Asia?”
  • 34. © 2017 MapR Technologies 34 Collect Data log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center
  • 35. © 2017 MapR Technologies 35 And Transport to Global Analytics log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  • 36. © 2017 MapR Technologies 36 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  • 37. © 2017 MapR Technologies 37 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center
  • 38. © 2017 MapR Technologies 38 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center
  • 39. © 2017 MapR Technologies 39 What about visualization?
  • 40. © 2017 MapR Technologies 40 Can’t see small count bars
  • 41. © 2017 MapR Technologies 41 Good Results
  • 42. © 2017 MapR Technologies 42 Bad Results – 1% of measurements are 3x bigger
  • 43. © 2017 MapR Technologies 43 Bad Results – 1% of measurements are 3x bigger
  • 44. © 2017 MapR Technologies 44 With Better Vertical Scaling
  • 45. © 2017 MapR Technologies 45 Uniform Bins
  • 46. © 2017 MapR Technologies 46 FloatHistogram Bins
  • 47. © 2017 MapR Technologies 47 With FloatHistogram
  • 48. © 2017 MapR Technologies 48 Original Ping Latency Data
  • 49. © 2017 MapR Technologies 49 Summary • Single measurements insufficient, need distributions • Uniform binned histograms not good • FloatHistogram for some cases • T-digest for general cases • Upcoming release has super- fast and accurate versions • Good visualization also key 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 50. © 2017 MapR Technologies 50 Q & A
  • 51. © 2017 MapR Technologies 51 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning
  • 52. © 2017 MapR Technologies 52 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 • Interpolate using centroids in x • Very good near extremes, no dynamic allocation 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k