Common Machine Learning Solutions Everyone Needs to Know

© 2019 SPLUNK INC.
© 2019 SPLUNK INC.
Common Machine
Learning Solutions
Everyone Needs to
Know
Eurus Kim | Amir Malekpour
Wednesday, October 23, 2019

© 2019 SPLUNK INC.
Staff ML Architect | Splunk
Eurus Kim
Principal Software Engineer | Splunk
Amir Malekpour

During the course of this presentation, we may make forward‐looking statements
regarding future events or plans of the company. We caution you that such statements
reflect our current expectations and estimates based on factors currently known to us
and that actual events or results may differ materially. The forward-looking statements
made in the this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, it may not contain current or
accurate information. We do not assume any obligation to update
any forward‐looking statements made herein.
In addition, any information about our roadmap outlines our general product direction
and is subject to change at any time without notice. It is for informational purposes only,
and shall not be incorporated into any contract or other commitment. Splunk undertakes
no obligation either to develop the features or functionalities described or to include any
such feature or functionality in a future release.
Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud,
Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the
United States and other countries. All other brand names, product names, or
trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved.
Forward-
Looking
Statements
© 2019 SPLUNK INC.

© 2019 SPLUNK INC.
Agenda
Understanding Machine Learning with Splunk
Solution 1 – Outlier Detection using DensityFunction
• Use cases covered
• What is Density Function?
• How to use DensityFunction
Solution 2 – Forecasting using StateSpaceForecast
• Understanding forecasting
• How to use StateSpaceForecast
• Caveats and considerations

© 2019 SPLUNK INC.
Understanding
Machine Learning with
Splunk

© 2019 SPLUNK INC.
What is Machine Learning?
Use mathematical
models to learn patterns
in information
Catalog the patterns
(and in some cases, iterate them as
new data is received)
Use learned patterns to
understand and interpret new
data or make predictions

© 2019 SPLUNK INC.
• Deviation from past behavior
• Deviation from peers
• (aka Multivariate AD or Cohesive
AD)
• Unusual change in features
• Predict Service Health
Score/Churn
• Predicting Events
• Trend Forecasting
• Detecting influencing entities
• Early warning of failure
• Identify peer groups
• Event Correlation
• Reduce alert noise
• Behavioral Analytics
Anomaly detection Predictive Analytics Clustering
Splunk Customers Want Answers
from their Data

© 2019 SPLUNK INC.
Solution #1
from their Data
AD)
Score/Churn

© 2019 SPLUNK INC.
AD)
Score/Churn
Solution #2
from their Data

© 2019 SPLUNK INC.
Overview of ML at Splunk
CORE PLATFORM
SEARCH + Smarter
Splunk
PACKAGED PREMIUM
SOLUTIONS
MACHINE LEARNING
TOOLKIT
Platform for Operational Intelligence

© 2019 SPLUNK INC.
Splunk Machine Learning Toolkit (MLTK)
• Experiments and Assistants: Guided model building, testing,
and deployment for common objectives
• Showcases: Interactive examples for typical
IT, security, business, and IoT use cases
• Algorithms: 80+ standard algorithms (supervised &
unsupervised)
• ML Commands: New SPL commands to
fit, test, score and operationalize models
• ML-SPL API: Extensibility to easily import any algorithm
(proprietary / open source)
• Python for Scientific Computing Library: Access to 300+
open source algorithms
• Apache Spark MLLib: Support large scale model training via
Spark Add-on for MLTK (LAR)
• Tensorflow Container: Supports NN and GPU accelerated
machine learning
Build custom analytics for any use case

© 2019 SPLUNK INC.
Solution #1: Outlier
Detection Using
DensityFunction

© 2019 SPLUNK INC.
Solution 1 – Using DensityFunction
Outlier in some numerical value
• Number of transactions
• Transaction latency
• System utilization (CPU/memory)
• Number of logins
• Amount of data transfer
• Time between actions
• Sensor measurement
What type of use cases are we talking about?
Detect Numeric Outliers Assistant in MLTK

© 2019 SPLUNK INC.
We can do this today with the MLTK
Using the Detect Numeric Outliers assistant

© 2019 SPLUNK INC.
But it can be hard to figure out how to use
Which method works best for my data?
Using Standard Deviation
with no sliding window
Using Standard Deviation
with a sliding window
Using Median Absolute Deviation
with a sliding window

© 2019 SPLUNK INC.
And there is no model created
You have to run your search on all your data every time
Where’s the fit command?

© 2019 SPLUNK INC.
Average
1 SD
below
1 SD
above
2 SD
above
3 SD
above
2 SD
below
Why is it so hard?
Your data may not be so “Normal”
When viewing our data as a histogram, the average may not be so ”average”

© 2019 SPLUNK INC.
What if we could follow the shape of
our data?
We can with the DensityFunction algorithm!

© 2019 SPLUNK INC.
What is a Density
Function Anyway?

© 2019 SPLUNK INC.
What is a Density Function Anyway?
A mathematical function that maps outcomes to their relative likelihood
Likely
Not Likely

© 2019 SPLUNK INC.
What is a Density Function Anyway?
Parameter
Likely
Not Likely
A mathematical function that maps outcomes to their relative likelihood

© 2019 SPLUNK INC.
Fitting with DensityFunction
With a set of values, we’d like to know their distribution type and parameters

© 2019 SPLUNK INC.
Fitting with DensityFunction
DensityFunction
Data
Model
Type: Gaussian KDE
Parameters: (x1, x2, ...)
DensityFunction fits your data over a set of distributions and picks the best fit

© 2019 SPLUNK INC.
Outlier Detection with DensityFunction
Outlier
When new data comes in, we use our density function to determined its likelihood

© 2019 SPLUNK INC.
Caveats with DensityFunction
Don’t fit on noise!
If you have only a few data points it’s likely you’re fitting on noise

© 2019 SPLUNK INC.
Total number of logins/month: Day 5 Total number of logins/month: Day 25
Caveats with DensityFunction
Beware of shifting mean!
If your measure is cumulative, your distribution mean shifts

© 2019 SPLUNK INC.
How do you use
DensityFunction?

© 2019 SPLUNK INC.
Using DensityFunction
index=your-index field=value
| stats count as my_field by dim1 dim2
...
| bin my_field bins=1000
| stats count by my_field
| makecontinuous my_field
| fillnull
| sort my_field
Or use the `histogram` macro in MLTK!
...
| `histogram(my_field,1000)`
First you should understand the shape (distribution) of your data
Use the Column Chart or Histogram Chart (in MLTK) Viz

© 2019 SPLUNK INC.
index=your-index other search terms
...
| timechart span=5m avg(my_field) as my_field
Possibly also understand the shape of your data over time

© 2019 SPLUNK INC.
Create a DensityFunction model
...
| fit DensityFunction my_field by "dim1,dim2" into MyDFModel as IsOutlier
threshold=0.01 dist=auto

© 2019 SPLUNK INC.
...
| apply MyDFModel threshold=0.005
| search "IsOutlier(my_field)"=1
Applying your DensityFunction model

© 2019 SPLUNK INC.
You can change your threshold at apply
The BoundaryRanges designates where there are outliers

© 2019 SPLUNK INC.
Visualizing the Probability Density
Estimate
...
| fit DensityFunction my_field show_density=true
| bin my_field bins=100
| stats count avg("ProbabilityDensity(my_field)") as pd by my_field
| makecontinuous my_field
| sort my_field
Visualize as a Bar Chart, and put the pd field on a separate axis

© 2019 SPLUNK INC.
Get more advanced and Create an
Anomaly Score
Apply different pivots of your data with different models
...
| fit DensityFunction my_field as IsOutlierOverall
| fit DensityFunction my_field by "dim1" as IsOutlierByDim1
| fit DensityFunction my_field by "dim2" as IsOutlierByDim2
| eval AnomalyScore=0
| foreach IsOutlier* [eval AnomalyScore=AnomalyScore+<<FIELD>>]

© 2019 SPLUNK INC.
Using the Smart Outlier Detection
Assistant
Putting it all together with an “easier” button

© 2019 SPLUNK INC.
Solution #2:
Forecasting using
StateSpaceForecast

© 2019 SPLUNK INC.
Forecast vs Prediction
What is the difference?
Forecast
• Given the past values of a metric, tell me
what the value will looks like X time periods
from now (e.g. tomorrow, next week, etc).
• Forecasting relies on time and the historical
values of a measurement in question as its
inputs.
Prediction
• Given the past values of a set of fields,
estimate (or predict) what the value of one
of those fields will be, given the other fields
as inputs.
• Prediction relies on many other inputs to try
and explain the relationship between those
inputs and the measurement you are trying
to predict.
But both of these fall under the category of “Predictive Analytics”

© 2019 SPLUNK INC.
Forecast vs Prediction
What is the difference?
Forecast
•Given the past values of a metric, tell me
what the value will looks like X time
periods from now (e.g. tomorrow, next
week, etc).
•Forecasting relies on time and the
historical values of a measurement in
question as its inputs. We are covering.
Prediction
• Given the past values of a set of fields,
estimate (or predict) what the value of one
of those fields will be, given the other fields
as inputs.
• Prediction relies on many other inputs to try
and explain the relationship between those
inputs and the measurement you are trying
to predict.
But both of these fall under the category of “Predictive Analytics”

© 2019 SPLUNK INC.
Why would I use forecasting?
Typically used for planning
• Based on past trends, what do we expect next
week/month/quarter/year to look like?
• Capacity planning (hard drive, operating
temperature)
Forecasting is not a crystal ball, but it gives
you a quantitative estimate on future
values
• Getting a picture of what the future might look like

© 2019 SPLUNK INC.
Using the old way for forecasting
You have to be an expert at the math
• You have to specify the algorithm to use for the predict command
• You have to know how to optimize on P, D, and Q parameters for ARIMA
There is no model file created, which means you can’t “apply” your model to future data
Doesn’t consider special days (holidays)
There’s nothing wrong with the old way, it’s just often improperly used

© 2019 SPLUNK INC.
Using StateSpaceForecast
• Uses basically the same math (Kalman filter) as the predict command, but it will try to figure out the
parameters and mode (algorithm in predict)
• You can “apply” your model to future data
• You can account for special days
• You can use incremental fit (continuously update your model with new data)
• You can do multivariate analysis
• It will automatically impute the missing values (null values)
Applying more real-time operational use cases

© 2019 SPLUNK INC.
Confidence Level and Confidence Interval
• Confidence level is how confident we are about the prediction that our confidence interval includes
the real value
• Confidence interval and confidence level need to be interpreted together
• 95% confidence level means we are 95% confident that the confidence interval includes the true
value
What’s the difference?

© 2019 SPLUNK INC.
Confidence Level and Confidence Interval
The confidence interval increases over time because the algorithm needs more
“leeway” to fulfill its promise of 95% confidence level
Confidence interval is not about if the prediction is an outlier or not. It’s about
accuracy of prediction.
Interpreting the data further into the future

© 2019 SPLUNK INC.
Caveats with StateSpaceForecast
• Don’t project too far into the future
• Choose a large confidence level (e.g., 95%)
• If the confidence interval is too wide be careful about the reliability of the forecast

© 2019 SPLUNK INC.
After cleaning up some of the
outliers
Raw data without cleaning up
outliers
Forecasting is Sensitive to Outliers
Make sure you do some data cleansing first

© 2019 SPLUNK INC.
1. Use DensityFunction for finding outliers
• Visually inspect fit and tune threshold
• Don’t fit over noise
2. Use StateSpaceForecast for projection
and planning
• Remove outliers before fitting
• Pay attention to confidence interval
This is where the
subtitle goes
Key
Takeaways

Common Machine Learning Solutions Everyone Needs to Know

More Related Content

Similar to Common Machine Learning Solutions Everyone Needs to Know (20)

Recently uploaded (20)

Common Machine Learning Solutions Everyone Needs to Know