SlideShare a Scribd company logo
© 2019 SPLUNK INC.
© 2019 SPLUNK INC.
Common Machine
Learning Solutions
Everyone Needs to
Know
Eurus Kim | Amir Malekpour
Wednesday, October 23, 2019
© 2019 SPLUNK INC.
Staff ML Architect | Splunk
Eurus Kim
Principal Software Engineer | Splunk
Amir Malekpour
During the course of this presentation, we may make forward‐looking statements
regarding future events or plans of the company. We caution you that such statements
reflect our current expectations and estimates based on factors currently known to us
and that actual events or results may differ materially. The forward-looking statements
made in the this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, it may not contain current or
accurate information. We do not assume any obligation to update
any forward‐looking statements made herein.
In addition, any information about our roadmap outlines our general product direction
and is subject to change at any time without notice. It is for informational purposes only,
and shall not be incorporated into any contract or other commitment. Splunk undertakes
no obligation either to develop the features or functionalities described or to include any
such feature or functionality in a future release.
Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud,
Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the
United States and other countries. All other brand names, product names, or
trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved.
Forward-
Looking
Statements
© 2019 SPLUNK INC.
© 2019 SPLUNK INC.
Agenda
​Understanding Machine Learning with Splunk
​Solution 1 – Outlier Detection using DensityFunction
• Use cases covered
• What is Density Function?
• How to use DensityFunction
​Solution 2 – Forecasting using StateSpaceForecast
• Understanding forecasting
• How to use StateSpaceForecast
• Caveats and considerations
© 2019 SPLUNK INC.
Understanding
Machine Learning with
Splunk
© 2019 SPLUNK INC.
What is Machine Learning?
Use mathematical
models to learn patterns
in information
Catalog the patterns
(and in some cases, iterate them as
new data is received)
Use learned patterns to
understand and interpret new
data or make predictions
© 2019 SPLUNK INC.
• Deviation from past behavior
• Deviation from peers
• (aka Multivariate AD or Cohesive
AD)
• Unusual change in features
• Predict Service Health
Score/Churn
• Predicting Events
• Trend Forecasting
• Detecting influencing entities
• Early warning of failure
• Identify peer groups
• Event Correlation
• Reduce alert noise
• Behavioral Analytics
Anomaly detection Predictive Analytics Clustering
Splunk Customers Want Answers
from their Data
© 2019 SPLUNK INC.
Solution #1
Splunk Customers Want Answers
from their Data
• Deviation from past behavior
• Deviation from peers
• (aka Multivariate AD or Cohesive
AD)
• Unusual change in features
• Predict Service Health
Score/Churn
• Predicting Events
• Trend Forecasting
• Detecting influencing entities
• Early warning of failure
• Identify peer groups
• Event Correlation
• Reduce alert noise
• Behavioral Analytics
Anomaly detection Predictive Analytics Clustering
© 2019 SPLUNK INC.
• Deviation from past behavior
• Deviation from peers
• (aka Multivariate AD or Cohesive
AD)
• Unusual change in features
• Predict Service Health
Score/Churn
• Predicting Events
• Trend Forecasting
• Detecting influencing entities
• Early warning of failure
• Identify peer groups
• Event Correlation
• Reduce alert noise
• Behavioral Analytics
Anomaly detection Predictive Analytics Clustering
Solution #2
Splunk Customers Want Answers
from their Data
© 2019 SPLUNK INC.
Overview of ML at Splunk
CORE PLATFORM
SEARCH + Smarter
Splunk
PACKAGED PREMIUM
SOLUTIONS
MACHINE LEARNING
TOOLKIT
Platform for Operational Intelligence
© 2019 SPLUNK INC.
Overview of ML at Splunk
CORE PLATFORM
SEARCH + Smarter
Splunk
PACKAGED PREMIUM
SOLUTIONS
MACHINE LEARNING
TOOLKIT
Platform for Operational Intelligence
© 2019 SPLUNK INC.
Splunk Machine Learning Toolkit (MLTK)
• Experiments and Assistants: Guided model building, testing,
and deployment for common objectives
• Showcases: Interactive examples for typical
IT, security, business, and IoT use cases
• Algorithms: 80+ standard algorithms (supervised &
unsupervised)
• ML Commands: New SPL commands to
fit, test, score and operationalize models
• ML-SPL API: Extensibility to easily import any algorithm
(proprietary / open source)
• Python for Scientific Computing Library: Access to 300+
open source algorithms
• Apache Spark MLLib: Support large scale model training via
Spark Add-on for MLTK (LAR)
• Tensorflow Container: Supports NN and GPU accelerated
machine learning
Build custom analytics for any use case
© 2019 SPLUNK INC.
Solution #1: Outlier
Detection Using
DensityFunction
© 2019 SPLUNK INC.
Solution 1 – Using DensityFunction
​Outlier in some numerical value
• Number of transactions
• Transaction latency
• System utilization (CPU/memory)
• Number of logins
• Amount of data transfer
• Time between actions
• Sensor measurement
What type of use cases are we talking about?
Detect Numeric Outliers Assistant in MLTK
© 2019 SPLUNK INC.
We can do this today with the MLTK
Using the Detect Numeric Outliers assistant
© 2019 SPLUNK INC.
But it can be hard to figure out how to use
Which method works best for my data?
Using Standard Deviation
with no sliding window
Using Standard Deviation
with a sliding window
Using Median Absolute Deviation
with a sliding window
© 2019 SPLUNK INC.
And there is no model created
You have to run your search on all your data every time
Where’s the fit command?
© 2019 SPLUNK INC.
Average
1 SD
below
1 SD
above
2 SD
above
3 SD
above
2 SD
below
Why is it so hard?
Your data may not be so “Normal”
When viewing our data as a histogram, the average may not be so ”average”
© 2019 SPLUNK INC.
What if we could follow the shape of
our data?
We can with the DensityFunction algorithm!
© 2019 SPLUNK INC.
What is a Density
Function Anyway?
© 2019 SPLUNK INC.
What is a Density Function Anyway?
A mathematical function that maps outcomes to their relative likelihood
Likely
Not Likely
© 2019 SPLUNK INC.
What is a Density Function Anyway?
Parameter
Likely
Not Likely
A mathematical function that maps outcomes to their relative likelihood
© 2019 SPLUNK INC.
Fitting with DensityFunction
With a set of values, we’d like to know their distribution type and parameters
© 2019 SPLUNK INC.
Fitting with DensityFunction
DensityFunction
Data
Model
Type: Gaussian KDE
Parameters: (x1, x2, ...)
DensityFunction fits your data over a set of distributions and picks the best fit
© 2019 SPLUNK INC.
Outlier Detection with DensityFunction
Outlier
When new data comes in, we use our density function to determined its likelihood
© 2019 SPLUNK INC.
Caveats with DensityFunction
Don’t fit on noise!
If you have only a few data points it’s likely you’re fitting on noise
© 2019 SPLUNK INC.
Total number of logins/month: Day 5 Total number of logins/month: Day 25
Caveats with DensityFunction
Beware of shifting mean!
If your measure is cumulative, your distribution mean shifts
© 2019 SPLUNK INC.
How do you use
DensityFunction?
© 2019 SPLUNK INC.
Using DensityFunction
index=your-index field=value
| stats count as my_field by dim1 dim2
...
| bin my_field bins=1000
| stats count by my_field
| makecontinuous my_field
| fillnull
| sort my_field
Or use the `histogram` macro in MLTK!
...
| `histogram(my_field,1000)`
First you should understand the shape (distribution) of your data
Use the Column Chart or Histogram Chart (in MLTK) Viz
© 2019 SPLUNK INC.
Using DensityFunction
​index=your-index other search terms
​...
​| timechart span=5m avg(my_field) as my_field
Possibly also understand the shape of your data over time
© 2019 SPLUNK INC.
Using DensityFunction
Create a DensityFunction model
index=your-index other search terms
| stats count as my_field by dim1 dim2
...
| fit DensityFunction my_field by "dim1,dim2" into MyDFModel as IsOutlier
threshold=0.01 dist=auto
© 2019 SPLUNK INC.
Using DensityFunction
index=your-index other search terms
| stats count as my_field by dim1 dim2
...
| apply MyDFModel threshold=0.005
| search "IsOutlier(my_field)"=1
Applying your DensityFunction model
© 2019 SPLUNK INC.
You can change your threshold at apply
The BoundaryRanges designates where there are outliers
| apply MyDFModel threshold=0.01
| apply MyDFModel threshold=0.001
| apply MyDFModel threshold=0.0001
© 2019 SPLUNK INC.
Visualizing the Probability Density
Estimate
...
| fit DensityFunction my_field show_density=true
| bin my_field bins=100
| stats count avg("ProbabilityDensity(my_field)") as pd by my_field
| makecontinuous my_field
| sort my_field
Visualize as a Bar Chart, and put the pd field on a separate axis
© 2019 SPLUNK INC.
Get more advanced and Create an
Anomaly Score
Apply different pivots of your data with different models
...
| fit DensityFunction my_field as IsOutlierOverall
| fit DensityFunction my_field by "dim1" as IsOutlierByDim1
| fit DensityFunction my_field by "dim2" as IsOutlierByDim2
| eval AnomalyScore=0
| foreach IsOutlier* [eval AnomalyScore=AnomalyScore+<<FIELD>>]
© 2019 SPLUNK INC.
Using the Smart Outlier Detection
Assistant
Putting it all together with an “easier” button
© 2019 SPLUNK INC.
Solution #2:
Forecasting using
StateSpaceForecast
© 2019 SPLUNK INC.
Let’s clarify some nomenclature
Forecast ≠ Prediction
© 2019 SPLUNK INC.
Forecast vs Prediction
What is the difference?
​Forecast
• Given the past values of a metric, tell me
what the value will looks like X time periods
from now (e.g. tomorrow, next week, etc).
• Forecasting relies on time and the historical
values of a measurement in question as its
inputs.
​Prediction
• Given the past values of a set of fields,
estimate (or predict) what the value of one
of those fields will be, given the other fields
as inputs.
• Prediction relies on many other inputs to try
and explain the relationship between those
inputs and the measurement you are trying
to predict.
But both of these fall under the category of “Predictive Analytics”
© 2019 SPLUNK INC.
Forecast vs Prediction
What is the difference?
​Forecast
•Given the past values of a metric, tell me
what the value will looks like X time
periods from now (e.g. tomorrow, next
week, etc).
•Forecasting relies on time and the
historical values of a measurement in
question as its inputs. We are covering.
​Prediction
• Given the past values of a set of fields,
estimate (or predict) what the value of one
of those fields will be, given the other fields
as inputs.
• Prediction relies on many other inputs to try
and explain the relationship between those
inputs and the measurement you are trying
to predict.
But both of these fall under the category of “Predictive Analytics”
© 2019 SPLUNK INC.
Why would I use forecasting?
​Typically used for planning
• Based on past trends, what do we expect next
week/month/quarter/year to look like?
• Capacity planning (hard drive, operating
temperature)
​Forecasting is not a crystal ball, but it gives
you a quantitative estimate on future
values
• Getting a picture of what the future might look like
© 2019 SPLUNK INC.
The old way of forecasting in MLTK
| predict my_field algorithm=LLP5 holdback=112 future_timespan=224
© 2019 SPLUNK INC.
Using the old way for forecasting
​You have to be an expert at the math
• You have to specify the algorithm to use for the predict command
• You have to know how to optimize on P, D, and Q parameters for ARIMA
​There is no model file created, which means you can’t “apply” your model to future data
​Doesn’t consider special days (holidays)
There’s nothing wrong with the old way, it’s just often improperly used
© 2019 SPLUNK INC.
The new way of forecasting in MLTK
| fit StateSpaceForecast my_field holdback=112 forecast_k=224
© 2019 SPLUNK INC.
Using StateSpaceForecast
• Uses basically the same math (Kalman filter) as the predict command, but it will try to figure out the
parameters and mode (algorithm in predict)
• You can “apply” your model to future data
• You can account for special days
• You can use incremental fit (continuously update your model with new data)
• You can do multivariate analysis
• It will automatically impute the missing values (null values)
Applying more real-time operational use cases
© 2019 SPLUNK INC.
StateSpaceForecast
Caveats and
Considerations
© 2019 SPLUNK INC.
Confidence Level and Confidence Interval
• Confidence level is how confident we are about the prediction that our confidence interval includes
the real value
• Confidence interval and confidence level need to be interpreted together
• 95% confidence level means we are 95% confident that the confidence interval includes the true
value
What’s the difference?
© 2019 SPLUNK INC.
Confidence Level and Confidence Interval
​The confidence interval increases over time because the algorithm needs more
“leeway” to fulfill its promise of 95% confidence level
​Confidence interval is not about if the prediction is an outlier or not. It’s about
accuracy of prediction.
Interpreting the data further into the future
© 2019 SPLUNK INC.
Caveats with StateSpaceForecast
• Don’t project too far into the future
• Choose a large confidence level (e.g., 95%)
• If the confidence interval is too wide be careful about the reliability of the forecast
© 2019 SPLUNK INC.
After cleaning up some of the
outliers
Raw data without cleaning up
outliers
Forecasting is Sensitive to Outliers
Make sure you do some data cleansing first
© 2019 SPLUNK INC.
1. Use DensityFunction for finding outliers
• Visually inspect fit and tune threshold
• Don’t fit over noise
2. Use StateSpaceForecast for projection
and planning
• Remove outliers before fitting
• Pay attention to confidence interval
This is where the
subtitle goes
Key
Takeaways
RATE THIS SESSION
Go to the .conf19 mobile app to
© 2019 SPLUNK INC.
You
!
Thank
© 2019 SPLUNK INC.
Q&A
Eurus Kim | Staff ML Architect
Amir Malekpour | Principal Software Engineer

More Related Content

PPTX
Machine Learning in Action
PPTX
Machine Learning and Social Good
PPTX
Machine Learning in Action
PPTX
SplunkLive! Prelert Session - Extending Splunk with Machine Learning
PPTX
Machine Learning in Action
PPTX
Machine Learning in Action
PPTX
20190123 LSEC CTI - Machine Learning in Infosec
PPTX
SplunkLive! London 2017 - Using Machine Learning to Feed Hungry People
Machine Learning in Action
Machine Learning and Social Good
Machine Learning in Action
SplunkLive! Prelert Session - Extending Splunk with Machine Learning
Machine Learning in Action
Machine Learning in Action
20190123 LSEC CTI - Machine Learning in Infosec
SplunkLive! London 2017 - Using Machine Learning to Feed Hungry People

Similar to Common Machine Learning Solutions Everyone Needs to Know (20)

PDF
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
PDF
2014-08-14 Alpine Innovation to Spark
PPTX
Alpine innovation final v1.0
PPTX
Einführung in Security Analytics Methoden
PDF
Machine Learning + Analytics
PPTX
Intro to Machine Learning for non-Data Scientists
PDF
Splunk Artificial Intelligence & Machine Learning Webinar
PPTX
Machine Learning and Analytics Breakout Session
PDF
Scaling Analytics with Apache Spark
PDF
Splunk for DataScience (.conf2014)
PDF
Splunk conf2014 - Splunk for Data Science
PDF
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
PPTX
Splunk live! Customer Presentation – Prelert
PDF
Machine Learning From Raw Data To The Predictions
PPTX
Machine Learning and Analytics Breakout Session
PDF
Navy security contest-bigdataforsecurity
PPTX
Machine Learning and Analytics in Splunk
PDF
BSSML17 - Anomaly Detection
PPTX
Machine Learning and Analytics Breakout Session
PPTX
Machine Learning + Analytics in Splunk
Advanced Outlier Detection and Noise Reduction with Splunk & MLTK August 11, ...
2014-08-14 Alpine Innovation to Spark
Alpine innovation final v1.0
Einführung in Security Analytics Methoden
Machine Learning + Analytics
Intro to Machine Learning for non-Data Scientists
Splunk Artificial Intelligence & Machine Learning Webinar
Machine Learning and Analytics Breakout Session
Scaling Analytics with Apache Spark
Splunk for DataScience (.conf2014)
Splunk conf2014 - Splunk for Data Science
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
Splunk live! Customer Presentation – Prelert
Machine Learning From Raw Data To The Predictions
Machine Learning and Analytics Breakout Session
Navy security contest-bigdataforsecurity
Machine Learning and Analytics in Splunk
BSSML17 - Anomaly Detection
Machine Learning and Analytics Breakout Session
Machine Learning + Analytics in Splunk
Ad

Recently uploaded (20)

PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
AutoCAD Professional Crack 2025 With License Key
PPTX
assetexplorer- product-overview - presentation
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PDF
Download FL Studio Crack Latest version 2025 ?
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
AutoCAD Professional Crack 2025 With License Key
assetexplorer- product-overview - presentation
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Download FL Studio Crack Latest version 2025 ?
Operating system designcfffgfgggggggvggggggggg
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
CHAPTER 2 - PM Management and IT Context
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Digital Systems & Binary Numbers (comprehensive )
Oracle Fusion HCM Cloud Demo for Beginners
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Computer Software and OS of computer science of grade 11.pptx
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Wondershare Filmora 15 Crack With Activation Key [2025
Ad

Common Machine Learning Solutions Everyone Needs to Know

  • 1. © 2019 SPLUNK INC. © 2019 SPLUNK INC. Common Machine Learning Solutions Everyone Needs to Know Eurus Kim | Amir Malekpour Wednesday, October 23, 2019
  • 2. © 2019 SPLUNK INC. Staff ML Architect | Splunk Eurus Kim Principal Software Engineer | Splunk Amir Malekpour
  • 3. During the course of this presentation, we may make forward‐looking statements regarding future events or plans of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results may differ materially. The forward-looking statements made in the this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, it may not contain current or accurate information. We do not assume any obligation to update any forward‐looking statements made herein. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionalities described or to include any such feature or functionality in a future release. Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved. Forward- Looking Statements © 2019 SPLUNK INC.
  • 4. © 2019 SPLUNK INC. Agenda ​Understanding Machine Learning with Splunk ​Solution 1 – Outlier Detection using DensityFunction • Use cases covered • What is Density Function? • How to use DensityFunction ​Solution 2 – Forecasting using StateSpaceForecast • Understanding forecasting • How to use StateSpaceForecast • Caveats and considerations
  • 5. © 2019 SPLUNK INC. Understanding Machine Learning with Splunk
  • 6. © 2019 SPLUNK INC. What is Machine Learning? Use mathematical models to learn patterns in information Catalog the patterns (and in some cases, iterate them as new data is received) Use learned patterns to understand and interpret new data or make predictions
  • 7. © 2019 SPLUNK INC. • Deviation from past behavior • Deviation from peers • (aka Multivariate AD or Cohesive AD) • Unusual change in features • Predict Service Health Score/Churn • Predicting Events • Trend Forecasting • Detecting influencing entities • Early warning of failure • Identify peer groups • Event Correlation • Reduce alert noise • Behavioral Analytics Anomaly detection Predictive Analytics Clustering Splunk Customers Want Answers from their Data
  • 8. © 2019 SPLUNK INC. Solution #1 Splunk Customers Want Answers from their Data • Deviation from past behavior • Deviation from peers • (aka Multivariate AD or Cohesive AD) • Unusual change in features • Predict Service Health Score/Churn • Predicting Events • Trend Forecasting • Detecting influencing entities • Early warning of failure • Identify peer groups • Event Correlation • Reduce alert noise • Behavioral Analytics Anomaly detection Predictive Analytics Clustering
  • 9. © 2019 SPLUNK INC. • Deviation from past behavior • Deviation from peers • (aka Multivariate AD or Cohesive AD) • Unusual change in features • Predict Service Health Score/Churn • Predicting Events • Trend Forecasting • Detecting influencing entities • Early warning of failure • Identify peer groups • Event Correlation • Reduce alert noise • Behavioral Analytics Anomaly detection Predictive Analytics Clustering Solution #2 Splunk Customers Want Answers from their Data
  • 10. © 2019 SPLUNK INC. Overview of ML at Splunk CORE PLATFORM SEARCH + Smarter Splunk PACKAGED PREMIUM SOLUTIONS MACHINE LEARNING TOOLKIT Platform for Operational Intelligence
  • 11. © 2019 SPLUNK INC. Overview of ML at Splunk CORE PLATFORM SEARCH + Smarter Splunk PACKAGED PREMIUM SOLUTIONS MACHINE LEARNING TOOLKIT Platform for Operational Intelligence
  • 12. © 2019 SPLUNK INC. Splunk Machine Learning Toolkit (MLTK) • Experiments and Assistants: Guided model building, testing, and deployment for common objectives • Showcases: Interactive examples for typical IT, security, business, and IoT use cases • Algorithms: 80+ standard algorithms (supervised & unsupervised) • ML Commands: New SPL commands to fit, test, score and operationalize models • ML-SPL API: Extensibility to easily import any algorithm (proprietary / open source) • Python for Scientific Computing Library: Access to 300+ open source algorithms • Apache Spark MLLib: Support large scale model training via Spark Add-on for MLTK (LAR) • Tensorflow Container: Supports NN and GPU accelerated machine learning Build custom analytics for any use case
  • 13. © 2019 SPLUNK INC. Solution #1: Outlier Detection Using DensityFunction
  • 14. © 2019 SPLUNK INC. Solution 1 – Using DensityFunction ​Outlier in some numerical value • Number of transactions • Transaction latency • System utilization (CPU/memory) • Number of logins • Amount of data transfer • Time between actions • Sensor measurement What type of use cases are we talking about? Detect Numeric Outliers Assistant in MLTK
  • 15. © 2019 SPLUNK INC. We can do this today with the MLTK Using the Detect Numeric Outliers assistant
  • 16. © 2019 SPLUNK INC. But it can be hard to figure out how to use Which method works best for my data? Using Standard Deviation with no sliding window Using Standard Deviation with a sliding window Using Median Absolute Deviation with a sliding window
  • 17. © 2019 SPLUNK INC. And there is no model created You have to run your search on all your data every time Where’s the fit command?
  • 18. © 2019 SPLUNK INC. Average 1 SD below 1 SD above 2 SD above 3 SD above 2 SD below Why is it so hard? Your data may not be so “Normal” When viewing our data as a histogram, the average may not be so ”average”
  • 19. © 2019 SPLUNK INC. What if we could follow the shape of our data? We can with the DensityFunction algorithm!
  • 20. © 2019 SPLUNK INC. What is a Density Function Anyway?
  • 21. © 2019 SPLUNK INC. What is a Density Function Anyway? A mathematical function that maps outcomes to their relative likelihood Likely Not Likely
  • 22. © 2019 SPLUNK INC. What is a Density Function Anyway? Parameter Likely Not Likely A mathematical function that maps outcomes to their relative likelihood
  • 23. © 2019 SPLUNK INC. Fitting with DensityFunction With a set of values, we’d like to know their distribution type and parameters
  • 24. © 2019 SPLUNK INC. Fitting with DensityFunction DensityFunction Data Model Type: Gaussian KDE Parameters: (x1, x2, ...) DensityFunction fits your data over a set of distributions and picks the best fit
  • 25. © 2019 SPLUNK INC. Outlier Detection with DensityFunction Outlier When new data comes in, we use our density function to determined its likelihood
  • 26. © 2019 SPLUNK INC. Caveats with DensityFunction Don’t fit on noise! If you have only a few data points it’s likely you’re fitting on noise
  • 27. © 2019 SPLUNK INC. Total number of logins/month: Day 5 Total number of logins/month: Day 25 Caveats with DensityFunction Beware of shifting mean! If your measure is cumulative, your distribution mean shifts
  • 28. © 2019 SPLUNK INC. How do you use DensityFunction?
  • 29. © 2019 SPLUNK INC. Using DensityFunction index=your-index field=value | stats count as my_field by dim1 dim2 ... | bin my_field bins=1000 | stats count by my_field | makecontinuous my_field | fillnull | sort my_field Or use the `histogram` macro in MLTK! ... | `histogram(my_field,1000)` First you should understand the shape (distribution) of your data Use the Column Chart or Histogram Chart (in MLTK) Viz
  • 30. © 2019 SPLUNK INC. Using DensityFunction ​index=your-index other search terms ​... ​| timechart span=5m avg(my_field) as my_field Possibly also understand the shape of your data over time
  • 31. © 2019 SPLUNK INC. Using DensityFunction Create a DensityFunction model index=your-index other search terms | stats count as my_field by dim1 dim2 ... | fit DensityFunction my_field by "dim1,dim2" into MyDFModel as IsOutlier threshold=0.01 dist=auto
  • 32. © 2019 SPLUNK INC. Using DensityFunction index=your-index other search terms | stats count as my_field by dim1 dim2 ... | apply MyDFModel threshold=0.005 | search "IsOutlier(my_field)"=1 Applying your DensityFunction model
  • 33. © 2019 SPLUNK INC. You can change your threshold at apply The BoundaryRanges designates where there are outliers | apply MyDFModel threshold=0.01 | apply MyDFModel threshold=0.001 | apply MyDFModel threshold=0.0001
  • 34. © 2019 SPLUNK INC. Visualizing the Probability Density Estimate ... | fit DensityFunction my_field show_density=true | bin my_field bins=100 | stats count avg("ProbabilityDensity(my_field)") as pd by my_field | makecontinuous my_field | sort my_field Visualize as a Bar Chart, and put the pd field on a separate axis
  • 35. © 2019 SPLUNK INC. Get more advanced and Create an Anomaly Score Apply different pivots of your data with different models ... | fit DensityFunction my_field as IsOutlierOverall | fit DensityFunction my_field by "dim1" as IsOutlierByDim1 | fit DensityFunction my_field by "dim2" as IsOutlierByDim2 | eval AnomalyScore=0 | foreach IsOutlier* [eval AnomalyScore=AnomalyScore+<<FIELD>>]
  • 36. © 2019 SPLUNK INC. Using the Smart Outlier Detection Assistant Putting it all together with an “easier” button
  • 37. © 2019 SPLUNK INC. Solution #2: Forecasting using StateSpaceForecast
  • 38. © 2019 SPLUNK INC. Let’s clarify some nomenclature Forecast ≠ Prediction
  • 39. © 2019 SPLUNK INC. Forecast vs Prediction What is the difference? ​Forecast • Given the past values of a metric, tell me what the value will looks like X time periods from now (e.g. tomorrow, next week, etc). • Forecasting relies on time and the historical values of a measurement in question as its inputs. ​Prediction • Given the past values of a set of fields, estimate (or predict) what the value of one of those fields will be, given the other fields as inputs. • Prediction relies on many other inputs to try and explain the relationship between those inputs and the measurement you are trying to predict. But both of these fall under the category of “Predictive Analytics”
  • 40. © 2019 SPLUNK INC. Forecast vs Prediction What is the difference? ​Forecast •Given the past values of a metric, tell me what the value will looks like X time periods from now (e.g. tomorrow, next week, etc). •Forecasting relies on time and the historical values of a measurement in question as its inputs. We are covering. ​Prediction • Given the past values of a set of fields, estimate (or predict) what the value of one of those fields will be, given the other fields as inputs. • Prediction relies on many other inputs to try and explain the relationship between those inputs and the measurement you are trying to predict. But both of these fall under the category of “Predictive Analytics”
  • 41. © 2019 SPLUNK INC. Why would I use forecasting? ​Typically used for planning • Based on past trends, what do we expect next week/month/quarter/year to look like? • Capacity planning (hard drive, operating temperature) ​Forecasting is not a crystal ball, but it gives you a quantitative estimate on future values • Getting a picture of what the future might look like
  • 42. © 2019 SPLUNK INC. The old way of forecasting in MLTK | predict my_field algorithm=LLP5 holdback=112 future_timespan=224
  • 43. © 2019 SPLUNK INC. Using the old way for forecasting ​You have to be an expert at the math • You have to specify the algorithm to use for the predict command • You have to know how to optimize on P, D, and Q parameters for ARIMA ​There is no model file created, which means you can’t “apply” your model to future data ​Doesn’t consider special days (holidays) There’s nothing wrong with the old way, it’s just often improperly used
  • 44. © 2019 SPLUNK INC. The new way of forecasting in MLTK | fit StateSpaceForecast my_field holdback=112 forecast_k=224
  • 45. © 2019 SPLUNK INC. Using StateSpaceForecast • Uses basically the same math (Kalman filter) as the predict command, but it will try to figure out the parameters and mode (algorithm in predict) • You can “apply” your model to future data • You can account for special days • You can use incremental fit (continuously update your model with new data) • You can do multivariate analysis • It will automatically impute the missing values (null values) Applying more real-time operational use cases
  • 46. © 2019 SPLUNK INC. StateSpaceForecast Caveats and Considerations
  • 47. © 2019 SPLUNK INC. Confidence Level and Confidence Interval • Confidence level is how confident we are about the prediction that our confidence interval includes the real value • Confidence interval and confidence level need to be interpreted together • 95% confidence level means we are 95% confident that the confidence interval includes the true value What’s the difference?
  • 48. © 2019 SPLUNK INC. Confidence Level and Confidence Interval ​The confidence interval increases over time because the algorithm needs more “leeway” to fulfill its promise of 95% confidence level ​Confidence interval is not about if the prediction is an outlier or not. It’s about accuracy of prediction. Interpreting the data further into the future
  • 49. © 2019 SPLUNK INC. Caveats with StateSpaceForecast • Don’t project too far into the future • Choose a large confidence level (e.g., 95%) • If the confidence interval is too wide be careful about the reliability of the forecast
  • 50. © 2019 SPLUNK INC. After cleaning up some of the outliers Raw data without cleaning up outliers Forecasting is Sensitive to Outliers Make sure you do some data cleansing first
  • 51. © 2019 SPLUNK INC. 1. Use DensityFunction for finding outliers • Visually inspect fit and tune threshold • Don’t fit over noise 2. Use StateSpaceForecast for projection and planning • Remove outliers before fitting • Pay attention to confidence interval This is where the subtitle goes Key Takeaways
  • 52. RATE THIS SESSION Go to the .conf19 mobile app to © 2019 SPLUNK INC. You ! Thank
  • 53. © 2019 SPLUNK INC. Q&A Eurus Kim | Staff ML Architect Amir Malekpour | Principal Software Engineer