SlideShare a Scribd company logo
DevOps tutorial:
How to setup intelligent
machine learning alerts
Sébastien Léger, founder, Loud ML,
loudml.io
16 January 2019
DevOps tutorial:
How to setup intelligent
machine learning alerts
Sébastien Léger
founder and CEO, Loud ML
loudml.io
In this session, you will learn about:
• the TICK stack;
• Loud ML, a popular extension for live
anomaly detection in time-series data;
• how to setup alerting via Kapacitor and
integrate with PagerDuty, and other
alerting tools;
• Donut: the neural net architecture in Loud ML;
• applicable use-cases.
10K+ cumulated downloads in 2018,
lots of feedback, and more!
Joined the Rockstart AI accelerator program
in November 2018.
Thank You!
Agenda
• User stories, DevOps and
IoT
• Typical requirements for
alerting and ML
• Data pipeline
• Live streaming data demo
• Low false positives
(ie, low noise)
• Neural nets, deep-learning,
and Donut
• Loud ML 1.5 – Join the beta
• FAQ
User stories
User story 1: uptime
• My e-commerce site is global.
• Different users from different countries connect every day and
make purchases.
• Planned updates, or DevOps (CICD).
• Will it break anything, or cause downtime?
• How to spot if the # of transactions are correct, or how much
time is spent in the conversion funnel.
User story 2: security
• My e-commerce site is global.
• Different users from different countries connect every day and
make purchases.
• Different volumes from different sources.
User story 3: utilization
• The load changes during the day, during the night, and
during the weekend.
• Cloud or private DC resource utilization versus costs.
User story 4: Internet transit
• Running data center operations across multiple regions.
• Dynamic changes in traffic volume at the network edges.
• Get the right capacity, for the right cost (price per Mb/s).
User story 5: IoT
• PV: voltages, internal temperature, charge cycles, quantity
of electricity produced.
• Remote maintenance: spot damaged batteries.
• Physical infrastructure.
• Patterns in structural frequencies.
• Remote maintenance: spot when significant changes occur.
• Digital clone, industrial IoT (IIoT).
• Normal versus abnormal.
Typical requirements for
alerting and ML
Outliers, and then...
Typical requirements for alerting and ML
Performance
• Near real time, low alert delay.
• Running at scale, 24/7,
10,000+ users, hosts,
applications, devices.
• Low false positives; ie, low noise.
• Developer friendly: fast to
validate and deploy.
Functionality
• Can understand seasonality in the
data; eg, weekend vs daily
patterns, or across regions.
• Can learn and reinforce
continuously using live data.
• Can understand business rules.
• Works with third-party integrations.
Applicable to logs, metrics, events: page views, clicks, online users, orders,
response times, active IPs, syslogs, temperature, acceleration data & more.
Live TICK-L demo
Telegraf + InfluxDb + Chronograf + Kapacitor + Loud ML (AutoML)
Useful resources
• Website: loudml.io
• Blog: medium.com/loud-ml
• Github: github.com/regel/loudml
Data pipeline
Data pipeline
Metrics
and logs
collection
Pre-
processing,
feature
engineering
Database
storage
Machine
Learning
Automation
and alerting
DataViz
T (K)
(C)
(I) (L) (K)
Low false positives ie, low noise
How to evaluate ML fitness in a given application
Low false positives ie, low noise
How to evaluate ML fitness in a given application
Credits: dataschool.io
Precision P=TP/(TP+FP)
Recall R=TP/(TP+FN)
F1-score 2/(1/P+1/R)
Recall
Precision
Donut
arXiv:1802.03903
April 23-27, 2018, Lyon
Neural nets
Encode Decode
Donut neural nets
Baseline
Reconstruction
probability
Encode Sampling Decode
Donut is cool
• Donut has interesting properties.
• Low false positives (ie, low noise).
• It is as good as it gets.
• F1-score = 0.7 to 0.9, from arXiv:1802.03903.
• It can understand seasonality in the data.
• It can learn from labels.
Why Loud ML
Loud ML 1.5 beta
• Donut, plus more:
• Near real time, low alert delay.
• Running at scale, 24/7, 10,000+ users, hosts, applications.
• Can learn and reinforce continuously using live data.
• Developer friendly: fast to deploy, runs on CPUs or GPUs.
• Why Loud ML
• Fast ML deployment for time series data:
• the goal is to remove all the hurdles in AI.
• Explainable: gives % to observe specific values, easy to interpret!
• Agnostic of the underlying database.
• Accessible: the best ML, at a fraction of the cost.
Thank You
Interested in joining the beta?
loudml.io/contact@loud_ml
FAQ
How well does it run?
Will the model continue to learn?
What are the options for
feature engineering?

More Related Content

PDF
Doing DevOps for Big Data? What You Need to Know About AIOps
PDF
Doing DevOps for Big Data? What You Need to Know About AIOps
PDF
Modernizing Infrastructure Monitoring and Management with AIOps
PDF
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
PDF
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
PDF
“The Data-Driven Engineering Revolution,” a Presentation from Edge Impulse
PDF
HPE AIOps Expo
PDF
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Doing DevOps for Big Data? What You Need to Know About AIOps
Doing DevOps for Big Data? What You Need to Know About AIOps
Modernizing Infrastructure Monitoring and Management with AIOps
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
“The Data-Driven Engineering Revolution,” a Presentation from Edge Impulse
HPE AIOps Expo
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...

What's hot (19)

PDF
JRI 2021 AIOps for Preventive& Automated Incident Management
PPTX
Data Science Powered Apps for Internet of Things
PPTX
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
PPTX
What Does Artificial Intelligence Have to Do with IT Operations?
PDF
AIOps: Your DevOps Co-Pilot
PPTX
Real time machine learning
PDF
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
DOCX
Gartner market guide ai ops platforms
PDF
Smart App@Pivotal by Dat Tran
PPTX
Before You Deploy An AIOps System, Do this
PDF
AIOps Roundtable Munich 2018
PPTX
Get your Service Intelligence off to a Flying Start
PDF
AIOps-Driven Network Performance Management: The First Step Toward Self-Heali...
PDF
Splunk for AIOps: Reduce IT outages through prediction with machine learning
PPTX
Building Real Time Targeting Capabilities - Ryan Zotti, Subbu Thiruppathy - C...
PDF
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
PDF
AIOps Is How We Will Survive DevOps
PDF
AIOps - The next 5 years
PPTX
Global Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use Cases
JRI 2021 AIOps for Preventive& Automated Incident Management
Data Science Powered Apps for Internet of Things
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
What Does Artificial Intelligence Have to Do with IT Operations?
AIOps: Your DevOps Co-Pilot
Real time machine learning
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Gartner market guide ai ops platforms
Smart App@Pivotal by Dat Tran
Before You Deploy An AIOps System, Do this
AIOps Roundtable Munich 2018
Get your Service Intelligence off to a Flying Start
AIOps-Driven Network Performance Management: The First Step Toward Self-Heali...
Splunk for AIOps: Reduce IT outages through prediction with machine learning
Building Real Time Targeting Capabilities - Ryan Zotti, Subbu Thiruppathy - C...
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps Is How We Will Survive DevOps
AIOps - The next 5 years
Global Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use Cases
Ad

Similar to A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts (20)

PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PPTX
230208 MLOps Getting from Good to Great.pptx
PPTX
CNCF-Istanbul-MLOps for Devops Engineers.pptx
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
PDF
DevOps Days Rockies MLOps
PPTX
Artificial intelligence - A Teaser to the Topic.
PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
Practical Machine Learning on Databricks (1st Edition) Debu Sinha
PDF
10 more lessons learned from building Machine Learning systems - MLConf
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
PDF
10 more lessons learned from building Machine Learning systems
PDF
Machine learning at scale challenges and solutions
PDF
VSSML17 Review. Summary Day 2 Sessions
PDF
Data ops: Machine Learning in production
PDF
The Data Science Process - Do we need it and how to apply?
PDF
Ideas spracklen-final
PDF
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
PPTX
Machine Learning + Analytics in Splunk
PDF
Machine learning for IoT - unpacking the blackbox
PDF
Strata parallel m-ml-ops_sept_2017
Infrastructure Agnostic Machine Learning Workload Deployment
230208 MLOps Getting from Good to Great.pptx
CNCF-Istanbul-MLOps for Devops Engineers.pptx
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
DevOps Days Rockies MLOps
Artificial intelligence - A Teaser to the Topic.
Machine learning at scale - Webinar By zekeLabs
Practical Machine Learning on Databricks (1st Edition) Debu Sinha
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
10 more lessons learned from building Machine Learning systems
Machine learning at scale challenges and solutions
VSSML17 Review. Summary Day 2 Sessions
Data ops: Machine Learning in production
The Data Science Process - Do we need it and how to apply?
Ideas spracklen-final
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
Machine Learning + Analytics in Splunk
Machine learning for IoT - unpacking the blackbox
Strata parallel m-ml-ops_sept_2017
Ad

More from DevOps.com (20)

PDF
Modernizing on IBM Z Made Easier With Open Source Software
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PDF
Next Generation Vulnerability Assessment Using Datadog and Snyk
PPTX
Vulnerability Discovery in the Cloud
PDF
2021 Open Source Governance: Top Ten Trends and Predictions
PDF
A New Year’s Ransomware Resolution
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Don't Panic! Effective Incident Response
PDF
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
PDF
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
PDF
Monitoring Serverless Applications with Datadog
PDF
Deliver your App Anywhere … Publicly or Privately
PPTX
Securing medical apps in the age of covid final
PDF
How to Build a Healthy On-Call Culture
PPTX
The Evolving Role of the Developer in 2021
PDF
Service Mesh: Two Big Words But Do You Need It?
PPTX
Secure Data Sharing in OpenShift Environments
PPTX
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
PDF
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Modernizing on IBM Z Made Easier With Open Source Software
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Next Generation Vulnerability Assessment Using Datadog and Snyk
Vulnerability Discovery in the Cloud
2021 Open Source Governance: Top Ten Trends and Predictions
A New Year’s Ransomware Resolution
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Don't Panic! Effective Incident Response
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Monitoring Serverless Applications with Datadog
Deliver your App Anywhere … Publicly or Privately
Securing medical apps in the age of covid final
How to Build a Healthy On-Call Culture
The Evolving Role of the Developer in 2021
Service Mesh: Two Big Words But Do You Need It?
Secure Data Sharing in OpenShift Environments
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25-Week II
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts

  • 1. DevOps tutorial: How to setup intelligent machine learning alerts Sébastien Léger, founder, Loud ML, loudml.io 16 January 2019
  • 2. DevOps tutorial: How to setup intelligent machine learning alerts Sébastien Léger founder and CEO, Loud ML loudml.io In this session, you will learn about: • the TICK stack; • Loud ML, a popular extension for live anomaly detection in time-series data; • how to setup alerting via Kapacitor and integrate with PagerDuty, and other alerting tools; • Donut: the neural net architecture in Loud ML; • applicable use-cases.
  • 3. 10K+ cumulated downloads in 2018, lots of feedback, and more! Joined the Rockstart AI accelerator program in November 2018. Thank You!
  • 4. Agenda • User stories, DevOps and IoT • Typical requirements for alerting and ML • Data pipeline • Live streaming data demo • Low false positives (ie, low noise) • Neural nets, deep-learning, and Donut • Loud ML 1.5 – Join the beta • FAQ
  • 6. User story 1: uptime • My e-commerce site is global. • Different users from different countries connect every day and make purchases. • Planned updates, or DevOps (CICD). • Will it break anything, or cause downtime? • How to spot if the # of transactions are correct, or how much time is spent in the conversion funnel.
  • 7. User story 2: security • My e-commerce site is global. • Different users from different countries connect every day and make purchases. • Different volumes from different sources.
  • 8. User story 3: utilization • The load changes during the day, during the night, and during the weekend. • Cloud or private DC resource utilization versus costs.
  • 9. User story 4: Internet transit • Running data center operations across multiple regions. • Dynamic changes in traffic volume at the network edges. • Get the right capacity, for the right cost (price per Mb/s).
  • 10. User story 5: IoT • PV: voltages, internal temperature, charge cycles, quantity of electricity produced. • Remote maintenance: spot damaged batteries. • Physical infrastructure. • Patterns in structural frequencies. • Remote maintenance: spot when significant changes occur. • Digital clone, industrial IoT (IIoT). • Normal versus abnormal.
  • 11. Typical requirements for alerting and ML Outliers, and then...
  • 12. Typical requirements for alerting and ML Performance • Near real time, low alert delay. • Running at scale, 24/7, 10,000+ users, hosts, applications, devices. • Low false positives; ie, low noise. • Developer friendly: fast to validate and deploy. Functionality • Can understand seasonality in the data; eg, weekend vs daily patterns, or across regions. • Can learn and reinforce continuously using live data. • Can understand business rules. • Works with third-party integrations. Applicable to logs, metrics, events: page views, clicks, online users, orders, response times, active IPs, syslogs, temperature, acceleration data & more.
  • 13. Live TICK-L demo Telegraf + InfluxDb + Chronograf + Kapacitor + Loud ML (AutoML)
  • 14. Useful resources • Website: loudml.io • Blog: medium.com/loud-ml • Github: github.com/regel/loudml
  • 17. Low false positives ie, low noise How to evaluate ML fitness in a given application
  • 18. Low false positives ie, low noise How to evaluate ML fitness in a given application Credits: dataschool.io Precision P=TP/(TP+FP) Recall R=TP/(TP+FN) F1-score 2/(1/P+1/R) Recall Precision
  • 22. Donut is cool • Donut has interesting properties. • Low false positives (ie, low noise). • It is as good as it gets. • F1-score = 0.7 to 0.9, from arXiv:1802.03903. • It can understand seasonality in the data. • It can learn from labels.
  • 24. Loud ML 1.5 beta • Donut, plus more: • Near real time, low alert delay. • Running at scale, 24/7, 10,000+ users, hosts, applications. • Can learn and reinforce continuously using live data. • Developer friendly: fast to deploy, runs on CPUs or GPUs. • Why Loud ML • Fast ML deployment for time series data: • the goal is to remove all the hurdles in AI. • Explainable: gives % to observe specific values, easy to interpret! • Agnostic of the underlying database. • Accessible: the best ML, at a fraction of the cost.
  • 25. Thank You Interested in joining the beta? loudml.io/contact@loud_ml
  • 26. FAQ
  • 27. How well does it run?
  • 28. Will the model continue to learn?
  • 29. What are the options for feature engineering?