SlideShare a Scribd company logo
What DevOps Need to Know About
AIOPs
Bala Venkatrao
VP, Products at Unravel Data
January 17, 2019
Polling questions
(1) What are your most common big data application challenges?
(select all that apply)
- Performance tuning for slow applications
- Root cause analysis for application failures
- Establishing meaningful SLAs
- Detecting runaway queries
(2) What are your most common operational challenges for big data clusters ?
(select all that apply)
- Chargeback/Showback for multi-tenant clusters
- Visibility into resource utilization
- Job and workload management
- Cluster tuning
- Capacity Planning
(3) What tools do you use today for troubleshooting and tuning your big data applications?
(select all that apply)
- Cloudera Manager/Apache Ambari
- YARN/Spark WebUI
- Dr. Elephant or other open source tools
- Log Mgmt tools: Splunk, ELK etc.
2
Challenge: Operationalizing Modern Data Applications
What does it mean for new
data-driven applications
and analytics to be
enterprise grade?
3
Impact of poor performance and failures in the data pipeline
Low
Productivity
Sub-Optimized
Resources
Lack of
Reliability
4
Cascading Problems Impact Applications and Operations
LACK OF SINGLE,
CORRELATED VIEW
OUT OF CONTROL COSTS.
POORLY UTILIZED INFRA.
REACTIVE & SLOW
TO FIX PROBLEMS
APPLICATIONS
OPERATIONS
5
6
Examples of Big Data Apps: ETL, Analytics, Machine Learning, IoT, etc.
Typical Big Data Architecture
RDBMS
SOCIAL
SENSOR
MACHINE
ERP
MOBILE
Data Sources
ModernDataPipelinesareVariedandComplex
ETL DATA
PIPELINE
STREAM
PROCESSING
RAW
DATA
COMPUTED
DATA
MESSAGING
RESULT STORE
QUERY DATA
REPORT
ML APPS
IoT APPS
ANALYTICS
B.I.
ALERTS
SERVICES
Data ConsumerReal-time / Batch Process Result StoreData Collection
Tackling Complexity in Big Data Applications
7
DevOps and AIOps
As big data adoption grows, the ability to manually intervene for
hundreds of jobs running on thousands of nodes becomes problematic
8
AI Powers Application Performance Management (APM) for Big Data
Essential Elements of an AIOps Solution for Big Data
APM
• Data Collection and Correlation
• Observe and collect all relevant data
• Operational Data Model
• AI-assisted monitoring, troubleshooting, tuning, and
managing requires a data model
• Analytics
• Statistical analysis – correlate, classify, extrapolate
from operational data
• Predictive/Prescriptive analytics – forecasting and
recommendations for capacity
• Pattern and anomaly detection, root-cause analysis
• Context, topology and coded expertise
• Automation
• Auto-tuning of applications and resources
• Cluster load balancing and job scheduling
• Autonomous response to alerts and failures
Data Collection
and Correlation
Modern Data
Apps and Stack Data Model Analytics Automation
Statistical
Predictive/
Prescriptive
Anomaly
Detection
Context/
Topology
Auto-tuning
Cluster
Operations
Resource
Management
Autonomous
Remediation
9
Without AI, Big Data APM is a manual, logistical challenge
One complete correlated view
with built-in AI and ML.
Multiple tools, no complete
view, no intelligence.
Big Data APM
Without AI
AI-Powered
Big Data APM
10
Unravel: First AIOps Solution for Big Data APM
Full-stack, Intelligent, Autonomous
11
AIOps Use Cases for Unravel
Automated Cloud Cost Management
• Optimize cost by right-sizing cloud
images
• Optimize cost by choosing the optimal
price plan
Automated Workload Management
• Eliminate CPU, Memory, Network I/O and
Disk I/O contention
• Correctly size VM’s and Cloud Images
• Place VM’s in the best Hosts and Clusters
Automated Event Management
• De-duplicate events
• Support a collaborative (DevOps)
problem resolution process
Automated Performance Optimization
and Remediation
• Automatically learn the performance
characteristics apps and supporting stack
• Automatically optimize for a chosen KPI
(performance, efficiency)
12
Unravel Applies Machine Learning (ML) at various
levels
Error Views &
Analysis
Tuning Recommendation
Application
Management
Automated Tuning
13
Cluster
Optimization
Capacity Planning &
Forecasting
Operations
Management
Unravel Applies Machine Learning (ML) at various levels
14
A single pane of glass
for application &
operations management
Anomaly detection
to rapidly detect &
diagnose unpredictable
behavior
Proactive alerting
& remediation
of cluster/SLA problems
caused by applications
Automatic root
cause analysis
of Workflow that missed
SLA
Intelligent tuning
to make Yarn (Spark, Hive)
applications faster &
resource efficient
Unravel AIOps Demo
Before Unravel: Global 200 Financial Services Company
Complex Infrastructure
Landscape
Debugging Performance
Problems is a Challenge
Out of Control Costs
Sub-optimal Capacity
Management
Missing Insights on Data
Operations
Ineffective Alerting and
Automatic Actions
100+ projects
5,000+ jobs/day
600+ users globally
3PB+ of data
>$1m spent on un-utilized storage
>5 different interfaces for job monitoring
>10 different logs for debugging a single
workflow
1-2 weeks to determine root
cause for performance issues
80% of the datasets can be candidates for
lower cost storage
99% of all the current alerting cannot be co-
related with performance issues
Customer Case Study
16
Complex Infrastructure
Landscape
Debugging Performance
Problems is a Challenge
Out of Control Costs
Sub-optimal Capacity
Management
Missing Insights on Data
Operations
Ineffective Alerting and
Automatic Actions
After Unravel: Global 200 Financial Services Company
Customer Case Study
17
Scale to unlimited # of users, apps,
data, projects
1 interface for job or workflow
monitoring
Reduce troubleshooting time by
98%
Maximize resource utilization Save 60% on resource cost 70% reduction in support tickets
Live Q&A questions
1. Does Unravel support big data workloads in the cloud?
2. I am planning to migrate from an on-premises installation to the
cloud. Can Unravel help with that?
3. Does Unravel do more than monitoring?
18
The benefits of Unravel’s AI-powered APM Solution
19
20
Thank You
Free Full Feature Trial on Amazon EMR, Microsoft Azure
https://unraveldata.com/free-trial/
https://unraveldata.com/

More Related Content

PDF
Doing DevOps for Big Data? What You Need to Know About AIOps
PDF
Modernizing Infrastructure Monitoring and Management with AIOps
PPTX
Context is Critical: How Richer Data Yields Richer Results in AIOps | Bhanu S...
PDF
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
PDF
AIOps: Your DevOps Co-Pilot
PPTX
The future of AIOps
PPTX
Before You Deploy An AIOps System, Do this
PDF
AIOps-Driven Network Performance Management: The First Step Toward Self-Heali...
Doing DevOps for Big Data? What You Need to Know About AIOps
Modernizing Infrastructure Monitoring and Management with AIOps
Context is Critical: How Richer Data Yields Richer Results in AIOps | Bhanu S...
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
AIOps: Your DevOps Co-Pilot
The future of AIOps
Before You Deploy An AIOps System, Do this
AIOps-Driven Network Performance Management: The First Step Toward Self-Heali...

What's hot (18)

PDF
AIOps - The next 5 years
PDF
HPE AIOps Expo
PDF
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
PDF
Unifying IT with Outcome-Aware AIOps
PDF
Bringing AIOps to Hybrid Cloud Monitoring and Management
PPTX
What Does Artificial Intelligence Have to Do with IT Operations?
PDF
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
PDF
AIOps Roundtable Munich 2018
PDF
Scale Container Operations with AIOps
PDF
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
PDF
AIOps Is How We Will Survive DevOps
PDF
No Ops? Or Yes, Ops! The Future of Operations in a DevOps World
PDF
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
PDF
Splunk for AIOps: Reduce IT outages through prediction with machine learning
PDF
Webinar Slides - How KeyBank Liberated its IT Ops from Rules-Based Event Mana...
PDF
2019 Performance Monitoring and Management Trends and Insights
PPTX
How to apply machine learning into your CI/CD pipeline
PDF
JRI 2021 AIOps for Preventive& Automated Incident Management
AIOps - The next 5 years
HPE AIOps Expo
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
Unifying IT with Outcome-Aware AIOps
Bringing AIOps to Hybrid Cloud Monitoring and Management
What Does Artificial Intelligence Have to Do with IT Operations?
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps Roundtable Munich 2018
Scale Container Operations with AIOps
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
AIOps Is How We Will Survive DevOps
No Ops? Or Yes, Ops! The Future of Operations in a DevOps World
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
Splunk for AIOps: Reduce IT outages through prediction with machine learning
Webinar Slides - How KeyBank Liberated its IT Ops from Rules-Based Event Mana...
2019 Performance Monitoring and Management Trends and Insights
How to apply machine learning into your CI/CD pipeline
JRI 2021 AIOps for Preventive& Automated Incident Management
Ad

Similar to Doing DevOps for Big Data? What You Need to Know About AIOps (20)

PDF
Understanding DataOps and Its Impact on Application Quality
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PDF
Use Machine Learning to Get the Most out of Your Big Data Clusters
PDF
Using AI-powered Automation for High Performance Data Pipelines in the Cloud
PPTX
Cisco event 6 05 2014v3 wwt only
PDF
Introdution to Dataops and AIOps (or MLOps)
PDF
Effective Cost Management for Amazon EMR
PDF
7 Leading machine learning Use-cases (AWS)
PDF
On the Application of AI for Failure Management: Problems, Solutions and Algo...
PPTX
JustEnoughDevOpsForDataScientists
PDF
2014 Big_Data_Forum_Pivotal
PDF
Machine Learning to Turbo-Charge the Ops Portion of DevOps
PDF
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
PDF
Machine Learning in Autonomous Data Warehouse
PDF
Introduction to Machine Learning and Data Science using the Autonomous databa...
PPTX
Machine Learning Models in Production
PDF
Complete guide to AIOps_ Automate IT Operations with AI.pdf
PDF
Introduction to Machine Learning and Data Science using Autonomous Database ...
PDF
Distributed Trace & Log Analysis using ML
PPTX
Strategies of Top Performing Organizations in Deploying AIOps - key findings
Understanding DataOps and Its Impact on Application Quality
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Use Machine Learning to Get the Most out of Your Big Data Clusters
Using AI-powered Automation for High Performance Data Pipelines in the Cloud
Cisco event 6 05 2014v3 wwt only
Introdution to Dataops and AIOps (or MLOps)
Effective Cost Management for Amazon EMR
7 Leading machine learning Use-cases (AWS)
On the Application of AI for Failure Management: Problems, Solutions and Algo...
JustEnoughDevOpsForDataScientists
2014 Big_Data_Forum_Pivotal
Machine Learning to Turbo-Charge the Ops Portion of DevOps
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Machine Learning in Autonomous Data Warehouse
Introduction to Machine Learning and Data Science using the Autonomous databa...
Machine Learning Models in Production
Complete guide to AIOps_ Automate IT Operations with AI.pdf
Introduction to Machine Learning and Data Science using Autonomous Database ...
Distributed Trace & Log Analysis using ML
Strategies of Top Performing Organizations in Deploying AIOps - key findings
Ad

More from DevOps.com (20)

PDF
Modernizing on IBM Z Made Easier With Open Source Software
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PDF
Next Generation Vulnerability Assessment Using Datadog and Snyk
PPTX
Vulnerability Discovery in the Cloud
PDF
2021 Open Source Governance: Top Ten Trends and Predictions
PDF
A New Year’s Ransomware Resolution
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Don't Panic! Effective Incident Response
PDF
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
PDF
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
PDF
Monitoring Serverless Applications with Datadog
PDF
Deliver your App Anywhere … Publicly or Privately
PPTX
Securing medical apps in the age of covid final
PDF
How to Build a Healthy On-Call Culture
PPTX
The Evolving Role of the Developer in 2021
PDF
Service Mesh: Two Big Words But Do You Need It?
PPTX
Secure Data Sharing in OpenShift Environments
PPTX
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
PDF
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Modernizing on IBM Z Made Easier With Open Source Software
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Next Generation Vulnerability Assessment Using Datadog and Snyk
Vulnerability Discovery in the Cloud
2021 Open Source Governance: Top Ten Trends and Predictions
A New Year’s Ransomware Resolution
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Don't Panic! Effective Incident Response
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Monitoring Serverless Applications with Datadog
Deliver your App Anywhere … Publicly or Privately
Securing medical apps in the age of covid final
How to Build a Healthy On-Call Culture
The Evolving Role of the Developer in 2021
Service Mesh: Two Big Words But Do You Need It?
Secure Data Sharing in OpenShift Environments
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A comparative analysis of optical character recognition models for extracting...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
Assigned Numbers - 2025 - Bluetooth® Document
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Doing DevOps for Big Data? What You Need to Know About AIOps

  • 1. What DevOps Need to Know About AIOPs Bala Venkatrao VP, Products at Unravel Data January 17, 2019
  • 2. Polling questions (1) What are your most common big data application challenges? (select all that apply) - Performance tuning for slow applications - Root cause analysis for application failures - Establishing meaningful SLAs - Detecting runaway queries (2) What are your most common operational challenges for big data clusters ? (select all that apply) - Chargeback/Showback for multi-tenant clusters - Visibility into resource utilization - Job and workload management - Cluster tuning - Capacity Planning (3) What tools do you use today for troubleshooting and tuning your big data applications? (select all that apply) - Cloudera Manager/Apache Ambari - YARN/Spark WebUI - Dr. Elephant or other open source tools - Log Mgmt tools: Splunk, ELK etc. 2
  • 3. Challenge: Operationalizing Modern Data Applications What does it mean for new data-driven applications and analytics to be enterprise grade? 3
  • 4. Impact of poor performance and failures in the data pipeline Low Productivity Sub-Optimized Resources Lack of Reliability 4
  • 5. Cascading Problems Impact Applications and Operations LACK OF SINGLE, CORRELATED VIEW OUT OF CONTROL COSTS. POORLY UTILIZED INFRA. REACTIVE & SLOW TO FIX PROBLEMS APPLICATIONS OPERATIONS 5
  • 6. 6 Examples of Big Data Apps: ETL, Analytics, Machine Learning, IoT, etc. Typical Big Data Architecture RDBMS SOCIAL SENSOR MACHINE ERP MOBILE Data Sources ModernDataPipelinesareVariedandComplex ETL DATA PIPELINE STREAM PROCESSING RAW DATA COMPUTED DATA MESSAGING RESULT STORE QUERY DATA REPORT ML APPS IoT APPS ANALYTICS B.I. ALERTS SERVICES Data ConsumerReal-time / Batch Process Result StoreData Collection
  • 7. Tackling Complexity in Big Data Applications 7
  • 8. DevOps and AIOps As big data adoption grows, the ability to manually intervene for hundreds of jobs running on thousands of nodes becomes problematic 8 AI Powers Application Performance Management (APM) for Big Data
  • 9. Essential Elements of an AIOps Solution for Big Data APM • Data Collection and Correlation • Observe and collect all relevant data • Operational Data Model • AI-assisted monitoring, troubleshooting, tuning, and managing requires a data model • Analytics • Statistical analysis – correlate, classify, extrapolate from operational data • Predictive/Prescriptive analytics – forecasting and recommendations for capacity • Pattern and anomaly detection, root-cause analysis • Context, topology and coded expertise • Automation • Auto-tuning of applications and resources • Cluster load balancing and job scheduling • Autonomous response to alerts and failures Data Collection and Correlation Modern Data Apps and Stack Data Model Analytics Automation Statistical Predictive/ Prescriptive Anomaly Detection Context/ Topology Auto-tuning Cluster Operations Resource Management Autonomous Remediation 9
  • 10. Without AI, Big Data APM is a manual, logistical challenge One complete correlated view with built-in AI and ML. Multiple tools, no complete view, no intelligence. Big Data APM Without AI AI-Powered Big Data APM 10
  • 11. Unravel: First AIOps Solution for Big Data APM Full-stack, Intelligent, Autonomous 11
  • 12. AIOps Use Cases for Unravel Automated Cloud Cost Management • Optimize cost by right-sizing cloud images • Optimize cost by choosing the optimal price plan Automated Workload Management • Eliminate CPU, Memory, Network I/O and Disk I/O contention • Correctly size VM’s and Cloud Images • Place VM’s in the best Hosts and Clusters Automated Event Management • De-duplicate events • Support a collaborative (DevOps) problem resolution process Automated Performance Optimization and Remediation • Automatically learn the performance characteristics apps and supporting stack • Automatically optimize for a chosen KPI (performance, efficiency) 12
  • 13. Unravel Applies Machine Learning (ML) at various levels Error Views & Analysis Tuning Recommendation Application Management Automated Tuning 13
  • 15. A single pane of glass for application & operations management Anomaly detection to rapidly detect & diagnose unpredictable behavior Proactive alerting & remediation of cluster/SLA problems caused by applications Automatic root cause analysis of Workflow that missed SLA Intelligent tuning to make Yarn (Spark, Hive) applications faster & resource efficient Unravel AIOps Demo
  • 16. Before Unravel: Global 200 Financial Services Company Complex Infrastructure Landscape Debugging Performance Problems is a Challenge Out of Control Costs Sub-optimal Capacity Management Missing Insights on Data Operations Ineffective Alerting and Automatic Actions 100+ projects 5,000+ jobs/day 600+ users globally 3PB+ of data >$1m spent on un-utilized storage >5 different interfaces for job monitoring >10 different logs for debugging a single workflow 1-2 weeks to determine root cause for performance issues 80% of the datasets can be candidates for lower cost storage 99% of all the current alerting cannot be co- related with performance issues Customer Case Study 16
  • 17. Complex Infrastructure Landscape Debugging Performance Problems is a Challenge Out of Control Costs Sub-optimal Capacity Management Missing Insights on Data Operations Ineffective Alerting and Automatic Actions After Unravel: Global 200 Financial Services Company Customer Case Study 17 Scale to unlimited # of users, apps, data, projects 1 interface for job or workflow monitoring Reduce troubleshooting time by 98% Maximize resource utilization Save 60% on resource cost 70% reduction in support tickets
  • 18. Live Q&A questions 1. Does Unravel support big data workloads in the cloud? 2. I am planning to migrate from an on-premises installation to the cloud. Can Unravel help with that? 3. Does Unravel do more than monitoring? 18
  • 19. The benefits of Unravel’s AI-powered APM Solution 19
  • 20. 20
  • 21. Thank You Free Full Feature Trial on Amazon EMR, Microsoft Azure https://unraveldata.com/free-trial/ https://unraveldata.com/