SlideShare a Scribd company logo
Timothy Chen
Co-founder / CTO
Why AIOps Matters for
Kubernetes
About me
2
Apache PMC Mesos, Drill
Container engineer lead at Mesosphere
Help start community effort for Spark on Kubernetes (Google, Redhat,
Palantir, Salesforce, etc).
@tnachen
Agenda
3
Operating Kubernetes in production at scale
What is AIOps?
Operating Kubernetes in production at scale w/ AIOps!
Adoption
Infrastructure Maturity
Microservices
Microservices at scale
Open source software
Why AIOps Matters For Kubernetes
Container Production Stack
Orchestration
Security Network Storage
Runtime
OS
Image
Infrastructure Monitoring
Orchestration
Network
Storage
Runtime
OS
Image
Users:
SRE
System Engineers
Ops
DevOps
Application Monitoring
Kafka
Microservice #1
Spark
Microservice #2
Microservice #3
Users:
SRE
DevOps
Developers
Metrics Explosion
Reactive monitoring new challenges
Containers != VM
Cannot focus on host metrics anymore with dynamic container
changes
Which metric(s) to track?
10000s of metrics, massive amounts of alert
How to identify or isolate the problem (correlate metrics)?
HW/Container/Host/Neighbor/Cluster/Dependency….
How to resolve these performance problems? (RCA)
Unmanaged resources -> Manual tuning
Proactive challenges
Goal: Maximize performance with minimum cost!
Capacity Planning / What-If Analysis
Cluster scheduling (interference, heterogenous, etc)
Optimal configuration
-
Orchestration
Network Storage
Runtime
OS
Application
VM
Why AIOps Matters For Kubernetes
https://netman.cs.tsinghua.edu.cn/contacts/projects/
AIOps is not a new concept
Decades of academic research around managing
performance in computing….
- Cluster Scheduling
- SLA-driven interference, storage, network, caching...
Now is the time
- Workloads are more complex than before
Infrastructure interfaces and data are becoming more
standardized (Kubernetes, Prometheus, CTE, etc)
All metrics centrally collected
-
Why AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
https://www.youtube.com/watch?v=dJxGtfTPVCg
Why AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
The Difficulty of using the Public Cloud
Many instance types
Reserved, on-demand, spot instances
Many instant sizes (10s)
Application churn
Variability in load
Can we guarantee SLOs & minimize cost automatically?
Choosing the right VM/HW
Performance tuning
Why AIOps Matters For Kubernetes
Live traffic load testing
https://engineering.linkedin.com/blog/2017/02/redliner--how-linkedin-determines-the-capacity-limits-of-its-ser
Live traffic bottleneck
https://research.fb.com/publications/kraken-leveraging-live-traf%EF%AC%81c-tests-to-id
entify-and-resolve-resource-utilization-bottlenecks-in-large-scale-web-services/
Chaos Engineering
Why AIOps Matters For Kubernetes
Batch scheduling
Improve deadline by 5x - 13x, increase utilization 14 - 28%
Placement
Variability in user load
Bloated reservations due performance variability
TwitterGoogle
Oversubscription
Google search + Google brain (deep neural nets)
>90% hardware utilization
No latency violations for search
[Lo et al’15]
>90% HW utilization No tail latency
problems
Why AIOps Matters For Kubernetes
Hyperpilot
40
Help enterprises
operate container in
production.
Using AI to learn how
applications are behaving with your
infrastructure and actively manage these
resources.
Processors, Memory, Flash, Disk, NIC
Operating system
Cluster resource manager
Distributed
analytics
Hadoop,
Spark, …
IaaS
Online,
data-intensive
(OLDI)
Search, social nets,
SaaS
Oversubscription in Action
Hiring Engineers in Silicon Valley / Taiwan!
tim@hyperpilot.io
Christos Kozyrakis (Stanford)
Michael Huang (ex-CoreOS)
Timothy Chen (ex-Mesosphere)

More Related Content

PDF
App Modernization
PDF
Java Application Modernization Patterns and Stories from the IBM Garage
PPTX
Riverbed Granite 2.5
PDF
Cloud Migration: Azure acceleration with CAST Highlight
PPTX
Streaming Analytics for IoT with Apache Spark
PPTX
E1: Building the Digital Twin (Predix Transform 2016)
PDF
IBM Tivoli - Security Solutions for the Cloud
PDF
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
App Modernization
Java Application Modernization Patterns and Stories from the IBM Garage
Riverbed Granite 2.5
Cloud Migration: Azure acceleration with CAST Highlight
Streaming Analytics for IoT with Apache Spark
E1: Building the Digital Twin (Predix Transform 2016)
IBM Tivoli - Security Solutions for the Cloud
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support

What's hot (20)

PDF
Five keys to successful cloud migration
 
PPTX
App Modernization - What you need to know before planning a migration to Offi...
PDF
Cloud-Native Workshop New York- Dynatrace
PPTX
The Ideal Approach to Application Modernization; Which Way to the Cloud?
PDF
Demystifying Operational Features for Product Owners - AgileCam - SkeltonThat...
PDF
Driving TAS Enterprise Fitness
PPTX
Troubleshooting App Health and Performance with PCF Metrics 1.2
PDF
From Mainframe to Microservices with Pivotal Platform and Kafka: Bridging the...
PDF
Accelerating Digital Transformation with App Modernization
PDF
GE Predix 新手入门 赵锴 物联网_IoT
PDF
Elastic APM: Amping up your logs and metrics for the full picture
PPTX
Smart building mendix azure influx / smart City / IoT
PDF
Rackspace::Solve NYC - Solving for Rapid Customer Growth and Scale Through De...
PDF
Combining logs, metrics, and traces for unified observability
PDF
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
PDF
Combining logs, metrics, and traces for unified observability
PDF
Cloud migration strategies
PDF
Cloud native enterprise
PDF
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
PPTX
The SnapLogic Integration Cloud for ServiceNow
Five keys to successful cloud migration
 
App Modernization - What you need to know before planning a migration to Offi...
Cloud-Native Workshop New York- Dynatrace
The Ideal Approach to Application Modernization; Which Way to the Cloud?
Demystifying Operational Features for Product Owners - AgileCam - SkeltonThat...
Driving TAS Enterprise Fitness
Troubleshooting App Health and Performance with PCF Metrics 1.2
From Mainframe to Microservices with Pivotal Platform and Kafka: Bridging the...
Accelerating Digital Transformation with App Modernization
GE Predix 新手入门 赵锴 物联网_IoT
Elastic APM: Amping up your logs and metrics for the full picture
Smart building mendix azure influx / smart City / IoT
Rackspace::Solve NYC - Solving for Rapid Customer Growth and Scale Through De...
Combining logs, metrics, and traces for unified observability
Enterprise Development Trends 2016 - Cloud, Container and Microservices Insig...
Combining logs, metrics, and traces for unified observability
Cloud migration strategies
Cloud native enterprise
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
The SnapLogic Integration Cloud for ServiceNow
Ad

Similar to Why AIOps Matters For Kubernetes (20)

PDF
Scale Container Operations with AIOps
PDF
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
PPTX
DevOps in Age of Kubernetes
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PDF
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
PPTX
Si so product 1 day technical
PDF
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
PPTX
Apache Hadoop India Summit 2011 talk "Profiling Application Performance" by U...
PDF
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
PDF
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
PDF
Addressing the 8 Key Pain Points of Kubernetes Cluster Management
PPTX
Episode 4: Operating Kubernetes at Scale with DC/OS
PPSX
Microservices Architecture - Cloud Native Apps
PDF
8 - OpenShift - A look at a container platform: what's in the box
PDF
Lessons learned from writing over 300,000 lines of infrastructure code
PPTX
Episode 1: Building Kubernetes-as-a-Service
PDF
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
PPTX
Harbour IT & VMware - vForum 2010 Wrap
PDF
The New Stack Container Summit Talk
Scale Container Operations with AIOps
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps in Age of Kubernetes
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Si so product 1 day technical
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Apache Hadoop India Summit 2011 talk "Profiling Application Performance" by U...
VMworld 2013: How to Replace Websphere Application Server (WAS) with TCserver
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
OS for AI: Elastic Microservices & the Next Gen of ML
Addressing the 8 Key Pain Points of Kubernetes Cluster Management
Episode 4: Operating Kubernetes at Scale with DC/OS
Microservices Architecture - Cloud Native Apps
8 - OpenShift - A look at a container platform: what's in the box
Lessons learned from writing over 300,000 lines of infrastructure code
Episode 1: Building Kubernetes-as-a-Service
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Harbour IT & VMware - vForum 2010 Wrap
The New Stack Container Summit Talk
Ad

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Mechanical Engineering MATERIALS Selection
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Sustainable Sites - Green Building Construction
PPTX
additive manufacturing of ss316l using mig welding
PDF
composite construction of structures.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Well-logging-methods_new................
PDF
Digital Logic Computer Design lecture notes
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions
CH1 Production IntroductoryConcepts.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mechanical Engineering MATERIALS Selection
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Sustainable Sites - Green Building Construction
additive manufacturing of ss316l using mig welding
composite construction of structures.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT 4 Total Quality Management .pptx
Lesson 3_Tessellation.pptx finite Mathematics
Embodied AI: Ushering in the Next Era of Intelligent Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Well-logging-methods_new................
Digital Logic Computer Design lecture notes
Structs to JSON How Go Powers REST APIs.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Lecture Notes Electrical Wiring System Components

Why AIOps Matters For Kubernetes