SlideShare a Scribd company logo
How to Apply Machine
Learning into Your
CI/CD Pipeline
Alon Weiss / Sealights
● Complexity
○ Architecture
○ Deployment & infrastructure
○ Technologies
○ Product → Service
○ Visibility
● Load
○ Automation → more tests, longer CI cycles
○ Shift left → more tests and build steps
○ Support more devices & platforms → More tests
○ Lack of resources = bottleneck
● Human resources
○ Hiring
○ Lack of time and expertise to research, plan & execute strategic engineering tasks
The Digital Transformation and DevOps
AI-Ops for CI/CD | Business Impact
“We see continued evidence that software speed, stability, and availability contribute to organizational
performance (including profitability, productivity, and customer satisfaction). Our highest performers
are twice as likely to meet or exceed their organizational performance goals.” - DORA 2019
AI-Ops for CI/CD | Business Impact
Alon Weiss
Chief Architect @SeaLights
alonweiss
alonw@sealights.io | www.sealights.io
The world’s #1 Software Quality Intelligence Platform that
fastens executions without sacrificing quality.
Research & Inspiration
Our daily needs - Releasing Faster and with Higher Quality
Continuous Delivery: Reliable Software Releases Through Build, Test, and
Deployment Automation / Jez Humble, David Farley
Accelerate / Nicole Forsgren, Jez Humble and Gene Kim
“Market Guide for AIOps Platforms”, “Artificial Intelligence for IT
Operations Delivers Improved Business Outcomes” by Gartner
“Take The Mystery Out Of AI for IT Operations (AIOps)” by Forrester
AI-Ops | Definition / Gartner
AIOps platforms combine big data and machine
learning functionality to support all primary IT
operations functions through the scalable ingestion and
analysis of the ever-increasing volume, variety and
velocity of data generated by IT.
The goal of the analytics effort is the discovery of
patterns — novel elements used to look forward in time
to predict possible incidents and emerging usage
profiles — and to look backward in time to determine
the root causes of current system behaviors .
AI-Ops | Definition / Forrester
Software that applies AI/ML or other advanced
analytics to business and operations data to make
correlations and provide prescriptive and
predictive answers in real time. These insights
produce real-time business performance KPIs,
allow teams to resolve incidents faster, and help
avoid incidents altogether.
AI-Ops | Use cases
● IT groups
○ Monitoring (IT Infrastructure, SREs)
■ excessive data usage
■ communication patterns
■ intrusion detection
○ Security
○ Release pipeline (the majority of this talk)
■ Release faster and with greater confidence in quality
● Non-IT
○ Demand / Order processing / Customer Satisfaction
○ Business Health
○ Marketing
AI-Ops | Usage Patterns
● Noise reduction (e.g. Alert Consolidation)
● Root Cause Analysis (e.g. during/after
incidents)
● Incident prevention (extrapolate future
events to prevent breakdowns)
● Anomaly detection beyond thresholds and
rule-based systems
● Initiating action using automation or
escalation
AI-Ops | Existing Tools
AI-Ops platforms
BigPanda
“Intelligent Automation for IT
Incident Management”
Moogsoft
“Purpose-Built AIOps Platform
for IT. Less Noise. Faster Fixes.
Shorter Outages."
APMs
NewRelic AI - NRAI launched
last week
Appdynamics - “Central
Nervous System”
Dynatrace - “Davis”
Splunk
Trends
“Current tools and processes aren’t
up to the task of monitoring today’s
apps and their underpinnings”
- Forrester
“AIOps tools show a “right-shift”
across the four stages of
monitoring — data acquisition,
aggregation, analysis and action —
with their core capabilities at data
aggregation and analysis. As the
technology matures further, users
will be able to leverage proactive
advice from the platform, enabling
the action stage. ”
- Gartner
Applying AI-Ops to the CI/CD pipeline
AI-Ops for CI/CD | Data Sources
● GitHub / GitLab / Bitbucket / Azure Devops
● JIRA / ServiceNow
● Jenkins / *Pipelines / others
● Test Stages - coverage per test, timing, pass/fail
● Static scanners - code quality, dependencies, automated code review
● APMs
● Logs - ELK, Splunk
● Calendars / IM status
● Provisioning - Terraform, Ansible, Puppet, Chef
● Salesforce
Release Pipeline Components (pre-production)
Pain Solution
Build Queues Important jobs wait time Prioritize or Parallelize
Build+Package Time Smaller components, parallelize
Tests (Unit tests, Integration,
Selenium ,e2e)
Time
Time-to-failure
Test failure RCA
Test Impact Analysis and Test Prioritization
Pinpoint root cause to developers
Infrastructure & Provisioning Limited resources
Cold starts
Provision ahead of time
Risk management Manual and mostly gut-based AI assisted (anomaly detection)
Monitoring after deployment Engineers are rarely involved Notify stakeholders and facilitate RCA
AI-Ops for CI/CD | Optimized Build Queues
● Goal:
○ Some jobs are more important than others. Prioritize the queues.
● Data Sources:
○ JIRA - issue types, relations, priority, severity, custom fields
○ VCS - commit history (also available on Jenkins) and change area/scope
○ Jenkins - Historic build graph and timing
○ Salesforce - customer account importance
● Machine Learning algorithm family:
Graph Neural Network to determine priority, Regression to determine build length
● Usage:
A CI plugin to determine and assign the priority, then sort the queue
AI-Ops for CI/CD | Smart Testing
● Goal:
○ Use Test Impact Analysis to run the minimal set of tests that are necessary
○ Fail fast
○ Eliminate Overlapping tests
● Data Sources:
○ Git/Build tool - build content and changes
○ Deep Coverage tools - per-test coverage
● Machine Learning algorithm family:
○ Classification
○ Statistical models
● Usage:
○ Find impacted tests by cross-referencing the changes and past test history
○ Deep integration with test runners so they run only those that are needed
AI-Ops for CI/CD | Flaky test detection
● Goal:
○ Isolate and weed out flaky tests
● Data Sources:
○ VCS - commit history (also available on Jenkins) and change area/scope
○ Deep Coverage tools - per-test coverage
○ Jenkins / Test Runners - Test results and history
● Machine Learning algorithm family:
○ Regression
○ Statistical models
● Usage:
○ If a test flips between passing/failing status without a detected change to explain it, it
may be flaky
○ Automatically quarantine tests, notify author
AI-Ops for CI/CD | Infrastructure provisioning
● Goal:
○ Prevent resource contention in CI/CD
○ Minimize wait time for resource provisioning
● Data Sources:
○ Jenkins / * Pipelines - job history and graph
○ Infrastructure - historic demand & usage, real-time capacity
○ IM / Calendar - Engineers availability
● Machine Learning algorithm family:
○ Predictive analytics (Regression)
○ Statistical models
● Usage:
○ Update autoscaler targets continuously based on real-time and historic capacity
AI-Ops for CI/CD | Smart Risk Management
● Goal:
○ Formally introduce the concept of Risk Management to the semi-automatic review
process
○ Find common risks
■ Untested code and configuration changes
■ Anomalies
● Test time
● Code paths
● Network usage pattern
● Git: Big changes & unusual churn, New contributors,
Self-merging PRs, Long-running PR
● Data Sources:
○ VCS
○ APMs, NPMDs tools
AI-Ops for CI/CD | Smart Risk Management
● Machine Learning algorithm family: Anomaly Detection
● Usage:
○ Evaluate risk using Anomaly Detection, 3rd party tools (e.g. GitPrime)
○ Put smart quality gates in place
○ Require manual approval only when risks are too high
○ Determine APM thresholds and rollout configuration according to risk
AI-Ops for CI/CD | Proactive Root Cause Analysis
● Goal:
○ Facilitate root cause analysis for production and test failures
● Data Sources:
○ VCS - commit history (also available on Jenkins) and change area/scope
○ ALMs - incidents, stack frames
○ Log collectors - capture messages, function names, stack frames
● Machine Learning algorithm family:
○ None! Good old text indexing
● Usage:
○ Cross reference the suspected code areas and logs with the commit history and
escalate to the contributors
AI-Ops Market | Market Direction and Forecast
Devops adoption is accelerating:
“The proportion of our highest performers has almost tripled, now comprising 20% of all teams.
This shows that excellence is possible - those that execute on key capabilities see the benefits.”
AIOps adoption in increasing, platforms are the next big thing:
“By 2020, approximately 50% of enterprises will actively use AIOps technologies ... up from 10%
today” - Gartner
“Over the next 5 years, wide-scope AIOps platforms will become the de facto form-factor for the
delivery of AIOps functionality as opposed to AIOps functionality embedded in a monitoring tool
like APM” - Gartner
Thank You!
Of course, we’re hiring! :-)
Questions, anyone?

More Related Content

PDF
AIOps - The next 5 years
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PDF
What is MLOps
PDF
Unifying IT with Outcome-Aware AIOps
PPTX
IBM Q-radar security intelligence roadmap
PDF
MLOps for production-level machine learning
PDF
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
PDF
Big Data Architecture and Design Patterns
AIOps - The next 5 years
MLOps and Data Quality: Deploying Reliable ML Models in Production
What is MLOps
Unifying IT with Outcome-Aware AIOps
IBM Q-radar security intelligence roadmap
MLOps for production-level machine learning
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Big Data Architecture and Design Patterns

What's hot (20)

PPTX
MLOps - The Assembly Line of ML
PPTX
MLOps in action
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PPTX
Google Vertex AI
PDF
Ml ops on AWS
PDF
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
PPTX
Fundamentals of Microsoft 365 Security , Identity and Compliance
PPTX
Data Streaming in Big Data Analysis
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
PDF
MLOps with Kubeflow
PDF
Data Platform Architecture Principles and Evaluation Criteria
PDF
MLOps by Sasha Rosenbaum
PPTX
IBM Security QRadar
PDF
MDM for Customer data with Talend
PPTX
Multi cloud security architecture
PPTX
Azure Sentinel.pptx
PDF
AutoML - The Future of AI
PPTX
Splunk Overview
PDF
Data Mesh Part 4 Monolith to Mesh
MLOps - The Assembly Line of ML
MLOps in action
Using MLOps to Bring ML to Production/The Promise of MLOps
Introducing the Snowflake Computing Cloud Data Warehouse
Google Vertex AI
Ml ops on AWS
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
Fundamentals of Microsoft 365 Security , Identity and Compliance
Data Streaming in Big Data Analysis
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
MLOps with Kubeflow
Data Platform Architecture Principles and Evaluation Criteria
MLOps by Sasha Rosenbaum
IBM Security QRadar
MDM for Customer data with Talend
Multi cloud security architecture
Azure Sentinel.pptx
AutoML - The Future of AI
Splunk Overview
Data Mesh Part 4 Monolith to Mesh
Ad

Similar to How to apply machine learning into your CI/CD pipeline (20)

PDF
How AI Developers Are Reshaping DevOps for Smarter Automation.pdf
PDF
How-to-Integrate-AI-in-DevOps. Best-Practices.pdf
PDF
How AI and ML Can Accelerate and Optimize Software Development and Testing
PDF
Git Meets Grit_ AI Developers Reinvent DevOps Logic (2).pdf
PPTX
DevOps Online Training | DevOps Certification Training in Hyderabad
PPTX
Devops phase-1
PDF
Complete guide to AIOps_ Automate IT Operations with AI.pdf
PDF
AI-Powered DevOps_ How AI Revolutionize Software Delivery.pdf
PDF
9 Ways to Integrate AI in DevOps for Enhanced Efficiency.pdf
PDF
On the Application of AI for Failure Management: Problems, Solutions and Algo...
PPTX
What Does Artificial Intelligence Have to Do with IT Operations?
PPTX
DevOps for AI Apps
PDF
Agile Mumbai 27-28th Sep 2024 | AI Revolution: Transforming the Future of Dev...
PDF
Devops Automation and modern tools you can adapt
PDF
DevOps for absolute beginners (2022 edition)
PDF
How AIOps Evolved from Monitoring Tools to Autonomous IT Operations_.pdf
PPTX
It Consulting & Services - Black Basil Technologies
PDF
Sukumar Nayak-Agile-DevOps-Cloud Management
PPTX
AIOps-Solutions-Transforming-IT-Operations-with-Artificial-Intelligence.pptx
PDF
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
How AI Developers Are Reshaping DevOps for Smarter Automation.pdf
How-to-Integrate-AI-in-DevOps. Best-Practices.pdf
How AI and ML Can Accelerate and Optimize Software Development and Testing
Git Meets Grit_ AI Developers Reinvent DevOps Logic (2).pdf
DevOps Online Training | DevOps Certification Training in Hyderabad
Devops phase-1
Complete guide to AIOps_ Automate IT Operations with AI.pdf
AI-Powered DevOps_ How AI Revolutionize Software Delivery.pdf
9 Ways to Integrate AI in DevOps for Enhanced Efficiency.pdf
On the Application of AI for Failure Management: Problems, Solutions and Algo...
What Does Artificial Intelligence Have to Do with IT Operations?
DevOps for AI Apps
Agile Mumbai 27-28th Sep 2024 | AI Revolution: Transforming the Future of Dev...
Devops Automation and modern tools you can adapt
DevOps for absolute beginners (2022 edition)
How AIOps Evolved from Monitoring Tools to Autonomous IT Operations_.pdf
It Consulting & Services - Black Basil Technologies
Sukumar Nayak-Agile-DevOps-Cloud Management
AIOps-Solutions-Transforming-IT-Operations-with-Artificial-Intelligence.pptx
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectroscopy.pptx food analysis technology
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
sap open course for s4hana steps from ECC to s4
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Encapsulation theory and applications.pdf

How to apply machine learning into your CI/CD pipeline

  • 1. How to Apply Machine Learning into Your CI/CD Pipeline Alon Weiss / Sealights
  • 2. ● Complexity ○ Architecture ○ Deployment & infrastructure ○ Technologies ○ Product → Service ○ Visibility ● Load ○ Automation → more tests, longer CI cycles ○ Shift left → more tests and build steps ○ Support more devices & platforms → More tests ○ Lack of resources = bottleneck ● Human resources ○ Hiring ○ Lack of time and expertise to research, plan & execute strategic engineering tasks The Digital Transformation and DevOps
  • 3. AI-Ops for CI/CD | Business Impact “We see continued evidence that software speed, stability, and availability contribute to organizational performance (including profitability, productivity, and customer satisfaction). Our highest performers are twice as likely to meet or exceed their organizational performance goals.” - DORA 2019
  • 4. AI-Ops for CI/CD | Business Impact
  • 5. Alon Weiss Chief Architect @SeaLights alonweiss alonw@sealights.io | www.sealights.io The world’s #1 Software Quality Intelligence Platform that fastens executions without sacrificing quality.
  • 6. Research & Inspiration Our daily needs - Releasing Faster and with Higher Quality Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation / Jez Humble, David Farley Accelerate / Nicole Forsgren, Jez Humble and Gene Kim “Market Guide for AIOps Platforms”, “Artificial Intelligence for IT Operations Delivers Improved Business Outcomes” by Gartner “Take The Mystery Out Of AI for IT Operations (AIOps)” by Forrester
  • 7. AI-Ops | Definition / Gartner AIOps platforms combine big data and machine learning functionality to support all primary IT operations functions through the scalable ingestion and analysis of the ever-increasing volume, variety and velocity of data generated by IT. The goal of the analytics effort is the discovery of patterns — novel elements used to look forward in time to predict possible incidents and emerging usage profiles — and to look backward in time to determine the root causes of current system behaviors .
  • 8. AI-Ops | Definition / Forrester Software that applies AI/ML or other advanced analytics to business and operations data to make correlations and provide prescriptive and predictive answers in real time. These insights produce real-time business performance KPIs, allow teams to resolve incidents faster, and help avoid incidents altogether.
  • 9. AI-Ops | Use cases ● IT groups ○ Monitoring (IT Infrastructure, SREs) ■ excessive data usage ■ communication patterns ■ intrusion detection ○ Security ○ Release pipeline (the majority of this talk) ■ Release faster and with greater confidence in quality ● Non-IT ○ Demand / Order processing / Customer Satisfaction ○ Business Health ○ Marketing
  • 10. AI-Ops | Usage Patterns ● Noise reduction (e.g. Alert Consolidation) ● Root Cause Analysis (e.g. during/after incidents) ● Incident prevention (extrapolate future events to prevent breakdowns) ● Anomaly detection beyond thresholds and rule-based systems ● Initiating action using automation or escalation
  • 11. AI-Ops | Existing Tools AI-Ops platforms BigPanda “Intelligent Automation for IT Incident Management” Moogsoft “Purpose-Built AIOps Platform for IT. Less Noise. Faster Fixes. Shorter Outages." APMs NewRelic AI - NRAI launched last week Appdynamics - “Central Nervous System” Dynatrace - “Davis” Splunk Trends “Current tools and processes aren’t up to the task of monitoring today’s apps and their underpinnings” - Forrester “AIOps tools show a “right-shift” across the four stages of monitoring — data acquisition, aggregation, analysis and action — with their core capabilities at data aggregation and analysis. As the technology matures further, users will be able to leverage proactive advice from the platform, enabling the action stage. ” - Gartner
  • 12. Applying AI-Ops to the CI/CD pipeline
  • 13. AI-Ops for CI/CD | Data Sources ● GitHub / GitLab / Bitbucket / Azure Devops ● JIRA / ServiceNow ● Jenkins / *Pipelines / others ● Test Stages - coverage per test, timing, pass/fail ● Static scanners - code quality, dependencies, automated code review ● APMs ● Logs - ELK, Splunk ● Calendars / IM status ● Provisioning - Terraform, Ansible, Puppet, Chef ● Salesforce
  • 14. Release Pipeline Components (pre-production) Pain Solution Build Queues Important jobs wait time Prioritize or Parallelize Build+Package Time Smaller components, parallelize Tests (Unit tests, Integration, Selenium ,e2e) Time Time-to-failure Test failure RCA Test Impact Analysis and Test Prioritization Pinpoint root cause to developers Infrastructure & Provisioning Limited resources Cold starts Provision ahead of time Risk management Manual and mostly gut-based AI assisted (anomaly detection) Monitoring after deployment Engineers are rarely involved Notify stakeholders and facilitate RCA
  • 15. AI-Ops for CI/CD | Optimized Build Queues ● Goal: ○ Some jobs are more important than others. Prioritize the queues. ● Data Sources: ○ JIRA - issue types, relations, priority, severity, custom fields ○ VCS - commit history (also available on Jenkins) and change area/scope ○ Jenkins - Historic build graph and timing ○ Salesforce - customer account importance ● Machine Learning algorithm family: Graph Neural Network to determine priority, Regression to determine build length ● Usage: A CI plugin to determine and assign the priority, then sort the queue
  • 16. AI-Ops for CI/CD | Smart Testing ● Goal: ○ Use Test Impact Analysis to run the minimal set of tests that are necessary ○ Fail fast ○ Eliminate Overlapping tests ● Data Sources: ○ Git/Build tool - build content and changes ○ Deep Coverage tools - per-test coverage ● Machine Learning algorithm family: ○ Classification ○ Statistical models ● Usage: ○ Find impacted tests by cross-referencing the changes and past test history ○ Deep integration with test runners so they run only those that are needed
  • 17. AI-Ops for CI/CD | Flaky test detection ● Goal: ○ Isolate and weed out flaky tests ● Data Sources: ○ VCS - commit history (also available on Jenkins) and change area/scope ○ Deep Coverage tools - per-test coverage ○ Jenkins / Test Runners - Test results and history ● Machine Learning algorithm family: ○ Regression ○ Statistical models ● Usage: ○ If a test flips between passing/failing status without a detected change to explain it, it may be flaky ○ Automatically quarantine tests, notify author
  • 18. AI-Ops for CI/CD | Infrastructure provisioning ● Goal: ○ Prevent resource contention in CI/CD ○ Minimize wait time for resource provisioning ● Data Sources: ○ Jenkins / * Pipelines - job history and graph ○ Infrastructure - historic demand & usage, real-time capacity ○ IM / Calendar - Engineers availability ● Machine Learning algorithm family: ○ Predictive analytics (Regression) ○ Statistical models ● Usage: ○ Update autoscaler targets continuously based on real-time and historic capacity
  • 19. AI-Ops for CI/CD | Smart Risk Management ● Goal: ○ Formally introduce the concept of Risk Management to the semi-automatic review process ○ Find common risks ■ Untested code and configuration changes ■ Anomalies ● Test time ● Code paths ● Network usage pattern ● Git: Big changes & unusual churn, New contributors, Self-merging PRs, Long-running PR ● Data Sources: ○ VCS ○ APMs, NPMDs tools
  • 20. AI-Ops for CI/CD | Smart Risk Management ● Machine Learning algorithm family: Anomaly Detection ● Usage: ○ Evaluate risk using Anomaly Detection, 3rd party tools (e.g. GitPrime) ○ Put smart quality gates in place ○ Require manual approval only when risks are too high ○ Determine APM thresholds and rollout configuration according to risk
  • 21. AI-Ops for CI/CD | Proactive Root Cause Analysis ● Goal: ○ Facilitate root cause analysis for production and test failures ● Data Sources: ○ VCS - commit history (also available on Jenkins) and change area/scope ○ ALMs - incidents, stack frames ○ Log collectors - capture messages, function names, stack frames ● Machine Learning algorithm family: ○ None! Good old text indexing ● Usage: ○ Cross reference the suspected code areas and logs with the commit history and escalate to the contributors
  • 22. AI-Ops Market | Market Direction and Forecast Devops adoption is accelerating: “The proportion of our highest performers has almost tripled, now comprising 20% of all teams. This shows that excellence is possible - those that execute on key capabilities see the benefits.” AIOps adoption in increasing, platforms are the next big thing: “By 2020, approximately 50% of enterprises will actively use AIOps technologies ... up from 10% today” - Gartner “Over the next 5 years, wide-scope AIOps platforms will become the de facto form-factor for the delivery of AIOps functionality as opposed to AIOps functionality embedded in a monitoring tool like APM” - Gartner
  • 23. Thank You! Of course, we’re hiring! :-) Questions, anyone?

Editor's Notes

  • #3: Architecture - Microservices, Serverless technologies, FaaS/Lambda Deployment - Monoliths → Microservices, K8s value comes with its cost Technologies - Teams often choose their own mix. More common to see a lot of technologies (java, node, python, .NET, etc.) Product - a product is now a service, needs an ecosystem of monitoring and support Visibility - multiple tools (APMs, network monitoring, logs) Load - MORE services, pipelines, This is the first slide. Need to polish this to be perfect. Hook. Tell them what you’re going to tell them. TODO: Insert quotes from research papers regarding the increasing requirements from CI/CD and DevOps TODO: Insert stats from DORA 2019 (pg 5, 18) on speed,stability, availability contributing to organization performance, and from there to profitability, productivity and customer satisfaction
  • #8: IT-generated data grows by 2-3x per annum
  • #9: AIOps described as extending the monitoring from Application, through Infrastructure to Business and Integrations It’s tough to keep up with the demands from business stakeholders I&O teams have become too siloed by discipline 1st gen - observation, 2nd gen - diagnosis. The problem of visibility is getting worse and will continue to do so
  • #10: IT - monitoring too many Systems, Data, using Rules and Thresholds instead of using Anomaly Detection algorithms
  • #12: NR AI - noise reduction, improved correlations, augmented intelligence
  • #13: The central concept: The Change. Get it to production as soon as possible, minimize human involvement and without sacrificing quality and security. Shifting left = “CI before the CI”, so we find and fix problems at the right time, before this affects others. Key principles of CI: Build quality in - automation & fail fast Word in small batches Computers perform repetitive tasks; people solve problems Continuous Improvement Everyone is responsible
  • #16: Note: In a cloud-native, unconstrained datacenter, build queues are replaced with scheduling algorithms (e.g. kubernetes-native solutions like Jenkins X, Tekton, Gitlab)
  • #17: Fail fast - Shift left using Pull Requests Run as many test stages as possible before they are merged
  • #18: Can be further correlated with ALM and Log collectors to find the difference between the “passing” and “failing” profiles
  • #19: This also minimizes Infrastructure cost by scaling down during off-peak hours
  • #20: Every change imposes a risk, but proper risk management is not a common practice Change-advisory board are manual and error-prone Shoutout to GitPrime and their “20 patterns to watch for in your engineering team” Add NewRelic, dynatrace, AppD logos
  • #21: Risk management should be an optional final step before accepting the change
  • #22: The same data can flow backwards - give hint to developers when they are touching code that’s known to be sensitive / erroneous. This “risky code” can be shown during code review to ensure the future modifications are well tested
  • #23: Shoutout to GitPrime and their “20 patterns to watch for in your engineering team”