SlideShare a Scribd company logo
Мониторинг
облачной CI системы
на примере Jenkins
Alexander Akbashev
HERE Technologies
Here Technologies
HERE Technologies, the Open Location Platform company, enables
people, enterprises and cities to harness the power of location. By
making sense of the world through the lens of location we empower
our customers to achieve better outcomes – from helping a city
manage its infrastructure or an enterprise optimize its assets to
guiding drivers to their destination safely.
To learn more about HERE, including our new generation of cloud-
based location platform services, visit http://
360.here.com and www.here.com
Context
• Every change goes through pre-submit validation
• Feedback time is 15-40 minutes
• A lot of products and platforms
• 6 Jenkins masters
• Up to 185k runs per day in the biggest one
• 20k runs per day in average
if something goes wrong…
What can go wrong?
Compilation is broken
Tests are broken
Network issues
What can go wrong?
Compilation is broken
Tests are broken
Network issues
Jenkins master crashed
EC2 plugin does not raise new nodes
No connection to labs
Can not cleanup workspace
AWS S3 is down
Git master dies
Git replica is broken
Compiler cache was invalidated
Hit the limit of API calls to AWS
Job was deleted
UI is blocked
Queue is too big
System.exit(1)
NFS stuck
Deadlock in Jenkins
Staging started to give feedback
Restarted the wrong server
What can go wrong?
Compilation is broken
Tests are broken
Network issues
Jenkins master crashed
EC2 plugin does not raise new nodes
No connection to labs
Can not cleanup workspace
AWS S3 is down
Git master dies
Git replica is broken
Compiler cache was invalidated
Hit the limit of API calls to AWS
Job was deleted
UI is blocked
Queue is too big
System.exit(1)
NFS stuck
Deadlock in Jenkins
Staging started to give feedback
Restarted the wrong server
Cloud CI Monitoring
Monitoring Jenkins
Out of the box
Monitoring Jenkins
© http://www.jenkinselectric.com/monitoring
Monitoring Jenkins
https://jenkins.io/doc/book/system-administration/monitoring/
Monitoring Jenkins
https://wiki.jenkins.io/display/JENKINS/Monitoring
Monitoring Plugin (March 2016)
Monitoring Plugin (March 2016)
+ Easy to install
Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
- Only one instance
Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
- Only one instance
- Not scalable
Monitoring Plugin (nowadays)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
- Only one instance
- Not scalable
+ InfluxDB/CloudWatch/Graphite
Let’s craft own monitoring!
Design own monitoring (March 2016)
Jenkins Python InfluxDB
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
import influxdb
import jenkins
j = Jenkins(“jenkins.host”)
queue_info = j.get_queue_info()
for q in queue_info:
influx_server.push({“name”: q[‘job_name’],
“reason”: q[‘why’]})
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
import influxdb
import jenkins
j = Jenkins(“jenkins.host”)
queue_info = j.get_queue_info()
for q in queue_info:
influx_server.push({“name”: q[‘job_name’],
“reason”: q[‘why’]})
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
import influxdb
import jenkins
j = Jenkins(“jenkins.host”)
queue_info = j.get_queue_info()
for q in queue_info:
influx_server.push({“name”: q[‘job_name’],
“reason”: q[‘why’]})
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
- not all data is accessible
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
- not all data is accessible
- extra load
API API
Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
- not all data is accessible
- extra load
API API
Let’s do event based
monitoring!
Cloud CI Monitoring
Jenkins Core
public abstract class RunListener<R extends Run> implements
ExtensionPoint {
public void onCompleted(R r, TaskListener listener) {}



public void onFinalized(R r) {}



public void onStarted(R r, TaskListener listener) {}
public void onDeleted(R r) {}
}
Jenkins Core
public abstract class RunListener<R extends Run> implements
ExtensionPoint {
public void onCompleted(R r, TaskListener listener) {}



public void onFinalized(R r) {}



public void onStarted(R r, TaskListener listener) {}
public void onDeleted(R r) {}
}
Groovy Event Listener Plugin (April 2016)
• Allows to execute custom groovy code for every event
• Supports RunListener
Groovy Event Listener Plugin (nowadays)
• Allows to execute custom groovy code for every event
• Supports RunListener, ComputerListener, ItemListener,
QueueListener
• Works at scale
• Allows custom classpath
Groovy Event Listener Plugin
if (event == 'RunListener.onFinalized') {
def build = Thread.currentThread().executable
def queueAction = build.getAction(TimeInQueueAction.class)
def queuing = queueAction.getQueuingDurationMillis()
log.info “number=$build.number, queue_duration=$queuing
}
Ok, we have events, but how
to fill the db?
FluentD
FluentD
• Process 13,000 events/second/core
FluentD
• Process 13,000 events/second/core
• Retry/buffer/routing
FluentD
• Process 13,000 events/second/core
• Retry/buffer/routing
• Easy to extend
FluentD
• Process 13,000 events/second/core
• Retry/buffer/routing
• Easy to extend
• Simple
FluentD
• Process 13,000 events/second/core
• Retry/buffer/routing
• Easy to extend
• Simple
• Reliable
FluentD
• Process 13,000 events/second/core
• Retry/buffer/routing
• Easy to extend
• Simple
• Reliable
• Memory footprint is 30-40MB
FluentD
• Process 13,000 events/second/core
• Retry/buffer/routing
• Easy to extend
• Simple
• Reliable
• Memory footprint is 30-40MB
• Ruby
FluentD
Jenkins FluentD InfluxDB
JSON JSON
FluentD
Jenkins FluentD InfluxDB
JSON JSON
Postgres
SQL
FluentD
Jenkins FluentD InfluxDB
JSON JSON
Postgres
SQL
Logs
FluentD. Config.
<match **.influx.**>
type influxdb
host influxdb.host
port 8086
dbname stats
auto_tags “true”
timestamp_tag timestamp
time_precision s
</match>
FluentD. Config.
<match **.influx.**>
type influxdb
host influxdb.host
port 8086
dbname stats
auto_tags “true”
timestamp_tag timestamp
time_precision s
</match>
FluentD. Config.
<match **.influx.**>
type influxdb
host influxdb.host
port 8086
dbname stats
auto_tags “true”
timestamp_tag timestamp
time_precision s
</match>
FluentD. Config.
<match **.influx.**>
type influxdb
host influxdb.host
port 8086
dbname stats
auto_tags “true”
timestamp_tag timestamp
time_precision s
</match>
FluentD. Config.
<match **.influx.**>
type influxdb
host influxdb.host
port 8086
dbname stats
auto_tags “true”
timestamp_tag timestamp
time_precision s
</match>
Ok, we have events, we have
fluentd, but how to pass event
to it?
FluentD Plugin for Jenkins
FluentD Plugin for Jenkins
• Developed in HERE
Technologies
FluentD Plugin for Jenkins
• Developed in HERE
Technologies
• Very simple
FluentD Plugin for Jenkins
• Developed in HERE
Technologies
• Very simple
• Supports JSON
FluentD Plugin for Jenkins
• Developed in HERE
Technologies
• Very simple
• Supports JSON
• Post-build-step
FluentD Plugin for Jenkins
https://github.com/jenkinsci/fluentd-plugin
Great! Let’s do something with
this data!
Infra issues
Build Failure Analyzer (config)
Build Failure Analyzer (code)
def bfa = build.getAction(FailureCauseBuildAction.class)
def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses()
for(def cause : causes) {
final Map<String, Object> data = new HashMap<>();
data.put("name", jobName)
data.put("number", build.number)
data.put("cause", cause.getName())
data.put("categories", cause.getCategories().join(','))
data.put("timestamp", build.timestamp.timeInMillis)
data.put("node", node)
context.logger.log("influx.bfa", data)
}
Build Failure Analyzer (code)
def bfa = build.getAction(FailureCauseBuildAction.class)
def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses()
for(def cause : causes) {
final Map<String, Object> data = new HashMap<>();
data.put("name", jobName)
data.put("number", build.number)
data.put("cause", cause.getName())
data.put("categories", cause.getCategories().join(','))
data.put("timestamp", build.timestamp.timeInMillis)
data.put("node", node)
context.logger.log("influx.bfa", data)
}
Build Failure Analyzer (code)
def bfa = build.getAction(FailureCauseBuildAction.class)
def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses()
for(def cause : causes) {
final Map<String, Object> data = new HashMap<>();
data.put("name", jobName)
data.put("number", build.number)
data.put("cause", cause.getName())
data.put("categories", cause.getCategories().join(','))
data.put("timestamp", build.timestamp.timeInMillis)
data.put("node", node)
context.logger.log("influx.bfa", data)
}
Build Failure Analyzer (code)
def bfa = build.getAction(FailureCauseBuildAction.class)
def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses()
for(def cause : causes) {
final Map<String, Object> data = new HashMap<>();
data.put("name", jobName)
data.put("number", build.number)
data.put("cause", cause.getName())
data.put("categories", cause.getCategories().join(','))
data.put("timestamp", build.timestamp.timeInMillis)
data.put("node", node)
context.logger.log("influx.bfa", data)
}
Build Failure Analyzer (code)
def bfa = build.getAction(FailureCauseBuildAction.class)
def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses()
for(def cause : causes) {
final Map<String, Object> data = new HashMap<>();
data.put("name", jobName)
data.put("number", build.number)
data.put("cause", cause.getName())
data.put("categories", cause.getCategories().join(','))
data.put("timestamp", build.timestamp.timeInMillis)
data.put("node", node)
context.logger.log("influx.bfa", data)
}
Build Failure Analyzer (code)
def bfa = build.getAction(FailureCauseBuildAction.class)
def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses()
for(def cause : causes) {
final Map<String, Object> data = new HashMap<>();
data.put("name", jobName)
data.put("number", build.number)
data.put("cause", cause.getName())
data.put("categories", cause.getCategories().join(','))
data.put("timestamp", build.timestamp.timeInMillis)
data.put("node", node)
context.logger.log("influx.bfa", data)
}
Build Failure Analyzer (result)
Speed up compilation
CCache (problem)
CCache
CCache
• New node - empty local cache
CCache
• New node - empty local cache
• Old local cache - a lot of misses
CCache
• New node - empty local cache
• Old local cache - a lot of misses
+ Distributed cache solves all this problems
CCache
• New node - empty local cache
• Old local cache - a lot of misses
+ Distributed cache solves all this problems
- Once a year distributes problem across the
cluster
CCache (result)
Improve node utilization
LoadBalancer (problem)
LoadBalancer (solution)
LoadBalancer (solution)
• Default balancer is optimized for cache
LoadBalancer (solution)
• Default balancer is optimized for cache
• Cron jobs are pinned to different hosts
LoadBalancer (solution)
• Default balancer is optimized for cache
• Cron jobs are pinned to different hosts
• Nothing to terminate/stop - no idle nodes
LoadBalancer (solution)
• Default balancer is optimized for cache
• Cron jobs are pinned to different hosts
• Nothing to terminate/stop - no idle nodes
+ Saturate Node Load Balancer: always put all load to the oldest
node
LoadBalancer (result)
Minimize impact
Jar Hell (problem)
java.io.InvalidClassException: hudson.util.StreamTaskListener;
local class incompatible: stream classdesc serialVersionUID = 1,
local class serialVersionUID = 294073340889094580
Jar Hell (explanation)
Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
• If first run that is using some class is aborted - this class is “lost”
Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
• If first run that is using some class is aborted - this class is “lost”
• Does not recover
Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
• If first run that is using some class is aborted - this class is “lost”
• Does not recover
• Huge impact
Jar Hell (“solution”)
if (cause.getName().equals("Jar Hell”)) {
Node node = build.getBuiltOn()
if (node != Jenkins.getInstance()) {
node.setLabelString("disabled_jar_hell");
}
Our daily dashboard
Cloud CI Monitoring
Resources
Resources
• FluentD
• Influxdb plugin for fluentd
• JavaGC plugin for fluentd
• FluentD Plugin
• Groovy Event Listener Plugin
• Build Failure Analyzer Plugin
• Saturate Node Load Balancer Plugin
• CCache with memcache
• InfluxDB
Q/A?
alexander.akbashev@here.com
Github: Jimilian

More Related Content

PDF
Scaling LoL Chat to 70M Players
PDF
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
PDF
Reactive Streams, j.u.concurrent & Beyond!
PDF
End to End Akka Streams / Reactive Streams - from Business to Socket
PPTX
OpenStack Contribution Workflow
PPTX
Apache Flink Hands On
PDF
Your first patch to open stack
PDF
Webinar slides: Replication Topology Changes for MySQL and MariaDB
Scaling LoL Chat to 70M Players
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Reactive Streams, j.u.concurrent & Beyond!
End to End Akka Streams / Reactive Streams - from Business to Socket
OpenStack Contribution Workflow
Apache Flink Hands On
Your first patch to open stack
Webinar slides: Replication Topology Changes for MySQL and MariaDB

What's hot (20)

PDF
Akka-chan's Survival Guide for the Streaming World
PPT
Heroku for team collaboration
PDF
Akka Streams in Action @ ScalaDays Berlin 2016
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
PDF
How Reactive Streams & Akka Streams change the JVM Ecosystem
PDF
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
ODP
Automatic codefixes
PDF
Running tests for every commit: Gerrit, Jenkins, Docker, AWS
PPTX
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
PDF
Continuous Integration on Steroids
PDF
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
PPTX
Nanog75, Network Device Property as Code
ODP
ATLRUG Announcements and Fun Facts - April 2016
ODP
Automating OWASP ZAP - DevCSecCon talk
PPTX
Elk ruminating on logs
PDF
Jenkins vs. AWS CodePipeline
PDF
Testing at Stream-Scale
PDF
SplunkSummit 2015 - HTTP Event Collector, Simplified Developer Logging
PDF
Nginx performance monitoring with Dynatrace
PPTX
Supercharging CI/CD with GitLab and Rancher - June 2017 Online Meetup
Akka-chan's Survival Guide for the Streaming World
Heroku for team collaboration
Akka Streams in Action @ ScalaDays Berlin 2016
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
How Reactive Streams & Akka Streams change the JVM Ecosystem
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
Automatic codefixes
Running tests for every commit: Gerrit, Jenkins, Docker, AWS
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Continuous Integration on Steroids
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
Nanog75, Network Device Property as Code
ATLRUG Announcements and Fun Facts - April 2016
Automating OWASP ZAP - DevCSecCon talk
Elk ruminating on logs
Jenkins vs. AWS CodePipeline
Testing at Stream-Scale
SplunkSummit 2015 - HTTP Event Collector, Simplified Developer Logging
Nginx performance monitoring with Dynatrace
Supercharging CI/CD with GitLab and Rancher - June 2017 Online Meetup
Ad

Similar to Cloud CI Monitoring (20)

PDF
Serverless in Production, an experience report (AWS UG South Wales)
PDF
Serverless in production, an experience report (FullStack 2018)
PDF
Continuous Delivery - Devoxx Morocco 2016
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PDF
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
Continuous Delivery - Voxxed Days Thessaloniki 21.10.2016
PDF
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
PDF
Serverless in production, an experience report
PDF
Neotys PAC 2018 - Jonathon Wright
PDF
Continuous Delivery - Voxxed Days Cluj-Napoca 2017
PDF
The future of paas is serverless
PDF
Intro to open source telemetry linux con 2016
PPTX
The Usual Suspects - Red Hat Developer Day 2012-11-01
PPTX
Puppet ENC – a ServiceNow Scoped Application; Richard Romanus
KEY
NDC 2011 - Let me introduce my Moncai
PPTX
Riga Dev Day - Automated Android Continuous Integration
PDF
Web Scale Reasoning and the LarKC Project
Serverless in Production, an experience report (AWS UG South Wales)
Serverless in production, an experience report (FullStack 2018)
Continuous Delivery - Devoxx Morocco 2016
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Continuous Delivery - Voxxed Days Thessaloniki 21.10.2016
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
Serverless in production, an experience report
Neotys PAC 2018 - Jonathon Wright
Continuous Delivery - Voxxed Days Cluj-Napoca 2017
The future of paas is serverless
Intro to open source telemetry linux con 2016
The Usual Suspects - Red Hat Developer Day 2012-11-01
Puppet ENC – a ServiceNow Scoped Application; Richard Romanus
NDC 2011 - Let me introduce my Moncai
Riga Dev Day - Automated Android Continuous Integration
Web Scale Reasoning and the LarKC Project
Ad

Recently uploaded (20)

PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Digital Strategies for Manufacturing Companies
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
top salesforce developer skills in 2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
AI in Product Development-omnex systems
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administration Chapter 2
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
medical staffing services at VALiNTRY
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
CHAPTER 2 - PM Management and IT Context
Online Work Permit System for Fast Permit Processing
Digital Strategies for Manufacturing Companies
ManageIQ - Sprint 268 Review - Slide Deck
top salesforce developer skills in 2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
AI in Product Development-omnex systems
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
How to Migrate SBCGlobal Email to Yahoo Easily
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administration Chapter 2
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
medical staffing services at VALiNTRY
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Design an Analysis of Algorithms I-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions
How Creative Agencies Leverage Project Management Software.pdf
How to Choose the Right IT Partner for Your Business in Malaysia

Cloud CI Monitoring

  • 1. Мониторинг облачной CI системы на примере Jenkins Alexander Akbashev HERE Technologies
  • 2. Here Technologies HERE Technologies, the Open Location Platform company, enables people, enterprises and cities to harness the power of location. By making sense of the world through the lens of location we empower our customers to achieve better outcomes – from helping a city manage its infrastructure or an enterprise optimize its assets to guiding drivers to their destination safely. To learn more about HERE, including our new generation of cloud- based location platform services, visit http:// 360.here.com and www.here.com
  • 3. Context • Every change goes through pre-submit validation • Feedback time is 15-40 minutes • A lot of products and platforms • 6 Jenkins masters • Up to 185k runs per day in the biggest one • 20k runs per day in average
  • 4. if something goes wrong…
  • 5. What can go wrong? Compilation is broken Tests are broken Network issues
  • 6. What can go wrong? Compilation is broken Tests are broken Network issues Jenkins master crashed EC2 plugin does not raise new nodes No connection to labs Can not cleanup workspace AWS S3 is down Git master dies Git replica is broken Compiler cache was invalidated Hit the limit of API calls to AWS Job was deleted UI is blocked Queue is too big System.exit(1) NFS stuck Deadlock in Jenkins Staging started to give feedback Restarted the wrong server
  • 7. What can go wrong? Compilation is broken Tests are broken Network issues Jenkins master crashed EC2 plugin does not raise new nodes No connection to labs Can not cleanup workspace AWS S3 is down Git master dies Git replica is broken Compiler cache was invalidated Hit the limit of API calls to AWS Job was deleted UI is blocked Queue is too big System.exit(1) NFS stuck Deadlock in Jenkins Staging started to give feedback Restarted the wrong server
  • 14. Monitoring Plugin (March 2016) + Easy to install
  • 15. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain
  • 16. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring
  • 17. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats
  • 18. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance
  • 19. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance - Not scalable
  • 20. Monitoring Plugin (nowadays) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance - Not scalable + InfluxDB/CloudWatch/Graphite
  • 21. Let’s craft own monitoring!
  • 22. Design own monitoring (March 2016) Jenkins Python InfluxDB API API
  • 23. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API
  • 24. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API
  • 25. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API
  • 26. Design own monitoring (March 2016) Jenkins Python InfluxDB API API
  • 27. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple API API
  • 28. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months API API
  • 29. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling API API
  • 30. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code API API
  • 31. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible API API
  • 32. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible - extra load API API
  • 33. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible - extra load API API
  • 34. Let’s do event based monitoring!
  • 36. Jenkins Core public abstract class RunListener<R extends Run> implements ExtensionPoint { public void onCompleted(R r, TaskListener listener) {}
 
 public void onFinalized(R r) {}
 
 public void onStarted(R r, TaskListener listener) {} public void onDeleted(R r) {} }
  • 37. Jenkins Core public abstract class RunListener<R extends Run> implements ExtensionPoint { public void onCompleted(R r, TaskListener listener) {}
 
 public void onFinalized(R r) {}
 
 public void onStarted(R r, TaskListener listener) {} public void onDeleted(R r) {} }
  • 38. Groovy Event Listener Plugin (April 2016) • Allows to execute custom groovy code for every event • Supports RunListener
  • 39. Groovy Event Listener Plugin (nowadays) • Allows to execute custom groovy code for every event • Supports RunListener, ComputerListener, ItemListener, QueueListener • Works at scale • Allows custom classpath
  • 40. Groovy Event Listener Plugin if (event == 'RunListener.onFinalized') { def build = Thread.currentThread().executable def queueAction = build.getAction(TimeInQueueAction.class) def queuing = queueAction.getQueuingDurationMillis() log.info “number=$build.number, queue_duration=$queuing }
  • 41. Ok, we have events, but how to fill the db?
  • 43. FluentD • Process 13,000 events/second/core
  • 44. FluentD • Process 13,000 events/second/core • Retry/buffer/routing
  • 45. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend
  • 46. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple
  • 47. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable
  • 48. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable • Memory footprint is 30-40MB
  • 49. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable • Memory footprint is 30-40MB • Ruby
  • 52. FluentD Jenkins FluentD InfluxDB JSON JSON Postgres SQL Logs
  • 53. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  • 54. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  • 55. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  • 56. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  • 57. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  • 58. Ok, we have events, we have fluentd, but how to pass event to it?
  • 60. FluentD Plugin for Jenkins • Developed in HERE Technologies
  • 61. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple
  • 62. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple • Supports JSON
  • 63. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple • Supports JSON • Post-build-step
  • 64. FluentD Plugin for Jenkins https://github.com/jenkinsci/fluentd-plugin
  • 65. Great! Let’s do something with this data!
  • 68. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  • 69. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  • 70. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  • 71. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  • 72. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  • 73. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  • 78. CCache • New node - empty local cache
  • 79. CCache • New node - empty local cache • Old local cache - a lot of misses
  • 80. CCache • New node - empty local cache • Old local cache - a lot of misses + Distributed cache solves all this problems
  • 81. CCache • New node - empty local cache • Old local cache - a lot of misses + Distributed cache solves all this problems - Once a year distributes problem across the cluster
  • 86. LoadBalancer (solution) • Default balancer is optimized for cache
  • 87. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts
  • 88. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts • Nothing to terminate/stop - no idle nodes
  • 89. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts • Nothing to terminate/stop - no idle nodes + Saturate Node Load Balancer: always put all load to the oldest node
  • 92. Jar Hell (problem) java.io.InvalidClassException: hudson.util.StreamTaskListener; local class incompatible: stream classdesc serialVersionUID = 1, local class serialVersionUID = 294073340889094580
  • 94. Jar Hell (explanation) • Bug in Jenkins Remoting Layer
  • 95. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost”
  • 96. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost” • Does not recover
  • 97. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost” • Does not recover • Huge impact
  • 98. Jar Hell (“solution”) if (cause.getName().equals("Jar Hell”)) { Node node = build.getBuiltOn() if (node != Jenkins.getInstance()) { node.setLabelString("disabled_jar_hell"); }
  • 102. Resources • FluentD • Influxdb plugin for fluentd • JavaGC plugin for fluentd • FluentD Plugin • Groovy Event Listener Plugin • Build Failure Analyzer Plugin • Saturate Node Load Balancer Plugin • CCache with memcache • InfluxDB