SlideShare a Scribd company logo
ss
Docker, Monitoring and SLURM Specific Visualisations
QNIBTerminal @ work
• Docker in a Nutshell
• QNIBx
Terminal
Monitoring
Inventory
• SLURM Autogenerated Dashboards
2
Agenda
3
About Me
• Christian Kniep
@CQnib, christian@qnib.org
4
About Me
• Christian Kniep
@CQnib, christian@qnib.org
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
5
About Me
• Christian Kniep
@CQnib, christian@qnib.org
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
• Founder of QNIB Solutions
Holistic System Management
Containerization of SysOps and Workload
Consultancy / Software Design & Development
Docker in a Nutshell
7
Multiple Guests
SERVER SERVER
Traditional Virtualisation Containerisation
8
Multiple Guests
SERVER
HOST	
  KERNEL
SERVER
HOST	
  KERNEL
Traditional Virtualisation Containerisation
9
Multiple Guests
SERVER
HOST	
  KERNEL
Userland
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
10
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
Userland
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
11
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
Userland
KERNEL KERNEL
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
12
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
13
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
14
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Traditional Virtualisation Containerisation
Docker
15
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Userland	
  (#1) Userland	
  (#2)
Traditional Virtualisation Containerisation
Docker
16
Multiple Guests
SERVER
HOST	
  KERNEL
HYPERVISOR	
  (Type	
  II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
  KERNEL
Userland
Userland	
  (#1) Userland	
  (#2)
SERVICE SERVICE
Traditional Virtualisation Containerisation
Docker
HOST
container1
17
Docker Internal View
• Containers are ‘grouped processes’
isolated by Kernel Namespaces (PID, network, mount, …)
resource restrictions applicable through CGroups
bash
ls -l
container2
apache
container3
mysqld
container4
slurmd
ssh
• 1/2 Day, July 16th @ISC High Performance
Deep dive into the talking points
How Docker might impact System Operations & HPC Applications
Further discussion beyond what I am talking about today
18
Docker Workshop
• Full Day, September 28th @ISC Cloud&BigData
19
Docker Workshop #2
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
21
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
22
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
23
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
24
QNIBTerminal
1
2
3
QNIBMonitoring
• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
26
QNIBMonitoring
• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
• QNIBMonitoring provides
open metrics system (system / application metrics, log aggregates)
log event framework, consuming/processing/visualise events
auto discovery / configuration through consul
27
QNIBMonitoring
28
QNIBMonitoring
• Logstash (Log/Event Monitoring)
29
QNIBMonitoring
• Grafana (Performance Monitoring)
30
QNIBMonitoring
• Overlay Metrics w/ Events
QNIBInventory
32
QNIBInventory
• Network Topology
33
QNIBInventory
• Installed Software
34
QNIBInventory
• SLURM Cluster
• Enrich Log/Events
35
QNIBInventory
1
2
• Enrich Log/Events
• Help visualise connections
36
QNIBInventory
• Enrich Log/Events
• Help visualise connections
• Build up history
37
QNIBInventory
Cluster Use-Case
• Multiple backgrounds have to be considered
Enduser (Engineer, Software Developer, Scientist)
Operation Personel
Management
• Psychology plays important role
Local rationality / context
10.000ft Overview vs. verifying hypothesis vs. Reporting
Empower users to extend their domain knowledge by providing toolset
39
Context Sensitive Dashboards
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
40
Cluster Usecase
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
41
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
42
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
43
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
44
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
45
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
46
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
postgres postgres
galaxy
galaxy galaxy
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
47
Management Context
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
• Reports
per day / user / job-type / …
48
Management Context
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
• Reports
per day / user / job-type / …
• Capacity Planning
utilisation over time, comparison of HW generations, global FS capacity
49
Management Context
50
SLURM Dashboard
51
SLURM Dashboard
• Nodes are connected to Partitions
52
SLURM Inventar
• Nodes are connected to Partitions
• Jobs are connected to both
53
SLURM Inventar
54
SLURM Dashboard
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
55
Enduser Context
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
• Post Mortem
Get detailed report after job has finished
56
Enduser Context
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
• Post Mortem
Get detailed report after job has finished
• MDO jobs
depending on outcome and progression submit next iteration(s)
57
Enduser Context
58
SLURM Dashboard
59
SLURM Dashboard
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
60
SysOps Context
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
61
SysOps Context
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
• Guid through ‘known problems’
close feedback loops provide confidence
62
SysOps Context
63
Central Logging
64
Galaxy
65
Galaxy Use-Cases
SLURM
66
Galaxy Use-Cases
SLURM
Log
Events
WORKFLOW
Metrics Inventory
• Model Assess Workflow in Galaxy
Easy to grasp (in contrast to Hadoop, Spark, …)
Event triggered, Cronjob?
Using idle compute resources
67
Thank you!
• Contact
christian@qnib.org
@CQnib, @_qnib
• Web
www.qnib.org (blog)
doc.qnib.org (Paper)
• Feel free…
…ask questions (now / later)
…ask for a Demo

More Related Content

PDF
HPC in a Box - Docker Workshop at ISC 2015
PPTX
Monitoring and Reporting for IBM i Compliance and Security
PDF
Sensu and Sensibility - Puppetconf 2014
PDF
systemd @ Facebook -- a year later
PPTX
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
PDF
Three Perspectives on Measuring Latency
PDF
Security 101: IBM i Security Auditing and Reporting
PDF
webtechfeb20replicationmanagement_final
HPC in a Box - Docker Workshop at ISC 2015
Monitoring and Reporting for IBM i Compliance and Security
Sensu and Sensibility - Puppetconf 2014
systemd @ Facebook -- a year later
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Three Perspectives on Measuring Latency
Security 101: IBM i Security Auditing and Reporting
webtechfeb20replicationmanagement_final

Similar to Docker, Monitoring and SLURM Specific Visualisations (20)

PDF
Guider: An Integrated Runtime Performance Analyzer on AGL
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PPT
Simplescalar Overview- a Superscalar.ppt
PDF
Resilient Predictive Data Pipelines (QCon London 2016)
PDF
Kubernetes Walk Through from Technical View
PPTX
NCM Training - Part 2 - Automation, Notification, Compliance and Reports
PPTX
Serverspec and Sensu - Testing and Monitoring collide
PDF
Mike Weber - Nagios and Group Deployment of Service Checks
PDF
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
PDF
Kubernetes: Beyond Baby Steps
PDF
Configuration Management Tools on NX-OS
PDF
Meetup Openshift Geneva 03/10
PDF
Kubecon seattle 2018 workshop slides
PDF
Regain Control Thanks To Prometheus
PDF
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
PDF
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
PDF
Agile infrastructure
PPTX
Configlets, compliance, RBAC & reports - Network Configuration Manager
PPTX
Free training on Network Configuration Manager - Season 2 - Part 2
Guider: An Integrated Runtime Performance Analyzer on AGL
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Simplescalar Overview- a Superscalar.ppt
Resilient Predictive Data Pipelines (QCon London 2016)
Kubernetes Walk Through from Technical View
NCM Training - Part 2 - Automation, Notification, Compliance and Reports
Serverspec and Sensu - Testing and Monitoring collide
Mike Weber - Nagios and Group Deployment of Service Checks
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Kubernetes: Beyond Baby Steps
Configuration Management Tools on NX-OS
Meetup Openshift Geneva 03/10
Kubecon seattle 2018 workshop slides
Regain Control Thanks To Prometheus
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Agile infrastructure
Configlets, compliance, RBAC & reports - Network Configuration Manager
Free training on Network Configuration Manager - Season 2 - Part 2
Ad

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Current and future trends in Computer Vision.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Artificial Intelligence
PPTX
Sustainable Sites - Green Building Construction
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
DOCX
573137875-Attendance-Management-System-original
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
Project quality management in manufacturing
PPTX
Construction Project Organization Group 2.pptx
PPTX
web development for engineering and engineering
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Current and future trends in Computer Vision.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Artificial Intelligence
Sustainable Sites - Green Building Construction
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Internet of Things (IOT) - A guide to understanding
573137875-Attendance-Management-System-original
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Project quality management in manufacturing
Construction Project Organization Group 2.pptx
web development for engineering and engineering
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Ad

Docker, Monitoring and SLURM Specific Visualisations

  • 1. ss Docker, Monitoring and SLURM Specific Visualisations QNIBTerminal @ work
  • 2. • Docker in a Nutshell • QNIBx Terminal Monitoring Inventory • SLURM Autogenerated Dashboards 2 Agenda
  • 3. 3 About Me • Christian Kniep @CQnib, christian@qnib.org
  • 4. 4 About Me • Christian Kniep @CQnib, christian@qnib.org • >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer DevOps @Locafox (hyper-scale web-service)
  • 5. 5 About Me • Christian Kniep @CQnib, christian@qnib.org • >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer DevOps @Locafox (hyper-scale web-service) • Founder of QNIB Solutions Holistic System Management Containerization of SysOps and Workload Consultancy / Software Design & Development
  • 6. Docker in a Nutshell
  • 7. 7 Multiple Guests SERVER SERVER Traditional Virtualisation Containerisation
  • 8. 8 Multiple Guests SERVER HOST  KERNEL SERVER HOST  KERNEL Traditional Virtualisation Containerisation
  • 9. 9 Multiple Guests SERVER HOST  KERNEL Userland SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 10. 10 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) Userland SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 11. 11 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL Userland KERNEL KERNEL SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 12. 12 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL Userland KERNEL KERNEL Userland Userland Userland SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 13. 13 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation
  • 14. 14 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Traditional Virtualisation Containerisation Docker
  • 15. 15 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Userland  (#1) Userland  (#2) Traditional Virtualisation Containerisation Docker
  • 16. 16 Multiple Guests SERVER HOST  KERNEL HYPERVISOR  (Type  II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST  KERNEL Userland Userland  (#1) Userland  (#2) SERVICE SERVICE Traditional Virtualisation Containerisation Docker
  • 17. HOST container1 17 Docker Internal View • Containers are ‘grouped processes’ isolated by Kernel Namespaces (PID, network, mount, …) resource restrictions applicable through CGroups bash ls -l container2 apache container3 mysqld container4 slurmd ssh
  • 18. • 1/2 Day, July 16th @ISC High Performance Deep dive into the talking points How Docker might impact System Operations & HPC Applications Further discussion beyond what I am talking about today 18 Docker Workshop
  • 19. • Full Day, September 28th @ISC Cloud&BigData 19 Docker Workshop #2
  • 21. • Framework of system container to spin up stacks SLURM 21 QNIBTerminal
  • 22. • Framework of system container to spin up stacks SLURM 22 QNIBTerminal
  • 23. • Framework of system container to spin up stacks SLURM 23 QNIBTerminal
  • 24. • Framework of system container to spin up stacks SLURM 24 QNIBTerminal 1 2 3
  • 26. • Current monitoring systems do not connect overlaying metrics with log events use/build inventory system to provide connections usually hidden users perspective and scope/context/background 26 QNIBMonitoring
  • 27. • Current monitoring systems do not connect overlaying metrics with log events use/build inventory system to provide connections usually hidden users perspective and scope/context/background • QNIBMonitoring provides open metrics system (system / application metrics, log aggregates) log event framework, consuming/processing/visualise events auto discovery / configuration through consul 27 QNIBMonitoring
  • 36. • Enrich Log/Events • Help visualise connections 36 QNIBInventory
  • 37. • Enrich Log/Events • Help visualise connections • Build up history 37 QNIBInventory
  • 39. • Multiple backgrounds have to be considered Enduser (Engineer, Software Developer, Scientist) Operation Personel Management • Psychology plays important role Local rationality / context 10.000ft Overview vs. verifying hypothesis vs. Reporting Empower users to extend their domain knowledge by providing toolset 39 Context Sensitive Dashboards
  • 40. • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 40 Cluster Usecase
  • 41. • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 41 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute
  • 42. • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 42 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute
  • 43. • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 43 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana
  • 44. elasticsearch • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 44 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf
  • 45. elasticsearch • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 45 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf neo4j neo4j Inventory inventory QINBInv
  • 46. elasticsearch • Small SLURM cluster couple of nodes, two user groups, couple of users script & MPI workload 46 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf neo4j neo4j Inventory inventory QINBInv postgres postgres galaxy galaxy galaxy
  • 47. • Live cluster Status Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser 47 Management Context
  • 48. • Live cluster Status Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser • Reports per day / user / job-type / … 48 Management Context
  • 49. • Live cluster Status Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser • Reports per day / user / job-type / … • Capacity Planning utilisation over time, comparison of HW generations, global FS capacity 49 Management Context
  • 52. • Nodes are connected to Partitions 52 SLURM Inventar
  • 53. • Nodes are connected to Partitions • Jobs are connected to both 53 SLURM Inventar
  • 55. • Live progress of SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) 55 Enduser Context
  • 56. • Live progress of SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) • Post Mortem Get detailed report after job has finished 56 Enduser Context
  • 57. • Live progress of SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) • Post Mortem Get detailed report after job has finished • MDO jobs depending on outcome and progression submit next iteration(s) 57 Enduser Context
  • 60. • Live cluster Status USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour 60 SysOps Context
  • 61. • Live cluster Status USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour • Drill into monitoring verify hypothesis about incidents/problems correlate events, metrics and inventory 61 SysOps Context
  • 62. • Live cluster Status USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour • Drill into monitoring verify hypothesis about incidents/problems correlate events, metrics and inventory • Guid through ‘known problems’ close feedback loops provide confidence 62 SysOps Context
  • 66. 66 Galaxy Use-Cases SLURM Log Events WORKFLOW Metrics Inventory • Model Assess Workflow in Galaxy Easy to grasp (in contrast to Hadoop, Spark, …) Event triggered, Cronjob? Using idle compute resources
  • 67. 67 Thank you! • Contact christian@qnib.org @CQnib, @_qnib • Web www.qnib.org (blog) doc.qnib.org (Paper) • Feel free… …ask questions (now / later) …ask for a Demo