SlideShare a Scribd company logo
Your logo
here
Monitoring Alerts and Metrics on
Large Power Systems Clusters
Marcelo Perazolo
Cognitive Systems Architect
IBM Systems
mperazolo@us.ibm.com
Nuremberg, Nov 4-7, 2019
http://osmc.de
• Introduction
• CORAL & Summit Supercomputer case
• Power Firmware Monitoring – The CRASSD open source project
• Power-Ops open source project – an open source collaboration
• Demo
• Conclusion
Agenda
Why Power/OpenPOWER is popular for certain Workloads
• Open Hardware Architecture
• Multiple vendors
• OpenPOWER Foundation
• CORAL: Collaboration of Oak Ridge, Argonne and Lawrence Livermore
• Summit is located at the Oak Ridge Laboratory, used for civilian research
• Sister project: Sierra supercomputer at Lawrence Livermore (nuclear weapons research)
• First supercomputer to reach exaOps performance
• ~ interconnected by 185 miles of fiber optic cables
• ~ 5,600 sqft of data center floor space
• ~ 340 tons of hardware and overhead infrastructure
• ~ 13MW power consumption
• 4,608 Power9 AC922 22-core systems
• 27,648 NVIDIA GPUs (6 per node)
• 250 Peta Bytes of Storage
• 200Gbps InfiniBand bandwidth between nodes
• Pumps up to 200 petaFLOPS / 3 exaOps
• Helps researchers with AI / BigData / Analytics, HPC capabilities
Case Study: The Summit Supercomputer
Summit: The Most Energy-Efficient Supercomputer
“The world’s smartest supercomputer is sharing data with its cooling
plant, reducing energy consumption and cost”
• “Summit is also the most energy-efficient supercomputer in
its Green500 class—based on gigaflops per watt—outranking systems a 10th as
fast.”
• “We wanted to couple Summit’s mechanical cooling system with its
computational workload to optimize efficiency, which can translate to significant
cost savings for a system of this size.”
• “We’ve developed the infrastructure architecture to scale to millions of events
per second using containerized microservices and popular enterprise open-
source software.”
• “On each Summit node OpenBMC provides real-time data readings from dozens
of sensors totaling more than 460,000 metrics per second that describe power
consumption, temperature, and performance for the entire supercomputer.”
• ”Facility staff can now visualize Summit behavior across all 4,608 nodes with a
temperature heat map, a power consumption map, and power and consumption
data broken down by CPUs and GPUs.”
• “Capturing all possible data in real time allows operators and researchers to
gain powerful insights into job behavior, machine performance, and cooling
response.”
*** Quoted from: https://www.hpcwire.com/off-the-wire/olcf-and-providentia-worldwide-build-intelligence-
system-for-supercomputer-cooling-plant/
Summit: High Level Hardware/Architecture View
CRASSD
Firmware Alerts & Telemetry from Power nodes flow to Crassd servers and then to open tools for
visualization such as Grafana, Elastic Stack. Data includes power consumption, frequencies, cooling, etc.
CRASSD: Open tooling for Power Firmware Monitoring
CRASSD Facts
▪ CORAL required telemetry data for all
nodes/layers in the Power Cluster
▪ Proposed RAS architecture had flaws:
▪ No method existed to route errors from the BMC
▪ Built CRASSD as an open tool:
–To collect error events and sort using policy tables
– extended the daemon to gather sensor readings to
fulfill ORNL telemetry requirements
–Provides an API that makes it easy to develop plug-
ins using various Open Source monitoring tools
▪ The results have been impressive, and many
more use cases are being developed
▪ CRASSD currently being incorporated into
other Solutions where the same requirements
exist, e.g. Power-Ops stack.
Available at: https://github.com/open-power-ref-design-toolkit/ibm-crassd
Motivations
• Replace legacy tools and solutions with modern/open alternatives for Power clusters
• Monitoring for x86 is feature-rich and commoditized with extensive support
• Not so much for Power, e.g.: Elastic on Power still on v5.x; new v7.x now has binaries (x86 only)
• Power users often need to port / build / configure these tools from scratch !!
➔ May influence cost of maintenance, thus decision to user Power at all
• Automate a complete ecosystem of tools that fit all needs of a modern Ops stack
• types of data: logs/alerts vs. telemetry
• analysis: historical vs. real-time
• multi-layer aggregation: firmware, OS, services, etc.
• single system or cluster-wide
➔ Popular stacks use Grafana & Prometheus, ELK, Nagios / Icinga / Zabbix, Netdata, etc.
and are deployed/configured by tools such as Ansible, Terraform, Salt, Puppet, etc.
Proposal: Build & curate a key set of modern open tools for Power systems, engage Power systems
users and open source monitoring/ops community
Value 1: reduce cost of modernizing Operations for existing Power clusters (legacy → open)
Value 2: enable adding Power nodes easily into data centers that already use modern Ops tooling
Value 3: reduced entry cost of Operation for new solutions interested on Power advantages
Beyond Power Firmware Monitoring: Power-Ops project
Power-Ops: Open tooling for Power Cluster Operations
Power-Ops Facts
▪ Management stack runs on Power LE architecture
▪ Managed endpoints supported are Power Linux
(could also be easily used on x86):
▪ RedHat family of OSs
▪ Debian/Ubuntu family of OSs
▪ AIX (limited, starting to be supported as endpoints)
▪ Composed of automation components using
Ansible playbooks
▪ 3 Main goals:
▪ Bring-up and pre-configure target platforms
(Bare-Metal, Virtual Machines, Containers*)
▪ Build components not currently available on the
Power platform
▪ Deploy and Configure tooling and start-up dashboards that
work off-the-shelf with Power
▪ Growing community of interested end-users
Power-Ops: Bring-Up
The Bring-Up Process
▪ DevOps professional triggers process on
CI/CD platform
▪ CI/CD tools invoke Ansible
▪ Ansible Playbooks interact with IaaS of choice
▪ Nodes are brought up targeted for different roles:
–Builders
–Controllers
–Endpoints
▪ Bring-up includes powering-up (if needed) and
laying down pre-requisites for building or
deployment
–OS
–Packages & Libraries
–Access configuration
–Software configuration
devops CI/CD
builders
controllers
endpoints
This could be one of several choices, e.g.
- Bare-Metal
- Hypervisors or Power
- Power Hyperconverged Infrastructure
- Containers on OpenShift, etc.
(integrations are easy, just drop playbook)
Power-Ops: Build
The Build Process
▪ Many components are already available on Power,
but there are exceptions
▪ CRASSD: source on github
▪ Build process generates packages for Debian, RedHat
▪ Go Lang
▪ Go Daemon binary must be recompiled on Power
▪ Elastic Stack
–Up to v5.x code is implemented in Java
–Newer releases include binaries (not yet supported)
–Beats must be re-packaged for Debian, RedHat
▪ All relevant packages are then stored on a
local repository
▪ Doesn’t have to run frequently
–DevOps orgs could automate upstream integration
devops CI/CD
builders
repo
Generates binaries/packages for Power
not yet widely available on public repos
Long-term goal is to
integrate Power packages
onto upstream repositories
libs
Power-Ops: Deploy
The Deploy Process
▪ Choose deployment topology
▪ Where each component is deployed to
▪ How they interconnect with each other
▪ Deploy tooling to nodes
▪ Elastic Stack, Netadata, Crassd go to Controller nodes
▪ Beats (FileBeat, MetricBeat) go to Endpoint nodes
▪ Deploy configuration & visualizations/dashboards
▪ Crassd is configured to collect firmware data:
Telemetry data goes to Netdata
Alerting data goes to Logstash
▪ FileBeat collects logs and sends to Logstash
▪ MetricBeat collects telemetry and sends to Elasticsearch
▪ Visualizations/Dashboards are deployed to Netdata and
Kibana
▪ Operators can then access User Interfaces from
Kibana and Netdata
devops CI/CD
repo
CRASSD
Flexible deployment to both
controllers and endpoints
Demo Overview
(controller)
wmdepos
P8 bare metal
Marcelo’s Laptop
(endpoint/VM)
pops-ubuntu-ept
crassd
(endpoint/VM)
pops-redhat-ept
(endpoint / P9)
bos-1
github
repos
deploy
f/w
alerts
telemetry
+ logs
(controller)
launchgr01
P9 bare metal
crassd
Dashboards:
- F/W Alerts (Kibana)
- Logs/Infrastructure (Kibana)
- Cluster Metrics (Kibana)
- OS & F/W Metrics (Netdata)
firmware
192.168.10.25
IPMI OBMCtelemetry
deployment
playbooks
(*)
(*) F/W data supported on Power9 systems
(endpoint/VM)
pops-aix-ept
DEMO / Walk-through
Next Steps
Grow the community
1. Engage with traditional Power systems users (e.g. AIX, legacy Power) promoting modernization
2. Engage with Power Linux community, foster benefits of sharing solutions for everybody’s benefit
3. Engage with Open Source communities, promote support of Power out of the box (when such doesn’t yet exist)
4. Use as a catalyst for monitoring of new large Power clusters (taking advantage of lower cost of entry on Power)
Enhance the Operational Stack
• Add Call Home support to CRASSD
• Support more deployment use cases, such as:
• Containers (development under way)
• Broader integration targeting other IaaS/PaaS solutions (e.g. OpenShift clusters)
• Support additional tools, such as:
• Prometheus / Grafana (development planned)
• Zabbix and/or Nagios / Icinga, others… (feel free to suggest / collaborate !!!)
• Support additional hardware, such as:
• Support other/newer BMC Firmware interfaces such as Redfish
• Monitor GPUs, Networking & Storage equipment
• More Power / OpenPOWER system models
• Currency work to support and maintain newer releases of tooling, e.g.
• Migrate to Elastic Stack v7.x (needs automation)
• Add support for more Beats
• More AIX support
Q&As
Backup
Kibana: Dashboard for Power Firmware events (fed from CRASSD Alerts)
Kibana: Dashboard for Power Infrastructure logs (fed from FileBeat)
Kibana: Multiple Dashboard for Long-Term Power metrics
(fed from MetricBeat and kept on Elasticsearch)
+ more
Netdata: Dashboards for Real-Time Power Firmware metrics (fed from CRASSD)
and Power Infrastructure metrics (fed from other Netdata plugins)

More Related Content

PPTX
Scale-Out Resource Management at Microsoft using Apache YARN
PDF
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
PPTX
Apache Apex & Bigtop
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
PPTX
Next Generation Execution Engine for Apache Storm
PPTX
Performance Comparison of Streaming Big Data Platforms
Scale-Out Resource Management at Microsoft using Apache YARN
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
Apache Apex & Bigtop
Flexible and Real-Time Stream Processing with Apache Flink
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Next Generation Execution Engine for Apache Storm
Performance Comparison of Streaming Big Data Platforms

What's hot (20)

PDF
The Future of Apache Storm
PDF
Supporting Over a Thousand Custom Hive User Defined Functions
PPTX
YARN and the Docker container runtime
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
PDF
Spark Summit EU talk by Jorg Schad
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
From Device to Data Center to Insights
PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PDF
A TPC Benchmark of Hive LLAP and Comparison with Presto
PPTX
Llap: Locality is Dead
PDF
Using Spark with Tachyon by Gene Pang
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PDF
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
PPTX
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
PPTX
Ansible + Hadoop
PDF
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
PDF
Hive on spark berlin buzzwords
PPSX
LLAP Nov Meetup
PDF
Provisioning with Stacki at NIST
The Future of Apache Storm
Supporting Over a Thousand Custom Hive User Defined Functions
YARN and the Docker container runtime
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit EU talk by Jorg Schad
OpenPOWER Acceleration of HPCC Systems
From Device to Data Center to Insights
Apache Hadoop 3.0 What's new in YARN and MapReduce
A TPC Benchmark of Hive LLAP and Comparison with Presto
Llap: Locality is Dead
Using Spark with Tachyon by Gene Pang
Low latency high throughput streaming using Apache Apex and Apache Kudu
Apache Hive 2.0: SQL, Speed, Scale
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Ansible + Hadoop
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Hive on spark berlin buzzwords
LLAP Nov Meetup
Provisioning with Stacki at NIST
Ad

Similar to OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo (20)

PDF
OCP Telco Engineering Workshop at BCE2017
PDF
ODP Presentation LinuxCon NA 2014
PPTX
HPC and cloud distributed computing, as a journey
PDF
Stacks and Layers: Integrating P4, C, OVS and OpenStack
PPTX
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
PDF
Deview 2013 rise of the wimpy machines - john mao
PDF
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
PDF
Design installation-commissioning-red raider-cluster-ttu
PDF
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
PPTX
Sharing High-Performance Interconnects Across Multiple Virtual Machines
PDF
Japan's post K Computer
PDF
Maxwell siuc hpc_description_tutorial
PDF
Introduction to Apache Mesos and DC/OS
PDF
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
PPTX
Sanger, upcoming Openstack for Bio-informaticians
PPTX
Flexible compute
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
PDF
Optimized placement in Openstack for NFV
PDF
Public vs. Private Cloud Performance by Flex
PDF
3.2 Streaming and Messaging
OCP Telco Engineering Workshop at BCE2017
ODP Presentation LinuxCon NA 2014
HPC and cloud distributed computing, as a journey
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Deview 2013 rise of the wimpy machines - john mao
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
Design installation-commissioning-red raider-cluster-ttu
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Japan's post K Computer
Maxwell siuc hpc_description_tutorial
Introduction to Apache Mesos and DC/OS
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
Sanger, upcoming Openstack for Bio-informaticians
Flexible compute
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Optimized placement in Openstack for NFV
Public vs. Private Cloud Performance by Flex
3.2 Streaming and Messaging
Ad

Recently uploaded (20)

PPTX
Transform Your Business with a Software ERP System
PPTX
Introduction to Artificial Intelligence
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
System and Network Administraation Chapter 3
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Nekopoi APK 2025 free lastest update
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
ai tools demonstartion for schools and inter college
Transform Your Business with a Software ERP System
Introduction to Artificial Intelligence
Which alternative to Crystal Reports is best for small or large businesses.pdf
System and Network Administraation Chapter 3
VVF-Customer-Presentation2025-Ver1.9.pptx
Operating system designcfffgfgggggggvggggggggg
Navsoft: AI-Powered Business Solutions & Custom Software Development
Understanding Forklifts - TECH EHS Solution
Softaken Excel to vCard Converter Software.pdf
Nekopoi APK 2025 free lastest update
Online Work Permit System for Fast Permit Processing
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Design an Analysis of Algorithms I-SECS-1021-03
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Wondershare Filmora 15 Crack With Activation Key [2025
ISO 45001 Occupational Health and Safety Management System
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
CHAPTER 2 - PM Management and IT Context
ai tools demonstartion for schools and inter college

OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo

  • 1. Your logo here Monitoring Alerts and Metrics on Large Power Systems Clusters Marcelo Perazolo Cognitive Systems Architect IBM Systems mperazolo@us.ibm.com Nuremberg, Nov 4-7, 2019 http://osmc.de
  • 2. • Introduction • CORAL & Summit Supercomputer case • Power Firmware Monitoring – The CRASSD open source project • Power-Ops open source project – an open source collaboration • Demo • Conclusion Agenda
  • 3. Why Power/OpenPOWER is popular for certain Workloads • Open Hardware Architecture • Multiple vendors • OpenPOWER Foundation
  • 4. • CORAL: Collaboration of Oak Ridge, Argonne and Lawrence Livermore • Summit is located at the Oak Ridge Laboratory, used for civilian research • Sister project: Sierra supercomputer at Lawrence Livermore (nuclear weapons research) • First supercomputer to reach exaOps performance • ~ interconnected by 185 miles of fiber optic cables • ~ 5,600 sqft of data center floor space • ~ 340 tons of hardware and overhead infrastructure • ~ 13MW power consumption • 4,608 Power9 AC922 22-core systems • 27,648 NVIDIA GPUs (6 per node) • 250 Peta Bytes of Storage • 200Gbps InfiniBand bandwidth between nodes • Pumps up to 200 petaFLOPS / 3 exaOps • Helps researchers with AI / BigData / Analytics, HPC capabilities Case Study: The Summit Supercomputer
  • 5. Summit: The Most Energy-Efficient Supercomputer “The world’s smartest supercomputer is sharing data with its cooling plant, reducing energy consumption and cost” • “Summit is also the most energy-efficient supercomputer in its Green500 class—based on gigaflops per watt—outranking systems a 10th as fast.” • “We wanted to couple Summit’s mechanical cooling system with its computational workload to optimize efficiency, which can translate to significant cost savings for a system of this size.” • “We’ve developed the infrastructure architecture to scale to millions of events per second using containerized microservices and popular enterprise open- source software.” • “On each Summit node OpenBMC provides real-time data readings from dozens of sensors totaling more than 460,000 metrics per second that describe power consumption, temperature, and performance for the entire supercomputer.” • ”Facility staff can now visualize Summit behavior across all 4,608 nodes with a temperature heat map, a power consumption map, and power and consumption data broken down by CPUs and GPUs.” • “Capturing all possible data in real time allows operators and researchers to gain powerful insights into job behavior, machine performance, and cooling response.” *** Quoted from: https://www.hpcwire.com/off-the-wire/olcf-and-providentia-worldwide-build-intelligence- system-for-supercomputer-cooling-plant/
  • 6. Summit: High Level Hardware/Architecture View CRASSD Firmware Alerts & Telemetry from Power nodes flow to Crassd servers and then to open tools for visualization such as Grafana, Elastic Stack. Data includes power consumption, frequencies, cooling, etc.
  • 7. CRASSD: Open tooling for Power Firmware Monitoring CRASSD Facts ▪ CORAL required telemetry data for all nodes/layers in the Power Cluster ▪ Proposed RAS architecture had flaws: ▪ No method existed to route errors from the BMC ▪ Built CRASSD as an open tool: –To collect error events and sort using policy tables – extended the daemon to gather sensor readings to fulfill ORNL telemetry requirements –Provides an API that makes it easy to develop plug- ins using various Open Source monitoring tools ▪ The results have been impressive, and many more use cases are being developed ▪ CRASSD currently being incorporated into other Solutions where the same requirements exist, e.g. Power-Ops stack. Available at: https://github.com/open-power-ref-design-toolkit/ibm-crassd
  • 8. Motivations • Replace legacy tools and solutions with modern/open alternatives for Power clusters • Monitoring for x86 is feature-rich and commoditized with extensive support • Not so much for Power, e.g.: Elastic on Power still on v5.x; new v7.x now has binaries (x86 only) • Power users often need to port / build / configure these tools from scratch !! ➔ May influence cost of maintenance, thus decision to user Power at all • Automate a complete ecosystem of tools that fit all needs of a modern Ops stack • types of data: logs/alerts vs. telemetry • analysis: historical vs. real-time • multi-layer aggregation: firmware, OS, services, etc. • single system or cluster-wide ➔ Popular stacks use Grafana & Prometheus, ELK, Nagios / Icinga / Zabbix, Netdata, etc. and are deployed/configured by tools such as Ansible, Terraform, Salt, Puppet, etc. Proposal: Build & curate a key set of modern open tools for Power systems, engage Power systems users and open source monitoring/ops community Value 1: reduce cost of modernizing Operations for existing Power clusters (legacy → open) Value 2: enable adding Power nodes easily into data centers that already use modern Ops tooling Value 3: reduced entry cost of Operation for new solutions interested on Power advantages Beyond Power Firmware Monitoring: Power-Ops project
  • 9. Power-Ops: Open tooling for Power Cluster Operations Power-Ops Facts ▪ Management stack runs on Power LE architecture ▪ Managed endpoints supported are Power Linux (could also be easily used on x86): ▪ RedHat family of OSs ▪ Debian/Ubuntu family of OSs ▪ AIX (limited, starting to be supported as endpoints) ▪ Composed of automation components using Ansible playbooks ▪ 3 Main goals: ▪ Bring-up and pre-configure target platforms (Bare-Metal, Virtual Machines, Containers*) ▪ Build components not currently available on the Power platform ▪ Deploy and Configure tooling and start-up dashboards that work off-the-shelf with Power ▪ Growing community of interested end-users
  • 10. Power-Ops: Bring-Up The Bring-Up Process ▪ DevOps professional triggers process on CI/CD platform ▪ CI/CD tools invoke Ansible ▪ Ansible Playbooks interact with IaaS of choice ▪ Nodes are brought up targeted for different roles: –Builders –Controllers –Endpoints ▪ Bring-up includes powering-up (if needed) and laying down pre-requisites for building or deployment –OS –Packages & Libraries –Access configuration –Software configuration devops CI/CD builders controllers endpoints This could be one of several choices, e.g. - Bare-Metal - Hypervisors or Power - Power Hyperconverged Infrastructure - Containers on OpenShift, etc. (integrations are easy, just drop playbook)
  • 11. Power-Ops: Build The Build Process ▪ Many components are already available on Power, but there are exceptions ▪ CRASSD: source on github ▪ Build process generates packages for Debian, RedHat ▪ Go Lang ▪ Go Daemon binary must be recompiled on Power ▪ Elastic Stack –Up to v5.x code is implemented in Java –Newer releases include binaries (not yet supported) –Beats must be re-packaged for Debian, RedHat ▪ All relevant packages are then stored on a local repository ▪ Doesn’t have to run frequently –DevOps orgs could automate upstream integration devops CI/CD builders repo Generates binaries/packages for Power not yet widely available on public repos Long-term goal is to integrate Power packages onto upstream repositories libs
  • 12. Power-Ops: Deploy The Deploy Process ▪ Choose deployment topology ▪ Where each component is deployed to ▪ How they interconnect with each other ▪ Deploy tooling to nodes ▪ Elastic Stack, Netadata, Crassd go to Controller nodes ▪ Beats (FileBeat, MetricBeat) go to Endpoint nodes ▪ Deploy configuration & visualizations/dashboards ▪ Crassd is configured to collect firmware data: Telemetry data goes to Netdata Alerting data goes to Logstash ▪ FileBeat collects logs and sends to Logstash ▪ MetricBeat collects telemetry and sends to Elasticsearch ▪ Visualizations/Dashboards are deployed to Netdata and Kibana ▪ Operators can then access User Interfaces from Kibana and Netdata devops CI/CD repo CRASSD Flexible deployment to both controllers and endpoints
  • 13. Demo Overview (controller) wmdepos P8 bare metal Marcelo’s Laptop (endpoint/VM) pops-ubuntu-ept crassd (endpoint/VM) pops-redhat-ept (endpoint / P9) bos-1 github repos deploy f/w alerts telemetry + logs (controller) launchgr01 P9 bare metal crassd Dashboards: - F/W Alerts (Kibana) - Logs/Infrastructure (Kibana) - Cluster Metrics (Kibana) - OS & F/W Metrics (Netdata) firmware 192.168.10.25 IPMI OBMCtelemetry deployment playbooks (*) (*) F/W data supported on Power9 systems (endpoint/VM) pops-aix-ept
  • 15. Next Steps Grow the community 1. Engage with traditional Power systems users (e.g. AIX, legacy Power) promoting modernization 2. Engage with Power Linux community, foster benefits of sharing solutions for everybody’s benefit 3. Engage with Open Source communities, promote support of Power out of the box (when such doesn’t yet exist) 4. Use as a catalyst for monitoring of new large Power clusters (taking advantage of lower cost of entry on Power) Enhance the Operational Stack • Add Call Home support to CRASSD • Support more deployment use cases, such as: • Containers (development under way) • Broader integration targeting other IaaS/PaaS solutions (e.g. OpenShift clusters) • Support additional tools, such as: • Prometheus / Grafana (development planned) • Zabbix and/or Nagios / Icinga, others… (feel free to suggest / collaborate !!!) • Support additional hardware, such as: • Support other/newer BMC Firmware interfaces such as Redfish • Monitor GPUs, Networking & Storage equipment • More Power / OpenPOWER system models • Currency work to support and maintain newer releases of tooling, e.g. • Migrate to Elastic Stack v7.x (needs automation) • Add support for more Beats • More AIX support
  • 16. Q&As
  • 18. Kibana: Dashboard for Power Firmware events (fed from CRASSD Alerts)
  • 19. Kibana: Dashboard for Power Infrastructure logs (fed from FileBeat)
  • 20. Kibana: Multiple Dashboard for Long-Term Power metrics (fed from MetricBeat and kept on Elasticsearch) + more
  • 21. Netdata: Dashboards for Real-Time Power Firmware metrics (fed from CRASSD) and Power Infrastructure metrics (fed from other Netdata plugins)