SlideShare a Scribd company logo
presented by
Server RAS and UEFI CPER
Spring 2017 UEFI Seminar and Plugfest
March 27 - 31, 2017
Presented by
Lucia, Mao (Intel)
Spike Yuan (Intel)
UEFI Plugfest – March 2017 www.uefi.org 1
Updated 2011-06-01
Agenda
• RAS Basic
• Server RAS Challenges
• SW Building Blocks
• Possible Solutions
• Call to Action
UEFI Plugfest – March 2017 www.uefi.org 2
RAS Basic
UEFI Plugfest – March 2017 www.uefi.org 3
• Reliability:
– System capability to detect errors, correct errors, and flag errors.
– Measured in FITs (Failure in Time).
– 1FIT = 1 Failure / 1 Billion hours (MTBF – Mean Time Between Failures
= 114K Years!)
• Availability
– System capability to stay operational even when error occur.
– Measured in terms of ‘down time within a time interval’
– Five 9’s (99.999% Up Time) => 22 seconds Down Time in One Month
• Serviceability
– System capability to report failures for “FRU Isolation” and ease of
repair. FRU – Field Replaceable Unit, e.g., DIMM, PCI Express* device
RAS101: Key Definitions
Service Failure
Sources of Fault
Fault, Error, and Failure
Operator
Mistake
Unstable
Environment
Marginal
Hardware
Incorrect
Design
Physical
Defect
Error
Service Failure:
When the delivered service deviates from the
specified service
Unobservable
State
Observable
State
Detected
Fatal
SN Error Source Definition Example
1 Transient Error Electrical Noise induced faults mainly
affecting links such as DDR Bus, or PCI
Express links.
Transient errors on links may alter the data,
Command or Address bits during read/write
operation. Reads won’t alter the DRAM
stored value, but ‘Writes’ may alter.
2 Soft Error Errors due to external high energy
particle strike, e.g., Alpha particles,
Neutrons. Soft errors could occur is any
known good system
Result in affecting storage structures such as
DRAM cell (SBE or MBE), L1/L2/L3 caches.
3 Hard (Device)
Failure
Device failure due to marginality of the
device or degradation over time.
Failure of entire device such as DRAM,
memory buffer chip, or CPU chip
Sources of Fault/Error
• SBE: Single-bit Error. MBE: Multi-bit Error
An Example - Intel® Xeon®
Processor Fault Classification
MCA: Machine Check Architecture
DUE: Detectable but Uncorrected Error
UCR: Uncorrected Recoverable
Faults
Detected
(e.g., MCA)
Corrected Uncorrected
Catastrophic
(DUE)
Fatal
(DUE)
Recoverable
(UCR)
Uncorrected
No Action
(UCNA)
SW Recoverable
Action Optional
(SRAO)
SW Recoverable
Action Required
(SRAR)
Undetected
Benign Critical
Server RAS Challenges
UEFI Plugfest – March 2017 www.uefi.org 8
Apps
OS
FW
HW
Apps/Service
Reconfigured
Fault Correction
HW/FW/SW
Fault DetectionFault Avoidance
Life of a Fault – Pillars of RAS
AppsRestored
Fault
(HW)
Fault
(SW)
Detection
Correct in
HW
Apps
Failure
Apps
Degradation
Handle in
OS
Correct in
FW
Avoidance
System Reliability Serviceability
System Availability
Fault (SW) Examples:
1. Programming bug
2. Configuration Error
3. Operator Error
Fault (HW) Examples:
1. Marginal Design
2. Unstable environment
3. High energy particle strike
4. Component failure or degradation
Fault Avoidance
Hidden: Apps, SW, FW, HW
Visible: None
Feature Example:
1. Micro-circuit level capabilities
Fault Detection in HW
Hidden: Apps, SW, FW
Visible: HW
Feature Examples:
1. CRC Check
2. Parity Detection
3. Patrol Scrub
Fault Correction in HW
Hidden: Apps, SW, FW
Visible: HW
Examples:
1. Cache and Buffer ECC
2. Memory SDDC, Mirroring
3. Links CRC Retry
Fault Correction in FW
Hidden: Apps, SW
Visible: HW, FW
Feature Example:
1. Memory Sparing
Fault Handling in SW Hidden: Apps
Visible: HW, FW, SW
Feature Examples:
1. PCI Express* Link Retry using AER (Advance
Error Reporting)
2. CMCI (Corrected Machine Check Interrupt)
based Predictive Failure Analysis
System Availability Delivered Through the Stack (HW, FW, SW)
RAS Enabling Framework
Fault Handling = RAS (Four Pillars)
Fault Tolerance
1. Avoidance
2. Detection
3. Correction
Fault Management
4. Reconfiguration
E.g., Failed DIMM Isolation
Diagnosability/Serviceability/Manageability
Minimizing Downtime
Error Logging
FW-based Fault Management
OS-based Fault Management
Silicon
FW (Silicon RefCode)
FW (OEM/IBV/BMC)
OS/VMM
Application
System Stack
Error Signaling/Polling
RAS Enabling Requires HW, FW/BIOS, OS/SW
E.g., Memory ECC
System Reliability
Extending the Uptime
Fault Avoidance, Detection, and
Correction in HW
Fault Correction Through FW
Fault Correction Through OS
Fault Correction at App Layer
Which of these is Cloud Cluster and HPC Cluster?
RAS Needs of Cloud and
HPC Clusters
Cloud Infrastructure HPC Infrastructure
Need Fault Handling Need Fault Handling
Check pointing is not used Check-pointing is actively used
Applications can tolerate single machine failure Applications can not tolerant single machine
failure
Fault Management Capabilities for improving
TCO
Fault Tolerant Capabilities for extending uptime
Example: Automated techniques for identify
failed component
Example: HW/FW based self-healing
techniques
RAS Needs of Mission Critical Segment
Prevent/minimize unplanned downtime
Source: Trend in IT Value Report, Standish Group International, 2008
System UP TimeCorrectable
Fault
Fatal
Unplanned System DOWN
Undesirable
Intel® Xeon® RAS features directly impact end-user’s bottom line!
What would an outage cost?
FW/SW Building Blocks
UEFI Plugfest – March 2017 www.uefi.org 13
Error Reporting Basics
• Error Reporting includes two functions:
– Logging
– Signaling
• Logging
– Through MCA Banks, PCIe AER Registers, and Memory Corrected Error Registers
• eMCA2 Mode – Enhanced Error reporting to support Firmware-First mode;
• Signaling of Corrected Errors
– CMCI (Corrected Machine Check Interrupt)
• Threshold based
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode.
– CSMI (Corrected SMI) for core/uncore (Part of the eMCA2 new Feature)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• No Threshold
– SMI (System Management Interrupt) for Memory errors
– MSI (Message Signaled Interrupt) or external signaling for PCI Express* errors
Error Reporting Basics
(Continued)
• Signaling of Uncorrected Recoverable Errors (e.g., UCNA)
– CMCI for core/uncore errors at the source
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode.
– MSMI (Machine Check SMI) for core/uncore errors at the source (Part of the eMCA2)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• MSMI trigger (same as SMI).
– MSI or external signaling for Severity1 PCIe AER nonfatal errors
• Signaling of Uncorrected Recoverable Errors (e.g., SRAO and SRAR)
– MCERR (Machine Check Error) for core/uncore errors
• External signaling – via CATERR_N pin (16 BCLK Pulse). Allows propagation to other
sockets.
• In-band signaling – MCE trigger (vector 18h). In-band SMI trigger if configured.
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode
– MSMI (Machine Check SMI) for core/uncore errors at the source(Part of the eMCA2)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• External signaling – via MSMI_N pin. Allows propagation to other sockets.
• In-band signaling – MSMI trigger (same as SMI).
UCNA: Uncorrected No Action
SRAO: SW Recoverable Action Optional
SRAR: SW Recoverable Action Required
Linux RAS Blocks
UEFI Plugfest – March 2017 www.uefi.org 16
Linux CMCI/MCA Handler
with eMCA2
UEFI Plugfest – March 2017 www.uefi.org 17
Machine Check Banks
Other HW Registers
Every Corrected
Error Occurred
SMI
Hardware
OS
CMCI Handler
Read
Error
Log
SMI HandlerBIOS
Main Memory
Enhance
Error
Log
CMCI
Optional
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
Machine Check Banks
Other HW Registers
Uncorrected Error
Occurred
SMI
Hardware
OS
MCA Handler
Read
Error
Log
SMI Handler
BIOS
Main Memory
Enhance
Error
Log
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
INT18
1
2
2
34
5a 5b
APEI Overview
Application Software, e.g., APEI based Tool
Kernel OSPM System Code
Device Driver ACPI Driver
ACPI Tables BIOS
Hardware Platform including Processors, Memory, and I/O
OS Independent code
OS Specific code
ACPI Table Interface
Existing
industry
standard
register
interface, e.g.,
MSR Rd/Wr
Boot-time:
Step 1a: BIOS/SMM presents
APEI tables towards OS. BIOS
checks if target processor
supports APEI feature prior to
presenting to OS
Run-time:
Step 2b: OS requests BIOS/SMM
to enable APEI, i.e., enable
machine-check bank non-zero
value write capability
Step 2c: OS writes Machine-
check banks values requested in
Step 2d: OS requests BIOS/SMM
to either inject MCERR or CMCI.
Note that actual trigger event
occurs within OS context.
Step 2f: Once testing is
completed, OS requests
disabling MCERR/CMCI signaling.
Run-time:
Step 2a: APEI based Error
Injection Tool requests
OS to inject EINJ based
error in selected
Machine-check bank
Step 2e: BIOS discovers
the banks updated by OS
and program SMI
triggering source
registers.
UEFI CPER Overview
• Common Platform Error Record
• CPER is also the format used to
describe platform hardware error by
various APEI tables, such as ERST,
BERT and HEST etc.
UEFI Plugfest – March 2017 www.uefi.org 19
Linux CPER implementation
• Legacy MCA way:
In arch/x86/kernel/cpu/mcheck/mce-apei.c -
an error report given to the kernel by APEI and pass it into the normal
Linux error logging code that invoke after finding an error in a machine
check bank. This code didn’t provide address information in the machine
check bank.
• eMCA2 Mode:
In drivers/acpi/acpi_extlog.c –
on recent generation servers with eMCA, this code picks up additional
information provided by BIOS associated with each error logged. All Linux
really looks for in the CPER record is the “handle” into the SMBIOS/DMI
table so that, for example, it can report which DIMM is associated with the
error.
UEFI Plugfest – March 2017 www.uefi.org 20
precisely map error vs.
specific component
UEFI Plugfest – March 2017 www.uefi.org 21
Solution
Platform UEFI
Boot Service
1. Platform Error logging
2. Platform Error report
3. Platform Specific
algorithms
UEFI Runtime
Service using
CPER
1. Classify error severity
2. Consolidate error sources
3. Standardize RAS policy
OS Module
1. OS Error log
2. OS/App Error
Recovery
3. Component offline or
replacement
1
2 3
CPER w/ eMCA2
UEFI Plugfest – March 2017 www.uefi.org 23
Machine Check Banks
Other HW Registers
Uncorrected Error
Occurred
SMI
Hardware
OS
UEFI CPER
Read
Error
Log
SMI Handler
BIOS
Main Memory
Enhance
Error
Log
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
INT18
1
2
2
34
5a 5b
Server RAS Policy
Call to Action
UEFI Plugfest – March 2017 www.uefi.org 24
Call to Action
• Standardize UEFI Error Reporting/Error
Handling/RAS Policy Protocol
• Centralize more Errors sources like PCIe
AER/MCA etc.
• Connect APEI/CPER with OS-based
policy like CPU/Memory/PCIe devices
hotplug;
• OS-guided RAS policy back to FW and
HW, like page offline.
UEFI Plugfest – March 2017 www.uefi.org 25
Thanks for attending the Spring
2017 UEFI Seminar and Plugfest
For more information on the
Unified EFI Forum and UEFI
Specifications, visit
http://www.uefi.org
presented by
UEFI Plugfest – March 2017 www.uefi.org 26

More Related Content

PDF
Luận văn: Các tình tiết loại trừ tính chất tội phạm của hành vi, 9đ
PPTX
Série Evangelho no Lar - Pão Nosso - Cap. 49 - Velho Argumento
PDF
LECTIO DIVINA Navidad, CICLO A, (Lc 2, 1-14)
PDF
Luận án: Bảo đảm cạnh tranh trong đấu thầu xây lắp theo luật
PDF
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
PDF
Las16 200 - firmware summit - ras what is it- why do we need it
PDF
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
PDF
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
Luận văn: Các tình tiết loại trừ tính chất tội phạm của hành vi, 9đ
Série Evangelho no Lar - Pão Nosso - Cap. 49 - Velho Argumento
LECTIO DIVINA Navidad, CICLO A, (Lc 2, 1-14)
Luận án: Bảo đảm cạnh tranh trong đấu thầu xây lắp theo luật
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Las16 200 - firmware summit - ras what is it- why do we need it
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64

Similar to Spike yuan server ras and uefi cper final (20)

PDF
HKG18-116 - RAS Solutions for Arm64 Servers
PDF
Xen RAS Status and Progress
PDF
Reliability, Availability and Serviceability on Linux
ODP
Cpu And Memory Events
PDF
safety_critical_applications_and_customer_concerns
PDF
Beyond Bios Implementing the Unified Extensible Firmware Interface with Intel...
PDF
Quick Boot A Guide for Embedded Firmware Developers 2nd Edition TEXT searchab...
PDF
Quick Boot A Guide for Embedded Firmware Developers 2nd edition Pete Dice
PDF
fwts-plumbers-2011.pdf
PDF
Quick Boot A Guide for Embedded Firmware Developers 2nd Edition Pete Dice
PDF
Quick Boot A Guide for Embedded Firmware Developers 2nd edition Pete Dice
PDF
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
PDF
Beyond BIOS Developing with the Unified Extensible Firmware Interface Third E...
PDF
Quick Boot A Guide for Embedded Firmware Developers 2nd Edition Pete Dice
PDF
44CON London 2015 - Is there an EFI monster inside your apple?
PDF
uefi-secure-firmware-lockdown-idf2009-presentation-1-820238.pdf
PDF
MIPI DevCon 2020 | High Speed MIPI CSI-2 Interface Meeting Automotive ASIL-B
PDF
XS Boston 2008 VT-D PCI
PPTX
BlueHat v17 || KERNELFAULT: R00ting the Unexploitable using Hardware Fault In...
PDF
CSW2017 Privilege escalation on high-end servers due to implementation gaps i...
HKG18-116 - RAS Solutions for Arm64 Servers
Xen RAS Status and Progress
Reliability, Availability and Serviceability on Linux
Cpu And Memory Events
safety_critical_applications_and_customer_concerns
Beyond Bios Implementing the Unified Extensible Firmware Interface with Intel...
Quick Boot A Guide for Embedded Firmware Developers 2nd Edition TEXT searchab...
Quick Boot A Guide for Embedded Firmware Developers 2nd edition Pete Dice
fwts-plumbers-2011.pdf
Quick Boot A Guide for Embedded Firmware Developers 2nd Edition Pete Dice
Quick Boot A Guide for Embedded Firmware Developers 2nd edition Pete Dice
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
Beyond BIOS Developing with the Unified Extensible Firmware Interface Third E...
Quick Boot A Guide for Embedded Firmware Developers 2nd Edition Pete Dice
44CON London 2015 - Is there an EFI monster inside your apple?
uefi-secure-firmware-lockdown-idf2009-presentation-1-820238.pdf
MIPI DevCon 2020 | High Speed MIPI CSI-2 Interface Meeting Automotive ASIL-B
XS Boston 2008 VT-D PCI
BlueHat v17 || KERNELFAULT: R00ting the Unexploitable using Hardware Fault In...
CSW2017 Privilege escalation on high-end servers due to implementation gaps i...
Ad

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Introduction to Artificial Intelligence
PDF
top salesforce developer skills in 2025.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
System and Network Administration Chapter 2
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
ai tools demonstartion for schools and inter college
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Introduction to Artificial Intelligence
top salesforce developer skills in 2025.pdf
Digital Systems & Binary Numbers (comprehensive )
System and Network Administration Chapter 2
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms II-SECS-1021-03
Understanding Forklifts - TECH EHS Solution
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Ad

Spike yuan server ras and uefi cper final

  • 1. presented by Server RAS and UEFI CPER Spring 2017 UEFI Seminar and Plugfest March 27 - 31, 2017 Presented by Lucia, Mao (Intel) Spike Yuan (Intel) UEFI Plugfest – March 2017 www.uefi.org 1 Updated 2011-06-01
  • 2. Agenda • RAS Basic • Server RAS Challenges • SW Building Blocks • Possible Solutions • Call to Action UEFI Plugfest – March 2017 www.uefi.org 2
  • 3. RAS Basic UEFI Plugfest – March 2017 www.uefi.org 3
  • 4. • Reliability: – System capability to detect errors, correct errors, and flag errors. – Measured in FITs (Failure in Time). – 1FIT = 1 Failure / 1 Billion hours (MTBF – Mean Time Between Failures = 114K Years!) • Availability – System capability to stay operational even when error occur. – Measured in terms of ‘down time within a time interval’ – Five 9’s (99.999% Up Time) => 22 seconds Down Time in One Month • Serviceability – System capability to report failures for “FRU Isolation” and ease of repair. FRU – Field Replaceable Unit, e.g., DIMM, PCI Express* device RAS101: Key Definitions
  • 5. Service Failure Sources of Fault Fault, Error, and Failure Operator Mistake Unstable Environment Marginal Hardware Incorrect Design Physical Defect Error Service Failure: When the delivered service deviates from the specified service Unobservable State Observable State Detected Fatal
  • 6. SN Error Source Definition Example 1 Transient Error Electrical Noise induced faults mainly affecting links such as DDR Bus, or PCI Express links. Transient errors on links may alter the data, Command or Address bits during read/write operation. Reads won’t alter the DRAM stored value, but ‘Writes’ may alter. 2 Soft Error Errors due to external high energy particle strike, e.g., Alpha particles, Neutrons. Soft errors could occur is any known good system Result in affecting storage structures such as DRAM cell (SBE or MBE), L1/L2/L3 caches. 3 Hard (Device) Failure Device failure due to marginality of the device or degradation over time. Failure of entire device such as DRAM, memory buffer chip, or CPU chip Sources of Fault/Error • SBE: Single-bit Error. MBE: Multi-bit Error
  • 7. An Example - Intel® Xeon® Processor Fault Classification MCA: Machine Check Architecture DUE: Detectable but Uncorrected Error UCR: Uncorrected Recoverable Faults Detected (e.g., MCA) Corrected Uncorrected Catastrophic (DUE) Fatal (DUE) Recoverable (UCR) Uncorrected No Action (UCNA) SW Recoverable Action Optional (SRAO) SW Recoverable Action Required (SRAR) Undetected Benign Critical
  • 8. Server RAS Challenges UEFI Plugfest – March 2017 www.uefi.org 8
  • 9. Apps OS FW HW Apps/Service Reconfigured Fault Correction HW/FW/SW Fault DetectionFault Avoidance Life of a Fault – Pillars of RAS AppsRestored Fault (HW) Fault (SW) Detection Correct in HW Apps Failure Apps Degradation Handle in OS Correct in FW Avoidance System Reliability Serviceability System Availability Fault (SW) Examples: 1. Programming bug 2. Configuration Error 3. Operator Error Fault (HW) Examples: 1. Marginal Design 2. Unstable environment 3. High energy particle strike 4. Component failure or degradation Fault Avoidance Hidden: Apps, SW, FW, HW Visible: None Feature Example: 1. Micro-circuit level capabilities Fault Detection in HW Hidden: Apps, SW, FW Visible: HW Feature Examples: 1. CRC Check 2. Parity Detection 3. Patrol Scrub Fault Correction in HW Hidden: Apps, SW, FW Visible: HW Examples: 1. Cache and Buffer ECC 2. Memory SDDC, Mirroring 3. Links CRC Retry Fault Correction in FW Hidden: Apps, SW Visible: HW, FW Feature Example: 1. Memory Sparing Fault Handling in SW Hidden: Apps Visible: HW, FW, SW Feature Examples: 1. PCI Express* Link Retry using AER (Advance Error Reporting) 2. CMCI (Corrected Machine Check Interrupt) based Predictive Failure Analysis System Availability Delivered Through the Stack (HW, FW, SW)
  • 10. RAS Enabling Framework Fault Handling = RAS (Four Pillars) Fault Tolerance 1. Avoidance 2. Detection 3. Correction Fault Management 4. Reconfiguration E.g., Failed DIMM Isolation Diagnosability/Serviceability/Manageability Minimizing Downtime Error Logging FW-based Fault Management OS-based Fault Management Silicon FW (Silicon RefCode) FW (OEM/IBV/BMC) OS/VMM Application System Stack Error Signaling/Polling RAS Enabling Requires HW, FW/BIOS, OS/SW E.g., Memory ECC System Reliability Extending the Uptime Fault Avoidance, Detection, and Correction in HW Fault Correction Through FW Fault Correction Through OS Fault Correction at App Layer
  • 11. Which of these is Cloud Cluster and HPC Cluster? RAS Needs of Cloud and HPC Clusters Cloud Infrastructure HPC Infrastructure Need Fault Handling Need Fault Handling Check pointing is not used Check-pointing is actively used Applications can tolerate single machine failure Applications can not tolerant single machine failure Fault Management Capabilities for improving TCO Fault Tolerant Capabilities for extending uptime Example: Automated techniques for identify failed component Example: HW/FW based self-healing techniques
  • 12. RAS Needs of Mission Critical Segment Prevent/minimize unplanned downtime Source: Trend in IT Value Report, Standish Group International, 2008 System UP TimeCorrectable Fault Fatal Unplanned System DOWN Undesirable Intel® Xeon® RAS features directly impact end-user’s bottom line! What would an outage cost?
  • 13. FW/SW Building Blocks UEFI Plugfest – March 2017 www.uefi.org 13
  • 14. Error Reporting Basics • Error Reporting includes two functions: – Logging – Signaling • Logging – Through MCA Banks, PCIe AER Registers, and Memory Corrected Error Registers • eMCA2 Mode – Enhanced Error reporting to support Firmware-First mode; • Signaling of Corrected Errors – CMCI (Corrected Machine Check Interrupt) • Threshold based • Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode. – CSMI (Corrected SMI) for core/uncore (Part of the eMCA2 new Feature) • Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode. • No Threshold – SMI (System Management Interrupt) for Memory errors – MSI (Message Signaled Interrupt) or external signaling for PCI Express* errors
  • 15. Error Reporting Basics (Continued) • Signaling of Uncorrected Recoverable Errors (e.g., UCNA) – CMCI for core/uncore errors at the source • Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode. – MSMI (Machine Check SMI) for core/uncore errors at the source (Part of the eMCA2) • Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode. • MSMI trigger (same as SMI). – MSI or external signaling for Severity1 PCIe AER nonfatal errors • Signaling of Uncorrected Recoverable Errors (e.g., SRAO and SRAR) – MCERR (Machine Check Error) for core/uncore errors • External signaling – via CATERR_N pin (16 BCLK Pulse). Allows propagation to other sockets. • In-band signaling – MCE trigger (vector 18h). In-band SMI trigger if configured. • Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode – MSMI (Machine Check SMI) for core/uncore errors at the source(Part of the eMCA2) • Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode. • External signaling – via MSMI_N pin. Allows propagation to other sockets. • In-band signaling – MSMI trigger (same as SMI). UCNA: Uncorrected No Action SRAO: SW Recoverable Action Optional SRAR: SW Recoverable Action Required
  • 16. Linux RAS Blocks UEFI Plugfest – March 2017 www.uefi.org 16
  • 17. Linux CMCI/MCA Handler with eMCA2 UEFI Plugfest – March 2017 www.uefi.org 17 Machine Check Banks Other HW Registers Every Corrected Error Occurred SMI Hardware OS CMCI Handler Read Error Log SMI HandlerBIOS Main Memory Enhance Error Log CMCI Optional Read Enhance Error Log Read Error Log eMCA Enabled Machine Check Banks Other HW Registers Uncorrected Error Occurred SMI Hardware OS MCA Handler Read Error Log SMI Handler BIOS Main Memory Enhance Error Log Read Enhance Error Log Read Error Log eMCA Enabled INT18 1 2 2 34 5a 5b
  • 18. APEI Overview Application Software, e.g., APEI based Tool Kernel OSPM System Code Device Driver ACPI Driver ACPI Tables BIOS Hardware Platform including Processors, Memory, and I/O OS Independent code OS Specific code ACPI Table Interface Existing industry standard register interface, e.g., MSR Rd/Wr Boot-time: Step 1a: BIOS/SMM presents APEI tables towards OS. BIOS checks if target processor supports APEI feature prior to presenting to OS Run-time: Step 2b: OS requests BIOS/SMM to enable APEI, i.e., enable machine-check bank non-zero value write capability Step 2c: OS writes Machine- check banks values requested in Step 2d: OS requests BIOS/SMM to either inject MCERR or CMCI. Note that actual trigger event occurs within OS context. Step 2f: Once testing is completed, OS requests disabling MCERR/CMCI signaling. Run-time: Step 2a: APEI based Error Injection Tool requests OS to inject EINJ based error in selected Machine-check bank Step 2e: BIOS discovers the banks updated by OS and program SMI triggering source registers.
  • 19. UEFI CPER Overview • Common Platform Error Record • CPER is also the format used to describe platform hardware error by various APEI tables, such as ERST, BERT and HEST etc. UEFI Plugfest – March 2017 www.uefi.org 19
  • 20. Linux CPER implementation • Legacy MCA way: In arch/x86/kernel/cpu/mcheck/mce-apei.c - an error report given to the kernel by APEI and pass it into the normal Linux error logging code that invoke after finding an error in a machine check bank. This code didn’t provide address information in the machine check bank. • eMCA2 Mode: In drivers/acpi/acpi_extlog.c – on recent generation servers with eMCA, this code picks up additional information provided by BIOS associated with each error logged. All Linux really looks for in the CPER record is the “handle” into the SMBIOS/DMI table so that, for example, it can report which DIMM is associated with the error. UEFI Plugfest – March 2017 www.uefi.org 20
  • 21. precisely map error vs. specific component UEFI Plugfest – March 2017 www.uefi.org 21
  • 22. Solution Platform UEFI Boot Service 1. Platform Error logging 2. Platform Error report 3. Platform Specific algorithms UEFI Runtime Service using CPER 1. Classify error severity 2. Consolidate error sources 3. Standardize RAS policy OS Module 1. OS Error log 2. OS/App Error Recovery 3. Component offline or replacement 1 2 3
  • 23. CPER w/ eMCA2 UEFI Plugfest – March 2017 www.uefi.org 23 Machine Check Banks Other HW Registers Uncorrected Error Occurred SMI Hardware OS UEFI CPER Read Error Log SMI Handler BIOS Main Memory Enhance Error Log Read Enhance Error Log Read Error Log eMCA Enabled INT18 1 2 2 34 5a 5b Server RAS Policy
  • 24. Call to Action UEFI Plugfest – March 2017 www.uefi.org 24
  • 25. Call to Action • Standardize UEFI Error Reporting/Error Handling/RAS Policy Protocol • Centralize more Errors sources like PCIe AER/MCA etc. • Connect APEI/CPER with OS-based policy like CPU/Memory/PCIe devices hotplug; • OS-guided RAS policy back to FW and HW, like page offline. UEFI Plugfest – March 2017 www.uefi.org 25
  • 26. Thanks for attending the Spring 2017 UEFI Seminar and Plugfest For more information on the Unified EFI Forum and UEFI Specifications, visit http://www.uefi.org presented by UEFI Plugfest – March 2017 www.uefi.org 26