Spike yuan server ras and uefi cper final

presented by
Server RAS and UEFI CPER
Spring 2017 UEFI Seminar and Plugfest
March 27 - 31, 2017
Presented by
Lucia, Mao (Intel)
Spike Yuan (Intel)
UEFI Plugfest – March 2017 www.uefi.org 1
Updated 2011-06-01

Agenda
• RAS Basic
• Server RAS Challenges
• SW Building Blocks
• Possible Solutions
• Call to Action

RAS Basic

• Reliability:
– System capability to detect errors, correct errors, and flag errors.
– Measured in FITs (Failure in Time).
– 1FIT = 1 Failure / 1 Billion hours (MTBF – Mean Time Between Failures
= 114K Years!)
• Availability
– System capability to stay operational even when error occur.
– Measured in terms of ‘down time within a time interval’
– Five 9’s (99.999% Up Time) => 22 seconds Down Time in One Month
• Serviceability
– System capability to report failures for “FRU Isolation” and ease of
repair. FRU – Field Replaceable Unit, e.g., DIMM, PCI Express* device
RAS101: Key Definitions

Service Failure
Sources of Fault
Fault, Error, and Failure
Operator
Mistake
Unstable
Environment
Marginal
Hardware
Incorrect
Design
Physical
Defect
Error
Service Failure:
When the delivered service deviates from the
specified service
Unobservable
State
Observable
State
Detected
Fatal

SN Error Source Definition Example
1 Transient Error Electrical Noise induced faults mainly
affecting links such as DDR Bus, or PCI
Express links.
Transient errors on links may alter the data,
Command or Address bits during read/write
operation. Reads won’t alter the DRAM
stored value, but ‘Writes’ may alter.
2 Soft Error Errors due to external high energy
particle strike, e.g., Alpha particles,
Neutrons. Soft errors could occur is any
known good system
Result in affecting storage structures such as
DRAM cell (SBE or MBE), L1/L2/L3 caches.
3 Hard (Device)
Failure
Device failure due to marginality of the
device or degradation over time.
Failure of entire device such as DRAM,
memory buffer chip, or CPU chip
Sources of Fault/Error
• SBE: Single-bit Error. MBE: Multi-bit Error

An Example - Intel® Xeon®
Processor Fault Classification
MCA: Machine Check Architecture
DUE: Detectable but Uncorrected Error
UCR: Uncorrected Recoverable
Faults
Detected
(e.g., MCA)
Corrected Uncorrected
Catastrophic
(DUE)
Fatal
(DUE)
Recoverable
(UCR)
Uncorrected
No Action
(UCNA)
SW Recoverable
Action Optional
(SRAO)
SW Recoverable
Action Required
(SRAR)
Undetected
Benign Critical

Server RAS Challenges

Apps
OS
FW
HW
Apps/Service
Reconfigured
Fault Correction
HW/FW/SW
Fault DetectionFault Avoidance
Life of a Fault – Pillars of RAS
AppsRestored
Fault
(HW)
Fault
(SW)
Detection
Correct in
HW
Apps
Failure
Apps
Degradation
Handle in
OS
Correct in
FW
Avoidance
System Reliability Serviceability
System Availability
Fault (SW) Examples:
1. Programming bug
2. Configuration Error
3. Operator Error
Fault (HW) Examples:
1. Marginal Design
2. Unstable environment
3. High energy particle strike
4. Component failure or degradation
Fault Avoidance
Hidden: Apps, SW, FW, HW
Visible: None
Feature Example:
1. Micro-circuit level capabilities
Fault Detection in HW
Hidden: Apps, SW, FW
Visible: HW
Feature Examples:
1. CRC Check
2. Parity Detection
3. Patrol Scrub
Fault Correction in HW
Hidden: Apps, SW, FW
Visible: HW
Examples:
1. Cache and Buffer ECC
2. Memory SDDC, Mirroring
3. Links CRC Retry
Fault Correction in FW
Hidden: Apps, SW
Visible: HW, FW
Feature Example:
1. Memory Sparing
Fault Handling in SW Hidden: Apps
Visible: HW, FW, SW
Feature Examples:
1. PCI Express* Link Retry using AER (Advance
Error Reporting)
2. CMCI (Corrected Machine Check Interrupt)
based Predictive Failure Analysis
System Availability Delivered Through the Stack (HW, FW, SW)

RAS Enabling Framework
Fault Handling = RAS (Four Pillars)
Fault Tolerance
1. Avoidance
2. Detection
3. Correction
Fault Management
4. Reconfiguration
E.g., Failed DIMM Isolation
Diagnosability/Serviceability/Manageability
Minimizing Downtime
Error Logging
FW-based Fault Management
OS-based Fault Management
Silicon
FW (Silicon RefCode)
FW (OEM/IBV/BMC)
OS/VMM
Application
System Stack
Error Signaling/Polling
RAS Enabling Requires HW, FW/BIOS, OS/SW
E.g., Memory ECC
System Reliability
Extending the Uptime
Fault Avoidance, Detection, and
Correction in HW
Fault Correction Through FW
Fault Correction Through OS
Fault Correction at App Layer

Which of these is Cloud Cluster and HPC Cluster?
RAS Needs of Cloud and
HPC Clusters
Cloud Infrastructure HPC Infrastructure
Need Fault Handling Need Fault Handling
Check pointing is not used Check-pointing is actively used
Applications can tolerate single machine failure Applications can not tolerant single machine
failure
Fault Management Capabilities for improving
TCO
Fault Tolerant Capabilities for extending uptime
Example: Automated techniques for identify
failed component
Example: HW/FW based self-healing
techniques

RAS Needs of Mission Critical Segment
Prevent/minimize unplanned downtime
Source: Trend in IT Value Report, Standish Group International, 2008
System UP TimeCorrectable
Fault
Fatal
Unplanned System DOWN
Undesirable
Intel® Xeon® RAS features directly impact end-user’s bottom line!
What would an outage cost?

FW/SW Building Blocks

Error Reporting Basics
• Error Reporting includes two functions:
– Logging
– Signaling
• Logging
– Through MCA Banks, PCIe AER Registers, and Memory Corrected Error Registers
• eMCA2 Mode – Enhanced Error reporting to support Firmware-First mode;
• Signaling of Corrected Errors
– CMCI (Corrected Machine Check Interrupt)
• Threshold based
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode.
– CSMI (Corrected SMI) for core/uncore (Part of the eMCA2 new Feature)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• No Threshold
– SMI (System Management Interrupt) for Memory errors
– MSI (Message Signaled Interrupt) or external signaling for PCI Express* errors

Error Reporting Basics
(Continued)
• Signaling of Uncorrected Recoverable Errors (e.g., UCNA)
– CMCI for core/uncore errors at the source
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode.
– MSMI (Machine Check SMI) for core/uncore errors at the source (Part of the eMCA2)
• MSMI trigger (same as SMI).
– MSI or external signaling for Severity1 PCIe AER nonfatal errors
• Signaling of Uncorrected Recoverable Errors (e.g., SRAO and SRAR)
– MCERR (Machine Check Error) for core/uncore errors
• External signaling – via CATERR_N pin (16 BCLK Pulse). Allows propagation to other
sockets.
• In-band signaling – MCE trigger (vector 18h). In-band SMI trigger if configured.
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode
– MSMI (Machine Check SMI) for core/uncore errors at the source(Part of the eMCA2)
• External signaling – via MSMI_N pin. Allows propagation to other sockets.
• In-band signaling – MSMI trigger (same as SMI).
UCNA: Uncorrected No Action
SRAO: SW Recoverable Action Optional
SRAR: SW Recoverable Action Required

Linux RAS Blocks

Linux CMCI/MCA Handler
with eMCA2
Machine Check Banks
Other HW Registers
Every Corrected
Error Occurred
SMI
Hardware
OS
CMCI Handler
Read
Error
Log
SMI HandlerBIOS
Main Memory
Enhance
Error
Log
CMCI
Optional
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
Machine Check Banks
Other HW Registers
Uncorrected Error
Occurred
SMI
Hardware
OS
MCA Handler
Read
Error
Log
SMI Handler
BIOS
Main Memory
Enhance
Error
Log
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
INT18
1
2
2
34
5a 5b

APEI Overview
Application Software, e.g., APEI based Tool
Kernel OSPM System Code
Device Driver ACPI Driver
ACPI Tables BIOS
Hardware Platform including Processors, Memory, and I/O
OS Independent code
OS Specific code
ACPI Table Interface
Existing
industry
standard
register
interface, e.g.,
MSR Rd/Wr
Boot-time:
Step 1a: BIOS/SMM presents
APEI tables towards OS. BIOS
checks if target processor
supports APEI feature prior to
presenting to OS
Run-time:
Step 2b: OS requests BIOS/SMM
to enable APEI, i.e., enable
machine-check bank non-zero
value write capability
Step 2c: OS writes Machine-
check banks values requested in
Step 2d: OS requests BIOS/SMM
to either inject MCERR or CMCI.
Note that actual trigger event
occurs within OS context.
Step 2f: Once testing is
completed, OS requests
disabling MCERR/CMCI signaling.
Run-time:
Step 2a: APEI based Error
Injection Tool requests
OS to inject EINJ based
error in selected
Machine-check bank
Step 2e: BIOS discovers
the banks updated by OS
and program SMI
triggering source
registers.

UEFI CPER Overview
• Common Platform Error Record
• CPER is also the format used to
describe platform hardware error by
various APEI tables, such as ERST,
BERT and HEST etc.

Linux CPER implementation
• Legacy MCA way:
In arch/x86/kernel/cpu/mcheck/mce-apei.c -
an error report given to the kernel by APEI and pass it into the normal
Linux error logging code that invoke after finding an error in a machine
check bank. This code didn’t provide address information in the machine
check bank.
• eMCA2 Mode:
In drivers/acpi/acpi_extlog.c –
on recent generation servers with eMCA, this code picks up additional
information provided by BIOS associated with each error logged. All Linux
really looks for in the CPER record is the “handle” into the SMBIOS/DMI
table so that, for example, it can report which DIMM is associated with the
error.

precisely map error vs.
specific component

Solution
Platform UEFI
Boot Service
1. Platform Error logging
2. Platform Error report
3. Platform Specific
algorithms
UEFI Runtime
Service using
CPER
1. Classify error severity
2. Consolidate error sources
3. Standardize RAS policy
OS Module
1. OS Error log
2. OS/App Error
Recovery
3. Component offline or
replacement
1
2 3

CPER w/ eMCA2
Machine Check Banks
Other HW Registers
Uncorrected Error
Occurred
SMI
Hardware
OS
UEFI CPER
Read
Error
Log
SMI Handler
BIOS
Main Memory
Enhance
Error
Log
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
INT18
1
2
2
34
5a 5b
Server RAS Policy

Call to Action

Call to Action
• Standardize UEFI Error Reporting/Error
Handling/RAS Policy Protocol
• Centralize more Errors sources like PCIe
AER/MCA etc.
• Connect APEI/CPER with OS-based
policy like CPU/Memory/PCIe devices
hotplug;
• OS-guided RAS policy back to FW and
HW, like page offline.

Thanks for attending the Spring
2017 UEFI Seminar and Plugfest
For more information on the
Unified EFI Forum and UEFI
Specifications, visit
http://www.uefi.org
presented by

Spike yuan server ras and uefi cper final

More Related Content

Similar to Spike yuan server ras and uefi cper final (20)

Recently uploaded (20)

Spike yuan server ras and uefi cper final