SlideShare a Scribd company logo
HP 3PAR StoreServ 7000 Storage
Troubleshooting Guide
Service Edition
Abstract
This guide is intended for experienced users and system administrators troubleshooting HP 3PAR StoreServ 7000 Storage
systems and have a firm understanding of RAID schemes.
HP Part Number: QR482-96619
Published: March 2014
© Copyright 2014 Hewlett-Packard Development Company, L.P.
Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial
Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under
vendor's standard commercial license.
The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express
warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall
not be liable for technical or editorial errors or omissions contained herein.
Acknowledgments
Microsoft® and Windows® are U.S. registered trademarks of Microsoft Corporation.
Warranty
To obtain a copy of the warranty for this product, see the warranty information website:
http://www.hp.com/go/storagewarranty
Contents
1 Identifying Storage System Components........................................................7
Understanding Component Numbering.......................................................................................7
Drive Enclosures...................................................................................................................7
Controller Nodes.................................................................................................................8
PCIe Slots and Ports.............................................................................................................9
I/O Modules ....................................................................................................................10
Power Cooling Modules......................................................................................................10
Power Distribution Units......................................................................................................11
Service Processor...............................................................................................................11
2 Understanding LED Indicator Status.............................................................12
Enclosure LEDs.......................................................................................................................12
Bezel LEDs........................................................................................................................12
Disk Drive LEDs..................................................................................................................13
Storage System Component LEDs..............................................................................................13
PCM LEDs.........................................................................................................................13
Drive PCM LEDs.................................................................................................................15
I/O Module LEDs..............................................................................................................16
External Port Activity LEDs...................................................................................................17
Controller Node and Internal Component LEDs...........................................................................18
Ethernet LEDs....................................................................................................................18
FC Port LEDs......................................................................................................................19
SAS Port LEDs....................................................................................................................20
Interconnect Port LEDs.........................................................................................................20
Fibre Channel Adapter Port LEDs..........................................................................................21
Converged Network Adapter Port LEDs.................................................................................21
Service Processor LEDs............................................................................................................22
3 Powering Off/On the Storage System..........................................................24
Powering Off the Storage System..............................................................................................24
Powering On the Storage System..............................................................................................24
4 Alerts......................................................................................................26
Getting Recommended Actions.................................................................................................26
5 Troubleshooting........................................................................................28
checkhealth Command............................................................................................................28
Using the checkhealth Command.........................................................................................28
Troubleshooting Storage System Components.............................................................................31
Alert................................................................................................................................32
Format of Possible Alert Exception Messages.....................................................................32
Alert Example...............................................................................................................32
Alert Suggested Action..................................................................................................33
Cabling............................................................................................................................33
Format of Possible Cabling Exception Messages................................................................33
Cabling Example 1.......................................................................................................34
Cabling Suggested Action 1...........................................................................................34
Cabling Example 2.......................................................................................................34
Cabling Suggested Action 2...........................................................................................34
Cage...............................................................................................................................35
Format of Possible Cage Exception Messages...................................................................35
Cage Example 1...........................................................................................................35
Cage Suggested Action 1..............................................................................................35
Contents 3
Cage Example 2...........................................................................................................37
Cage Suggested Action 2..............................................................................................37
Cage Example 3...........................................................................................................37
Cage Suggested Action 3..............................................................................................37
Cage Example 4...........................................................................................................38
Cage Suggested Action 4..............................................................................................38
Cage Example 5...........................................................................................................39
Cage Suggested Action 5..............................................................................................39
Consistency.......................................................................................................................40
Format of Possible Consistency Exception Messages...........................................................40
Consistency Example.....................................................................................................40
Consistency Suggested Action.........................................................................................40
Data Encryption (DAR)........................................................................................................40
Format of Possible DAR Exception Messages.....................................................................41
DAR Suggested Action...................................................................................................41
DAR Example 2............................................................................................................41
DAR Suggested Action 2................................................................................................41
Date................................................................................................................................41
Format of Possible Date Exception Messages.....................................................................41
Date Example...............................................................................................................41
Date Suggested Action..................................................................................................41
File..................................................................................................................................42
File Format of Possible Exception Messages......................................................................42
File Example 1..............................................................................................................42
File Suggested Action 1.................................................................................................42
File Example 2..............................................................................................................42
File Suggested Action 2.................................................................................................42
File Example 3..............................................................................................................43
File Suggested Action 3.................................................................................................43
LD....................................................................................................................................43
Format of Possible LD Exception Messages........................................................................43
LD Example 1...............................................................................................................44
LD Suggested Action 1...................................................................................................44
LD Example 2...............................................................................................................44
LD Suggested Action 2...................................................................................................44
LD Example 3...............................................................................................................45
LD Suggested Action 3...................................................................................................45
LD Example 4...............................................................................................................45
LD Suggested Action 4...................................................................................................45
License.............................................................................................................................46
Format of Possible License Exception Messages.................................................................46
License Example............................................................................................................46
License Suggested Action...............................................................................................46
Network...........................................................................................................................46
Format of Possible Network Exception Messages...............................................................46
Network Example 1......................................................................................................46
Network Suggested Action 1..........................................................................................47
Network Example 2......................................................................................................47
Network Suggested Action 2..........................................................................................47
Node...............................................................................................................................47
Format of Possible Node Exception Messages...................................................................48
Node Suggested Action.................................................................................................48
Node Example 1..........................................................................................................48
Node Suggested Action 1..............................................................................................48
Node Example 2..........................................................................................................49
4 Contents
Node Suggested Action 2..............................................................................................49
Node Example 3..........................................................................................................50
Node Suggested Action 3..............................................................................................50
Example Node 4..........................................................................................................50
Suggested Action Node 4..............................................................................................50
PD...................................................................................................................................51
Format of Possible PD Exception Messages.......................................................................51
PD Example 1...............................................................................................................51
PD Suggested Action 1..................................................................................................51
PD Example 2...............................................................................................................52
PD Suggested Action 2..................................................................................................52
PD Example 3...............................................................................................................54
PD Suggested Action 3..................................................................................................54
PD Example 4...............................................................................................................54
PD Suggested Action 4..................................................................................................54
PD Example 5...............................................................................................................55
PD Suggested Action 5..................................................................................................55
PD Example 6...............................................................................................................55
PD Suggested Action 6..................................................................................................55
PDCH...............................................................................................................................55
Format of Possible PDCH Exception Messages...................................................................55
PDCH Example 1..........................................................................................................55
Suggested PDCH Action 1..............................................................................................56
PDCH Example 2..........................................................................................................56
PDCH Suggested Action 2..............................................................................................56
Port..................................................................................................................................57
Format of Possible Port Exception Messages......................................................................57
Port Suggested Actions...................................................................................................57
Port Example 1.............................................................................................................57
Port Suggested Action 1.................................................................................................58
Port Example 2.............................................................................................................59
Port Suggested Action 2.................................................................................................59
Port Example 3.............................................................................................................59
Port Suggested Action 3.................................................................................................59
Port Example 4.............................................................................................................59
Port Suggested Action 4.................................................................................................59
Port Example 5.............................................................................................................60
Port Suggested Action 5.................................................................................................60
Port Example 6.............................................................................................................60
Port Suggested Action 6.................................................................................................60
Port Example 7.............................................................................................................60
Port Suggested Action 7.................................................................................................61
Port CRC...........................................................................................................................61
Format of Possible Port CRC Exception Messages...............................................................61
Port CRC Example.........................................................................................................61
Port PELCRC......................................................................................................................61
Format of Possible PELCRC Exception Messages................................................................61
Port PELCRC Example.....................................................................................................61
RC...................................................................................................................................62
Format of Possible RC Exception Messages.......................................................................62
RC Example.................................................................................................................62
RC Suggested Action.....................................................................................................62
SNMP..............................................................................................................................62
Format of Possible SNMP Exception Messages..................................................................62
SNMP Example............................................................................................................62
Contents 5
SNMP Suggested Action................................................................................................62
SP....................................................................................................................................62
Format of Possible SP Exception Messages........................................................................63
SP Example..................................................................................................................63
SP Suggested Action......................................................................................................63
Task.................................................................................................................................63
Format of Possible Task Exception Messages.....................................................................63
Task Example...............................................................................................................63
Task Suggested Action...................................................................................................63
VLUN...............................................................................................................................64
Format of Possible VLUN Exception Messages...................................................................64
VLUN Example.............................................................................................................64
VLUN Suggested Action.................................................................................................64
VV...................................................................................................................................64
Format of Possible VV Exception Messages.......................................................................65
VV Suggested Action.....................................................................................................65
Troubleshooting Storage System Setup.......................................................................................65
Storage System Setup Wizard Errors.....................................................................................65
Collecting SmartStart Log Files.............................................................................................71
Collecting Service Processor Log Files...................................................................................71
Contacting HP Support about System Setup...........................................................................72
6 CBIOS Error Codes...................................................................................73
LED Blink Codes.....................................................................................................................73
InForm OS Failed Error Codes and Resolution.............................................................................73
Failed Alerts......................................................................................................................73
CBIOS Degraded Error Codes and Resolution..........................................................................118
Degraded Alerts..............................................................................................................118
7 Support and Other Resources...................................................................133
Contacting HP......................................................................................................................133
HP 3PAR documentation........................................................................................................133
Typographic conventions.......................................................................................................136
HP 3PAR branding information...............................................................................................136
8 Documentation feedback.........................................................................137
6 Contents
1 Identifying Storage System Components
NOTE: The illustrations in this chapter are used examples only and may not reflect your storage
system configuration.
Understanding Component Numbering
Due to the large number of possible configurations, component placement and internal cabling is
standardized to simplify installation and maintenance. System components are placed in the rack
according to the principles outlined in this chapter, and are numbered according to their order
and location in the cabinet.
The Storage system includes the following types of drive and node enclosures:
• The HP M6710 Drive Enclosure (2U24) holds up to 24, 2.5 inch small form factor (SFF) Serial
Attached SCSI (SAS) disk drives arranged vertically in a single row on the front of the enclosure.
Two 580 W power cooling modules (PCMs) and two I/O modules are located at the rear of
the enclosure.
• The HP M6720 Drive Enclosure (4U24) holds up to 24, 3.5 inch large form factor (LFF) SAS
disk drives, arranged horizontally with four columns of six disk drives located on the front of
the enclosure. Two 580 W PCMs and two I/O modules are located at the rear of the enclosure.
• The HP 3PAR StoreServ 7200 and 7400 (two-node configuration) storage enclosures hold up
to 24, 2.5 inch SFF SAS disk drives arranged horizontally in a single row located on the front
of the enclosure. Two 764 W PCMs and two controller nodes are located at the rear of the
enclosure.
NOTE: In the HP 3PAR Management Console or CLI, the enclosures are displayed as DCS2 for
2U24 (M6710) , DCS1 (M6720) for 4U24, and DCN1 for a node enclosure.
Drive Enclosures
The maximum number of supported drive enclosures depends on the model and the number of
nodes.
Disk Drive Numbering
The disk drives are mounted on a drive carrier and reside at the front of the enclosures. There are
two types of disk drives for specific drive carriers:
• Vertical, 2.5 inch SFF disks. The 2U24 enclosure numbering starts with 0 on the left and ends
with 23 on the right. See Figure 1 (page 8).
• Horizontal, 3.5 inch LFF disks. The 4U24 enclosure are numbered with 0 on the lower left to
23 on the upper right, with six rows of four. See Figure 2 (page 8).
Understanding Component Numbering 7
Figure 1 HP M6710 Drive Enclosure (2U24)
Figure 2 HP M6720 Drive Enclosure (4U24)
Controller Nodes
The controller node caches and manages data in a system providing a comprehensive, virtualized
view of the system. The controller nodes are located at the rear of the node enclosure.
The HP 3PAR StoreServ 7200 Storage system contains two nodes numbered 0 and 1 (see
Figure 3 (page 8)). The HP 3PAR StoreServ 7400 Storage system has either two nodes or four
nodes. The four-node configuration is numbered 0 and 1 on the bottom, and 2 and 3 on the top
(see Figure 4 (page 9)).
Figure 3 HP 3PAR StoreServ 7200 Storage Numbering
8 Identifying Storage System Components
Figure 4 HP 3PAR StoreServ Four-node Configuration Storage Numbering
PCIe Slots and Ports
This table describes the default port configurations for the HP 3PAR StoreServ 7000 Storage
systems. See Table 1 (page 9) for details.
Table 1 Storage System Expansion Cards
Nodes 2 and 3Nodes 0 and 1Expansion cards
No expansion card1 FC HBA each2 FC HBAs only
No expansion card1 10 Gb/s CNA each2 10 Gb/s (CNA) only
1 10 Gb/s CNA each1 FC HBA each2 FC HBAs + 2 10 Gb/s CNAs
You can have either a 10 Gb/s Converge Network Adapter (CNA) or Fibre Channel (FC) card
in the expansion slots of all nodes, or a combination of the two in a four-node system (for example,
two 10 Gb/s CNAs and two FCs).
Each node enclosure must have matching PCIe cards. The following figure shows the location of
the controller node ports (see Figure 5 (page 9)).
NOTE: If you are upgrading from a two-node to a four-node configuration, you can have CNAs
installed in node 0 and node 1, and FC HBAs installed in node 2 and node 3.
Figure 5 Location of Controller Node Ports
Understanding Component Numbering 9
Table 2 Description of Controller Node Ports
PortItem
2 Ethernet
MGMT--Connects to the storage array management interfaces
1
RC--Connects to Remote Copy
Fibre Channel (FC-1 and FC-2)--Connects to host systems2
SAS (DP-2 and DP-1)--Connects the drive enclosures and I/O modules using
SAS cables
3
Node Interconnect--Connects four directional interconnect cables that connect
the controller nodes (four node 7400 only)
4
PCIe slot for optional four-port 8 Gb/s FC HBA or two-port 10 Gb/s CNA5
NOTE: The MFG port is not used.
I/O Modules
The I/O modules connect the controller nodes to the hard drives using a SAS cable and enabling
data transfer between the nodes, hard drives, PCMs, and enclosures. There are two I/O modules
located at the rear of the drive enclosure. There are two I/O modules per enclosure, numbered 0
and 1 from bottom to top. See Figure 6 (page 10).
Figure 6 I/O Module Numbering for HP M6710 (2U) and HP M6720 (4U) Drive Enclosures
NOTE: The I/O modules are located in slots 0 and 1 of the HP M6710 and M6720 drive
enclosures.
Power Cooling Modules
The PCM is an integrated power supply, battery, and cooling fan. There are two types of PCMs:
• The 580 W is used in drive enclosures and does not include a battery.
• The 764 W is used in node enclosures and includes a replaceable battery.
The PCMs are located at the rear of the storage system, and on the sides of the enclosure. There
are two PCMs per enclosure. The PCMs are numbered 0 and 1 from left to right.
10 Identifying Storage System Components
Figure 7 PCM Numbering
In the HP M6720 Drive Enclosure, the two PCMs are located diagonally from one another. The
remaining PCM slots are blank. See Figure 8 (page 11)).
Figure 8 PCMs in a HP M6710 (2U) and HP M6720 (4U) Drive Enclosures
Power Distribution Units
Two power distribution units (PDU) are mounted horizontally at the bottom of the rack. The PDUs
are numbered 0 and 1 from bottom to top. The default configuration for the HP Intelligent Series
Racks is two PDUs mounted vertically at the bottom of the rack so to provide a front-mounting unit
space.
NOTE: Depending on configuration, PDUs can also be mounted vertically.
Service Processor
The HP 3PAR StoreServ 7000 Storage system uses either a physical service processor (SP) or virtual
service processor (VSP). If your configuration includes an SP, the SP rests at the bottom of the rack
under the enclosures and above the PDUs.
Figure 9 HP 3PAR Service Processor DL 320e
Understanding Component Numbering 11
2 Understanding LED Indicator Status
Storage system components have LEDs indicating status of the hardware. Use the LED indicators
to help diagnose basic hardware problems. This chapter provides tables and illustrations of
component LEDs.
Enclosure LEDs
Bezel LEDs
The bezel LEDs are located at the front of the system on each side of the drive enclosure. The bezels
have three LED indicators. See Figure 10 (page 12).
Figure 10 Location of Bezel LEDs
Table 3 Description of Bezel LEDs
IndicatesLED
Appearance
LEDCallout
On – System power is available.GreenSystem Power1
On – System is running on battery power.Amber
On – System hardware fault to I/O modules or PCMs within the enclosure.
At the rear of the enclosure, identify if the PCM or I/O module LED is also
Amber.
AmberModule Fault2
On – There is a disk fault on the system.AmberDisk Drive
Status
3
NOTE: Prior to running installation scripts, the numeric display located under the Disk Drive Status
LED may not display the proper numeric order in relation to their physical locations. The correct
sequence will be displayed after the installation script is completed.
12 Understanding LED Indicator Status
Disk Drive LEDs
Disk Drive LEDs are located on the front of the disk drives. Disk drives have two LED indicators.
Figure 11 Location of Disk Drive LEDs
Table 4 Description of Disk Drive LEDs
IndicatesLED AppearanceLEDCallout
On – Normal operationGreenActivity1
Flashing – Activity
On – Disk failed and is ready to be
replaced.
Flashing – The locatecage command
is issued (which blinks all drive fault LEDs
AmberFault2
for up to 15 minutes (The I/O module
Fault LEDs at the rear of the enclosure also
blinks). Fault LEDs for failed disk drives
do not blink.
Storage System Component LEDs
PCM LEDs
The 764 W PCMs are used in controller node enclosures and include six LEDs. The 580 W PCMs
are used in drive enclosures and include four LEDs. The LEDs are located are located in the corner
of the module.
See Table 5 (page 14) for details of PCM LEDs.
Storage System Component LEDs 13
Figure 12 Location of Controller Node PCM LEDs
Table 5 Description of Controller Node PCM LEDs
IndicatesAppearanceDescriptionIcon
No AC power or PCM faultOn
AmberAC input fail
Firmware downloadFlashing
AC present and PCM On / OKOn
GreenPCM OK
Standby modeFlashing
PCM fail or PCM faultOn
AmberFan Fail
Firmware downloadFlashing
No AC power, PCM fault or out of toleranceOn
AmberDC Output Fail
Firmware downloadFlashing
Hard fault (not recoverable)On
AmberBattery Fail
Soft fault (recoverable)Flashing
14 Understanding LED Indicator Status
Table 5 Description of Controller Node PCM LEDs (continued)
IndicatesAppearanceDescriptionIcon
Present and chargedOn
GreenBattery Good
Charging or disarmedFlashing
Drive PCM LEDs
The following figure shows the location of drive 580 W PCM LEDs.
See Table 6 (page 15) for details of PCM LEDs..
Figure 13 Location of Drive PCM LEDs
Table 6 Description of Drive PCM LEDs
IndicatesLED AppearanceDescriptionIcon
No AC power or PCM faultOn
AmberAC input fail
Firmware DownloadFlashing
AC Present and PCM On / OKOn
GreenPCM OK
Standby modeFlashing
PCM fail or PCM faultOn
AmberFan Fail
Firmware downloadFlashing
Storage System Component LEDs 15
Table 6 Description of Drive PCM LEDs (continued)
IndicatesLED AppearanceDescriptionIcon
No AC power, PCM fault or out of toleranceOn
AmberDC Output Fail
Firmware downloadFlashing
I/O Module LEDs
I/O modules are located on the back of the system. I/O modules have two mini-SAS universal
ports, which can be connected to HBAs or other ports. Each port includes External Port Activity
LEDs, labeled 0 to 3. The I/O module also includes a Power and Fault LED.
Figure 14 Location of HP M6710/M6720 I/O Module LEDs
Figure 15 I/O Module Power and Fault LEDs
Table 7 Description of I/O module Power and Fault LEDs
IndicatesStateAppearanceFunctionIcon
Power is onOnGreenPower
Power is offOff
16 Understanding LED Indicator Status
Table 7 Description of I/O module Power and Fault LEDs (continued)
IndicatesStateAppearanceFunctionIcon
FaultOnAmberFault
Normal operationOff
Locate command issuedFlashing
External Port Activity LEDs
Figure 16 Location of External Port Activity LEDs
IndicatesStateAppearanceFunction
Ready, no activityOnGreenExternal Port Activity; 4 LEDs for
Data Ports 0 through 3
Not ready or no powerOff
ActivityFlashing
Storage System Component LEDs 17
Controller Node and Internal Component LEDs
NOTE: Enter the locatenode command to flash the hotplug LED blue.
Figure 17 Location of Controller Node LEDs
Table 8 Description of Controller Node LEDs
IndicatesAppearanceLEDCallout
Node status GoodGreenStatus1
• On – No cluster
• Quick Flashing – Boot
• Slow Flashing – Cluster
Node FRU IndicatorBlueHotplug2
• On – OK to remove
• Off – Not OK to remove
• Flashing – locatenode command has been
issued
Node status FaultAmberFault3
• On – Fault
• Off – No fault
• Flashing – Node in cluster and there is a fault
Ethernet LEDs
The controller node has two built-in Ethernet ports. Each built-in Ethernet ports has two LEDs.
18 Understanding LED Indicator Status
Figure 18 Location of Ethernet LEDs
Table 9 Description of Ethernet LEDs
IndicatesAppearanceLEDCallout
On – 1 GbE LinkGreenLink Up
Speed
1
On – 100 Mb LinkAmber
Off – No link established or 10 Mb Link
On – No Link activityGreenActivity2
Off – No link established
Flashing – Link activity
FC Port LEDs
The controller node has two FC ports. Each FC port has two LEDs. The arrow-head shaped LEDs
point to the associated port.
Figure 19 Location of FC Port LEDs
Table 10 Description of FC Port LEDs
IndicatesLED AppearanceLEDPort
Wake up failure (dead device) or power is not appliedOffNo lightAll ports
Not connectedOffAmberFC-1
Connected at 4 Gbs3 fast blinks
Connected at 8 Gbs4 fast blinks
Normal/Connected – link upOnGreenFC-2
Link down or nor connectedFlashing
Controller Node and Internal Component LEDs 19
SAS Port LEDs
The controller node has two SAS ports. Each SAS port has four LEDs and numbered 0 to 3:
Figure 20 Location of SAS Port LEDs
Table 11 Description of SAS port LEDs
IndicatesAppearanceLEDCallout
Off– SAS link is present or not, this LED does not remain litGreenDP-11
Flashing–Activity on port
Off–SAS link is present or not, this LED does not remain litGreenDP-22
Flashing–Activity on port
Interconnect Port LEDs
The controller node has two interconnect ports. Each interconnect port includes two LEDs.
Figure 21 Location Interconnect Port LEDs
Table 12 Description of Interconnect Port LEDs
IndicatesAppearanceLEDCallout
On – Link establishedGreenStatus1
20 Understanding LED Indicator Status
Table 12 Description of Interconnect Port LEDs (continued)
Off – Link not yet established
On – Failed to establish link connectionAmberFault2
Off – No errors currently on link
Flashing – Cluster link cabling error, controller node in wrong slot,
or serial number mismatch between controller nodes.
Fibre Channel Adapter Port LEDs
The Fibre Channel adapter in the controller node includes Fibre Channel port LEDs:
Figure 22 Location of Fibre Channel Adapter Port LEDs
Table 13 Description of Fibre Channel Adapter Port LEDs
IndicatesAppearanceLEDCallout
Off – Wake up failure (dead device) or power is not
applied
No lightAll ports
Off – Not connectedAmberPort speed1
3 fast blinks – Connected at 4 Gb/s.
4 fast blinks – Connected at 8 Gb/s.
On – Normal/Connected - link upGreenLink status2
Flashing – Link down or not connected
Converged Network Adapter Port LEDs
The CNA in the controller node includes two ports. Each port has a Link and Activity LED.
Figure 23 Location of CNA Port LEDs
Controller Node and Internal Component LEDs 21
Table 14 Description of CNA Port LEDs
IndicatesAppearanceLEDCallout
Off – Link downGreenLink1
On – Link up
Off – No activityGreenACT (Activity)2
On – Activity
Service Processor LEDs
The HP 3PAR SP (Proliant DL320e) LEDs are located at the front and rear of the SP.
Figure 24 Front Panel LEDs
Table 15 Front panel LEDs
DescriptionAppearanceLEDItem
ActiveBlueUID LED/button1
System is being managed remotelyFlashing Blue
DeactivatedOff
System is onGreenPower On/Standby button and
system power
2
Waiting for powerFlashing Green
System is on standby, power still onAmber
Power cord is not attached or power
supplied has failed
Off
System is on and system health is
normal
GreenHealth3
System health is degradedFlashing Amber
System health is criticalFlashing Red
System power is offOff
Linked to networkGreenNIC status4
Network activityFlashing Green
No network linkOff
22 Understanding LED Indicator Status
Figure 25 Rear Panel LEDs
Table 16 Rear panel LEDs
DescriptionAppearanceLEDItem
LinkGreenNIC link1
No linkOff
ActivityGreen or Flashing GreenNIC status2
No activityOff
ActiveBlueUID LED/button3
System is being managed remotelyFlashing Blue
DeactivatedOff
NormalGreenPower supply
NOTE: May not be applicable to
your system (for hot-plug HP CS
power supplies ONLY)
4
Off = one or more of the following
conditions:
Off
• Power is unavailable
• Power supply has failed
• Power supply is in standby mode
• Power supply error
Service Processor LEDs 23
3 Powering Off/On the Storage System
This chapter describes how to power the storage system on and off.
Powering Off the Storage System
NOTE: Power distribution units (PDU) in any expansion cabinets connected to the storage system
may need to be shut off. Use the locatesys command to identify all connected cabinets before
shutting down the system. The command blinks all node and drive enclosure LEDs.
Before you power off, use either SPmaint or SPOCC to shut down the system (see Service Processor
Onsite Customer Care in the HP 3PAR StoreServ 7000 Storage Service Guide).
The system must be shut down before powering off by using any of the following three methods:
Using SPOCC
1. Select InServ Product Maintenance.
2. Select Halt an InServ cluster/node.
3. Follow the prompts to shutdown a cluster. Do not shut down individual nodes.
4. Turn off power to the node PCMs.
5. Turn off power to the drive enclosure PCMs.
6. Turn off all PDUs in the rack.
Using SPmaint
1. Select option 4 (InServ Product Maintenance).
2. Select Halt an InServ cluster/node.
3. Follow the prompts to shutdown a cluster. Do not shut down individual nodes.
4. Turn off power to the node PCMs.
5. Turn off power to the drive enclosure PCMs.
6. Turn off all PDUs in the rack.
Using CLI Directly on the Controller Node if the SP is Inaccessible
1. Enter the CLI command shutdownsys – halt. Confirm all prompts.
2. Allow 2 to 3 minutes for the node to halt, then verify that the node Status LED is flashing green
and the node hotplug LED is blue, indicating that the node has been halted. For information
about LEDs status, see “Understanding LED Indicator Status” (page 12).
CAUTION: Failure to wait until all controller nodes are in a halted state could cause the
system to view the shutdown as uncontrolled and place the system in a checkld state upon
power up. This can seriously impact host access to data.
3. Turn off power to the node PCMs.
4. Turn off power to the drive enclosure PCMs.
5. Turn off power to all PDUs in the rack.
Powering On the Storage System
1. Set the circuit breakers on the PDUs to the ON position.
2. Set the switches on the power strips to the ON position.
3. Power on the drive enclosure PCMs.
24 Powering Off/On the Storage System
NOTE: To avoid any cabling errors, all drive enclosures must have at least one or more
hard drive(s) installed before powering on the enclosure.
4. Power on the node enclosure PCMs.
5. Verify the status of the LEDs. See “Understanding LED Indicator Status” (page 12).
Powering On the Storage System 25
4 Alerts
Alerts are triggered by events that require system administrator intervention. This chapter provides
a list of alerts identified by message code, the messages, and what action should be taken for
each alert. To learn more about alerts, see the HP 3PAR StoreServ Storage Concepts Guide.
For information about system alerts, go to HP Guided Troubleshooting at http://www.hp.com/
support/hpgt/3par and select your server platform.
To view the alerts, use the showalert command. Alert message codes have seven digits in the
schema AAABBBB, where:
• AAA is a 3-digit major code
• BBBB is a 4-digit sub-code
• 0x precedes the code to indicate hexadecimal notation
NOTE: Message codes ending in de indicate a degraded state alert. Message codes ending in
fa indicate a failed state alert.
See the HP 3PAR OS Command Line Interface Reference for complete information on the display
options on the event logs.
Table 17 Alert Severity Levels
DescriptionSeverity
A fatal event has occurred. It is no longer possible to take
remedial action.
Fatal
The event is critical and requires immediate action.Critical
The event requires immediate action.Major
An event has occurred that requires action, but the situation
is not yet serious.
Minor
An aspect of performance or availability may have become
degraded. You must determine whether action is necessary.
Degraded
The event is informational. No action is required other than
to acknowledge or remove the alert.
Informational
Getting Recommended Actions
For disk drive alerts, the component line in the right column lists the cage number, magazine
number, and drive number (cage:magazine:disk). The first and second numbers are sufficient to
identify the exact disk in an HP 3PAR StoreServ 7000 Storage system, since there is always only
a single disk (disk 0) in a single magazine.
1. Follow the link to alert actions under Recommended Actions.
2. At the HP Storage Systems Guided Troubleshooting website, follow the link for your product.
3. At the bottom of the HP 3PAR product page, click the link for HP 3PAR Alert Messages.
4. At the bottom of the Alert Messages page, choose the correct message code series based on
the first four characters of the alert.
5. Choose the link that matches the first five characters of the message code.
6. On the next page, select the message code that matches the code in the alert.
The next page shows the message type based on the message code selected and provides a
link to the suggested action.
7. Follow the link.
26 Alerts
8. On the suggested actions page, scroll through the list to find the message state listed in the
alert message. The recommended action is listed next to the message state.
Getting Recommended Actions 27
5 Troubleshooting
The HP 3PAR OS CLI checkhealth command checks and displays the status of storage system
hardware and software components. For example, the checkhealth command can check for
unresolved system alerts, display issues with hardware components, or display information about
virtual volumes that are not optimal.
By default the checkhealth command checks most storage system components, but you can also
check the status of specific components. For a complete list of storage system components analyzed
by the checkhealth command, see “checkhealth Command” (page 28).
The checkhealth -svc option is available only to users with Super CLI accounts. The -svc
option provides a summary of service related issues by default. If you use the -detail option,
both a summary and a detailed list of service issues are displayed. The service information displayed
is for service providers only, because it may produce cryptic output that only a service provider
understands, or it displays issues that only a service provider can resolve. The -svc option displays
the service related information in addition to the customer related information.
Alerts are processed by the SP. The HP Business Support Center (BSC) takes action on alerts that
are not customer administration alerts. Customer administration alerts are managed by customers.
The SP also runs the checkhealth command once an hour and sends the information to the BSC
where the information is monitored periodically for unusual system conditions.
checkhealth Command
The checkhealth command checks and displays the status of system hardware and software
components.
Command syntax is: checkhealth [<options> | <component>...]
Command authority is Super, Service
Command options are listed:
• -list, lists all components that checkhealth can analyze
• -quiet, suppresses the display of the item currently being checked
• -detail, displays detailed information regarding the status of the system
• -svc, performs service related checks on the system and reports the status. This is a hidden
option and does not appear in the CLI Help. This option is not intended for customers and is
available only to Service users
• -full, displays information about the status of the full system. This is a hidden option and it
does not appear in the CLI Help. This option has no effect if the –svc option is omitted. Some
of the additional components evaluated take longer to run than other components
The <component> is the command specifier, which indicates the component to check. Use the
-list option to view the list of components.
Using the checkhealth Command
Use the checkhealth command without any specifiers to check the health of all the components
that can be analyzed by the checkhealth command.
The following example lists both summary and detailed information about the hardware and
software components:
cli% checkhealth -detail
Checking alert
Checking cage
Checking dar
28 Troubleshooting
Checking date
Checking ld
Checking license
Checking network
Checking node
Checking pd
Checking port
Checking rc
Checking snmp
Checking task
Checking vlun
Checking vv
Component -----------Description----------- Qty
Alert New alerts 4
Date Date is not the same on all nodes 1
LD LDs not mapped to a volume 2
License Golden License. 1
vlun Hosts not connected to a port 5
The following information is reported with the -detail option:
Component ----Identifier---- -----------Description-------
Alert sw_port:1:3:1 Port 1:3:1 Degraded (Target Mode Port Went Offline)
Alert sw_port:0:3:1 Port 0:3:1 Degraded (Target Mode Port Went Offline)
Alert sw_sysmgr Total available FC raw space has reached threshold of 800G
(2G remaining out of 544G total)
Alert sw_sysmgr Total FC raw space usage at 307G (above 50% of total 544G)
Date -- Date is not the same on all nodes
LD ld:name.usr.0 LD is not mapped to a volume
LD ld:name.usr.1 LD is not mapped to a volume
vlun host:group01 Host wwn:2000000087041F72 is not connected to a port
vlun host:group02 Host wwn:2000000087041F71 is not connected to a port
vlun host:group03 Host iscsi_name:2000000087041F71 is not connected to a port
vlun host:group04 Host wwn:210100E08B24C750 is not connected to a port
vlun host:Host_name Host wwn:210000E08B000000 is not connected to a port
If there are no faults or exception conditions, the checkhealth command indicates the system
is healthy:
cli% checkhealth
Checking alert
Checking cage
…
Checking vlun
Checking vv
System is healthy
Use the <component> specifier to check the status of one or more specific storage system
components. For example:
cli% checkhealth node pd
Checking node
Checking pd
The following components are healthy: node, pd
The -svc option provides a summary of service related issues by default. If you use the -detail
option, both a summary and a detailed list of service issues are displayed. The -svc option displays
the service related information in addition to the customer related information.
checkhealth Command 29
The following example displays information intended only for service users:
cli% checkhealth -svc
Checking alert
Checking cabling
Checking cage
...
Checking vlun
Checking vv
Component -------------------Description------------------- Qty
Alert New alerts 2
File Nodes with Dump or HBA core files 1
PD There is an imbalance of active pd ports 1
PD PDs that are degraded or failed 2
pdch LDs with chunklets on a remote disk 2
pdch LDs with connection path different than ownership 2
Port Missing SFPs 6
The following information is included with the -detail option. The detailed output can be very
long if a node or cage is down.
cli% checkhealth -svc -detail
Checking alert
Checking cabling
Checking cage
...
Checking vlun
Checking vv
Component -------------------Description------------------- Qty
Alert New alerts 2
File Nodes with Dump or HBA core files 1
PD There is an imbalance of active pd ports 1
PD PDs that are degraded or failed 2
pdch LDs with chunklets on a remote disk 2
pdch LDs with connection path different than ownership 2
Port Missing SFPs 6
Component --------Identifier--------- ----------------Description---------------------
Alert hw_cage_sled:3:8:3,sw_pd:91 Magazine 3:8:3, Physical Disk 91 Degraded
(Prolonged Missing B Port) Alert hw_cage_sled:N/A,sw_pd:54 Magazine N/A, Physical
Disk 54 Failed (Prolonged Missing, Missing A Port, Missing B Port)
File node:0 Dump or HBA core files found
PD disk:54 Detailed State: prolonged_missing
PD disk:91 Detailed State: prolonged_missing_B_port
PD -- There is an imbalance of active pd ports
pdch LD:35 Connection path is not the same as LD ownership
pdch LD:54 Connection path is not the same as LD ownership
pdch ld:35 LD has 1 remote chunklets
pdch ld:54 LD has 10 remote chunklets
Port port:2:2:3 Port or devices attached to port have experienced
within the last day
To check for inconsistencies between the System Manager and kernel states and CRC errors for
FC and SAS ports, use the -full option:
checkhealth -svc -full
checkhealth -list -svc -full
30 Troubleshooting
Component -----------------------------------Description------------------------------
alert Displays any non-resolved alerts.
cabling Displays any cabling errors.*
cage Displays non-optimal drive cage conditions.
consistency Displays inconsistencies between sysmgr and kernel**
dar Displays Data Encryption issues.
date Displays if nodes have different dates.
file Displays non-optimal file system conditions.*
host Checks for FC host ports that are not configured for virtual port support.*
ld Displays non-optimal LDs.
license Displays license violations.
network Displays ethernet issues.
node Displays non-optimal node conditions.
pd Displays PDs with non-optimal states or conditions.
pdch Displays chunklets with non-optimal states.*
port Displays port connection issues.
portcrc Checks for increasing port CRC errors.**
portpelcrc Checks for increasing SAS port CRC errors.**
rc Displays Remote Copy issues.
snmp Displays issues with SNMP.
sp Checks the status of connection between sp and nodes.*
task Displays failed tasks.
vlun Displays inactive VLUNs and those which have not been reported by the host
agent.
vv Displays non-optimal VVs.
NOTE:
• One asterisk (*) at the end of the output indicates that it is checked only if –svc is part of the
command.
• Two asterisks (**) at the end of the output indicate that it is checked only if –svc –full is
part of the command.
Troubleshooting Storage System Components
Use the checkhealth -list command to list all components that can be analyzed by the
checkhealth command.
For detailed troubleshooting information about specific components, examples, and suggested
actions for correcting issues with components. See the component names in Table 18 (page 32).
Troubleshooting Storage System Components 31
Table 18 Component Functions
FunctionComponent
Displays unresolved alertsAlert
Displays any cabling errorsCabling
Displays drive cage conditions that are not optimalCage
Displays inconsistencies between sysmgr and the kernelConsistency
Displays data encryption issuesDar
Displays if nodes have different datesDate
Displays file system conditions that are not optimalFile
Checks for FC host ports that are not configured for virtual
port support
Host
Displays LDs that are not optimalLD
Displays license violationsLicense
Displays Ethernet issuesNetwork
Displays node conditions that are not optimalNode
Displays PDs with states or conditions that are not optimalPD
Displays chunklets with states that are not optimalPDCH
Displays port connection issuesPort
Checks for increasing port CRC errorsPortcrc
Checks for increasing SAS port CRC errorsPortpelcrc
Displays Remote Copy issuesRC
Displays issues with SNMPSNMP
Checks the status of Ethernet connections between the
Service Processor and nodes, when run from the SP
SP
Displays failed tasksTask
Displays inactive VLUNs and VLUNs that have not been
reported by the host agent
VLUN
Displays VVs that are not optimalVV
Alert
Displays unresolved alerts and shows any alerts generated by showalert -n.
Format of Possible Alert Exception Messages
Alert <component> <alert_text>
Alert Example
Component -Identifier- --------Description--------------------
Alert hw_cage:1 Cage 1 Degraded (Loop Offline)
Alert sw_cli 11 authentication failures in 120 secs
32 Troubleshooting
Alert Suggested Action
View the full Alert output using the MC (GUI) or the showalert -d CLI command.
Cabling
Displays any cabling errors.
Checks for compliance of standard cabling rules between nodes and drive cages (same slot and
port numbers on two different nodes to a cage).
NOTE: To avoid any cabling errors, all drive enclosures must have at least one or more hard
drives installed before powering on the enclosure.
Format of Possible Cabling Exception Messages
Cabling Bad SAS connection 20
Cabling cage1 Check connections or replace cable from (cage0, I/O 0, DP-1)
to (cage1, I/O 0, DP-1)
Cabling Unexpected cage found 24
Cabling -- Unexpected cage found on node3 DP-2
Cabling Wrong I/O or port 1
Cabling cage8 Cable in (cage8, I/O 1, Mfg) should be in (cage8, I/O 1, DP-2)
Cabling SAS cabling check incomplete 1
Cabling cage8 All three SAS ports of I/O 1 used, cabling check incomplete
Cabling Incorrect drive cage chaining 1
Cabling cage5 Cable in (cage5, I/O 1, DP-2) should be in (cage9, I/O 1, DP-1)
Cabling Mismatched cage order 1
Cabling cage0 node1 DP-2 should be cabled in the order: cage9 cage8 cage7
cage6 cage5
Cabling Cable chains are unbalanced 1
Cabling cage0 node0 DP-2 has 5 cages, node1 DP-2 has 4 cages
Cabling Missing I/O module 1
Cabling cage5 I/O 1 missing. Check status and cabling to cage5 I/O 1
Cabling Cable chain too long 1
Cabling cage0 node1 DP-2 has 6 cages connected, Maximum is 5 (cage9 cage8
cage7 cage6 cage5 cage11)
Cabling Cages not connected to paired nodes 1
Cabling cage11 Cage connected to non-paired nodes node1 DP-2 and node2 DP-2
Cabling Cages cabled to too many nodes 6
Cabling cage5 Cabled to node0 DP-2 and node1 DP-2 and node3 DP-2, remove a
cable from node3
Cabling Multiple node ports on a single cable chain 1
Cabling cage11 Cage is connected to too many node ports (node2 DP-1 & DP-2
and node3 DP-1 & DP-2)
Cabling Cages cabled to nodes twice 1
Cabling cage11 Cabled to node2 DP-2 and node3 DP-1 & DP-2, remove a cable from
node3
Cabling Cages with multiple paths to node ports 1
Cabling cage11 Cage has multiple paths to node2 DP-2 and node3 DP-2, correct
cabling
Troubleshooting Storage System Components 33
Cabling Cages with multiple paths to node ports 2
Cabling cage5 Cage has multiple paths to node0 DP-2, correct cabling
Cabling Cages cabled to a single node 1
Cabling cage11 Cage not connected to node2, move one connection from node3 to
node2
Cabling Cages not connected to same slot & port 1
Cabling cage11 Cage connected to different ports node2 DP-1 and node3 DP-2
Cabling Example 1
Component -Identifier- ---Description--
Cabling cage:3 Missing Port
Cabling Suggested Action 1
Check the status of the nodes, FC ports, cage, and paths to the drive cage using CLI commands
such as showcage, showcage -d, showpd, and shownode. If a node is offline, multiple
cages are affected.
cli% showcage cage3
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
3 cage3 --- 0 3:0:4 0 28 27-34 2.33 2.33 DC2 n/a
cli% showpd -p -cg 3
---Size(MB)---- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
48 3:0:0 FC 10 degraded 139520 120320 ----- 3:0:4*
49 3:0:1 FC 10 degraded 139520 126464 ----- 3:0:4*
50 3:0:2 FC 10 degraded 139520 120320 ----- 3:0:4*
51 3:0:3 FC 10 degraded 139520 126464 ----- 3:0:4*
cli% showpd -p -cg 3 -path
-------Paths-------
Id CagePos Type -State-- A B Order
48 3:0:0 FC degraded 2:0:4missing 3:0:4 3/-
49 3:0:1 FC degraded 2:0:4missing 3:0:4 3/-
50 3:0:2 FC degraded 2:0:4missing 3:0:4 3/-
51 3:0:3 FC degraded 2:0:4missing 3:0:4 3/-
cli% showport
2:0:4 initiator offline 2FF70002AC00054C 22040002AC00054C free
or
2:0:4 initiator loss_sync 2FF70002AC00054C 22040002AC00054C free
Cabling Example 2
Component -Identifier- --------Description------------------------
Cabling cage:0 Not connected to the same slot & port
Cabling Suggested Action 2
The recommended and factory-default node-to-drive-chassis (cage) cabling configuration is to
connect a drive cage to a node-pair (two node). Generally nodes 0/1, 2/3, 4/5 or 6/7, achieve
34 Troubleshooting
symmetry between slots and ports (use the same slot and port on each node to a cage). In the next
example, cage0 is incorrectly connected to either slot-0 of node-0 or slot-1 of node-1.
cli% showcage cage0
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
0 cage0 0:0:1 0 1:1:1 0 24 28-38 2.37 2.37 DC2 n/a
After determining the desired cabling and reconnecting correctly to slot-0 and port-1 of nodes 0
& 1, the output should look like this:
cli% showcage cage0
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
0 cage0 0:0:1 0 1:0:1 0 24 28-38 2.37 2.37 DC2 n/a
Cage
Displays drive cage conditions that are not optimal and reports exceptions if any of the following
do not have normal states:
• Ports
• Drive magazine states (DC1, DC2, & DC4)
• Small form-factor pluggable (SFP) voltages (DC2 and DC4)
• SFP signal levels (RX power low and TX failure)
• Power supplies
• Cage firmware (is not current)
Reports if a servicecage operation has been started and has not ended.
Format of Possible Cage Exception Messages
Cage cage:<cageid> "Missing A loop" (or "Missing B loop")
Cage cage:<cageid> "Interface Card <STATE>, SFP <SFPSTATE>" (is unqualified, is
disabled, Receiver Power Low: Check FC Cable, Transmit Power Low: Check FC Cable, has
RX loss, has TX fault)"
Cage cage:<cageid>,mag:<magpos> "Magazine is <MAGSTATE>"
Cage cage:<cageid> "Power supply <X> fan is <FANSTATE>"
Cage cage:<cageid> "Power supply <X> is <PSSTATE>" (Degraded, Failed, Not_Present)
Cage cage:<cageid> "Power supply <X> AC state is <PSSTATE>"
Cage cage:<cageid> "Cage is in 'servicing' mode (Hot-Plug LED may be illuminated)"
Cage cage:<cageid> "Firmware is not current"
Cage Example 1
Component -------------Description-------------- Qty
Cage Cages missing A loop 1
Cage SFPs with low receiver power 1
Component -Identifier- --------Description------------------------
Cage cage:4 Missing A loop
Cage cage:4 Interface Card 0, SFP 0: Receiver Power Low: Check FC Cable
Cage Suggested Action 1
Check the connection/path to the SFP in the cage and the level of signal the SFP is receiving. An
RX Power reading below 100 µW signals the RX Power Low condition; typical readings are between
300 and 400 µW. Useful CLI commands are showcage -d and showcage -sfp ddm.
Troubleshooting Storage System Components 35
At least two connections are expected for drive cages, and this exception is flagged if that is not
the case.
cli% showcage -d cage4
Id Name
LoopA
Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
4 cage4 --- 0 3:2:1 0 8 28-36 2.37 2.37 DC4 n/a
-----------Cage detail info for cage4 ---------
Fibre Channel Info PortA0 PortB0 PortA1 PortB1
Link_Speed 0Gbps -- -- 4Gbps
----------------------------------SFP Info-----------------------------------
FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
0 0 OK FINISAR CORP. 4.1 No No Yes Yes
1 1 OK FINISAR CORP. 4.1 No No No Yes
Interface Board Info FCAL0 FCAL1
Link A RXLEDs Off Off
Link A TXLEDs Green Off
Link B RXLEDs Off Green
Link B TXLEDs Off Green
LED(Loop_Split) Off Off
LEDS(system,hotplug) Green,Off Green,Off
-----------Midplane Info-----------
Firmware_status Current
Product_Rev 2.37
State Normal Op
Loop_Split 0
VendorId,ProductId 3PARdata,DC4
Unique_ID 1062030000098E00
...
-------------Drive Info------------- ----LoopA----- ----LoopB-----
Drive NodeWWN LED Temp(C) ALPA LoopState ALPA LoopState
0:0 2000001d38c0c613 Green 33 0xe1 Loop fail 0xe1 OK
0:1 2000001862953510 Green 35 0xe0 Loop fail 0xe0 OK
0:2 2000001862953303 Green 35 0xdc Loop fail 0xdc OK
0:3 2000001862953888 Green 31 0xda Loop fail 0xda OK
cli% showcage -sfp cage4
Cage FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
4 0 0 OK FINISAR CORP. 4.1 No No Yes Yes
4 1 1 OK FINISAR CORP. 4.1 No No No Yes
cli% showcage -sfp -ddm cage4
---------Cage 4 Fcal 0 SFP 0 DDM----------
-Warning- --Alarm--
--Type-- Units Reading Low High Low High
Temp C 33 -20 90 -25 95
Voltage mV 3147 2900 3700 2700 3900
TX Bias mA 7 2 14 1 17
TX Power uW 394 79 631 67 631
RX Power uW 0 15 794 10* 1259
---------Cage 4 Fcal 1 SFP 1 DDM----------
-Warning- --Alarm--
--Type-- Units Reading Low High Low High
Temp C 31 -20 90 -25 95
Voltage mV 3140 2900 3700 2700 3900
TX Bias mA 8 2 14 1 17
36 Troubleshooting
TX Power uW 404 79 631 67 631
RX Power uW 402 15 794 10 1259
Cage Example 2
Component -------------Description-------------- Qty
Cage Degraded or failed cage power supplies 2
Cage Degraded or failed cage AC power 1
Component -Identifier- ------------Description------------
Cage cage:1 Power supply 0 is Failed
Cage cage:1 Power supply 0's AC state is Failed
Cage cage:1 Power supply 2 is Off
Cage Suggested Action 2
A cage power supply or power supply fan has failed, is missing input AC power, or the switch is
turned OFF. The showcage -d cageX and showalert commands provide more detail.
cli% showcage -d cage1
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
1 cage1 0:0:2 0 1:0:2 0 24 27-39 2.37 2.37 DC2 n/a
-----------Cage detail info for cage1 ---------
Interface Board Info FCAL0 FCAL1
Link A RXLEDs Green Off
Link A TXLEDs Green Off
Link B RXLEDs Off Green
Link B TXLEDs Off Green
LED(Loop_Split) Off Off
LEDS(system,hotplug) Amber,Off Amber,Off
-----------Midplane Info-----------
Firmware_status Current
Product_Rev 2.37
State Normal Op
Loop_Split 0
VendorId,ProductId 3PARdata,DC2
Unique_ID 10320300000AD000
Power Supply Info State Fan State AC Model
ps0 Failed OK Failed POI <AC input is missing
ps1 OK OK OK POI
ps2 Off OK OK POI <PS switch is turned off
ps3 OK OK OK POI
Cage Example 3
Component -Identifier- --------------Description----------------
Cage cage:1 Cage has a hotplug enabled interface card
Cage Suggested Action 3
When a servicecage operation is started, the targeted cage goes into servicing mode,
illuminating the hot plug LED on the FCAL module (DC1, DC2, DC4), and routing I/O through
another path. When the service action is finished, enter the servicecage endfc command to
return the cage to normal status. The checkhealth exception is reported if the FCAL module's
hot plug LED is illuminated or if the cage is in servicing mode. If a maintenance activity is currently
occurring on the drive cage, this condition may be ignored.
Troubleshooting Storage System Components 37
NOTE: The primary path can be seen by an asterisk (*) in showpd's Ports columns.
cli% showcage -d cage1
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
1 cage1 0:0:2 0 1:0:2 0 24 28-40 2.37 2.37 DC2 n/a
-----------Cage detail info for cage1 ---------
Interface Board Info FCAL0 FCAL1
Link A RXLEDs Green Off
Link A TXLEDs Green Off
Link B RXLEDs Off Green
Link B TXLEDs Off Green
LED(Loop_Split) Off Off
LEDS(system,hotplug) Green,Off Green,Amber
-----------Midplane Info-----------
Firmware_status Current
Product_Rev 2.37
State Normal Op
Loop_Split 0
VendorId,ProductId 3PARdata,DC2
Unique_ID 10320300000AD000
cli% showpd -s
Id CagePos Type -State-- -----Detailed_State------
20 1:0:0 FC degraded disabled_B_port,servicing
21 1:0:1 FC degraded disabled_B_port,servicing
22 1:0:2 FC degraded disabled_B_port,servicing
23 1:0:3 FC degraded disabled_B_port,servicing
cli% showpd -p -cg 1
---Size(MB)---- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
20 1:0:0 FC 10 degraded 139520 119808 0:0:2* 1:0:2-
21 1:0:1 FC 10 degraded 139520 122112 0:0:2* 1:0:2-
22 1:0:2 FC 10 degraded 139520 119552 0:0:2* 1:0:2-
23 1:0:3 FC 10 degraded 139520 122368 0:0:2* 1:0:2-
Cage Example 4
SComponent ---------Description--------- Qty
Cage Cages not on current firmware 1
Component -Identifier- ------Description------
Cage cage:3 Firmware is not current
Cage Suggested Action 4
Check the drive cage firmware revision using the commands showcage and showcage -d
cageX. The showfirwaredb command displays current firmware level required for the specific
drive cage type.
NOTE: The DC1 and DC3 cages have firmware in the FCAL modules. The DC2 and DC4 cages
have firmware on the cage mid-plane. Use the upgradecage command to upgrade the firmware.
cli% showcage
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
2 cage2 2:0:3 0 3:0:3 0 24 29-43 2.37 2.37 DC2 n/a
3 cage3 2:0:4 0 3:0:4 0 32 29-41 2.36 2.36 DC2 n/a
38 Troubleshooting
cli% showcage -d cage3
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
3 cage3 2:0:4 0 3:0:4 0 32 29-41 2.36 2.36 DC2 n/a
-----------Cage detail info for cage3 ---------
.
.
.
-----------Midplane Info-----------
Firmware_status Old
Product_Rev 2.36
State Normal Op
Loop_Split 0
VendorId,ProductId 3PARdata,DC2
Unique_ID 10320300000AD100
cli% showfirmwaredb
Vendor Prod_rev Dev_Id Fw_status Cage_type Firmware_File
...
3PARDATA [2.37] DC2 Current DC2 /opt...dc2/lbod_fw.bin-2.37
Cage Example 5
Component -Identifier- ------------Description------------
Cage cage:4 Interface Card 0, SFP 0 is unqualified
Cage Suggested Action 5
In this example, a 2 Gb/s SFP was installed in a 4 Gb/s drive cage (DC4), and the 2 Gb/s SFP
is not qualified for use in this drive cage. For cage problems, the following CLI commands are
useful: showcage -d, showcage -sfp, showcage -sfp -ddm, showcage -sfp -d, and
showpd -state.
cli% showcage -d cage4
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
4 cage4 2:2:1 0 3:2:1 0 8 30-37 2.37 2.37 DC4 n/a
-----------Cage detail info for cage4 ---------
Fibre Channel Info PortA0 PortB0 PortA1 PortB1
Link_Speed 2Gbps -- -- 4Gbps
----------------------------------SFP Info-----------------------------------
FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
0 0 OK SIGMA-LINKS 2.1 No No No Yes
1 1 OK FINISAR CORP. 4.1 No No No Yes
Interface Board Info FCAL0 FCAL1
Link A RXLEDs Green Off
Link A TXLEDs Green Off
Link B RXLEDs Off Green
Link B TXLEDs Off Green
LED(Loop_Split) Off Off
LEDS(system,hotplug) Amber,Off Green,Off
...
cli% showcage -sfp -d cage4
--------Cage 4 FCAL 0 SFP 0--------
Cage ID : 4
Fcal ID : 0
SFP ID : 0
Troubleshooting Storage System Components 39
State : OK
Manufacturer : SIGMA-LINKS
Part Number : SL5114A-2208
Serial Number : U260651461
Revision : 1.4
MaxSpeed(Gbps) : 2.1
Qualified : No <<< Unqualified SFP
TX Disable : No
TX Fault : No
RX Loss : No
RX Power Low : No
DDM Support : Yes
--------Cage 4 FCAL 1 SFP 1--------
Cage ID : 4
Fcal ID : 1
SFP ID : 1
State : OK
Manufacturer : FINISAR CORP.
Part Number : FTLF8524P2BNV
Serial Number : PF52GRF
Revision : A
MaxSpeed(Gbps) : 4.1
Qualified : Yes
TX Disable : No
TX Fault : No
RX Loss : No
RX Power Low : No
DDM Support : Yes
Consistency
Displays inconsistencies between sysmgr and the kernel.
The check is added to find inconsistent and unusual conditions between of the system manager
and the node kernel. The check requires the hidden -svc -full parameter because the check
can take 20 minutes or longer for a large system.
Format of Possible Consistency Exception Messages
Consistency --<err>
Consistency Example
Component -Identifier- --------Description------------------------
Consistency -- Region Mover Consistency Check Failed
Consistency -- CH/LD/VV Consistency Check Failed
Consistency Suggested Action
Gather InSplore data and escalate to HP BSC.
Data Encryption (DAR)
Checks issues with data encryption. If the system is not licensed for HP 3PAR Data Encryption, no
checks are made.
40 Troubleshooting
Format of Possible DAR Exception Messages
Dar -- "There are 5 disks that are not self-encrypting"
DAR Suggested Action
Remove the drives that are not self-encrypting from the system because the non-encrypted drives
cannot be admitted into a system that is running with data encryption. Also, if the system is not yet
enabled for data encryption, the presence of these disks prevents data encryption from being
enabled.
DAR Example 2
Dar -- "DAR Encryption key needs backup"
DAR Suggested Action 2
Issue the controlencryption backup command to generate a password-enabled backup file.
Date
Checks the date and time on all nodes.
Format of Possible Date Exception Messages
Date -- "Date is not the same on all nodes"
Date Example
Component -Identifier- -----------Description-----------
Date -- Date is not the same on all nodes
Date Suggested Action
The time on the nodes should stay synchronized whether there is an NTP server or not. Use
showdate to see if a node is out of sync. Use shownet and shownet -d commands to view
network and NTP information.
cli% showdate
Node Date
0 2010-09-08 10:56:41 PDT (America/Los_Angeles)
1 2010-09-08 10:56:39 PDT (America/Los_Angeles)
cli% shownet
IP Address Netmask/PrefixLen Nodes Active Speed
192.168.56.209 255.255.255.0 0123 0 100
Duplex AutoNeg Status
Full Yes Active
Default route: 192.168.56.1
NTP server
: 192.168.56.109
Troubleshooting Storage System Components 41
File
Displays file system conditions that are not optimal.
Checks for the following:
• The presence of special files on each node, for example:
touch
manualstartup
• That the persistent repository (Admin VV) is mounted
• Whether the file-systems on any node disk are close to full
• The presence of any HBA core files or user process dumps
• Whether the amount of free node memory is sufficient
File Format of Possible Exception Messages
File node:<node> "Behavior altering file "<file> " exists created on <filetime>"
File node:master "Admin Volume is not mounted"
File Node node:,<node> "Filesystem <filesys> mounted on "<mounted_on> " is over xx%
full" (Warnings are given at 80 and 90%.)
File node:<node> "Dump or HBA core files found"
File -- "An online upgrade is in progress"
File Example 1
File node:2 Behavior altering file "manualstartup" exists created on Oct 7 14:16
File Suggested Action 1
After understanding the condition of the file, remove the file to prevent unwanted behavior. As root
on a node, use the UNIX rm command to remove the file.
A known condition includes some undesirable touch files are not being detected (bug 45661).
File Example 2
Component -----------Description----------- Qty
File Admin Volume is not mounted 1
File Suggested Action 2
Each node has a file system link so the admin volume can be mounted if the node is the master
node. This exception is reported if the link is missing or if the System Manager (sysmgr) is not
running at the time. For example, sysmgr may have restarted manually, due to error or during a
change of master-nodes. If sysmgr is restarted, the sysmgr to remount the admin volume every
few minutes.
Every node should have the following file system link so that the admin volume can be mounted,
if the node becomes the master node:
root@1001356-1~# onallnodes ls -l /dev/tpd_vvadmin
Node 0:
lrwxrwxrwx 1 root root 12 Oct 23 09:53 /dev/tpd_vvadmin -> tpddev/vvb/0
42 Troubleshooting
Node 1:
ls: /dev/tpd_vvadmin: No such file or directory
The corresponding alert when the admin volume is not properly mounted is as follows:
Message Code: 0xd0002
Severity : Minor
Type : PR transition
Message : The PR is currently getting data from the internal drive on node 1, not
the admin volume. Previously recorded alerts will not be visible until the PR
transitions to the admin volume.
If a link for the admin volume is not present, it can be recreated by rebooting the node.
File Example 3
Component -----------Description----------- Qty
File Nodes with Dump or HBA core files 1
Component ----Identifier----- ----Description------
File node:1 Dump or HBA core files found
File Suggested Action 3
This condition may be transient because the Service Processor retrieves the files and cleans up the
dump directory. If the SP is not gathering the dump files, check the condition and state of the SP.
LD
Checks the following and displays logical disks (LD) that are not optimal:
• Preserved LDs
• Verifies that current and created availability are the same
• Owner and backup
• Verifies preserved data space (pdsld) is the same as total data cache
• Size and number of logging LDs
Format of Possible LD Exception Messages
LD ld:<ldname> "LD is not mapped to a volume"
LD ld:<ldname> "LD is in write-through mode"
LD ld:<ldname> "LD has <X> preserved RAID sets and <Y> preserved chunklets"
LD ld:<ldname> "LD has reduced availability. Current: <cavail>, Configured: <avail>"
LD ld:<ldname> "LD does not have a backup"
LD ld:<ldname> "LD does not have owner and backup"
LD ld:<ldname> "Logical Ddisk is owned by <owner>, but preferred owner is <powner>"
LD ld:<ldname> "Logical Disk is backed by <backup>, but preferred backup is <pbackup>"
LD ld:<ldname> "A logging LD is smaller than 20G in size"
LD ld:<ldname> "Detailed State:<ldstate>" (degraded or failed)
LD -- "Number of logging LD's does not match number of nodes in the cluster"
LD -- "Preserved data storage space does not equal total node's Data memory"
Troubleshooting Storage System Components 43
LD Example 1
Component -------Description-------- Qty
LD LDs not mapped to a volume 10
Component -Identifier-- --------Description---------
LD ld:Ten.usr.0 LD is not mapped to a volume
LD Suggested Action 1
Examine the identified LDs using the following CLI commands:showld, showld –d, showldmap,
and showvvmap.
LDs are normally mapped to (used by) VVs but they can be disassociated with a VV if a VV is
deleted without the underlying LDs being deleted, or by an aborted tune operation. Normally, you
would remove the unmapped LD to return its chunklets to the free pool.
cli% showld Ten.usr.0
Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru
MapV
88 Ten.usr.0 0 normal 0/1/2/3 8704 0 V 0 --- N
N
cli% showldmap Ten.usr.0
Ld space not used by any vv
LD Example 2
Component -------Description-------- Qty
LD LDs in write through mode 3
Component -Identifier-- --------Description---------
LD ld:Ten.usr.12 LD is in write-through mode
LD Suggested Action 2
Examine the identified LDs for failed or missing disks by using the following CLI commands:showld,
showld –d, showldch, and showpd. Write-through mode (WThru) indicates that host I/O
operations must be written through to the disk before the host I/O command is acknowledged.
This is usually due to a node-down condition, when node batteries are not working, or where disk
redundancy is not optimal.
cli% showld Ten*
Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId
WThru
MapV
91 Ten.usr.3 0 normal 1/0/3/2 13824 0 V 0 --- N N
92 Ten.usr.12 0 normal 2/3/0/1 28672 0 V 0 --- Y N
cli% showldch Ten.usr.12
Ldch Row Set PdPos Pdid Pdch State Usage Media Sp From To
0 0 0 3:3:0 108 6 normal ld valid N --- ---
11 0 11 --- 104 74 normal ld valid N --- ---
44 Troubleshooting
cli% showpd 104
-Size(MB)-- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
104 4:9:0? FC 15 failed 428800 0 ----- -----
LD Example 3
Component ---------Description--------- Qty
LD LDs with reduced availability 1
Component --Identifier-- ------------Description---------------
LD ld:R1.usr.0 LD has reduced availability. Current: ch, Configured: cage
LD Suggested Action 3
LDs are created with certain high-availability characteristics, such as ha-cage. Reduced availability
can occur if chunklets in an LD are moved to a location where the current availability (CAvail) is
below the desired level of availability (Avail). Chunklets may have been manually moved with
movech or by specifying it during a tune operation or during failure conditions such as node,
path, or cage failures. The HA levels from highest to lowest are port, cage, mag, and ch (disk).
Examine the identified LDs for failed or missing disks by using the following CLI commands: showld,
showld –d, showldch, and showpd. In the example below, the LD should have cage-level
availability, but it currently has chunklet (disk) level availability (the chunklets are on the same disk).
cli% showld -d R1.usr.0
Id Name CPG RAID Own SizeMB RSizeMB RowSz StepKB SetSz Refcnt Avail CAvail
32 R1.usr.0 --- 1 0/1/3/2 256 512 1 256 2 0 cage ch
cli% showldch R1.usr.0
Ldch Row Set PdPos Pdid Pdch State Usage Media Sp From To
0 0 0 0:1:0 4 0 normal ld valid N --- ---
1 0 0 0:1:0 4 55 normal ld valid N --- ---
LD Example 4
Component -Identifier-- -----Description-------------
LD -- Preserved data storage space does not equal total node's Data
memory
LD Suggested Action 4
Preserved data LDs (pdsld) are created during system initialization Out-of-the-Box (OOTB) and after
some hardware upgrades (through admithw command). The total size of the pdsld should match
the total size of all data-cache in the storage system (see below). This message appears if a node
is offline because the comparison of LD size to data cache size does not match. This message can
be ignored unless all nodes are online. If all nodes are online and the error condition persists,
determine the cause of the failure. Use the admithw command to correct the condition.
cli% shownode
Control Data Cache
Node --Name--- -State- Master InCluster ---LED--- Mem(MB) Mem(MB) Available(%)
0 1001335-0 OK Yes Yes GreenBlnk 2048 4096 100
Troubleshooting Storage System Components 45
1 1001335-1 OK No Yes GreenBlnk 2048 4096 100
cli% showld pdsld*
Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV
19 pdsld0.0 1 normal 0/1 256 0 P,F 0 --- Y N
20 pdsld0.1 1 normal 0/1 7680 0 P 0 --- Y N
21 pdsld0.2 1 normal 0/1 256 0 P 0 --- Y N
----------------------------------------------------------------------------
3 8192 0
License
Displays license violations.
Format of Possible License Exception Messages
License <feature_name> "License has expired"
License Example
Component -Identifier- --------Description-------------
License -- System Tuner License has expired
License Suggested Action
Request a new or updated license from your Sales Engineer.
Network
Displays Ethernet issues for administrative and Remote Copy over IP (RCIP) networks that have
been logged on the previous 24-hours. Also, reports the storage system has fewer than two nodes
with working administrative Ethernet connections.
• Check the number of collisions in the previous day log. The number of collisions should be
less than 5% of the total packets for the day.
• Check for Ethernet errors and transmit (TX) or receive (RX) errors in previous day’s log.
Format of Possible Network Exception Messages
Network -- "IP address change has not been completed"
Network "Node<node>:<type>" "Errors detected on network"
Network "Node<node>:<type>" "There is less than one day of network history for this
node"
Network -- "No nodes have working admin network connections"
Network -- "Node <node> has no admin network link detected"
Network -- "Nodes <nodelist> have no admin network link detected"
Network -- "checkhealth was unable to determine admin link status
Network Example 1
Network -- "IP address change has not been completed"
46 Troubleshooting
Network Suggested Action 1
The setnet command is issued to change some network parameter, such as the IP address, but
the action has not completed. Use setnet finish to complete the change, or setnet abort
to cancel. Use the shownet command to examine the current condition.
cli% shownet
IP Address Netmask/PrefixLen Nodes Active Speed Duplex AutoNeg Status
192.168.56.209 255.255.255.0 0123 0 100 Full Yes Changing
192.168.56.233 255.255.255.0 0123 0 100 Full Yes Unverified
Network Example 2
Component ---Identifier---- -----Description----------
Network Node0:Admin Errors detected on network
Network Suggested Action 2
Network errors have been detected on the specified node and network interface. Commands such
as shownet and shownet -d are useful for troubleshooting network problems. These commands
display current network counters as checkhealth shows errors from the last logging sample.
NOTE: The error counters shown by shownet and shownet -d cannot be cleared except by
rebooting a controller node. Because checkhealth is showing network counters from a history
log, checkhealth stops reporting the issue if there is no increase in error in the next log entry.
shownet -d
IP Address: 192.168.56.209 Netmask 255.255.255.0
Assigned to nodes: 0123
Connected through node 0
Status: Active
Admin interface on node 0
MAC Address: 00:02:AC:25:04:03
RX Packets: 1225109 TX Packets: 550205
RX Bytes: 1089073679 TX Bytes: 568149943
RX Errors: 0 TX Errors: 0
RX Dropped: 0 TX Dropped: 0
RX FIFO Errors: 0 TX FIFO Errors: 0
RX Frame Errors: 60 TX Collisions: 0
RX Multicast: 0 TX Carrier Errors: 0
RX Compressed: 0 TX Compressed: 0
Node
Checks the following node conditions and displays nodes that are not optimal:
• Verifies node batteries have been tested in the last 30 days
• Offline nodes
• Power supply and battery problems
The following checks are performed only if the -svc option is used.
• Checks for symmetry of components between nodes such as Control-Cache and Data-Cache
size, OS version, bus speed, and CPU speed
• Checks if diagnostics such as ioload are running on any of the nodes
• Checks for stuck-threads, such as I/O operations that cannot complete
Troubleshooting Storage System Components 47
Format of Possible Node Exception Messages
Node node:<nodeID> "Node is not online"
Node node:<nodeID> "Power supply <psID> detailed state is <status>
Node node:<nodeID> "Power supply <psID> AC state is <acStatus>"
Node node:<nodeID> "Power supply <psID> DC state is <dcStatus>"
Node node:<nodeID> "Power supply <psID> battery is <batStatus>"
Node node:<nodeID> "Node <nodeID> battery is <batStatus>"
Node node:<priNodeID> "<bat> has not been tested within the last 30 days"
Node node:<nodeID> "Node <nodeID> battery is expired"
Node node:<nodeID> "Power supply <psID> is expired"
Node node:<nodeID> "Fan is <fanID> is <status>"
Node node:<nodeID> "Power supply <psID> fan module <fanID> is <status>"
Node node:<nodeID> "Fan module <fanID> is <status>
Node node:<nodeID> "Detailed State <state>" (degraded or failed)
The following checks are performed when the -svc option is used:
Node -- "BIOS version is not the same on all nodes"
Node -- "Control memory is not the same on all nodes"
Node -- "Data memory is not the same on all nodes"
Node -- "CPU Speed is not the same on all nodes"
Node -- "CPU Bus Speed is not the same on all nodes"
Node -- "HP 3PAR OS version is not the same on all nodes"
Node node:<nodenum> "Flusher speed set incorrectly to: <speeed>" (should be 0)
Node node:<nodenum> "Environmental factor <factor> is <state>" (DDR2, Node), (UNDER
LIMIT, OVER LIMIT)
Node node:<node> "Ioload is running"
Node node:<node> "Node has less than 100MB of free memory"
Node node:<node> "BIOS skip mask is <skip_mask>"
Node node:<node> "quo_cex_flags are not set correctly"
Node node:<node> "clus_upgr_group state is not set correctly"
Node node:<node> clus_upgr_state is not set correctly"
Node node:<node> Process <processID> has reached 90% of maximum size"
Node node:<node> VV <vvID> has outstanding <command> with a maximum wait time of
<sleeptime>"
Node -- "There is at least one active servicenode operation in progress"
Node Suggested Action
For node error conditions, examine the node and node-component states by using the following
commands: shownode, shownode -s, shownode -d, showbattery, and showsys -d.
Node Example 1
Component -Identifier- ---------------Description----------------
Node node:0 Power supply 1 detailed state is DC Failed
Node node:0 Power supply 1 DC state is Failed
Node node:1 Power supply 0 detailed state is AC Failed
Node node:1 Power supply 0 AC state is Failed
Node node:1 Power supply 0 DC state is Failed
Node Suggested Action 1
Examine the states of the power supplies with commands such as shownode, shownode -s,
shownode -ps. Turn on or replace the failed power supply.
48 Troubleshooting
NOTE: In the example below, the battery state is considered degraded because the power supply
is failed.
cli% shownode
Control Data Cache
Node --Name--- -State-- Master InCluster ---LED--- Mem(MB) Mem(MB) Available(%)
0 1001356-0 Degraded Yes Yes AmberBlnk 2048 8192 100
1 1001356-1 Degraded No Yes AmberBlnk 2048 8192 100
cli% shownode -s
Node -State-- -Detailed_State-
0 Degraded PS 1 Failed
1 Degraded PS 0 Failed
cli% shownode -ps
Node PS -Serial- -PSState- FanState ACState DCState -BatState- ChrgLvl(%)
0 0 FFFFFFFF OK OK OK OK OK 100
0 1 FFFFFFFF Failed -- OK Failed Degraded 100
1 0 FFFFFFFF Failed -- Failed Failed Degraded 100
1 1 FFFFFFFF OK OK OK OK OK 100
Node Example 2
Component -Identifier- ---------Description------------
Node node:3 Power supply 1 battery is Failed
Node Suggested Action 2
Examine the state of the battery and power supply by using the following commands: shownode,
shownode -s, shownode -ps, showbattery (and showbattery with -d, -s, -log). Turn
on, fix, or replace the battery backup unit.
NOTE: The condition of the degraded power supply is caused by the failing battery. The degraded
PS state is not the expected behavior. This issue will be fixed in a future release. (bug 46682).
cli% shownode
Control Data Cache
Node --Name--- -State-- Master InCluster ---LED--- Mem(MB) Mem(MB) Available(%)
2 1001356-2 OK No Yes GreenBlnk 2048 8192 100
3 1001356-3 Degraded No Yes AmberBlnk 2048 8192 100
cli% shownode -s
Node -State-- -Detailed_State-
2 OK OK
3 Degraded PS 1 Degraded
cli% shownode -ps
Node PS -Serial- -PSState- FanState ACState DCState -BatState- ChrgLvl(%)
2 0 FFFFFFFF OK OK OK OK OK 100
2 1 FFFFFFFF OK OK OK OK OK 100
3 0 FFFFFFFF OK OK OK OK OK 100
3 1 FFFFFFFF Degraded OK OK OK Failed 0
cli% showbattery
Node PS Bat Serial -State-- ChrgLvl(%) -ExpDate-- Expired Testing
3 0 0 100A300B OK 100 07/01/2011 No No
3 1 0 12345310 Failed 0 04/07/2011 No No
Troubleshooting Storage System Components 49
Node Example 3
Component -Identifier- --------------Description----------------
Node node:3 Node:3, Power Supply:1, Battery:0 has not been tested within
the last 30 days
Node Suggested Action 3
The indicated battery has not been tested in 30 days. A node backup battery is tested every 14
days under normal conditions. If the main battery is missing, expired, or failed, the backup battery
is not tested. A backup battery connected to the same node is not tested because testing it can
cause loss of power to the node. An untested battery has an unknown status in the showbattery
-s output. Use the following commands: showbattery, showbattery -s, and showbattery
-d.
showbattery -s
Node PS Bat -State-- -Detailed_State-
0 0 0 OK normal
0 1 0 Degraded Unknown
Examine the date of the last successful test of that battery. Assuming the comment date was
2009-10-14, the last battery test on Node 0, PS 1, Bat 0 was 2009-09-10, which is more
than 30 days ago.
showbattery -log
Node PS Bat Test Result Dur(mins) ---------Time----------
0 0 0 0 Passed 1 2009-10-14 14:34:50 PDT
0 0 0 1 Passed 1 2009-10-28 14:36:57 PDT
0 1 0 0 Passed 1 2009-08-27 06:17:44 PDT
0 1 0 1 Passed 1 2009-09-10 06:19:34 PDT
showbattery
Node PS Bat Serial -State-- ChrgLvl(%) -ExpDate-- Expired Testing
0 0 0 83205243 OK 100 04/07/2011 No No
0 1 0 83202356 Degraded 100 04/07/2011 No No
Example Node 4
Component ---Identifier---- -----Description----------
Node node:0 Ioload is running
Suggested Action Node 4
This output appears only if the -svc option of checkhealth is used. The output it is not displayed
for the non-service check. When a disk diagnostic stress test is detected running on the node, and
the test can affect the node performance. After installing the HP 3PAR Storage System, diagnostic
stress tests exercise the disks for up to two hours following the initial setup (OOTB). If the stress test
is detected within three hours of the initial setup, disregard the warning. If the test detected after
the setup, the test may have been manually started. Investigate the operation and contact HP
Support.
50 Troubleshooting
From a node's root login prompt, check the UNIX processes for ioload:
root@1001356-0 Tue Nov 03 13:37:31:~# onallnodes ps -ef |grep ioload
root 13384 1 2 13:36 ttyS0 00:00:01 ioload -n -c 2 -t 20000 -i 256 -o
4096 /dev/tpddev/pd/100
PD
Displays physical disks with states or conditions that are not optimal:
• Checks for failed and degraded PDs
• Checks for an imbalance of PD ports, for example, if Port-A is used on more disks than Port-B
• Checks for an Unknown sparing algorithm.
• Checks for disks experiencing a high number of IOPS
• Reports if a servicemag operation is outstanding (servicemag status)
• Reports if there are PDs that do not have entries in the firmware DB file
Format of Possible PD Exception Messages
PD disk:<pdid> "Degraded States: <showpd -s -degraded">
PD disk:<pdid> "Failed States: <showpd -s -failed">
PD -- "There is an imbalance of active PD ports"
PD -- "Sparing algorithm is not set"
PD disk:<pdid> "Disk is experiencing a high level of I/O per second: <iops>"
PD -- There is at least one active servicemag operation in progress
The following checks are performed when the -svc option is used, or on 7400/7200 hardware:
PD File: <filename> "Folder not found on all Nodes in <folder>"
PD File: <filename> "Folder not found on some Nodes in <folder>"
PD File: <filename> "File not found on all Nodes in <folder>"
PD File: <filename> "File not found on some Nodes in <folder>"
PD Disk:<pdID> "<pdmodel> PD for cage type <cagetype> in cage position <pos> is missing
from firmware database"
PD Example 1
Component -------------------Description------------------- Qty
PD PDs that are degraded or failed 40
Component -Identifier- ---------------Description-----------------
PD disk:48 Detailed State: missing_B_port,loop_failure
PD disk:49 Detailed State: missing_B_port,loop_failure
...
PD disk:107 Detailed State: failed,notready,missing_A_port
PD Suggested Action 1
Both degraded and failed disks are reported. When an FC path to a drive cage is not working,
all disks in the cage have a degraded state due to the non-redundant condition. To further diagnose,
Troubleshooting Storage System Components 51
use the following commands: showpd, showpd -s, showcage, showcage -d, showport
-sfp.
cli% showpd -degraded -failed
----Size(MB)---- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
48 3:0:0 FC 10 degraded 139520 115200 2:0:4* -----
49 3:0:1 FC 10 degraded 139520 121344 2:0:4* -----
…
107 4:9:3 FC 15 failed 428800 0 ----- 3:2:1*
cli% showpd -s -degraded -failed
Id CagePos Type -State-- -----------------Detailed_State--------------
48 3:0:0 FC degraded missing_B_port,loop_failure
49 3:0:1 FC degraded missing_B_port,loop_failure
…
107 4:9:3 FC failed prolonged_not_ready,missing_A_port,relocating
cli% showcage -d cage3
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
3 cage3 2:0:4 0 --- 0 32 28-39 2.37 2.37 DC2 n/a
-----------Cage detail info for cage3 ---------
Fibre Channel Info PortA0 PortB0 PortA1 PortB1
Link_Speed 2Gbps -- -- 0Gbps
----------------------------------SFP Info-----------------------------------
FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
0 0 OK SIGMA-LINKS 2.1 No No No Yes
1 1 OK SIGMA-LINKS 2.1 No No Yes Yes
Interface Board Info FCAL0 FCAL1
Link A RXLEDs Green Off
Link A TXLEDs Green Off
Link B RXLEDs Off Off
Link B TXLEDs Off Green
LED(Loop_Split) Off Off
LEDS(system,hotplug) Green,Off Green,Off
-------------Drive Info------------- ----LoopA----- ----LoopB-----
Drive NodeWWN LED Temp(C) ALPA LoopState ALPA LoopState
0:0 20000014c3b3eab9 Green 34 0xe1 OK 0xe1 Loop fail
0:1 20000014c3b3e708 Green 36 0xe0 OK 0xe0 Loop fail
PD Example 2
Component --Identifier-- --------------Description---------------
PD -- There is an imbalance of active pd ports
PD Suggested Action 2
The primary and secondary I/O paths for disks (PDs) are balanced between nodes. The primary
path is indicated in the showpd -path output and by an asterisk in the showpd output. An
imbalance of active ports is usually caused by a nonfunctional path/loop to a cage, or because
52 Troubleshooting
an odd number of drives is installed or detected. To further diagnose, use the following commands:
showpd, showpd path, showcage, and showcage -d.
cli% showpd
----Size(MB)----- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
0 0:0:0 FC 10 normal 139520 119040 0:0:1* 1:0:1
1 0:0:1 FC 10 normal 139520 121600 0:0:1 1:0:1*
2 0:0:2 FC 10 normal 139520 119040 0:0:1* 1:0:1
3 0:0:3 FC 10 normal 139520 119552 0:0:1 1:0:1*
...
46 2:9:2 FC 10 normal 139520 112384 2:0:3* 3:0:3
47 2:9:3 FC 10 normal 139520 118528 2:0:3 3:0:3*
48 3:0:0 FC 10 degraded 139520 115200 2:0:4* -----
49 3:0:1 FC 10 degraded 139520 121344 2:0:4* -----
50 3:0:2 FC 10 degraded 139520 115200 2:0:4* -----
51 3:0:3 FC 10 degraded 139520 121344 2:0:4* -----
cli% showpd -path
-----------Paths-----------
Id CagePos Type -State-- A B Order
0 0:0:0 FC normal 0:0:1 1:0:1 0/1
1 0:0:1 FC normal 0:0:1 1:0:1 1/0
2 0:0:2 FC normal 0:0:1 1:0:1 0/1
3 0:0:3 FC normal 0:0:1 1:0:1 1/0
...
46 2:9:2 FC normal 2:0:3 3:0:3 2/3
47 2:9:3 FC normal 2:0:3 3:0:3 3/2
48 3:0:0 FC degraded 2:0:4 3:0:4missing 2/-
49 3:0:1 FC degraded 2:0:4 3:0:4missing 2/-
50 3:0:2 FC degraded 2:0:4 3:0:4missing 2/-
51 3:0:3 FC degraded 2:0:4 3:0:4missing 2/-
cli% showcage -d cage3
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
3 cage3 2:0:4 0 --- 0 32 29-41 2.37 2.37 DC2 n/a
-----------Cage detail info for cage3 ---------
Fibre Channel Info PortA0 PortB0 PortA1 PortB1
Link_Speed 2Gbps -- -- 0Gbps
----------------------------------SFP Info-----------------------------------
FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
0 0 OK SIGMA-LINKS 2.1 No No No Yes
1 1 OK SIGMA-LINKS 2.1 No No Yes Yes
Interface Board Info FCAL0 FCAL1
Link A RXLEDs Green Off
Link A TXLEDs Green Off
Link B RXLEDs Off Off
Link B TXLEDs Off Green
LED(Loop_Split) Off Off
LEDS(system,hotplug) Green,Off Green,Off
...
-------------Drive Info------------- ----LoopA----- ----LoopB-----
Drive NodeWWN LED Temp(C) ALPA LoopState ALPA LoopState
0:0 20000014c3b3eab9 Green 35 0xe1 OK 0xe1 Loop fail
0:1 20000014c3b3e708 Green 38 0xe0 OK 0xe0 Loop fail
0:2 20000014c3b3ed17 Green 35 0xdc OK 0xdc Loop fail
0:3 20000014c3b3dabd Green 30 0xda OK 0xda Loop fail
Troubleshooting Storage System Components 53
PD Example 3
Component -------------------Description------------------- Qty
PD Disks experiencing a high level of I/O per second 93
Component --Identifier-- ---------Description----------
PD disk:100 Disk is experiencing a high level of I/O per second: 789.0
PD Suggested Action 3
This check samples the I/O per second (IOPS) information in statpd to see if any disks are being
overworked, and then it samples again after five seconds. This does not necessarily indicate a
problem, but it could negatively affect system performance. The IOPS thresholds currently set for
this condition are listed:
• NL disks < 75
• FC 10K RPM disks < 150
• FC 15K RPM disks < 200
• SSD < 1500
Operations such as servicemag and tunevv can cause this condition. If the IOPS rate is very
high and/or a large number of disks are experiencing very heavy I/O, examine the system further
using statistical monitoring commands/utilities such as statpd, the OS MC (GUI) and System
Reporter. The following example shows a report for a disk with a total I/O is 150 kb/s or more.
cli% statpd -filt curs,t,iops,150
14:51:49 11/03/09 r/w I/O per second KBytes per sec ... Idle %
ID Port Cur Avg Max Cur Avg Max ... Cur Avg
100 3:2:1 t 658 664 666 172563 174007 174618 ... 6 6
PD Example 4
Component --Identifier-- -------Description----------
PD disk:3 Detailed State: old_firmware
PD Suggested Action 4
The identified disk does not have firmware that the storage system considers current. When a disk
is replaced, the servicemag operation should upgrade the disk's firmware. When disks are
installed or added to a system, the admithw command can perform the firmware upgrade. Check
the state of the disk by using CLI commands such as showpd -s, showpd -i, and
showfirmwaredb.
cli% showpd -s 3
Id CagePos Type -State-- -Detailed_State-
3 0:4:0 FC degraded old_firmware
cli% showpd -i 3
Id CagePos State ----Node_WWN---- --MFR-- ---Model--- -Serial- -FW_Rev-
3 0:4:0 degraded 200000186242DB35 SEAGATE ST3146356FC 3QN0290H XRHJ
cli% showfirmwaredb
Vendor Prod_rev Dev_Id Fw_status Cage_type
...
SEAGATE [XRHK] ST3146356FC Current DC2.DC3.DC4
54 Troubleshooting
PD Example 5
Component --Identifier-- -------Description----------
PD -- Sparing Algorithm is not set
PD Suggested Action 5
Check the system’s Sparing Algorithm value using the CLI command showsys -param. The value
is normally set during the initial installation (OOTB). If it must be set later, use the command setsys
SparingAlgorithm; valid values are Default, Minimal, Maximal, and Custom. After setting the
parameter, use the admithw command to programmatically create and distribute the spare
chunklets.
% showsys -param
System parameters from configured settings
----Parameter----- --Value--
RawSpaceAlertFC : 0
RawSpaceAlertNL : 0
RemoteSyslog : 0
RemoteSyslogHost : 0.0.0.0
SparingAlgorithm : Unknown
PD Example 6
Component --Identifier-- -------Description----------
PD Disk:32 ST3400755FC PD for cage type DC3 in cage position 2:0:0 is missing from
the firmware database
PD Suggested Action 6
Check the release notes for mandatory updates and patches. Install updates and patches to HP
3PAR OS as needed to support the PD in the cage.
PDCH
Checks for Physical Disk Chunklets (PDCH) with states that are not optimal.
• Chunklets are not used by multiple LDs.
• Media failed chunklets.
• Verifies if LD ownership is the same as the physical connection.
Format of Possible PDCH Exception Messages
pdch ch:<pdid> "Chunklet is on a remote disk"
pdch LD:<ldid> "LD has <count> remote chunklets
pdch LD:<ldid> "Connection path is not the same as LD ownership"
pdch ch:<initpdid>:<initpdpos> "Chunklet used previously by multiple LDs"
pdch ch:<initpdid>:<initpdpos> "Chunklet used previously by LD <ldname> (ch: <id>:
<pdch), currently used by LD <ldname>"
PDCH Example 1
Component ------------Description------------ Qty
pdch LDs with chunklets on a remote disk 3
Component -Identifier- -------Description--------
Troubleshooting Storage System Components 55
pdch ld:19 LD has 3 remote chunklets
pdch ld:20 LD has 90 remote chunklets
pdch ld:21 LD has 3 remote chunklets
Suggested PDCH Action 1
If the message LD has remote chunklets is for Preserved Data LDs (pdslds), those warnings
can be ignored. See KB Solution 14550 for details. From the example above, LDs 19, 20, and
21 are pdslds and can be seen from the showld command:
cli% showld
Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV
19 pdsld0.0 1 normal 1/0 256 0 P,F 0 --- Y N
20 pdsld0.1 1 normal 1/0 7680 0 P 0 --- Y N
21 pdsld0.2 1 normal 1/0 256 0 P 0 --- Y N
PDCH Example 2
Component -------------------Description------------------- Qty
pdch LDs with connection path different than ownership 23
pdch LDs with chunklets on a remote disk 18
Component -Identifier- ---------------Description--------------
pdch LD:35 Connection path is not the same as LD ownership
pdch ld:35 LD has 1 remote chunklet
PDCH Suggested Action 2
The primary I/O paths for disks are balanced between the two nodes that are physically connected
to the drive cage. The node with the primary path to a disk is considered as the owning node. If
the path of the secondary node needs to be used for I/O to the disk, the secondary node is
considered remote I/O.
These messages usually indicate a node-to-cage FC path problem because the disks (chunklets)
are being accessed through their secondary path. The messages are usually a by product of other
conditions such as drive-cage/node-port/FC-loop problems and need to be investigated. If a node
is offline due to a service action, such as hardware or software upgrades, these exceptions can
be ignored until the action is finished and the node is back online.
In this example, LD 35, with a name of R1.usr.3, is owned (Own) by nodes 3/2/0/1, and the
primary/secondary physical paths to the disks (chunklets) in this LD are from nodes 3 and 2.
However, the FC path (Port B) from node 3 to PD 91 is failed/missing, node 2 is performing the
I/O to PD 91. When the path from node 3 to cage 3 is fixed (N:S:P 3:0:4 in this example), the
condition should disappear.
cli% showld
Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV
35 R1.usr.3 1 normal 3/2/0/1 256 256 V 0 --- N Y
cli% showldch R1.usr.3
Ldch Row Set PdPos Pdid
Pdch State Usage Media Sp From To 0 0 0 2:2:3 63
0 normal ld valid N --- --- 1 0 0 3:8:3 91
0 normal ld valid N --- ---
cli% showpd 91 63
----Size(MB)---- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
56 Troubleshooting
63 2:2:3 FC 10 normal 139520 124416 2:0:3* 3:0:3
91 3:8:3 FC 10 degraded 139520 124416 2:0:4* -----
cli% showpd -s -failed -degraded
Id CagePos Type -State-- ---------------Detailed_State----------
91 3:8:3 FC degraded missing_B_port,loop_failure
cli% showcage
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
2 cage2 2:0:3 0 3:0:3 0 24 29-42 2.37 2.37 DC2 n/a
3 cage3 2:0:4 0 ----- 0 32 28-40 2.37 2.37 DC2 n/a
Normal condition (after fixing):
cli% showpd 91 63
----Size(MB)---- ----Ports----
Id CagePos Type Speed(K) State Total Free A B
63 2:2:3 FC 10 normal 139520 124416 2:0:3* 3:0:3
91 3:8:3 FC 10 normal 139520 124416 2:0:4 3:0:4*
Port
Checks for the following port connection issues:
• Ports in unacceptable states
• Mismatches in type and mode, such as hosts connected to initiator ports, or host and Remote
Copy over Fibre Channel (RCFC) ports configured on the same FC adapter
• Degraded SFPs and those with low power; perform this check only if this FC Adapter type
uses SFPs
Format of Possible Port Exception Messages
Port port:<nsp> "Port mode is in <mode> state"
Port port:<nsp> "is offline"
Port port:<nsp> "Mismatched mode and type"
Port port:<nsp> "Port is <state>"
Port port:<nsp> "SFP is missing"
Port port:<nsp> SFP is <state>" (degraded or failed)
Port port:<nsp> "SFP is disabled"
Port port:<nsp> "Receiver Power Low: Check FC Cable"
Port port:<nsp> "Transmit Power Low: Check FC Cable"
Port port:<nsp> "SFP has TX fault"
Port Suggested Actions
Some specific examples are displayed below, but in general, use the following CLI commands to
check for port SPF errors: showport, showport -sfp,showport -sfp -ddm, showcage,
showcage -sfp, and showcage -sfp -ddm.
Port Example 1
Component ------Description------ Qty
Port Degraded or failed SFPs 1
Component -Identifier- --Description--
Port port:0:0:2 SFP is Degraded
Troubleshooting Storage System Components 57
Port Suggested Action 1
An SFP in a node-port is reporting a degraded condition. This is most often caused by the SFP
receiver circuit detecting a low signal level (RX Power Low), and usually caused by a cable with
poor or contaminated FC connection. An alert can identify the following condition:
Port 0:0:2, SFP Degraded (Receiver Power Low: Check FC Cable)
Check SFP statistics using CLI commands such as showport -sfp, showport -sfp -ddm,
showcage.
cli% showport -sfp
N:S:P -State-- -Manufacturer- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
0:0:1 OK FINISAR_CORP. 2.1 No No No Yes
0:0:2 Degraded FINISAR_CORP. 2.1 No No No Yes
In the following example an RX power level of 361 microwatts (uW) for Port 0:0:1 DDM is a good
reading; and 98 uW for Port 0:0:2 is a weak reading (< 100 uW). Normal RX power level
readings are 200-400 uW.
cli% showport -sfp -ddm
--------------Port 0:0:1 DDM--------------
-Warning- --Alarm--
--Type-- Units Reading Low High Low High
Temp C 41 -20 90 -25 95
Voltage mV 3217 2900 3700 2700 3900
TX Bias mA 7 2 14 1 17
TX Power uW 330 79 631 67 631
RX Power uW 361 15 794 10 1259
--------------Port 0:0:2 DDM--------------
-Warning- --Alarm--
--Type-- Units Reading Low High Low High
Temp C 40 -20 90 -25 95
Voltage mV 3216 2900 3700 2700 3900
TX Bias mA 7 2 14 1 17
TX Power uW 335 79 631 67 631
RX Power uW 98 15 794 10 1259
cli% showcage
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
0 cage0 0:0:1 0 1:0:1 0 15 33-38 08 08 DC3 n/a
1 cage1 --- 0 1:0:2 0 15 30-38 08 08 DC3 n/a
cli% showpd -s
Id CagePos Type -State-- -Detailed_State-
1 0:2:0 FC normal normal
...
13 1:1:0 NL degraded missing_A_port
14 1:2:0 FC degraded missing_A_port
cli% showpd -path
---------Paths---------
Id CagePos Type -State-- A B Order
1 0:2:0 FC normal 0:0:1 1:0:1 0/1
...
13 1:1:0 NL degraded 0:0:2missing 1:0:2 1/-
14 1:2:0 FC degraded 0:0:2missing 1:0:2 1/-
58 Troubleshooting
Port Example 2
Component -Description- Qty
Port Missing SFPs 1
Component -Identifier- -Description--
Port port:0:3:1 SFP is missing
Port Suggested Action 2
FC node-ports that normally contain SFPs will report an error if the SFP has been removed. The
condition can be checked using the showport -sfp command. In this example, the SFP in 0:3:1
has been removed from the adapter:
cli% showport -sfp
N:S:P -State- -Manufacturer- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
0:0:1 OK FINISAR_CORP. 2.1 No No No Yes
0:0:2 OK FINISAR_CORP. 2.1 No No No Yes
0:3:1 - - - - - - -
0:3:2 OK FINISAR_CORP. 2.1 No No No Yes
Port Example 3
Component -Description- Qty
Port Disabled SFPs 1
Component -Identifier- --Description--
Port port:3:5:1 SFP is disabled
Port Suggested Action 3
A node-port SFP will be disabled if the port has been placed offline using the controlport
offline command. See Example 4.
cli% showport -sfp
N:S:P -State- -Manufacturer- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM
3:5:1 OK FINISAR_CORP. 4.1 Yes No No Yes
3:5:2 OK FINISAR_CORP. 4.1 No No No Yes
Port Example 4
Component -Description- Qty
Port Offline ports 1
Component -Identifier- --Description--
Port port:3:5:1 is offline
Port Suggested Action 4
Check the state of the port with showport. If a port is offline, it is deliberately put in the particular
state by using the controlport offline command. Offline ports can be restored using
controlport rst.
cli% showport
N:S:P Mode State ----Node_WWN---- -Port_WWN/HW_Addr- Type
3:5:1 target offline 2FF70002AC00054C 23510002AC00054C free
Troubleshooting Storage System Components 59
Port Example 5
Component ------------Description------------ Qty
Port Ports with mismatched mode and type 1
Component -Identifier- ------Description-------
Port port:2:0:3 Mismatched mode and type
Port Suggested Action 5
The output indicates that the port's mode, such as an initiator or target, is not correct for the
connection type, such as disk, host, iSCSI or RCFC. Useful CLI command include: showport,
showport -c, showport -par, showport -rcfc, showcage.
cli% showport
N:S:P Mode State ----Node_WWN---- -Port_WWN/HW_Addr- Type
2:0:1 initiator ready 2FF70002AC000591 22010002AC000591 disk
2:0:2 initiator ready 2FF70002AC000591 22020002AC000591 disk
2:0:3 target ready 2FF70002AC000591 22030002AC000591 disk
2:0:4 target loss_sync 2FF70002AC000591 22040002AC000591 free
Component -Identifier- ------Description-------
Port port:0:1:1 Mismatched mode and type
cli% showport
N:S:P Mode State ----Node_WWN---- -Port_WWN/HW_Addr- Type
0:1:1 initiator ready 2FF70002AC000190 20110002AC000190 rcfc
0:1:2 initiator loss_sync 2FF70002AC000190 20120002AC000190 free
0:1:3 initiator loss_sync 2FF70002AC000190 20130002AC000190 free
0:1:4 initiator loss_sync 2FF70002AC000190 20140002AC000190 free
Port Example 6
Component -----------Description------------------------- Qty
Port Ports with increasing CRC error counts 2
Component -Identifier- ------Description-----------
Port port:3:2:1 Port or devices attached to port have experienced CRC
_______________________errors within the last day
Port Suggested Action 6
Check the fibre channel error counters for the port using the CLI commands showportlesb
single and showportlesb hist. Devices with high InvCRC values are receiving bad packets
from an upstream device (disk, HBA, SFP, or cable).
cli% showportlesb single 3:2:1
ID ALPA ----Port_WWN---- LinkFail LossSync LossSig InvWord InvCRC
<3:2:1> 0x1 23210002AC00054C 20697 2655432 20700 37943749 1756
pd107 0xa3 2200001D38C28AA3 0 157 0 1129 0
pd106 0xa5 2200001D38C0D01E 0 279 0 1551 0
Port Example 7
Component -----------Description------------------------- Qty
Port Ports with increasing CRC error counts 2
60 Troubleshooting
Component -Identifier- ------Description-----------
Port port:2:2:1 CRC errors have been increasing by more than one per day
______________________ over the past week
Port Suggested Action 7
Check the fibre channel error counters for the port using the CLI commands showportlesb
single and showportlesb hist.
The message "CRC errors have been increasing … over the past week" comes from a check of
the daily port-LESB history as seen in showportlesb hist. If the error condition is corrected,
checkhealth port may continue to report the error until the next daily update is stored. The
checkhealth port should stop reporting within 24 hours after the CRC counter stops counting.
cli% showportlesb single 3:2:1
ID ALPA ----Port_WWN---- LinkFail LossSync LossSig InvWord InvCRC
<3:2:1> 0x1 23210002AC00054C 20697 2655432 20700 37943749 1756
pd107 0xa3 2200001D38C28AA3 0 157 0 1129 0
pd106 0xa5 2200001D38C0D01E 0 279 0 1551 0
Port CRC
Checks for increasing FC port CRC errors.
Format of Possible Port CRC Exception Messages
Port port:<nsp> "There is less than one week of LESB history for this port"
Port port:<nsp> "Port or devices attached to port have experienced CRC errors within
the last day"
Port port:<nsp> "CRC errors have been increasing by more than one per day over the
past week"
Port CRC Example
FC port CRC errors is detected on the specified port. The command showportlesb hist 1:5:1
is useful for troubleshooting FC port CRC problems. This command will display current network
counters and the recent log entries that checkhealth uses to evaluate and report errors.
Port PELCRC
Checks for increasing SAS port CRC errors.
Format of Possible PELCRC Exception Messages
Portcrc:<nsp> "There is less than one week of PEL history for this port"
Port port:<nsp> "Port or devices attached to port have experienced CRC errors within
the last day"
Port port:<nsp> ""CRC errors have been increasing by more than <maxCRC> per day over
the last two days"
Port PELCRC Example
SAS port CRC errors is detected on the specified port. The command showportpel hist 1:5:1
is useful for troubleshooting SAS port CRC problems. This command will display current network
counters and the recent log entries that checkhealth uses to evaluate and report errors.
Troubleshooting Storage System Components 61
RC
Checks for the following Remote Copy issues.
• Remote Copy targets
• Remote Copy links
• Remote Copy Groups and VVs
Format of Possible RC Exception Messages
RC rc:<name> "All links for target <name> are down but target not yet marked failed."
RC rc:<name> "Target <name> has failed."
RC rc:<name> "Link <name> of target <target> is down."
RC rc:<name> "Group <name> is not started to target <target>."
RC rc:<vvname> "VV <vvname> of group <name> is stale on target <target>."
RC rc:<vvname> "VV <vvname> of group <name> is not synced on target <target>."
RC Example
Component -Description- Qty
RC Stale volumes 1
Component --Identifier--- ---------Description---------------
RC rc:yush_tpvv.rc VV yush_tpvv.rc of group yush_group.r1127
is stale on target S400_Async_Primary.
RC Suggested Action
Perform remote copy troubleshooting such as checking the physical links between the storage
system. Useful CLI commands are showrcopy, showrcopy -d, showport -rcip, showport
-rcfc, shownet -d, controlport rcip ping.
SNMP
Displays issues with SNMP. Attempts the showsnmpmgr command and reports errors if the CLI
returns an error.
Format of Possible SNMP Exception Messages
SNMP -- <err>
SNMP Example
Component -Identifier- ----------Description---------------
SNMP -- Could not obtain snmp agent handle. Could be
misconfigured.
SNMP Suggested Action
Any error message that can be produced by showsnmpmgr can display.
SP
Checks the status of the Ethernet connection between the SP and nodes.
The Ethernet connection can only be checked from the SP because it performs a short Ethernet
transfer check between the SP and the storage system.
62 Troubleshooting
Format of Possible SP Exception Messages
Network SP->InServ "SP ethernet Stat <stat> has increased too quickly check SP network
settings"
SP Example
Component -Identifier- --------Description------------------------
SP ethernet "State rx_errs has increased too quickly check SP network
settings"
SP Suggested Action
The <stat> variable can be any of the following: rx_errs, rx_dropped, rx_fifo, rx_frame,
tx_errs, tx_dropped, tx_fifo.
This message is usually caused by customer network issues, but may be caused by conflicting or
mismatching network settings between the SP, customer switch(es), and the storage system. Check
the SP network interface settings using SPmaint or SPOCC. Check the storage system settings by
using commands such as shownet and shownet -d.
Task
Displays failed tasks. Checks for any tasks that have failed within the past 24 hours. This is the
default time frame for the showtask -failed command.
Format of Possible Task Exception Messages
Task Task:<Taskid> "Failed Task"
Task Example
Component --Identifier--- -------Description--------
Task Task:6313 Failed Task
In this example, checkhealth also showed an alert. The task failed because the command is
entered with a syntax error:
Alert sw_task:6313 Task 6313 (type 'background_command', name 'upgradecage -a
-f') has failed (Task Failed). Please see task status for details.
Task Suggested Action
The CLI command showtask -d Task_id displays detailed information about the task. To clean
up the alerts and the reporting of checkhealth, you can delete the failed-task alerts. The alerts
are not auto-resolved and remain until they are manually removed with the MC (GUI) or CLI with
removealert or setalert ack. To display system-initiated tasks, use showtask -all.
cli% showtask -d 6313
Id Type Name Status Phase Step
6313 background_command upgradecage -a -f failed --- ---
Troubleshooting Storage System Components 63
Detailed status is as follows:
2010-10-22 10:35:36 PDT Created task.
2010-10-22 10:35:36 PDT Updated Executing "upgradecage -a -f" as 0:12109
2010-10-22 10:35:36 PDT Errored upgradecage: Invalid option: -f
VLUN
Displays host agent inactive and non-reported virtual LUNs (VLUNs). Also reports VLUNs that have
been configured but are not currently being exported to hosts or host-ports.
Format of Possible VLUN Exception Messages
vlun vlun:(<vvID>, <lunID>, <hostname>)"Path to <wwn> is not reported by host agent"
vlun vlun:(<vvID>, <lunID>, <hostname>)"Path to <wwn> is not is not seen by host" vlun
vlun:(<vvID>, <lunID>, <hostname>) "Path to <wwn> is failed"
vlun host:<hostname> "Host <ident>(<type>):<connection> is not connected to a port"
VLUN Example
Component ---------Description--------- Qty
vlun Hosts not connected to a port 1
Component -----Identifier----- ---------Description--------
vlun host:cs-wintec-test1 Host wwn:10000000C964121D is not connected to a port
VLUN Suggested Action
Check the export status and port status for the VLUN and HOST by using CLI commands: showvlun,
showvlun -pathsum, showhost, showhost pathsum, showport, servicehost list.
cli% showvlun -host cs-wintec-test1
Active VLUNs
Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type
2 BigVV cs-wintec-test1 10000000C964121C 2:5:1 host
-----------------------------------------------------------
1 total
VLUN Templates
Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type
2 BigVV cs-wintec-test1 ---------------- --- host
cli% showhost cs-wintec-test1
Id Name Persona -WWN/iSCSI_Name- Port
0 cs-wintec-test1 Generic 10000000C964121D ---
10000000C964121C 2:5:1
cli% servicehost list
HostName -WWN/iSCSI_Name- Port
host0 10000000C98EC67A 1:1:2
host1 210100E08B289350 0:5:2
Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type
2 BigVV cs-wintec-test1 10000000C964121D 3:5:1 unknown
VV
Displays Virtual Volumes (VV) that are not optimal. Checks for abnormal state of VVs and Common
Provisioning Groups (CPG).
64 Troubleshooting
Format of Possible VV Exception Messages
VV vv:<vvname> "IO to this volume will fail due to no_stale_ss policy"
VV vv:<vvname> "Volume has reached snapshot space allocation limit"
VV vv:<vvname> "Volume has reached user space allocation limit"
VV vv:<vvname> "VV has expired"
VV vv:<vvname> "Detailed State: <state>" (failed or degraded)
VV cpg:<cpg> "CPG is unable to grow SA (or SD) space"
VV Suggested Action
Check status by using CLI commands such as showvv, showvv -d, and showvv -cpg.
Troubleshooting Storage System Setup
If you are unable to access the SP setup wizard, SP, or the Storage System Setup wizard:
1. Collect the SmartStart log files. See “Collecting SmartStart Log Files” (page 71).
2. Collect the SP log files. See “Collecting Service Processor Log Files” (page 71).
3. Contact HP support and request support for your HP 3PAR StoreServ 7000 Storage system.
See “Contacting HP Support about System Setup” (page 72).
Storage System Setup Wizard Errors
This section describes possible error messages that may display while using the Storage System
Setup Wizard.
Common error strings that appear in multiple places
• The specified system is currently in the storage system
initialization process. Only one initialization process can run at
one time.
This message displays when the wizards of two users try to initialize the same storage system
on the same SP. Only one wizard can initialize a storage system.
Two options are available when this error displays in a dialog box; you can click Retry or
Cancel. When the error does not display in a dialog box, look for another SP by serial number
or wait a while and try again later.
• Could not communicate with the server. Make sure you are currently
connected to the network.
This message displays when the client computer that is running the wizard cannot communicate
with the SP, such as when network connectivity is lost.
The error can occur for one of the following reasons:
◦ Network connectivity is lost.
◦ The SP is no longer running.
◦ The SP is not plugged into the network.
◦ The SP IP address has been changed.
• Could not communicate with the storage system. Make sure it is
running and connected to the network.
This message can display if the HP 3PAR OS loses network connectivity, either by becoming
unplugged or by going down for some other reason.
Troubleshooting Storage System Setup 65
This message displays either in a dialog box or inline. If the message displays in a dialog
box, you can click Retry or Cancel in the wizard. If the message appears inline, you can only
click Next in the wizard.
• Setup encountered an unknown error ({0}). Contact HP support for
help.
This message displays in a dialog box with Retry and Cancel buttons, where {0} is the error
number.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
Errors that appear on the Enter System to Setup page
• Unable to execute the command. All required data was not sent to
the SP server. Contact HP support for help.
This message displays as an inline error on the bottom of the wizard page.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• No uninitialized storage system with the specified serial number
could be found. Make sure the SP is on the same network as the
specified storage system.
This message displays as an inline error on the bottom of the wizard page. In order for the
Storage System Setup Wizard to work, the storage system must be on the same network as
the SP, and you must type in the serial number of the storage system in order for the SP to find
it. If either of these conditions is not met, this error message displays.
Verify that the serial number you entered for the SP is correct, and then do one of the following:
◦ Move the SP or storage system so that they are on the same network.
◦ Use a different SP to set up the storage system.
• Unable to gather the storage system information. Make sure the
specified storage system is running HP 3PAR OS 3.1.2 or later. For
more help, contact HP support.
This message displays as an inline error on the bottom of the wizard page. The error might
be caused by a defect in the Storage System Setup Wizard code or by unexpected information
being returned in the CLI.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The SP encountered an unknown error while finding the specified
storage system. Contact HP support for help.
This message displays as an inline error on the bottom of the wizard page.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The SP does not have a suitable HP 3PAR OS version installed for
the specified storage system. Use SPOCC to install HP 3PAR OS version
{0}.
This message displays as an inline error on the bottom of the wizard page. The SP needs to
have the same Major.Minor.Patch TPD package as the storage system’s HP 3PAR OS. If
the package is not the same, then the SP cannot communicate with the HP 3PAR OS.
66 Troubleshooting
{0} will be the version of the TPD package that the user must install so that the SP will work
with the storage system.
• The SP does not have an HP 3PAR OS version installed. Use SPOCC to
install an HP 3PAR OS package.
This message displays as an inline error on the bottom of the wizard page when no TPD
package is installed. The SP needs a TPD package installed in order to communicate with an
HP 3PAR StoreServ Storage system.
Error strings specific to the prepare storage system progress step
The following errors occur during the Progress and Results page:
• The storage system has not yet discovered all the drive types. Make
sure there are no cage problems.
This error message displays in a dialog box with Retry and Cancel buttons. It occurs when the
HP 3PAR StoreServ Storage is unable to determine all the drive types that are connected to
the cage. Wait for about 5 minutes for drive discovery to complete. If the error persists, contact
HP Support.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The storage system has not yet discovered all the drive positions.
Make sure there are no cage problems.
Wait for about 5 minutes for drive position discovery to complete. If the error persists, contact
HP Support.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
Error strings specific to the check hardware health progress step
• The storage system found an error while checking node health. Details
are listed below. {0} appears to be offline. Make sure the node is
plugged in all the way and powered on.
This error message displays in a dialog box with Retry and Cancel buttons. {0} is the name
of the node that appears to be offline. Turn the storage system on and make sure the node is
plugged into the backplane.
• The storage system found an error while checking node health. Details
are listed below.
This error message displays in a dialog box with Retry and Cancel buttons. {0} is the port
location with the problem. Make sure the port is plugged into the node.
• The storage system found an error while checking port health. Details
are listed below. Port {0} appears to be offline.
This error message displays in a dialog box with Retry and Cancel buttons. Information listed
below the message is the CLI output for the checkhwconfig command, which occurs when
the SP does not recognize the command, allowing you to see the output.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The storage system found an error while checking port health. Details
are listed below.
This error message displays in a dialog box with Retry and Cancel buttons. {0} is the location
of the port with the problem.
Troubleshooting Storage System Setup 67
• The storage system found an error while checking cabling health.
Details are listed below.
This error message displays in a dialog box with Retry and Cancel buttons. The message is
followed by a list of errors. The errors may include:
◦ Cage {0} is connected to the same node twice through ports {1}
and {2}. Re-cable this cage.
This error displays if a cage is connected to the same node twice. {0} will be the name
of the cage and {1} and {2} will be the port locations where the cage is connected.
Re-cable the cage using best practices.
◦ Cage {0} appears to be missing a connection to a node. It does
have a connection on port {1}. Connect the loop pair.
This message displays if a cage is connected to only one node. {0} will be the name of
the cage, and {1} will be the single port to which the cage is connected. Re-cable the
cage using best practices.
◦ Cage {0} is not connected to the same slot and port on the nodes
it is connected to. Re-cable this cage.
This message displays if a cage is connected to different slots, ports, and nodes. {0}
will be the name of the cage with the problem. Re-cable the cage using best practices.
• The storage system found an error while checking cabling health.
Details are listed below.
This error message displays in a dialog box with Retry and Cancel buttons. Information listed
below the message is the CLI output for the checkhwconfig command, which occurs when
the SP does not recognize the command, allowing you to see the output.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The storage system found an error while checking cage health. The
firmware upgrade succeeded, but cage {0} has not come back. Contact
HP support for help.
This error message displays in a dialog box with Retry and Cancel buttons. This error might
occur after the drive cages have had a firmware upgrade. {0} will be the name of the cage
with the problem. Although the firmware upgrade may have succeeded, this error might occur
if the cage does not boot back up. Contact HP Support.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The storage system found an error while checking cage health. Details
are listed below.
This error message displays in a dialog box with Retry and Cancel buttons. Information listed
below the message is the CLI output for the checkhwconfig command, which occurs when
the SP does not recognize the command, allowing you to see the output.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The storage system found an error while checking cage health. There
is a problem with a drive cage that has had a firmware upgrade. Cage
68 Troubleshooting
{0} did not come back after the firmware upgrade. Contact HP support
for help.
This error message displays in a dialog box with Retry and Cancel buttons. This error might
occur after the drive cages have had a firmware upgrade. {0} will be the name of the cage
with the problem. Contact HP Support.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• The storage system found an error while checking disk health. Details
are listed below.
This error message displays in a dialog box with Retry and Cancel buttons. Information listed
below the message is the CLI output for the checkhwconfig command, which occurs when
the SP does not recognize the command, allowing you to see the output.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
Error Strings specific to network progress step
• Unable to set the storage system network configuration. The storage
system's admin volume has not been created. This must be created
before any networking information is set. Contact HP support for
help.
This message displays in a dialog box with Retry and Cancel buttons. This error occurs if a
previous command failed and the wizard did not detect the error, or if the system is rebooted
for any reason during the installation. Click Cancel to close the wizard, and then begin the
setup process again.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• Unable to set the storage system network configuration. An invalid
name was specified. A storage system name must start with an
alphanumeric character followed by any combination of the following
characters: a-z, A-Z, 0-9, period (.), hyphen (-), or underscore
(_).
This message displays in a dialog box with Retry and Cancel buttons. A storage system name
must contain at least 6 characters, must begin with an alphanumeric character, and must
include at least one of each of the following characters: lowercase letters (a-z); uppercase
letters (A-Z); numbers (0-9); and a period (.), a hyphen (-), or an underscore (_).
Click Cancel to close the wizard, and then begin the setup process again.
• Unable to set the storage system network configuration. An invalid
IPv4 address was specified.
This message displays in a dialog box. The error occurs if the storage system detects that the
defined storage system name is invalid.
Click Back and specify a valid IPv4 address.
• Unable to set the storage system network configuration. An invalid
subnet was specified.
This message displays in a dialog box. The error occurs if the storage system detects that the
defined subnet address is invalid.
Click Back and specify a valid subnet address.
Troubleshooting Storage System Setup 69
• Unable to set the storage system network configuration. An invalid
IPv4 gateway was specified.
This message displays in a dialog box. The error occurs if the storage system detects that the
defined IPv4 gateway address is invalid.
Click Back and specify a valid IPv4 gateway address.
• Unable to set the storage system network configuration. The specified
IPv4 gateway address is not reachable by using the specified storage
system IPv4 address.
This message displays in a dialog box. The error occurs if the storage system detects that the
defined IPv4 gateway address could not be reached.
Click Back and specify a valid IPv4 gateway address. If the error persists, contact HP Support.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• Unable to set the storage system network configuration. The storage
system IPv4 address cannot be the same as the IPv4 gateway.
This message displays in a dialog box. The error occurs if the storage system detects that the
defined IPv4 gateway address is the same as the configured IPv4 address.
Click Back and specify a different address for the IPv4 gateway address.
• Unable to set the storage system network configuration. The specified
address is already in use by another machine.
This message displays in a dialog box. The error occurs if the storage system detects that the
defined IPv4 address is already in use by another machine.
Click Back and specify a different IPv4 address.
• Unable to set the storage system network configuration. The storage
system could not be reached at the new IP address. Make sure your
network settings are configured correctly.
This error message displays in a dialog box with Retry and Cancel buttons. This error displays
when the SP is unable to reach the storage system at the new IP address.
Click Cancel to close the wizard, and then begin the setup process again.
• Unable to set the storage system network configuration. The storage
system did not recognize its new IP address as being validated.
This error message displays in a dialog box with Retry and Cancel buttons. This error displays
when the SP reaches the storage system at the new IP but fails to recognize that the SP was
able to do this.
Click Back and specify a valid IP address. if the error persists, contact HP Support.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
Errors strings for the time setup progress step
• Unable to set the storage system NTP server. An invalid address was
specified.
This error message displays in a dialog box. This error displays if the storage system detects
that the NTP address is invalid.
70 Troubleshooting
Click Cancel to close the wizard, and then begin the setup process again.
• Unable to set the storage system NTP server. The storage system's
admin volume has not been created. This must be created before any
networking information is created. Contact HP support for help.
This error message displays in a dialog box with Retry and Cancel buttons. This error occurs
if a previous command failed and the wizard did not detect the error, or if the system was
rebooted for any reason during installation.
Click Cancel to close the wizard, and then begin the setup process again.
For information about contacting HP Support, see “Contacting HP Support about System Setup”
(page 72).
• Unable to set the storage system time zone. An invalid time zone
was specified.
This error message displays in a dialog box. This error occurs if the storage system detects
that an unfamiliar time zone was selected.
Click Back and specify a valid time zone.
• Unable to set the storage system time zone. The storage system saw
the time zone as invalid.
This error message displays in a dialog box. This error occurs if the storage system detects
that an unfamiliar time zone was selected.
Click Back and specify a valid time zone.
• Unable to set the storage system time. An invalid time was specified.
This error message displays in a dialog box. This error occurs if the storage system detects
that an unfamiliar time zone was selected.
Click Back and specify a valid time zone.
• Unable to set the storage system time. The storage system saw the
time as invalid.
This error message displays in a dialog box. This error occurs if the storage system detects
that an invalid time zone was selected.
Click Back and specify a valid time zone.
Collecting SmartStart Log Files
To collect the SmartStart log files for HP support, zip all the files in this folder: C:Users
<username>SmartStartlog.
NOTE: You can continue to access the SmartStart log files in the Users folder after you have
removed SmartStart from your system.
Collecting Service Processor Log Files
To collect the SP log files for HP support:
1. Connect to SPOCC.
2. Type the SP IP address in a browser.
3. From the navigation pane, click Files.
4. Click the folder icons for files > syslog > apilogs.
Troubleshooting Storage System Setup 71
5. In the Action column, click Download for each log file:
Service Processor setup logSPSETLOG.log
Storage System setup logARSETLOG.system_serial_number.log
General errorserrorLog.log
6. Zip the downloaded log files.
Contacting HP Support about System Setup
For worldwide technical support information, see the HP support website:
http://www.hp.com/support
Before contacting HP about accessing the SP Setup wizard or the Storage System Setup Wizard,
collect the following information:
• SmartStart log files
• SP log files
• Product model names and numbers
• Technical support registration number (if applicable)
• Product serial numbers
• Error messages
• Operating system type and revision level
• Detailed questions
When contacting HP, specify that you are requesting support for your HP 3PAR StoreServ 7000
Storage system.
72 Troubleshooting
6 CBIOS Error Codes
LED Blink Codes
The two stacked LEDs located on the face of each node in the chassis are used by the CBIOS to
communicate status and error conditions. The following are normal conditions (in sequence):
Table 19 LED Blink Codes
StatusBottomTop
Low level CPU initialization completeoffamber
High level CPU and SRAM initialization
complete
amberoff
Main memory scrubbedgreen1off
Low 1 MB testedgreen2off
If a critical failure (fatal condition) is detected during CBIOS initialization or during system operation,
CBIOS stops the node at a fatal error:
StatusBottomTop
Fatal error (node may be hot-plugged)AmberAmber
In the table above, if a number follows a color, this indicates the flash rate as follows:
• 1 - slow flash, 1 Hz (1 flash per second)
• 2 - normal flash, 2Hz (2 flashes per second)
• 3 - fast flash, 3 Hz (3 flashes per second)
• 3p - fast flash, 3 Hz (3 flashes per second, then pause for 1 second)
A critical failure can be remedied. Simply enter Ctrl+ W (^W) on the node's serial console,
then reboot. Although discouraged, if ^C is entered during a fatal error the BIOS will attempt to
resume after the point of the error.
WARNING! Entering ^W on the serial console of a node that is part of a cluster will abruptly
remove the node from the cluster.
NOTE: The most likely cause for a critical failure is failed or incompatible hardware in the user’s
system.
InForm OS Failed Error Codes and Resolution
These hardware error codes occur while the InForm OS is running, not in CBIOS mode. Each error
will present the following data when triggered: code, sub-code and a message. The table below
includes this information, as well as background information on the problem’s most likely cause.
Also included within the Description column are the steps to take in order to remedy the malfunction,
in the event that anything can be done.
Failed Alerts
DescriptionSubcode
This is actually not a node hardware or software initialization or test failure. This code
should never occur, and suggests corruption of the PROM log if it is seen.
0 - 0x0 (0)
LED Blink Codes 73
DescriptionSubcode
Resolution: Contact 3PAR technical support.
Bad or unknown CPU ID (non-Intel). The BIOS is unable to fully identify the processor.
This sub-code indicates the CPUID string is not "GenuineIntel".
1 - 0x1 (0)
Resolution:
A) Replace the processor.
B) Try moving the processor to the other CPU socket. It could be a single socket problem.
C) Try moving the processor to another system. It could be node hardware or software.
D) Replace the node motherboard.
Each class of CPU has a list of technology features it supports. If this error occurs, it is
because the CPU is either severely downrev, the CPU is bad, or the motherboard is
bad.
1 - 0x2 (0)
Resolution:
A) Replace the processor.
B) Try moving the processor to the other CPU socket. It could be a single socket problem.
C) Try moving the processor to another system. It could be node hardware or software.
Each run of CPU has a major revision and a minor stepping number. If you receive this
message, the processor has not yet been verified by 3PAR for reliable operation. If this
1 - 0x3 (0)
is a new processor, it may be acceptable to press ^C to resume after this error. If you
are testing a new stepping of the processor and need to use it, use the following Whack
command to ignore an unknown CPUID:
Whack> set perm cpu_unqual_ok
Resolution:
A) Upgrade to the latest CBIOS to ensure newer certified processors are acceptable.
B) Replace the processor with one certified by 3PAR for use with the board.
If more than one processor is installed, both CPUs must be certified to operate in
multiprocessor mode. This error
1 - 0x4 (0)
indicates that the bootstrap processor was found to not be certified to run in a
multiprocessor mode.
See Code 1, sub-code 0x3 for resolution information.
See Code 1, sub-code 0x3 for resolution information.1 - 0x5 (0)
This is an internal CBIOS consistency check error. If you see this error, most likely
processor execution out of flash is not stable. The CPU identification is performed after
1 - 0x6 (0)
the flash is fully CRC verified, so this error is likely the result of a failing CPU or transient
bus operation.
Resolution:
A) Replace the processor.
B) Re-flash the CBIOS (no need to upgrade).
This is another internal CBIOS consistency check error. Before each block of update
microcode is uploaded to the Pentium, a checksum on it is first verified. If this checksum
is not valid, the block will be rejected with this error.
1 - 0x7 (0)
See Code 1, sub-code 0x3 for resolution information.
The processor has rejected the microcode update. This could be any number of things,
but is likely due to a failing processor. At this point a strong 64-bit CRC has been run
successfully across the BIOS and a checksum for each update line has also passed.
1 - 0x8 (0)
See Code 1, sub-code 0x4 for resolution information.
The BIOS was not able to locate a microcode update for this particular processor, yet
it is listed as a CPU which requires a microcode update. This is likely due to use of an
unqualified processor.
1 - 0x9 (0)
74 CBIOS Error Codes
DescriptionSubcode
See Code 1, sub-code 0x4 for resolution information.
The processor has failed its own built in self test. This indicates strongly that the processor
is at fault.
1 - 0xa (0)
Resolution:
A) Replace the processor.
B) Replace both processor VRM modules.
C) Replace the node motherboard.
The two processors in the system board do not have the same bus clock multiplier. The
likely cause is that the processors are of different clock speeds (or less likely minor
1 - 0xb (0)
steppings). The "First CPU" as written above is the bootstrap CPU. On a PIII board, the
bootstrap CPU (CPU3) is to the right, nearest the PromJet interface.
Resolution:
A) Remove both heatsinks and verify the processors are rated for the same clock speed
and bus multiplier.
B) Replace each processor individually.
C) Replace the node motherboard.
This CPU does not support clock multiplier changes1 - 0xc (0)
In the supported configuration, the two CPUs present in the node must run at the same
clock speed. If the BIOS detects CPUs which have different clock multipliers, it will
automatically configure all CPUs to use the highest common clock multiplier. If a CPU's
multiplier cannot be changed, then this fatal error will result.
See Code 1, sub-code 0x4 for resolution information.
Desired clock multiplier xx is too high for this CPU1 - 0xd (0)
This error indicates the CPU does not support a clock multiplier the BIOS is attempting
to set.
See Code 1, sub-code 0xc for resolution information.
Desired clock multiplier xx is illegal for this CPU1 - 0xe (0)
See Code 1, sub-code 0xd for information on this error.
During initialization, memory areas are tested before they are used. SRAM is used by
the processor for persistent storage during early initialization and the CPU memory
tests.
3 - 0x0 (0)
This sub-code indicates that the SRAM walking bits test has failed and that the onboard
SRAM may not be reliable.
Resolution:
A) Power down, wait 30 seconds, power up. This problem is likely not a one time
occurrence, so this problem is likely to recur.
B) Replace the node motherboard.
After SRAM contents have been updated with the BIOS static data, a test is performed
to ensure the data arrived intact. If it did not, this error is generated. The error could
indicate an SRAM failure with the same conditions as above.
3 - 0x1 (0)
See Code 3, sub-code 0x0 for resolution information.
The SDRAM DIMMs located on the motherboard are used for main CPU memory and
are critical to the proper operation of a node. Even before the memory is thoroughly
4 - 0x1 (0)
tested for proper operation, it must be configured to appear in CPU-addressable space.
Each DIMM has a small embedded serial EEPROM which holds DIMM configuration
information such as the number of rows, columns, and banks, as well as memory timing.
If this serial EEPROM becomes corrupt, data stored in it regarding the DIMM
configuration cannot be trusted. So, this EEPROM also contains a checksum which the
BIOS verifies is correct before configuring the DIMM. If this checksum does not match
the checksum the BIOS computes across the DIMM, this error will result.
InForm OS Failed Error Codes and Resolution 75
DescriptionSubcode
The minor code reported is the total count of errors for the DIMM.
Resolution:
A) Replace the defective CPU DIMM with an identical one.
B) If an identical one is not available, replace the CPU DIMM pair.
See Code 15 for more resolution information.
This error indicates that a CPU memory DIMM was detected but that the EEPROM
present on the DIMM could not be reliably read. The read operation is done through
I2C.
4 - 0x2 (0)
See Code 4 above for resolution information.
This error indicates the BIOS detected the CPU SDRAM DIMMs in the bank pair are of
a different type.
4 - 0x4 (0)
Resolution: Ensure both DIMMs in the pair are identical. Note that two DIMMs may
have the same capacity but have different number of rows, columns, or banks. The
DIMM configuration must exactly match. If the DIMMs have the same manufacturer,
markings and capacity, they are probably identical.
See Code 15 for more resolution information.
This error indicates the value the DIMM reports for refresh is not valid (greater than the
maximum refresh counter).
4 - 0x8 (0)
See Code 4 above for resolution information.
This error indicates the values the DIMM reports for rows, columns, and banks do not
correspond to any known configuration for a valid DIMM. It is possible the DIMM
4 - 0x10 (0)
EEPROM data has become corrupt or that the DIMM is a higher capacity than what is
currently supported.
See Code 4 above for resolution information.
This is P4 only. This error indicates that BIOS failed to find a set of acceptable DQS
values for every or one nibble of the DIMMs.
4 - 0x20 (0)
See Code 4 above for resolution information.
This error indicates the DIMM pair requires a memory controller setting which is outside
tolerance for the chipset's memory controller. This DIMM pair would likely not function
correctly if it were allowed to be used.
4 - 0x100 (0)
Resolution:
A) Replace CPU DIMMs with 3PAR-certified products.
B) Replace the node motherboard.
C) If there is no other choice, override this error with a BIOS variable, setting
"mem_margin" to the percentage outside margin. Example:
*** Fatal error: Code 4, sub-code 0x0 (2).
Whack> set perm mem_margin=2
Whack> reboot
This error indicates the DIMM pair requires a memory controller setting which is outside
tolerance for the chipset's memory controller. This DIMM pair would likely not function
correctly if it were allowed to be used.
4 - 0x200 (0)
See Code 4, sub-code 0x100 for resolution information.
This error indicates the DIMM pair requires a memory controller setting which is outside
tolerance for the chipset's memory controller. This DIMM pair would likely not function
correctly if it were allowed to be used.
4 - 0x400 (0)
See Code 4, sub-code 0x100 for resolution information.
76 CBIOS Error Codes
DescriptionSubcode
This error indicates the DIMM pair requires a memory controller setting which is outside
tolerance for the chipset's memory controller. This DIMM pair would likely not function
correctly if it were allowed to be used.
4 - 0x800 (0)
See Code 4, sub-code 0x100 for resolution information.
This error indicates the DIMM pair requires a memory controller setting which is outside
tolerance for the chipset's memory controller. This DIMM pair would likely not function
correctly if it were allowed to be used.
4 - 0x1000 (0)
See Code 4, sub-code 0x100 for resolution information.
This error indicates the DIMM pair requires a memory controller setting which is outside
tolerance for the chipset's memory controller. This DIMM pair would likely not function
correctly if it were allowed to be used.
4 - 0x2000 (0)
See Code 4, sub-code 0x100 for resolution information.
This exception should never happen unless an earlier exception was ignored by pressing
^C. This is because this exception will only occur if the main initialization, diagnostic
5 - 0x1 (0)
test, and boot sequence fails to complete a boot and then the user chooses to ignore
the error.
A further explanation is necessary. There are two halves to system initialization. The
first half relies on only SRAM being available and so stack and runtime variables are
stored there. Once main CPU memory has been tested, initialization switches to the
second half which relies on the tested SDRAM for all data structures. This second half
completes initialization and testing of all other node board devices and executes the
boot process. For this last step to fail, the IDE disk must either not be present or contains
an invalid boot. At that point a fatal error is generated.
Do not ignore this condition. It is a final recourse and an abort will reboot or hang the
node board. It is safer at this stage to press ^W and enter Whack. From Whack, you
can reboot with the "reboot" command.
Resolution:
A) Check control cache (CPU) DIMMs are installed and pass initialization.
B) Verify the node boot drive is present and node software has been installed.
C) Replace the node, including CPU DIMMs and boot drive.
*** SRAM failure: address xxxxxxxx wrote yy but read zz6 - 0x1 (0)
This failure indicates an early SRAM verification test revealed a problem with the SRAM.
This is an unrecoverable error which likely requires hardware diagnostic. This error is
displayed by low level init code. It will never be written to the PROM log because
hardware which writes to the PROM relies on correctly functioning SRAM.
Resolution:
A) Cycle power on the node.
B) Replace the bootstrap CPU.
C) Replace the node motherboard.
This error indicates the BIOS has detected that the front side bus speed exceeds the
expected speed (133 MHz on PIII, 533 MHz on P4, 1333 MHz on 5000P). The system
may not perform reliably.
7 - xxxx (yyyy)
Resolution:
A) Cycle power on the node.
B) Replace the bootstrap CPU.
C) Replace the node motherboard.
Machine check:8 - xxxxxxxx (yyyyyyyy)
MCG_STATUS == xxxxxxxx yyyyyyyy
During BIOS initialization and testing, the processor must execute instructions. If this
error results at any point, it is likely due to failing hardware related to the CPU's
instruction execution path.
InForm OS Failed Error Codes and Resolution 77
DescriptionSubcode
Resolution:
A) Cycle power on the node.
B) Update the node firmware to the latest version.
C) Replace CPU SDRAM in pairs.
D) Replace the node motherboard.
*** Entering memory segment test: Stack is in xxx ***9 - 0x0 (0)
One of the first memory tests performed in diagnostic mode is a sequential address or
random data test. If there is no memory in the system, or the memory DIMMs are
mismatched, or there is a memory subsystem problem, this error may result.
Resolution:
A) Verify memory is installed and in matched pairs (same manufacturer, exact same
memory configuration and speed).
B) Replace CPU DIMMs with a set of known good ones.
C) Replace the node motherboard.
Insufficient memory: BSS end == xxxx, stack limit == yyyy9 - 0x1 (0)
During the first part of initialization, system stack comes from SRAM. The second part
of initialization, system stack comes from CPU memory. If there is insufficient SDRAM
(such as no DIMMs installed) this error may result.
It is a bad idea to ignore this error with ^C as the system stack will fall past the available
memory and probably hang hard the initialization.
See Code 9, sub-code 0x0 for resolution information.
Expected sdram_init_test to be xxxx, but it was yyyy.9 - 0x2 (0)
After SDRAM has been initialized and scrubbed, the BIOS copies runtime variables
from Flash to CPU memory. The fact this data is copied to SDRAM is later verified. This
fatal error may be caused by either a software error in the BIOS, a hardware error
(such as flaky CPU memory), or user intervention such as modifying the memory
containing the SDRAM copy of the runtime variables.
Resolution:
A) Reboot. If the problem is caused by flaky hardware, a prior memory test should catch
this condition.
B) Upgrade BIOS version. Not a likely solution since this code path is well tested every
time the system is booted.
C) Replace CPU DIMMs with a set of known good ones.
D) Replace the node motherboard.
Low 1M test: Test completed: x iterations, y probes, z errors found9 - 0x3 (0)
The low 1 MB of memory is thoroughly tested to ensure reliable operation as this is the
memory area that the BIOS and Whack use during further initialization and testing. If
this test fails, it should not be ignored with ^C as having reliable system memory is
critical to proper operation.
Resolution:
A) Cycle power on the node. Occasionally, memory will fail during a memory test due
to metallic dust.
B) Reseat CPU memory DIMMs.
C) Pull CPU DIMMs, blow dust from sockets, reseat.
D) Replace CPU memory DIMMs in pairs to ensure replacement parts are matched.
PIII nodes: Non-paired DIMMs are proximally closest. Paired DIMMs are the
leftmost-leftmost and rightmost-rightmost of each two which are proximally closest.
P4 nodes: Paired DIMMs are proximally closest. DIMM0 and DIMM1 are a pair. DIMM2
and DIMM3 are a pair.
E200, Ironman, and Tinman nodes: There is only a single pair of CPU memory DIMMs.
78 CBIOS Error Codes
DescriptionSubcode
E) Replace the node motherboard.
High 64K test: Test completed: x iterations, y probes, z errors found9 - 0x4 (0)
In addition to the low 1 MB of memory, older BIOS versions also thoroughly tested the
high 64 KB of memory. This is because the operational stack for the CBIOS and Whack
used to reside at this address, which made the memory critical for proper initialization
and testing. The current BIOS now uses memory below 1 MB for stack space, so this
failure code is deprecated.
See Code 9, sub-code 0x3 for resolution information.
SDRAM walk: Test completed: xx iterations, yy probes, zz errors found9 - 0x5 (0)
During initialization (prior to a thorough test of the low 1 MB of memory), a quick walk
through all CPU memory is performed. If an error is found, this fatal error is displayed.
See Code 9, sub-code 0x3 for resolution information.
Full SDRAM test: Test completed: xx iterations, yy probes, zz errors found9 - 0x6 (0)
During later testing, a full SDRAM test is performed which more completely verifies
proper memory operation than the cursory SDRAM walk. This test is very similar to the
initial thorough 1 MB test done during initialization.
See Code 9, sub-code 0x3 for resolution information.
Pairwwww DIMMxxxx: Illegal SPD value <name of value> <value>9 - 0x7 (0)
This error indicates that a CPU DIMM was detected but that the EEPROM present on
the DIMM reported an illegal or unsupported value for our memory controller.
Example:
Density (SPD byte 31) has more than 1 bit set (ie. 0x30) which indicates a non-standard
part.
See Code 9, sub-code 0x3 for resolution information. Most likely, the DIMM is not
qualified for use in our node board. The DIMM number is logged in the Data field of
the Fatal Error.
Cannot allocate xx bytes for PCI bus yy scan9 - 0x10 (0)
or
Cannot allocate xx bytes for PCI device on bus yy
This error indicates there was not enough memory or a memory error occurred while
attempting to allocate heap space during the PCI device probe. SDRAM is needed
because the BIOS maintains a list of PCI devices present in the system.
Resolution:
A) Cycle power on the node.
B) Remove all PCI cards.
C) Replace CPU DIMMs.
D) Replace the node motherboard.
Cannot find bus xx in scanned PCI busses9 - 0x11 (0)
During the PCI bus scan, a list of PCI devices present is recorded in SDRAM. For each
device present, a block of memory is allocated and initialized. This error indicates that
a data value indicating bus number could not be found in the list of devices previously
scanned. This is probably due to an SDRAM or CPU failure.
Resolution:
A) Cycle power on the node.
B) Remove all PCI cards.
C) Replace CPU DIMMs.
D) Replace bootstrap CPU.
E) Replace the node motherboard.
InForm OS Failed Error Codes and Resolution 79
DescriptionSubcode
No memory installed.9 - 0x12 (0)
This error indicates that the CPU memory scan failed to locate any usable memory for
the system. There must be at least one bank of SDRAM configured for the node to
operate correctly.
Resolution:
A) Cycle power on the node.
B) Verify CPU DIMM scan output shows DIMMs.
C) Replace CPU DIMMs.
D) Replace the node motherboard.
Unknown DDR2 frequency (xxxx)9 - 0x13 (xxxx)
This error indicates that the CPU memory installed is of an unrecognized and thus
unsupported memory speed. Supported speeds include 533, 667 and 800 MHz.
Resolution: Replace CPU DIMMs with 533, 667 or 800 MHz modules.
FB-DIMM Initialization Failure9 - 0x14 (0)
This error indicates that CBIOS was unable to initialize the CPU memory installed.
Resolution:
A) Cycle power on the node.
B) Replace CPU DIMMs.
C) Replace the node motherboard.
This error indicates that an uncorrectable ECC error was detected on a DIMM. The data
value is a bitmask that may be decoded to determine which DIMM had the error. A
9 - 0x15 (data)
value of 1 indicates DIMM 0, 2 indicates DIMM 1, 4 -> DIMM 2, etc. More than one
bit may be set if CBIOS is unable to isolate the error down to a single DIMM.
Resolution:
A) Cycle power on the node.
B) Replace FB-DIMM(s).
C) Replace the node motherboard.
During the PCI scan, many devices which were programmed by previous PCI scan steps
are examined again to verify the programming was successful. This error indicates that
a bridge failed to record the PCI bus number of bridges below it.
10 - 0x1 (0)
Resolution:
A) Cycle power on the node.
B) Remove all PCI cards.
C) Replace the node motherboard.
There are on the PCI bus several devices in a node board which are known by the
CBIOS to have specific sizes. As a hardware consistency check, the BIOS verifies that
10 - 0x2 (0)
these devices are not only present, but also have appropriate memory and I/O space
requirements. If any device is found outside of expected requirements, it will cause this
error.
Resolution:
A) Cycle power on the node.
B) Reseat all PCI cards.
C) Swap out the PCI card for another qualified card (if it's a card).
D) Pull all PCI cards to see if the problem persists. If so, replace any defective cards.
E) Replace the node motherboard.
This error indicates that the system has run out of available mapping area while
attempting to map this device into the CPU's I/O address range (0x0000 - 0xfe00).
10 - 0x3 (0)
The likely cause of this error is that a prior PCI device is consuming too much I/O space.
80 CBIOS Error Codes
DescriptionSubcode
Since most device I/O ranges are extremely small, it is likely a defective PCI card or
PCI bus problem which is the cause.
Resolution:
A) Reseat all PCI cards.
B) Swap out individual PCI cards.
C) Replace the node motherboard.
Many PCI devices (and software drivers) require DMA addressable memory within the
32 bit address space (less than 4 GB). For this reason, all 32 bit PCI devices are required
10 - 0x4 (0)
to be mapped within this space. Currently, all CPU memory is also forced to be mapped
within this space, limiting the maximum 32-bit CPU memory to about 3 GB.
Resolution:
A) Swap out individual PCI cards.
B) Replace the node motherboard.
See Code 10, sub-code 0x3 for diagnostic information.
The non-prefetchable memory has the same 32 bit limitations as prefetchable memory
does.
10 - 0x5 (0)
See Code 10, sub-code 0x4 for resolution information.
64 bit PCI devices are not limited to a 32 bit address space. The CPU, however, can
only access a 36 bit space (when virtual memory is enabled). Because most drivers
10 - 0x6 (0)
need direct access to the memory a device provides on the bus, the device must be
addressable by the Pentium and so the maximum 64 bit address allowed is 0xf:ffffffff.
This is 64 GB.
See Code 10, sub-code 0x4 for resolution information.
Testing CM PCI 64-bit data lines: FAIL10 - 0x7 (0)
The Cluster Manager (Eagle / Osprey) is used to perform a walking bit test on both
PCI0 and PCI1 data paths to CPU memory. If a problem is found, with either path, this
error will be displayed. The error will be further qualified by one of the following prior
lines:
PCIxxxx all data bits stuck high
PCIxxxx found data bits stuck high: BitWW, BitXX, BitYY, BitZZ
PCIxxxx all data bits stuck low
PCIxxxx found data bits stuck low: BitWW, BitXX, BitYY, BitZZ
PCIxxxx data bits possibly floating: BitWW, BitXX, BitYY, BitZZ
Resolution:
A) Cycle power on the node.
B) Reseat all PCI cards.
C) Pull all PCI cards to see if the problem persists. If so, replace any defective cards.
D) Replace the node motherboard.
CBIOS runs simple CM PCI Tests as part of POST in both normal operation and
manufacturing test. The tests use XCBs to transfer data over both CM PCI interfaces from
10 - 0x8 (0)
Cluster Memory to CPU Memory and back. If any test fails due to a data miscompare,
the test will generate this fatal error code with sub-code '0x4'.
These tests are similar to the Cluster Memory Tests and may fail due to Cluster Memory
SDRAM hardware or CPU SDRAM hardware failures. Any test failure will result in a
fatal error.
Resolution:
A) Cycle power on the node.
B) Reseat CM memory riser card.
C) Reseat the failing Cluster memory DIMM.
InForm OS Failed Error Codes and Resolution 81
DescriptionSubcode
D) Replace the failing Cluster memory DIMM.
E) Replace the node motherboard.
This error indicates one of the PCI bridges on the board has a bad clock value and is
refusing to accept programming of a good clock.
10 - 0x9 (0)
Resolution:
A) Cycle power on the node. The problem may occur on power cycle (only) with random
chance on a bad board.
B) Pull all PCI cards which have integrated bridges (QLogic quad port cards are a good
example of this). You should power cycle several times to determine it is not an
intermittent problem with the motherboard.
C) Replace the node motherboard.
This error indicates one of the PCI bridges on the board has a bad GPIO input which
selects bridge clock sources on a power on condition.
10 - 0xa (0)
Resolution:
A) Cycle power on the node. The problem may occur on power cycle (only) with random
chance on a bad board.
B) Replace the node motherboard.
Warning: This node has xx PCI cards present, but yy is the required minimum. Please
verify your node is properly configured. You may adjust the required minimum with the
"set pci_min" command.
10 - 0xb (0)
This error indicates this node has detected less PCI cards than the recommended 3PAR
minimum. In a system configuration where there are less than the minimum active PCI
cards, inactive load cards should be used to reach the required minimum.
Resolution:
A) Verify the minimum required number of PCI cards are inserted in the node. Install
dummy load cards to reach the required minimum.
B) Verify all PCI cards in the system have been identified. Replace any missing card.
C) Replace the node motherboard.
Testing CM PCI 64-bit address lines: FAIL10 - 0xc (0)
CM XCB TEST miscompare at offset, uuuu
Expected (vvvvvvvv)
Actual (wwwwwwww)
CM DIMMxx (Jyyyy): Address (zz:zzzzzzzz)
The Cluster Manager is used to perform a walking bit test on both PCI0 and PCI1
address lines paths from CPU memory into cluster memory. If a problem is found (with
either path), this error will be displayed. The particular memory address which caused
this error will be indicated.
Resolution:
A) Cycle power on the node.
B) Reseat all PCI cards.
C) Pull all PCI cards to see if the problem persists. If so, replace any defective cards.
D) Replace the node motherboard.
*** Vendor xxxx device yyyy on motherboard not yet qualified.10 - 0xd (zz)
*** Vendor xxxx device yyyy in slot zz not yet qualified.
This is an error indicating that the device found is not recognized by the BIOS as a
3PAR-qualified device. This may be because the board is a new generation or that
there was a PCI error in communicating with the device. In the former case, it is probably
safe to press ^C to ignore this error. In the later case, it is possible that part of the board
has become non-functional to where the BIOS may not be able to determine if the rest
82 CBIOS Error Codes
DescriptionSubcode
of the board will continue to function.If you need to override this feature, enter Whack
at this point by pressing ^W. Enter the following command:
Whack> set perm pci_unqual_ok
If the data field is non-zero, it indicates the BIOS discovered the problem is a card in
a particular PCI slot.
The specific codes are as follows:
* 30 is PCI Slot 0
* 31 is PCI Slot 1
* 32 is PCI Slot 2
* 33 is PCI Slot 3
* 34 is PCI Slot 4
* 35 is PCI Slot 5
Resolution:
A) Swap out the PCI card for a qualified card.
B) Replace the node motherboard.
This error indicates the PCI scanning code was unable to lay out a valid PCI address
table mapping within 21 passes. The cause of this error is possibly due to either defective
hardware or BIOS firmware.
10 - 0xe (0)
Resolution:
A) Remove all PCI cards. If the error goes away, attempt to find failed card by process
of elimination (put back half of the cards and try to boot again).
B) Replace the node motherboard.
This error indicates a possible hardware failure on the board. The bus which connects
the CMIC (P4 North Bridge) to CIOB A failed to initialize properly.
10 - 0x10 (0)
Resolution:
A) Cycle power on the node. The problem may occur with random chance on a bad
board.
B) Replace the node motherboard.
The BIOS checks for specific onboard PCI devices (such as bridges) which are known
to be on a particular node board. If a device listed in the BIOS table is not found on
the board, then this error will result.
10 - 0x11 (0)
Resolution:
A) Cycle power on the node.
B) Remove PCI cards and see if error disappears.
C) Replace the node motherboard.
Onboard PCI devices (such as bridges) are well known by the BIOS to appear at specific
bus addresses. If this device is not known by the BIOS, but it is configured on a bus
10 - 0x12 (0)
which is not externally exposed (PCI slot), then you will see this error. Since the node
board is a closed solution, this error might occur if an on board device is failing and
does not report a correct device vendor/ID, or corrupts the device vendor/ID reported
by another device on the bus.
See Code 10, sub-code 0x11 for resolution information.
The PCI header is re-read on multiple passes of the PCI initialization. If a mismatch is
found with a previous read of the PCI bus, then this error will result. This is a strong
10 - 0x13 (0)
indicator of a flaky device or bus. If the BIOS is in Diagnostic mode (press ESC at the
initial memory test), at this point, the following will also be displayed:
Starting infinite PCI read loop...
In Diagnostic mode, once a failure is detected, this test is then repeated until manual
intervention.
See Code 10, sub-code 0x3 for resolution information.
InForm OS Failed Error Codes and Resolution 83
DescriptionSubcode
During PCI initialization, a 64 bit window was found on the PCI bus which is outside
the 36 bit range imposed by the CPU.
10 - 0x14 (0)
See Code 10, sub-code 0x3 for resolution information.
During PCI initialization, a window was found on the PCI device with a size of zero.
This fatal error may indicate that the BIOS is not able to properly communicate with the
PCI device.
10 - 0x15 (0)
See Code 10, sub-code 0x3 for resolution information.
During PCI initialization, each memory or I/O window present on each device found
on the bus is programmed with a CPU memory bus address so that it may be accessed
10 - 0x16 (slot)
by further BIOS initialization, tests and of course the main operating system. The BIOS
verifies the address it programs for each window was correctly programmed (by reading
back the value just written). If they do not match, this error is generated.
The slot number is an ASCII value represented as Hexadecimal. If the slot value is 0,
then the failure occurred on a node motherboard device. If PCI Slot 0 was involved,
then slot is 30. PCI Slot 1 is 31; PCI Slot 2 is 32; PCI Slot 6 is 36, etc.
See Code 10, sub-code 0x3 for resolution information.
See Code 10, sub-code 0x16 for information on this error.10 - 0x17 (0)
See Code 10, sub-code 0x16 for information on this error.10 - 0x18 (0)
During PCI initialization, each memory or I/O window present on each device found
on the bus is programmed with a CPU memory bus address. The size of the window
10 - 0x19 (0)
require is provided by the specific PCI device. It is required that this window is a power
of 2 in size (1 KB, 2 KB, 4 KB, ... 32 MB, 64 MB, etc). This is a consistency check the
BIOS performs to ensure it is properly communicating with the PCI device.
See Code 10, sub-code 0x3 for resolution information.
During PCI initialization, the entire PCI bus is walked as a tree and devices registers
are initialized and mapped into processor address space using this tree. The bus structure
10 - 0x1a (0)
is then ordered and summarized into a table so that software can later find specific
devices for high level initialization. This specific error indicates the PCI scan attempted
to map a PCI device into the CPU's 32-bit address space, but failed due to no more
available space. Verify that NVRAM flags such as "pci_base" and "mem_max" are not
set to unusual values.
See Code 10, sub-code 0x3 for resolution information.
This error indicates a possible hardware failure on the board. The bus which connects
the CMIC (P4 North Bridge) to CIOB B failed to initialize properly.
10 - 0x1b (0)
See Code 10, sub-code 0x10 for resolution information.
This error indicates a possible hardware failure on the board. The CIOB (which connects
the North Bridge to the I/O system) has an incorrect clock speed.
10 - 0x1c (data)
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
This error indicates one of the PCI bridges on the board has a bad speed selection set,
which could indicate an incorrect type of PCI card has been installed or that bridge
mode select strappings are bad.
10 - 0x1d (0)
Resolution:
A) Pull all PCI cards one at a time to determine failed card.
B) Replace the node motherboard.
This error indicates that during a previous PCI scan, the CPU hung repeatedly. Other
than this being a fatal error, this code is identical to that of sub-code 0x23. Note that
10 - 0x24 (data)
if this fatal error is seen without a preceding non-fatal sub-code 0x23, then the failure
is likely to be the node motherboard.
84 CBIOS Error Codes
DescriptionSubcode
If the non-fatal is not logged, then a PCI scan hung earlier in the PCI tree than a previous
hang. Unless both hangs happened on the same HBA, the cause is likely a shared
device on the node motherboard.
See Code 10, sub-code 0x23 for resolution information.
This error indicates that the PCI device does not have an EEPROM attached.10 - 0x25 (0)
Resolution: Replace node motherboard.
This error indicates that the EEPROM failed to be programmed.10 - 0x26 (0)
Resolution: Replace node motherboard.
This error indicates that BIOS was unable to verify the EEPROM contents after
programming or that the data was successfully written but did not persist.
10 - 0x27 (0)
Resolution: Replace node motherboard.
The BIOS installs an interrupt handler to catch spurious (unexpected) interrupts and
exceptions during initialization and testing of the node hardware. During initialization,
11 - yyyy (0)
the BIOS even tests to verify a generated interrupt is delivered correctly. This is a serious
condition and should not be ignored by pressing ^C. The specific interrupt received is
the sub-code displayed. The interrupt number will be less than 0x20.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
PIII or P4 node:12 - 0x0 (0)
--- SMI: No known cause (# zz)
GPE status: yyyyyy, GPE input: zzzzzz
An SMI is a System Management Interrupt, and interrupt generated by the node
hardware for the BIOS to service a particular failure. This error indicates the BIOS was
unable to determine the cause of the SMI delivered by hardware.
See Code 11 for resolution information.
Ironman, Tinman, Titan, or Atlas nodes:12 - 0x0 (0)
CPU0 SMI: Bootstrap
CPU0 SMI: Updating
CPU0 SMI: Updated
--- SMI: No known cause (# 1) on CPU6
SMSCS[0] = 0x00000000
...
ALT_GP_SMI_EN = 0xbfbf
ALT_GP_SMI_STS = 0x0000
TMP_STS = 0x00000000:88380000
TMP_INT = 0x00000000:00000001
This fatal error indicates the BIOS received an SMI, but wasn't able to determine which
device caused the interrupt. In this example, the "Bootstrap," "Updating," and "Updated"
messages suggest the BIOS firmware was updated.
Resolution:
A) Reboot the node.
B) Replace the node motherboard.
During initialization, the BIOS installs an interrupt handler to verify interrupts are
delivered reliably. It then generates an expected interrupt. If an interrupt is delivered
12 - 0x1 (yyyy)
which is not the same as the one expected, this error is displayed. The interrupt number,
yyyy, represents which interrupt occurred.
See Code 11 for resolution information.
InForm OS Failed Error Codes and Resolution 85
DescriptionSubcode
During initialization, the BIOS installs an interrupt handler to verify interrupts are
delivered reliably. It then generates a few expected interrupts. If the specific interrupt
13 - 0x0 (yyyy)
is not delivered, this error is displayed. The interrupt number, yyyy, represents which
interrupt should have been generated.
See Code 11 for resolution information.
The Whack "mem test ecc" command performs an ECC test over the main memory to
ensure ECC memory error correction is functioning. If this test fails, this message is
displayed, together with other messages giving details.
14 - 0x0 (0)
Note:
Running the "mem test ecc" command destroys some memory locations in the range of
[0 .. 512 KB] and [1 MB .. just below the top of SDRAM]. Hence, executing this once
Linux has booted will cause it to fail if it is reentered.
If you see this failure often during BIOS initialization, then the cause is likely a hardware
problem. Specifically, the error tells you that the hardware ECC error mechanism is not
working correctly. Changing CPU memory DIMMs may solve the problem, but it's more
likely a board failure.
Resolution:
A) Ensure the North Bridge heatsink is firmly attached.
B) Replace CPU DIMMs.
C) Replace bootstrap CPU.
D) Replace the node motherboard.
00 - 0f: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 40 0c 0014 - 0x1 (1)
10 - 1f: 01 ff 00 00 00 00 00 ff | ff ff ff ff ff ff ff ff
20 - 2f: 04 09 08 09 20 09 10 09 | 18 09 00 09 00 00 59 8e
30 - 3f: aa aa 0a 02 a8 00 00 00 | 00 00 00 c0 7b df ff ff
This error indicates the BIOS ECC hardware test could not get the hardware to generate
an ECC SMI in response to a corrupted memory address. It possibly indicates a failing
DIMM or memory controller, or that memory timings are too fast for the DIMMs present
in the node.
See Code 14, sub-code 0x0 for resolution information.
mailbox register xxxx changed inappropriately15 - 0x0 (slot)
(yyyy) != expected (zzzz)
register test: FAIL
(slot) = PCI slot number
There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The
slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and
Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top
three will depend on which slot the node is in. During POST, all present FCAL adapters
are tested for functionality. The HBA cards sometimes require a firmware download for
full capability. POST does not have access to this firmware and will only test basic
register access and functionality. If the Register Test fails, POST will indicate this error.
If the user continues past this error (^C), software will log the error and continue testing
the other PCI cards (if present).
Resolution:
A) Reseat the failing PCI Fibre Adapter.
B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI
Fibre Adapter.
B) Replace the node motherboard.
controller memory xxxx value (yyyy) != expected (zzzz)15 - 0x1 (slot)
memory test: FAIL
(slot) = PCI slot number
86 CBIOS Error Codes
DescriptionSubcode
There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The
slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and
Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top
three will depend on which slot the node is in. During POST, all present FCAL adapters
are tested for functionality. The HBA cards sometimes require a firmware download for
full capability. POST does not have access to this firmware and will only test basic
functionality. If the Onboard Memory Test fails, POST will indicate this error.
If the user continues past this error (^C), software will log the error and continue testing
the other PCI cards (if present).
Resolution:
A) Reseat the failing PCI Fibre Adapter.
B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI
Fibre Adapter.
B) Replace the node motherboard.
data bits possibly float: Bitxxxx-Bityyyy.15 - 0x2 (slot)
PCI walking bits: FAIL
(slot) = PCI slot number
There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The
slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and
Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top
three will depend on which slot the node is in. During POST, all present FCAL adapters
are tested for functionality. The HBA cards sometimes require a firmware download for
full capability. POST does not have access to this firmware and will only test basic
functionality. If the PCI Fibre Card Bus Test fails, POST will indicate this error.
If the user continues past this error (^C), software will log the error and continue testing
the other PCI cards (if present).
Resolution:
A) Reseat the failing PCI Fibre Adapter.
B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI
Fibre Adapter.
C) Replace the node motherboard.
data bits possibly float: Bitxxxx-Bityyyy.15 - 0x3 (slot)
CM0 walking bits: FAIL
(slot) = PCI slot number
There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The
slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and
Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top
three will depend on which slot the node is in.
This test indicates a problem was observed with the fibre channel card talking with the
Cluster Manager.
If the "fibre test pci" test passed, then this problem is likely in the interface to the CM
or CM memory.
Resolution:
A) Reseat the failing PCI Fibre Adapter.
B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI
Fibre Adapter.
C) Replace the node motherboard.
PCIe EYE test: FAIL15 - 0x4 (slot)
(slot) = PCI slot number
There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The
slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and
Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top
three will depend on which slot the node is in.
InForm OS Failed Error Codes and Resolution 87
DescriptionSubcode
If the "fibre test cm" test passed, then this problem is likely in the PCIe to PCIE link
between the card and the switch.
Resolution:
A) Reseat the failing PCI Fibre Adapter.
B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI
Fibre Adapter.
C) Replace the node motherboard.
BIOS can not make LSI card go into Operational state.15 - 0x10 (slot)
Resolution: Replace card. Send failed card back for FA.
HBA card register test failure15 - 0x11 (slot)
Resolution: Replace card. Send failed card back for FA.
LSI card register memory copy test failure.15 - 0x13 (slot)
Resolution: Replace card. Send failed card back for FA.
LSI card register memory copy test failure.15 - 0x14 (slot)
Resolution: Replace card. Send failed card back for FA.
Firmware rev xxxx not supported. Upgrade to yyyy15 - 0x15 (slot)
LSI card does not contain 3PAR-approved firmware. If you need to run with an LSI card
which has an older firmware (engineering only), you can set the "lsi_downrev" flag in
the BIOS.
Example:
Whack> set perm lsi_downrev
Resolution: Replace card. Send failed card back for upgrade.
Unable to get firmware rev15 - 0x16 (slot)
Attempting to get the firmware version from the LSI card failed.
Resolution:
A) Cycle power on the node.
B) Replace card. Send failed card back for FA.
Manufacturing test for E200 node Only.15 - 0x17 (slot)
This error occurs when the onboard LSI chips are not found. They are expected to be
in slot 0 and 3, with two devices on each slot.
Resolution:
A) Cycle power on the node.
B) Replace motherboard.
The IDE controller failed its internal self test.17 - 0x0 (0)
Resolution:
A) Replace the IDE or SATA boot drive.
B) Replace the IDE or SATA cable.
C) Replace the node motherboard.
The IDE controller failed to perform a self test.17 - 0x1 (0)
See Code 17, sub-code 0x0 for resolution information.
IDE register xx value (yyyy) != expected (zzzz)17 - 0x2 (0)
The IDE register test failed during a pattern test.
See Code 17, sub-code 0x0 for resolution information.
88 CBIOS Error Codes
DescriptionSubcode
IDE register xx value (yyyy) != expected (zzzz)17 - 0x3 (0)
The IDE register test failed during a walking bit test.
See Code 17, sub-code 0x0 for resolution information.
There was an IDE failure in data requested by the operating system bootstrap. It is
possible that data on the disk has become corrupt to the point the operating system will
not successfully load.
17 - 0x4 (0)
Resolution: Replace the IDE or SATA boot drive.
Communication with the IDE interface timed out. This error indicates the drive is not
responding to commands within an acceptable amount of time.
17 - 0x5 (0)
Resolution: Replace the IDE or SATA boot drive.
IDE reported a failure in read verify command.17 - 0x6 (0)
Resolution: Replace the IDE or SATA boot drive.
A timeout (10 seconds) was detected while performing DMA operation.17 - 0x7 (0)
Resolution: Replace the IDE or SATA boot drive.
An error condition was detected while performing DMA operation.17 - 0x8 (0)
Resolution: Replace the IDE or SATA boot drive.
IDE power up: Unknown error17 - 0x9 (xx)
ERROR : 80
SECCNT: 80
SECNUM: 80
CYLLOW: 80
CYHIGH: 80
DEVSEL: 80
ALT_STATUS: 80
Drive: BUSY
The IDE drive had a failure at power on reset which prevents it from communicating
with the chipset IDE controller.
Resolution:
A) Cycle power on the node.
B) Reseat drive cable on both node and drive.
C) Replace the IDE or SATA boot drive.
D) Replace the node motherboard.
IDE SMART self-test failed. The drive failed to finish a built-in self-test.17 - 0x11 (0)
Resolution: Replace the IDE or SATA boot drive.
Drive failed to collect SMART data. The data is vital for the drive to determine SMART
trigger.
17 - 0x12 (0)
Resolution: Replace the IDE or SATA boot drive.
Drive refused to accept SMART commands.17 - 0x13 (0)
Resolution: Replace the IDE or SATA boot drive.
The SMART command issued to drive has incorrect syntax.17 - 0x14 (0)
Resolution: Replace the IDE or SATA boot drive.
The SMART commands failed to write or read attributes.17 - 0x15 (0)
Resolution: Replace the IDE or SATA boot drive.
InForm OS Failed Error Codes and Resolution 89
DescriptionSubcode
The IDE controller failed the BIOS interrupt test, possibly due to a bad drive.17 - 0x18 (0)
See Code 17, sub-code 0x0 for resolution information.
Drive did not return status to host after a command within a reasonable amount of time.17 - 0x20 (0)
Resolution: Replace the IDE or SATA boot drive.
This error occurs when the disk drive for a harrier system is not a SSD disk drive type.17 - 0x21 (rpm)
Resolution: Replace the SATA drive with a SSD drive.
This error occurs when we have 32 GB or less of cluster memory and the disk drive is
less than 128 GB. This is because the disk is not large enough for the memory dumps
if the node panics.
17 - 0x22 (disk size)
Resolution: Replace the SSD drive with a drive of at least 128 GB.
This error occurs when we have more than 32 GB of cluster memory and the disk drive
is less than 256 GB. This is because the disk is not large enough for the memory dumps
if the node panics.
17 - 0x23 (disk size)
Resolution:
A) Replace the SSD drive with a drive of at least 256 GB.
B) Reduce cluster memory to 32 GB or less.
Drive returned an error status after command execution.17 - 0x30 (0)
Resolution: Replace the IDE or SATA boot drive.
Booting from SATA IDE...19 - 0x0 (0)
No IDE or USB drives present or boot sector is invalid.
or
Booting from SATA IDE (bootdev)...
No IDE drive present or boot sector is invalid.
or
Booting from PATA IDE...
No IDE drive present or boot sector is invalid.
or
Booting from USB...
No USB drive present or boot sector is invalid.
The IDE (PATA or SATA) or USB Flash disk is used for booting the operating system.
This error indicates no a drive was found during a hardware probe, but it was found
to not be bootable.
Resolution:
A) Cycle power on the node.
B) Verify disk power and data cables are connected to both the drive and the
motherboard. The red stripe on the IDE data cable must be oriented closest to the power
connector on the drive.
C) Replace the disk power cable and/or data cable.
D) Replace the drive.
E) Replace the node motherboard.
IDE TIMEOUT waiting for DRDY19 - 0x1 (0)
The IDE disk is used for booting the operating system.
This error indicates there was a problem communicating with the IDE controller, most
likely due to a missing IDE hard drive, a disconnected cable, or a failed IDE hard drive.
See Code 19, sub-code 0x0 for resolution information.
IDE TIMEOUT waiting for DRQ19 - 0x2 (0)
90 CBIOS Error Codes
DescriptionSubcode
The IDE disk is used for booting the operating system.
This error indicates that a command was issued to the IDE disk (read sectors) but the
drive controller did not report back with the data within a reasonable amount of time.
This may be caused by a failed sector or IDE controller failure.
See Code 19, sub-code 0x0 for resolution information.
IDE ERROR reading sector xxxx19 - 0x3 (0)
The IDE disk is used for booting the operating system.
This error indicates that a command was issued to the IDE disk (read sectors) but the
drive controller reported that there was a error in reliably retrieving the requested sectors.
This error may be caused by a failed sector or IDE controller failure.
See Code 19, sub-code 0x0 for resolution information.
If a board has more than a single CPU, only one CPU comes out of power-on executing
code. The other waits in a halted state for an AP message from the bootstrap processor.
20 - 0x0 (0)
All MP-capable Pentium processor has an onboard Advanced Programmable Interrupt
Controller called the Local APIC (there is a complementary component called the IOAPIC
located on the motherboard).
Once the bootstrap processor has completed all node board initialization and testing,
it starts up each application processor (which in Intel terms is defined as any processor
other than the initial bootstrap processor). Each AP then does a brief identify, verify,
and microcode update. In the above case, if the local APIC fails deliver an AP startup
to the other processor within a reasonable amount of time, this error will result. In a
single CPU system this error should not occur because an earlier probe should identify
no AP processor is present. If the Local APIC cannot reliably deliver a message over
the IOAPIC, then it is probably not safe to ignore this error by pressing ^C.
Resolution:
A) Reseat both processors in their sockets.
B) Replace each processor individually. Do not bother with downgrading to a single
processor system since this is a multiprocessor startup issue. The problem processor will
not be apparent with a single processor configuration.
C) Replace the node motherboard.
After an AP startup message has been delivered to the application processor through
the IOAPIC, the bootstrap processor waits for an indication the AP has started. If the
20 - 0x1 (0)
indication is not received before a reasonable timeout, this error is given. It should be
ok to ignore this message by pressing ^C and continue with further BIOS diagnostics.
See Code 20, sub-code 0x0 for resolution information.
Once the application processor (AP) has started initialization, it sets a flag that the
bootstrap processor can use to determine when the bootstrap processor has completed.
20 - 0x2 (0)
If the AP remains in the AP_INIT_START state too long, this fatal error is displayed. It is
probably not safe to resume after this error since the AP may be off executing errant
code or interfering with bootstrap processor bus cycles.
See Code 20, sub-code 0x0 for resolution information.
The application processor (AP) previously failed to complete a Built In Self Test (BIST).
This is likely due to a bad processor.
20 - 0x3 (0)
Resolution: Replace the application processor.
During application processor (AP) initialization, it verifies that the CPU model, stepping,
and clock multiplier which is being initialized matches those values of the bootstrap
processor. If they do not match, this error will result.
20 - 0x4 (0)
Resolution: Since the processors are possibly mismatched, remove the heatsink on both
and verify that the CPU model and stepping are identical.
See Code 20, sub-code 0x0 for more resolution information.
The currently supported node board hardware configuration is a maximum of two
physical processors. The BIOS uses this knowledge to limit the possibility of repeat
20 - 0x5 (0)
InForm OS Failed Error Codes and Resolution 91
DescriptionSubcode
initialization of the application processor (AP). If this message occurs, it may be due to
a variety of hardware problems, but most suspect is the application processor.
See Code 20, sub-code 0x0 for resolution information.
*** SMI setup error: Not expecting to install a vector on CPU xxxx21 - 0x0 (0)
Intel processors support an interrupt level called SMI (System Management Interrupt)
which is used for hardware management (usually by the BIOS). Events such as power
management and hardware errors usually trigger an SMI. When an SMI is triggered,
the system enters SMM (system management mode). In a multiprocessor system, both
processors are usually triggered by an SMI at the same time. Since both processors
may attempt to service an SMI at the same time, each processor must have a unique
stack area where to dump processor context. SMI setup configures each processor
individually with a unique stack address for SMI handling.
This particular error indicates that the SMI setup handler has detected a stack setup
SMI, yet one was not expected (because one had already been set up or CPU
initialization had not yet reached the point of SMI setup). The bootstrap CPU delivers
the setup SMI to itself and to the application processor. This error could be caused by
a faulty CPU or motherboard. The CPU which reports the setup error may not be the
one at fault.
Resolution:
A) Pull one processor at a time to determine if the problem is reproducible with a single
CPU.
B) Swap CPUs to see if the exact problem moves with CPU. If not, it may be the
motherboard.
C) Individually replace both CPUs.
D) Replace the node motherboard.
*** SMI setup error: CPU xxxx not found in CPU table21 - 0x1 (0)
During SMI setup, each processor in turn receives an SMI and then performs stack
initialization. Prior to the SMI setup, all application processors wait in a halted state
for an APIC message to identify and download microcode. If the processor performing
an SMI setup detects that it had not previously executed and added its CPU ID to the
system table, then this fatal error will be displayed.
See Code 20, sub-code 0x1 for resolution information.
*** SMI setup error: CPU xxxx did not respond21 - 0x2 (0)
During SMI setup, each processor in turn receives an SMI and then performs stack
initialization. This error indicates that the bootstrap processor issued an SMI through
the APIC and it was not processed by the targeted processor. This indicates that either
SMIs are not being delivered properly, or that the targeted processor may be defective.
See Code 20, sub-code 0x1 for resolution information.
CBIOS provides service to the 3PAR kernel through a special command queue. Responses
are returned to the OS through another queue, which is tested during BIOS initialization.
22 - 0x0 (0)
Sub-code 0x0 indicates that the CBIOS to OS queue did not pass the built-in test.
Resolution:
A) Pull one processor at a time to determine if the problem is reproducible with a single
CPU.
B) Swap SDRAM with good SDRAM.
C) Update CBIOS to the latest version.
D) Replace the node motherboard.
This error indicates that the CBIOS to OS queue test failed to acquire a message it
previously sent.
22 - 0x1 (0)
See Code 20, sub-code 0x0 for resolution information.
92 CBIOS Error Codes
DescriptionSubcode
This error indicates that the CBIOS to OS queue test failed because the message received
did not match the message sent.
22 - 0x2 (0)
See Code 20, sub-code 0x0 for resolution information.
This error indicates that the CBIOS to OS queue test failed because there were more
items in the queue than those sent.
22 - 0x3 (0)
See Code 20, sub-code 0x0 for resolution information.
This error indicates that the OS to CBIOS queue test failed. The minor code will indicate
to an engineer what went wrong.
22 - 0x4 (0)
See Code 20, sub-code 0x0 for resolution information.
This error indicates that the CBIOS to OS queue test failed because the queue pointers
became corrupt.
22 - 0x5 (0)
See Code 20, sub-code 0x0 for resolution information.
Invalid magic for full CBIOS23 - 0x2 (0)
Prior to starting up the non-failsafe (full diagnostic) CBIOS image, the failsafe CBIOS
performs some consistency checks over the image. This error indicates the failsafe BIOS
could not find a proper header record for the full CBIOS.
See Code 23, sub-code 0x1 for resolution information.
CRC mismatch for full CBIOS23 - 0x3 (0)
Prior to starting up the non-failsafe (full diagnostic) CBIOS image, the failsafe CBIOS
performs a strong CRC over the full CBIOS image to verify the image's integrity. This
error indicates the full CBIOS had a CRC failure.
See Code 23, sub-code 0x1 for resolution information.
Failsafe CBIOS is now enabling the full CBIOS ...23 - 0x4 (0)
The full CBIOS either detected an error or user input (the 'f' key) which forced it to return
to the failsafe BIOS. If the user did press the 'f' key, then press ^C to resume startup
under the failsafe BIOS. If the user did not press the 'f' key, browse prior messages to
learn of a failure which may have caused this error.
Resolution: If the error was not the result of a keystroke, try pressing the 'n' key at BIOS
startup to clear any initialization skips. It may be recorded in NVRAM to skip the full
BIOS version and always execute the failsafe.
See Code 23, sub-code 0x1 for more resolution information.
The BIOS presents to the operating system a set of tables which describe the hardware
present in the system. These tables have a rigid structure for each type of device. If the
24 - 0x0 (ptr)
CBIOS configuration structure becomes corrupt, this error may result when the TURD
structures are initialized for the operating system. A consistency check ensures the TURD
area does not go beyond 1 MB (which is the base address where the operating system
normally begins using main memory). The data to this error is the pointer address
reached, and will be greater than 0x100000. ptr is the value which exceeded
0x100000.
Resolution:
A) Remove cards from all PCI slots. If the error no longer occurs, it may be a hardware
failure on one of cards.
B) Replace the node motherboard.
The BIOS presents to the operating system a set of tables which describe the hardware
present in the system. In this case, the BIOS detected that one of the tables had a bad
checksum.
24 - 0x1 (0)
Resolution:
A) Remove cards from all PCI slots. If the error no longer occurs, it may be a hardware
failure on one of cards.
B) Replace the node motherboard.
InForm OS Failed Error Codes and Resolution 93
DescriptionSubcode
The BIOS presents to the operating system a set of tables which describe the hardware
present in the system. In this case, the BIOS detected that it had added too many entries
24 - 0x2 (0)
to the table, likely because too many PCI devices are present in the system. This error
is likely due to an earlier PCI failure.
Resolution:
A) Remove cards from all PCI slots. If the error no longer occurs, it may be a hardware
failure on one of cards.
B) Replace the node motherboard.
The node board has two different Serial EEPROM devices used for storing persistent
board information. One PROM device is located on the I2C bus. It stores node board
25 - 0x0 (0)
manufacturing, assembly, serial number, and error message log information. The second
PROM device is connected through the Intel 82559ER Ethernet controller. It stores
Ethernet controller information such as initialization state and the hardware MAC
address.
PROM checksum: FAIL
The PROM which stores node board manufacturing, assembly, serial number, and error
message log information does not have a valid checksum. If the PROM has not yet been
initialized or if it has become corrupt, you may see this error.
Resolution:
A) Press ^W to enter Whack and use either "prom init" or "prom edit" to correct this
error.
B) If the information looks correct with "prom id" then try using "prom checksum" to
rewrite the checksum.
C) Replace the node motherboard.
Ethernet 0 PROM checksum: FAIL25 - 0x1 (0)
A) Press ^W to enter Whack and use "prom id" to verify the other PROM is valid. If
not, first use "prom init" or "prom edit" to set the PROM information. If the PROM
information appears valid, use "prom mac" to reprogram the Ethernet MAC address
and checksum.
B) Try flushing out a correct checksum. Note: You must first select the device with an
error using the "eth dev" command.
Example:
Whack> eth dev 1
Whack> eth checksum
C) Replace the node motherboard.
This sub-code indicates that a CPU has asserted its THERMTRIP_N signal. This could
mean that it has reached its case temperature, that a VRM has failed, or there is a
problem with the FPGA.
27 - 0x4 (0)
Resolution:
A) Check the environmentals.
B) Replace the node.
The Eagle/Osprey/Harrier ASICs are the Cluster Managers which are used for high
speed communication between nodes of a cluster. These device are critical for the
28 - 0x0 (0)
correct operation of the node software, and hence for operation of the whole cluster.
The CM exists on all PCI buses in the node. If the CM cannot be found on any of the
require PCI bus, this is a serious problem. sub-code 0x0 indicates the PCI bus scan did
not locate the Cluster Manager.
Resolution:
A) Cycle power on the node.
B) Pull all PCI cards and cycle power on the node.
C) Replace the node motherboard.
94 CBIOS Error Codes
DescriptionSubcode
Pairwwww DIMMxxxx: Bad checksum. Got yyyy, SPD said zzzz28 - 0x1 (0)
The memory DIMMs located on the CM riser are called cluster memory. This memory
is used to store data destined for the disks (dirty data) as well as data previously read
from the disks (cache data). It is also used for communication among the nodes in the
cluster. This memory is not required to boot the operating system, but is required for
the node to participate in the cluster. Even before the memory is thoroughly tested for
proper operation, it must be configured to appear in CM addressable space.
Each memory DIMM has a small embedded serial EEPROM which holds DIMM
configuration information such as the number of rows, columns, and banks, as well as
memory timing. If this serial EEPROM becomes corrupt, data stored in it regarding the
DIMM configuration cannot be trusted. So, this EEPROM also contains a checksum
which the BIOS verifies is correct before configuring the DIMM. If this checksum does
not match the checksum the BIOS computes across the DIMM, this error will result. You
should look at prior output to determine if there were I2C errors. These errors suggest
a problem with riser installation. The DIMM number is logged in the Data field of the
Fatal Error.
Resolution:
A) Reseat Cluster Memory riser card(s).
B) Reseat Cluster Memory DIMMs.
C) Replace Cluster Memory DIMMs in pairs to ensure replacement parts are matched.
P4-Eagle and PIII-Eagle DIMM Pairs are always located four riser positions apart.
For example, if you number the slots from the top,
Pair 0 is at positions 3 and position 7 (top).
Pair 1 is at positions 0 (bottom) and position 4.
Pair 2 is at positions 2 and position 6.
Pair 3 is at positions 1 and position 5.
Ironman (Tclass) and Tinman (Fclass) sets are always in sets of three. The DIMMs are
set as "DIMM C.S" as in Channel then set. There are two riser cards, one for channel
0 and one for channel 1 and 2.
Set 0 is DIMM 0.0, 1.0, 2.0
Set 1 is DIMM 0.1, 1.1, 2.1
Set 2 is DIMM 0.2, 1.2, 2.2
Titan and Atlas have 4 DIMM sets on the motherboard.
Set 0: DIMM 0.0 and 1.0
Set 1: DIMM 0.1 and 1.1
Set 2: DIMM 2.0 and 3.0
Set 3: DIMM 2.1 and 3.1
D) Replace the Cluster memory riser(s).
E) Replace the node motherboard.
Pairww DIMMxx (yyyy): 'zzzz' read failed28 - 0x2 (mm)
Where xxxx is one of:
row address, column address, module rows, cas latency3, refresh, banks, cas latency2,
cas latency1, ras precharge, act_to_rw, act_to_deact, ras cycle, write_to_deact, density,
frequency, DIMM type
This error indicates that a Cluster Memory DIMM was detected but that the Serial
EEPROM present on the DIMM could not be reliably read. The DIMM number is logged
in the Data field of the Fatal Error.
See Code 28, sub-code 0x1 for resolution information.
This error indicates the Cluster Memory DIMMs reported an odd (and unsupported)
number of rows. Usually the number of rows reported by a DIMM corresponds to the
28 - 0x4 (mm)
number of sides of the DIMM which are populated by memory. One DIMM number of
the failing pair will be logged in the Data field of the Fatal Error.
See Code 28, sub-code 0x3 for resolution information.
InForm OS Failed Error Codes and Resolution 95
DescriptionSubcode
No Cluster Memory Installed28 - 0x5 (0)
This error indicates that no memory was found in the Cluster memory riser. Since cluster
memory is needed for proper node operation within the cluster, this is a condition which
must be resolved for proper operation. You should look at prior output to determine if
there were I2C errors. These errors suggest a problem with riser or DIMM installation.
See Code 28, sub-code 0x1 for resolution information.
This error indicates the Serial EEPROM on the DIMM reports a value which is outside
tolerance for the memory controller. One DIMM number of the failing pair will be logged
in the Data field of the Fatal Error.
28 - 0x6 (mm)
See Code 28, sub-code 0x1 for resolution information.
Before ECC initialization of Cluster memory (scrub), a small region must be tested and
configured by the CPU to set up the ECC scrub of the remainder. If an error occurs
28 - 0x7 (mm)
during this test (such as memory read does not match the value just written), then this
error will be reported. The DIMM number is logged in the Data field of the Fatal Error.
See Code 28, sub-code 0x1 for resolution information.
During the ECC initialization of Cluster memory, The Cluster Manager records and
memory errors it encounters. If any were recorded, this error will be displayed.
28 - 0x8 (0)
See Code 28, sub-code 0x1 for resolution information.
For each Cluster memory DIMM, there is a register in the Eagle / Osprey memory
controller which specifies where the DIMM maps into CM physical memory. These
28 - 0x9 (0)
mapping registers are configured during the Cluster memory probe and should not
change under normal circumstances. Since this is an internal CM register, it is unlikely
that reseating memory will correct this problem.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Replace the node motherboard.
The Cluster memory controller detected an Uncorrectable ECC error. Eagle / Osprey
identifies the failing bank and address with the error as well as the error syndrome.
28 - 0xa (mm)
The BIOS will convert the information into the failing DIMM and Riser Slot numbers.
There may be multiple Uncorrectable errors. In this case, the CM will save the
address/syndrome for the most recent error.
The DIMM number is logged in the Data field of the Fatal Error.
Eagle nodes (S-Series and E-Series):
There are 8 DIMMs maximum on the S-Series Cluster Memory Riser Card. If the DIMM
number is not between 0-7 (inclusive), then the failing DIMM cannot be identified.
Osprey nodes (T-Series and F-Series):
There are 6 DIMMs on T-Series and 3 DIMMs on F-Series. The data field encodes which
DIMM encodes the DIMM number in the lower 4 bits of the field and the channel number
in the upper 4 bits. So a data value of 12 indicates DIMM 1.2 is at fault.
Harrier nodes (V-Series, Atlas, Minime1 & 2):
There are 8 DIMMs on V-Series between two different Harrier ASICs; two memory
controllers with 2 DIMMs each. The data field encodes which memory channel
encountered the uncorrectable error. A data value of 10 means channel one ia at fault,
a value of 0 means channel zero is at fault.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Reseat the failing Cluster Memory DIMM(s).
D) Replace the failing Cluster Memory DIMM(s).
E) Replace the node motherboard.
96 CBIOS Error Codes
DescriptionSubcode
The Cluster memory controller detected a correctable ECC error.28 - 0xb (mm)
The CM identifies the failing bank and address with the error as well as the error
syndrome. The BIOS will convert the information into the failing DIMM and Riser Slot
numbers.
The DIMM number is logged in the Data field of the Fatal Error.
Eagle nodes (S-Series and E-Series):
There are 8 DIMMs maximum on the Cluster Memory Riser Card. If the DIMM number
is not between 0-7 (inclusive), then the failing DIMM cannot be identified.
Osprey nodes (T-Series and F-Series):
There are 6 DIMMs on T-Series and 3 DIMMs on F-Series. The data field encodes which
DIMM encodes the DIMM number in the lower 4 bits of the field and the channel number
in the upper 4 bits. So a data value of 12 indicates DIMM 2.1 is at fault.
Harrier nodes (V-Series, Atlas, Minime1 & 2):
This should not occur on Harrier.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Reseat the failing Cluster Memory DIMM.
D) Replace the failing Cluster Memory DIMM.
E) Replace the node motherboard.
The CBIOS runs Cluster Memory Tests as part of POST in both normal operation and
manufacturing test. If any test fails due to a data miscompare, the test will generate this
fatal error code with sub-code '0xc'. CBIOS runs the following tests:
28 - 0xc (mm)
Walking 1/0 across data
Walking 1/0 across address (512 MB Small Memory Window)
Walking 1/0 using XCB (64 bytes) across segment boundaries
Any test failure will result in a fatal error.
The DIMM number is logged in the Data field of the Fatal Error.
Eagle nodes (S-Series and E-Series):
There are 8 DIMMs maximum on the Cluster Memory Riser Card. If the DIMM number
is not between 0-7 (inclusive), then the failing DIMM cannot be identified.
Osprey nodes (T-Series and F-Series):
There are 6 DIMMs on T-Series and 3 DIMMs on F-Series. The data field encodes which
DIMM encodes the DIMM number in the lower 4 bits of the field and the channel number
in the upper 4 bits. So a data value of 12 indicates DIMM 2.1 is at fault.
Harrier nodes (V-Series, Atlas, Minime1 & 2):
This should not occur in Harrier.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Reseat the failing Cluster Memory DIMM.
D) Replace the failing Cluster Memory DIMM.
E) Replace the node motherboard.
Pairwwww DIMMxxxx: Illegal SPD value <name of value> <value>28 - 0xd (mm)
This error indicates that a Cluster Memory DIMM was detected but that the Serial
EEPROM present on the DIMM reported an illegal or unsupported value for our memory
controller.
The DIMM number is logged in the Data field of the Fatal Error.
Example:
InForm OS Failed Error Codes and Resolution 97
DescriptionSubcode
Density (SPD byte 31) has more than 1 bit set (ie. 0x30) which indicates a non-standard
part.
See Code 28, sub-code 0x1 for resolution information. Most likely, the DIMM is not
qualified for use in our node board.
If there was a problem mapping the CM Small Cluster window into CPU 32-bit space,
this error may result when attempting to initialize Cluster memory. The initialization
28 - 0xe (mm)
problem could be due either to hardware failure or by setting a special NVRAM variable
that eliminates the address space normally reserved for CM memory windows. An
example of such is setting "mem_max" to a value above 2496. Another example would
be setting "pci_base" above 0xa0000000.
Resolution: Contact 3PAR technical support.
The Cluster memory controller detected a memory error in a specific DIMM bank. The
CM memory error status register is logged in the Data field of the Fatal Error.
28 - 0xf (mm)
See Code 28, sub-code 0xb for resolution information.
H1 LPC0 HW ERR ST [00000004]: dataq_parity28 - 0x10 (mm)
H1 LPC0 ERR Stat [00000006]: EP-Error-Rpt Fatal-Error
H1 LPC0 ERR ID [80000000]: HW-Err
The Cluster memory controller detected a hardware error. This error is printed, as shown
above. mm is decoded as bits 31-28 represent the LPC number and bits 27-0 are the
error bits as set in the hardware error status register. The hardware error means that
the Harrier ASIC is non functional.
Resolution:
A) Cycle power on the node.
B) Replace the node.
Testing CM data lines with walking 128 - 0x20 (mm)
Addr (xxxx) Wrote(yyyy) Read(zzzz)
The CM walking 1 bits test verifies that the processor may directly access CM cluster
memory by performing a walking 1's test on all data lines. If any fails, this error will
result.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Reseat Cluster Memory DIMMs.
D) Replace the node motherboard.
Testing CM data lines with walking 028 - 0x21 (mm)
Addr (xxxx) Wrote(yyyy) Read(zzzz)
The CM walking 0 bits test verifies that the processor may directly access cluster memory
by performing a walking 0's test on all data lines. If any fails, this error will result.
See Code 28, sub-code 0x20 for resolution information.
ZERO CM problem at addr xxxx28 - 0x22 (mm)
Between PCI bus tests, a small portion of cluster memory is cleared. If errors in clearing
the memory are detected, this error will result.
See Code 28, sub-code 0x20 for resolution information.
Testing CM address lines with walking 1 (first 512 MB only)28 - 0x23 (mm)
The CM walking 1 address bits test verifies that the processor may directly access cluster
memory by performing a walking 1's test on all address lines. If any fails, this error will
result.
See Code 28, sub-code 0x20 for resolution information.
98 CBIOS Error Codes
DescriptionSubcode
Testing CM address lines with walking 0 (first 512 MB only)28 - 0x24 (mm)
The CM walking 0 address bits test verifies that the processor may directly access cluster
memory by performing a walking 0's test on all address lines. If any fails, this error will
result.
See Code 28, sub-code 0x20 for resolution information.
Testing CM segment decode boundaries28 - 0x25 (mm)
This test verifies that memory decoding at all CM DIMM pairs is working correctly. It
does so by writing a unique 128 bytes at each memory decode boundary location. It
then verifies the values were written correctly and looks for corruption of other addresses.
See Code 28, sub-code 0x20 for resolution information.
Testing CM with random XOR (all Cluster Memory)28 - 0x26 (eecd)
ee = number of errors in XOR errors.
c = Channel Number where the error took place.
d = DIMM number where the error took place.
This function performs a random data test on all cluster memory attached to the CM to
verify memory under stress with random patterns. This test also exercises the CM XOR
engine as several sources are used simultaneously throughout the cluster memory test.
See Code 28, sub-code 0x20 for resolution information.
This error occurs when the DQS training fails to find working values for the DQS enable,
DQS out skew, and DQS in skew.
28 - 0x27 (0)
See Code 28, sub-code 0x20 for resolution information.
Testing CM ECC lines with walking 128 - 0x30 (mm)
Addr (xxxx) Wrote(yyyy) Read(zzzz)
The CM walking 1 bits test verifies that the processor may directly access CM cluster
memory by performing a walking 1's test on all ECC lines. If any fails, this error will
result.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Reseat Cluster Memory DIMMs.
D) Replace the node motherboard.
Testing CM ECC lines with walking 028 - 0x31 (mm)
Addr (xxxx) Wrote(yyyy) Read(zzzz)
The CM walking 0 bits test verifies that the processor may directly access cluster memory
by performing a walking 0's test on all ECC lines. If any fails, this error will result.
See Code 28, sub-code 0x30 for resolution information.
Testing CM Op Codes28 - 0x32 (mm)
The CM Op Code test verifies that the processor may execute one of the available
operations for this cluster manager ASIC. This error means that a particular op code is
not supported. If any op code fails, this error will result.
Resolution: A) Replace the node motherboard.
Testing CM Source Interrupts28 - 0x33 (data)
The CM Source Interrupts test will test that an interrupt is generated for each CMA data
path, from processor, CMA, or companion CMA to either processor memory to local
CMA. On systems with only one CMA, the companion tests are not done.
Resolution: A) Replace the node motherboard.
Testing CM I2C communication test28 - 0x34 (data)
InForm OS Failed Error Codes and Resolution 99
DescriptionSubcode
The CM I2C communication test will read and write to various safe CMA registers or
CMA memory and verify that the expected values are read. A fail means either a bad
DIMM or bad CMA.
See Code 28, sub-code 0x30 for resolution information.
Stopped on an Uncorrectable Error28 - 0x35 (data)
The scan for errors found an uncorrectable error in one of the CMAs. The system stopped
during a BIOS test when this error was discovered.
See Code 28, sub-code 0x30 for resolution information.
Stopped on a Correctable Error28 - 0x36 (data)
The scan for errors found a correctable error in one of the CMAs. The system stopped
during a BIOS test when this error was discovered.
See Code 28, sub-code 0x30 for resolution information.
Testing CM MMW data lines with walking 128 - 0x40 (mm)
Addr (xxxx) Wrote(yyyy) Read(zzzz)
The CM walking 1 bits test verifies that the processor may directly access CM cluster
memory by performing a walking 1's test on all data lines. This test uses the Medium
Memory Window (MMW). If any fails, this error will result.
Resolution:
A) Cycle power on the node.
B) Reseat Cluster Memory riser card.
C) Reseat Cluster Memory DIMMs.
D) Replace the node motherboard.
Testing CM MMW data lines with walking 028 - 0x41 (mm)
Addr (xxxx) Wrote(yyyy) Read(zzzz)
The CM walking 0 bits test verifies that the processor may directly access cluster memory
by performing a walking 0's test on all data lines. This test uses the Medium Memory
Window (MMW). If any fails, this error will result.
See Code 28, sub-code 0x40 for resolution information.
ZERO CM problem at addr xxxx28 - 0x42 (mm)
Between PCI bus MMW tests, a small portion of cluster memory is cleared. If errors in
clearing the memory are detected, this error will result.
See Code 28, sub-code 0x40 for resolution information.
Testing CM address lines with walking 1 (MMW)28 - 0x43 (mm)
The CM walking 1 address bits test verifies that the processor may directly access cluster
memory by performing a walking 1's test test on all address lines using the medium
memory window. If any fails, this error will result.
See Code 28, sub-code 0x40 for resolution information.
Testing CM address lines with walking 0 (MMW)28 - 0x44 (mm)
The CM walking 0 address bits test verifies that the processor may directly access cluster
memory by performing a walking 0's test on all address lines using the medium memory
window. If any fails, this error will result.
See Code 28, sub-code 0x40 for resolution information.
Testing CM address lines with walking 1 (RMW)28 - 0x45 (mm)
The CM walking 1 address bits test verifies that the processor may directly access cluster
memory by performing a walking 1's test on all address lines using the remote memory
window. If any fails, this error will result.
See Code 28, sub-code 0x40 for resolution information.
100 CBIOS Error Codes
DescriptionSubcode
Testing CM address lines with walking 0 (RMW)28 - 0x46 (mm)
The CM walking 0 address bits test verifies that the processor may directly access cluster
memory by performing a walking 0's test test on all address lines using the remote
memory window. If any fails, this error will result.
See Code 28, sub-code 0x40 for resolution information.
Link 0 did not come up (0xac000000) error = (0x002022ff)29 - 0x0 (data)
(data = link number)
CM Links are high speed connections between all of the node boards in a cluster via
the center panel. During Manufacturing test, nodes are connected to a special
Manufacturing Center panel that connects the link transmitter to its own receivers (external
loopback). When the node senses that it is in this special Center Panel, it will initialize
all of the links and perform loopback tests. If any link fails to initialize, this sub-code
will be reported.
Resolution:
A) Cycle power on the node.
B) Verify that the node is securely mated with the Center Panel.
C) Turn off power, re-seat the node into the center panel, and turn power back on.
D) Replace the node motherboard.
CM Link Initialization failed29 - 0x1 (data)
(data = LLRR) where
LL is the link bit pattern. 01 is link 0, 02 is link 1, 04 is
link 2, and 08 is link 3.
RR is the failure reason. E4 is Hardware error, F0 is user abort.
CM Links are high speed connections between all of the node boards via the center
panel. During Manufacturing test, nodes are connected to a special Manufacturing
Center panel that connects each link's transmitter to its own receiver (external loopback).
When the node senses that it is in this special Center Panel, it will initialize the links
and run a special test to verify the operation of the transmitter/receivers of each link.
If any link fails, the test will report this sub-code.
See Code 29, sub-code 0x0 for resolution information.
CM# Link XOR test: Link [0]..[FAIL] (1)29 - 0x2 (data)
(data = the link bit pattern. bit 0 is link 0, bit 1 is link 1, bit
2 is link 2, and bit 3 is link 3.
CM Links are high speed connections between all of the node boards via the center
panel. During Manufacturing test, nodes are connected to a special Manufacturing
Center panel that connects each link's transmitter to its own receiver (external loopback).
When the node senses that it is in this special Center Panel, it will initialize the links
and run a special test to verify the operation of the transmitter/receivers of each link.
If any link fails, the test will report this sub-code.
See Code 29, sub-code 0x0 for resolution information.
CM# Link INT??? test: Link [0]..[FAIL] (1)29 - 0x3 (data)
(data = the link bit pattern. bit 0 is link 0, bit 1 is link 1, bit
2 is link 2, and bit 3 is link 3.
The CM Link INT test verifies that setting either of the two interrupt flags (DEST, SRC) in
the XCB does actually generate and interrupt to the processor.
See Code 29, sub-code 0x0 for resolution information.
(data = link number)29 - 0x4 (data)
The CM Link Round Trip Test failed due to an XCB failure.
CM XCB failed during link DMA. Use the "eagle status" command for more information
on the type of error. This test checks the CM link status at multiple times during the test.
InForm OS Failed Error Codes and Resolution 101
DescriptionSubcode
The "(Send)" part of the message indicates which stage failed. Another possible values
is "(Receive)".
See Code 29, sub-code 0x0 for resolution information.
(data = link number)29 - 0x5 (data)
The CM Link Round Trip Test failed due to data miscompare. All packets have a length
check and timestamp check. Payload compare is optional. Use the "eagle status"
command to check for Uncorrectable ECC errors.
See Code 29, sub-code 0x0 for resolution information.
(data = link number)29 - 0x6 (data)
The CM Link Round Trip Test failed due to packet timeout.
A packet was sent and not received in a reasonable timeout period. The Round Trip
Test may not have been started on a remote node. Use the "eagle status" to check for
Uncorrectable ECC errors.
Resolution:
A) Start CM Link Round Trip Test on remote node.
B) Cycle power on the node.
C) Verify that the node is securely mated with the Center Panel.
D) Turn off power, re-seat the node into the center panel, and turn power back on.
E) Replace the node motherboard.
REC_EN went low. Test failed for link [x](yyyyyyyy)29 - 0x10 (0)
The "cma link init" command is used to initialize and bring up the CM links to nodes
which indicate a "Power Ok" state. If this error occurs, it is possible the remote node
was transmitting BIST, but then later stopped (such as from a reset or power off).
Resolution:
A) Perform the same test again.
B) Replace the node motherboard.
The CM has XCB engines which transfer data. Software manages the producer register
and the CM hardware follows with the consumer register. If these two do not agree
29 - 0x11 (0)
and CM should be idle, then it's possible the CM has halted due to failure of some
operation. This problem is likely caused by a cluster memory or link failure.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
C) Replace the link partner node.
The Exar and Oxford serial chips are used for a secondary low speed link which directly
connects all nodes in the cluster. They are primarily in the event of a link failure to verify
30 - 0x0 (0)
whether another node in the cluster has actually gone down. Since the part is integrated
onto the motherboard and is on a PCI bus, a failure to locate the internal serial chips
may indicate other PCI problems as well.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
When the node board is inserted into a Manufacturing Test Centerpanel, the internal
Serial Port Manufacturing test will automatically run. This error indicates failures on all
ports tested.
30 - 0x1 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
102 CBIOS Error Codes
DescriptionSubcode
Port (4): Processed 109 bytes [FAIL]30 - 0x2 (0)
All cluster internal serial ports go through a quick internal loopback test immediately
after initialization to do a short test of proper operation. This test will run regardless of
the type of centerplane in which the node is connected. This error indicates failures on
all ports tested.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
Internal UART is not functioning properly.30 - 0x3 (0)
Most likely this is due to a hardware failure related to the SuperIO.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
FPGA Scratchpad registers failed meaning bad FPGA hardware.31 - 0x2 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node.
Xor Engine Status: P0_XERR32 - 0x0 (0)
Error Status : XOR_ERR
PCI0 Error Status:
PCI1 Error Status:
The Eagle ASIC and Osprey ASIC contain a DMA engine capable of XOR operations.
This DMA engine is commonly referred to as the XCB engine. The XCB engine can DMA
data between 14 different modules within the ASIC, each module capable of sinking
or sourcing data. The XCB engine will stop all DMA if it encounters an error while
transferring data. The XCB error status indicates the module that produced the error.
Further details of the error can be gathered by inspecting the error registers of that
module. Use the whack command "cma status" to get further diagnostic information.
If the user continues past this error, software will attempt to reset the error and continue.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
This error indicates that an SDRAM DIMM for which information was requested is no
longer available. This may be due to an intermittent I2C bus, or a hardware failure.
33 - 0x0 (0)
Resolution:
A) Cycle power on the node.
B) Replace the failing DIMM's pair.
C) Replace the node motherboard.
This error indicates an uncorrectable error occurred on the PCI bus. In the future, the
data field may indicate the PCI slot number for the device which failed. In order to
34 - 0x1 (0xff)
determine the cause of this error, it may be useful to review either console messages
or the IDE disk log. Typical messages preceding this error are likely difficult to read,
but may indicate the exact cause.
Example:
--- SMI: smm_inb(0x3a) == 0x86
GPE 9 triggered
Error in PCI device 02.02.00 (PCI/PCI Bridge #0 (controls slot 1)):
PCI status register (0x06) [62b0]: Signaled system error (SERR#),
Received master abort
InForm OS Failed Error Codes and Resolution 103
DescriptionSubcode
Secondary PCI status register (0x1e) [0aa0]: Signaled target abort
Bridge P_SERR (0x6a) [80]: Delayed transaction master initiator timeout
Error in PCI device 03.01.00 (PCI Slot 1):
PCI status register (0x06) [1290]: Received target abort
Secondary PCI status register (0x1e) [0a80]: Signaled target abort
Error in PCI device 04.06.00 (inside PCI Slot 1):
PCI status register (0x06) [1230]: Received target abort
Error in PCI device 04.06.01 (inside PCI Slot 1):
PCI status register (0x06) [1230]: Received target abort
(PCI errors not cleared)
*** Fatal error: Code 34, sub-code 0x1 (ff).
In the above case, a card in PCI Slot 1 was transferring data up to a device, likely the
cluster manager, when it didn't get a response. The bridge above the card received a
master abort, which it then relayed to its secondary side as signaled target abort. The
bridge on the card in PCI Slot 1 then received the target abort and signaled a target
abort on its secondary side. Both PCI devices then indicated they received target aborts.
Resolution:
A) Cycle power on the node.
B) Reseat all PCI cards.
C) Replace the suspected PCI card.
D) Remove PCI cards one at a time.
E) Replace the node motherboard.
One or both DIMMs in a DIMM pair has failed. Bits 4-7 of the data value indicate the
DIMM pair.
35 - 0x0 (data)
If data is 0, then DIMM pair 0 has failed.
if data is 10, then DIMM pair 1 has failed.
Example:
--- SMI: TEMPCAUT (SMALERT): 0x01 (bits reset)
Uncorrectable ECC error 0x9279a103 recorded in reg 0x98
Pair1, either DIMM1 or DIMM3 contains the error
Error in locations [0x382cd818 .. 0x382cd81f]
Uncorrectable ECC error 0x9279a101 recorded in reg 0x94
Syndrome/bit number information might not be accurate,
as more than 1 error happened
Pair1, either DIMM1 or DIMM3 contains the error
Error in locations [0x382cd808 .. 0x382cd80f]
(Clearing cache line at 0x382cd800)
(Clearing cache line at 0x382cd800)
ESR == 0x0003 (expected low bit == 0)
*** Fatal error: Code 35, sub-code 0x0 (10).
Resolution:
A) Cycle power on the node.
B) Clear dust and debris from the node.
C) Remove and reseat the specified CPU DIMM pair.
D) Replace the failed CPU DIMM pair.
E) Replace the node motherboard.
104 CBIOS Error Codes
DescriptionSubcode
A single DIMM of a DIMM pair has failed. The data value indicates which DIMM. Bits
4-7 of the data value indicate which DIMM pair. Bits 0-3 of the data value indicate
which DIMM within that pair.
35 - 0x1 (data)
If data is 0, then DIMM 0 of pair 0 has failed.
If data is 1, then DIMM 1 of pair 0 has failed.
if data is 10, then DIMM 0 of pair 1 has failed.
if data is 11, then DIMM 1 of pair 1 has failed.
Resolution:
A) Cycle power on the node.
B) Clear dust and debris from the node.
C) Remove and reseat the specified CPU DIMM.
D) Replace the failed CPU DIMM.
E) Replace the node motherboard.
This code means an ECC error was detected, but the BIOS did not completely decode
the error.
35 - 0x2 (data)
See Code 35, sub-code 0x0 for resolution information.
In the event of a hardware failure, it is normal to trigger a processor System Management
Interrupt (SMI). If the SMI gets cleared before the BIOS has a chance to observe it (which
should not happen), then this error will result.
36 - 0x0 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
In normal operation the operating system should not write to the ACPI PM register. If
the BIOS detects a write took place, it will flag this as an error caused by a failing
operating system or other node hardware.
36 - 0x1 (0)
Resolution:
A) Cycle power on the node.
B) Reinstall the operating system.
C) Replace the node motherboard.
The BIOS was not able to determine the actual cause of the triggered SMI.36 - 0x2 (0)
Resolution:
A) Cycle power on the node.
B) Reinstall the operating system.
C) Replace the node motherboard.
This error may result if there is an unknown hardware device triggering SMIs in the
system and those SMIs are happening too frequently. Most likely the device continues
36 - 0x3 (0)
to trigger an SMI because its problem has not been serviced, and no real work is
possible at this point because immediately after returning from the SMI, another is
triggered. The BIOS attempts to recognize this condition and stop with a fatal error
rather than just continuing to display errors.
Resolution:
A) Remote reset or cycle power on the node.
B) Reinstall the operating system.
C) Replace the node motherboard.
This error may result if a known SMI cause is happening too frequently. In a normally
functioning node, SMIs should occur infrequently, as there is a performance impact
36 - 0x4 (0)
associated with handling each SMI. The BIOS will first attempt to disable known SMIs
in order to mask this problem. If that is insufficient, the BIOS will stop with this fatal
error.
InForm OS Failed Error Codes and Resolution 105
DescriptionSubcode
Resolution:
A) Check for CPU memory DIMM correctables in the event log. Replace DIMMs if they
are suspect.
B) Check for hardware oscillating events in the event log (such as PS status). On some
node types, board GPIO changes are reported through SMI. You may need to replace
power supplies or another FRU.
C) Replace the node motherboard.
This error will result if the BIOS inadvertently changes the contents of CR2 while
processing a SMI. This should not happen in normal operation, but might happen as
36 - 0x5 (0)
the result of a `whack' command. As returning from this SMI could easily cause corruption
of the OS or of a user-level program, this fatal error is flagged instead.
Resolution: Cycle power on the node.
Code 37 sub-codes are a bitmask of error values.37 - zz (0)
This means you may find an error which will simultaneously trigger multiple GEVENTs.
This event is probably one of the hardest to interpret as it often will indicate multiple
board devices have detected a fatal error condition. In general, it's much more
convenient to look up the decoded error in the BIOS output of the idelog rather than
manually decoding this event back to indicators.
Resolution: Look up each individual documented sub-code below which when OR'd
together form the sub-code observed.
S-Series and E-Series (P4) nodes:37 - 0x1 (0)
--- SMI: smm_inb(0x39) == 0x01
CMIC_FATAL (GEVENT0)
This error indicates the CMIC (North Bridge) had a fatal error.
Resolution:
A) Cycle power on the node.
B) Verify the system is getting adequate ventilation.
C) Remove any recently installed PCI cards.
D) Remove all PCI cards.
E) Replace the node motherboard.
S-Series and E-Series (P4) nodes:37 - 0x2 (0)
--- SMI: smm_inb(0x39) == 0x02
ALERT (GEVENT1)
Error in PCI device 00.00.00 (CMIC-LE Memory Controller/Thin IMB):
ESR (0x4c) [0004]: IMBus error
(PCI errors not cleared)
The output above can be considered "typical" but really may contain any of the possible
CMIC (North Bridge) Memory Controller or other PCI bus errors. An IMBus error indicates
a communication problem between the North Bridge and one of the South Bridge or
CIOBX2. This would likely indicate a node motherboard failure. It has been observed
in the field that a flaky or bad PCI socket may also cause this.
Resolution:
A) Cycle power on the node.
B) Verify the system is getting adequate ventilation.
C) Remove any recently installed PCI cards.
D) Remove all PCI cards.
E) Replace the node motherboard.
This error indicates the MCH (North Bridge) has detected a fatal condition. Most likely
there are other error messages present in the idelog to help pinpoint the issue. Since
106 CBIOS Error Codes
DescriptionSubcode
the MCH is the top of the root complex, it's very common to see the MCH indicating
Fatal error on nearly all failures.
Resolution:
A) Cycle power on the node.
B) Replace CPU DIMMs if no other error is indicated.
C) Replace the node motherboard.
S-Series (PIII) nodes:37 - 0x4 (0)
--- SMI: smm_inb(0x39) == 0x04
GPE 2 triggered
THERMT_L0_OSB (GEVENT2)
This indicates a thermal event triggered a GPIO interrupt. It is a fatal condition on
Pentium III nodes, and the node will be immediately taken out of the cluster with this
fatal error.
Resolution:
A) Cycle power on the node. If it is a temperature related problem, verify the system is
getting adequate ventilation.
B) Replace the node motherboard.
S-Series and E-Series (P4) nodes:
--- SMI: smm_inb(0x39) == 0x04
GPE 2 triggered
P0_PROC_HOT (GEVENT2)
The Pentium 4 CPU supports clock modulation which reduces the core frequency when
the core temperature is too high. The BIOS enables this support when starting the OS,
so after the node has joined the cluster, the BIOS will asynchronously notify the OS if
this event occurs but not take it out of the cluster. At the same time, the Pentium 4
processor will automatically reduce its clock speed so as to generate less heat and not
reach a shutdown temperature. This message is therefore not fatal on P4 CPUs.
Resolution:
A) Cycle power on the node. If it is a temperature related problem, verify the system is
getting adequate ventilation.
B) Replace the node motherboard.
This error indicates either the PLX #0 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX
brige #0 detected a parity error. These components manage PCI slots 4, and 5 on
T-Series and Slot 2 on F-Series.
See Code 37, sub-code 0x1 for resolution information.
This error indicates that the PLX #0 PCIe-PCIe bridge detected a fatal error. These
components manage PCI slots 6, 7, and 8; Harrier 1 and 2 LPC2.
See Code 37, sub-code 0x1 for resolution information.
S-Series (PIII) nodes:37 - 0x8 (0)
--- SMI: smm_inb(0x39) == 0x08
GPE 3 triggered
THERMT_L1_OSB (GEVENT2)
This indicates a thermal event triggered a GPIO interrupt.
See Code 37, sub-code 0x2 for resolution information.
S-Series and E-Series (P4) nodes:
--- SMI: smm_inb(0x39) == 0x08
GPE 3 triggered
P1_PROC_HOT (GEVENT2)
This indicates a thermal event triggered a GPIO interrupt.
See Code 37, sub-code 0x2 for resolution information.
InForm OS Failed Error Codes and Resolution 107
DescriptionSubcode
This error indicates either the PLX #0 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX
bridge #0 detected a fatal error (SERR). These components manage PCI slots 4, and 5
on T-Series and Slot 2 on F-Series.
See Code 37, sub-code 0x1 for resolution information.
This error indicates that the PLX #1 PCIe-PCIe bridge detected a fatal error. These
components manage PCI slots 3, 4, and 5; Harrier 1 and 2 LPC1.
See Code 37, sub-code 0x1 for resolution information.
S-Series (PIII) nodes:37 - 0x10 (0)
GPE 4 triggered
MIRQ (GEVENT4)
This error indicates the memory controller (CNB20HE) triggered an interrupt. The
CNB20HE documentation lists possible sources as correctable ECC error on Memory
data bus and Processor data bus.
See below (P4) for resolution information.
S-Series and E-Series (P4) nodes:
--- SMI: smm_inb(0x39) == 0x10
GPE 4 triggered
P0_IERR (GEVENT4)
This error indicates that P4 CPU 0 has asserted IERR#, which is used to indicate a
processor internal error event occurred. The Intel documentation indicates one cause
of this error is a machine check exception when exceptions have not yet been enabled.
From our experience in the field, the problem is possibly a CPU or node motherboard
failure.
This error indicates either the PLX #1 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX
brige #1 detected a parity error. These components manage PCI slots 2, and 3 on
T-Series and Slot 1 on F-Series.
See Code 37, sub-code 0x1 for resolution information.
This error indicates an internal error. This should not occur in a V-Series system.
See Code 37, sub-code 0x1 for resolution information.
S-Series and E-Series (P4) nodes:37 - 0x20 (0)
--- SMI: smm_inb(0x39) == 0x20
GPE 5 triggered
P1_IERR (GEVENT5)
This error indicates that P4 CPU 1 has asserted IERR#.
See Code 37, sub-code 0x10 (P4) for resolution information.
This error indicates either the PLX #1 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX
brige #1 detected a fatal error (SERR). These components manage PCI slots 2, and 3
on T-Series and Slot 1 on F-Series.
See Code 37, sub-code 0x1 for resolution information.
This error indicates that NEMOE raised the FPGA SMI interrupt and it was not handled
properly.
See Code 37, sub-code 0x1 for resolution information.
S-Series and E-Series (P4) nodes:37 - 0x40 (0)
--- SMI: smm_inb(0x39) == 0x40
GPE 6 triggered
P_SERR (GEVENT6)
This error indicates one or more of the system's chipset is asserting P_SERR (primary
side system error). Output is usually followed by outstanding PCI errors as indicated by
chipset devices.
Resolution:
108 CBIOS Error Codes
DescriptionSubcode
A) Identify and replace failing PCI card based on error output. It may be necessary to
contact hardware engineering with BIOS output to determine which PCI slot is at fault.
B) Remove all PCI cards.
C) Replace the node motherboard.
This error indicates the MCH (North Bridge) has detected an uncorrectable error. Most
likely there are other error messages present in the idelog to help pinpoint the issue.
Since the MCH is the top of the root complex, it's very common to see the MCH
indicating Uncorrectable error on nearly all failures.
Resolution:
A) Cycle power on the node.
B) Replace CPU DIMMs if no other error is indicated.
C) Replace the node motherboard.
S-Series and E-Series (P4) nodes:37 - 0x80 (0)
--- SMI: smm_inb(0x39) == 0x80
GPE 7 triggered
P_PERR (GEVENT7)
This error indicates one or more of the system's chipset is asserting P_PERR (primary
side parity error).
See Code 37, sub-code 0x40 for resolution information.
This error indicates either the PLX #2 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX
brige #2 detected a fatal error (SERR). These components manage PCI slots 0, and 1
on T-Series and Slot 0 on F-Series.
See Code 37, sub-code 0x1 for resolution information.
This error indicates an internal error. This should not occur in a V-Series system.
See Code 37, sub-code 0x1 for resolution information.
S-Series (PIII) nodes:37 - 0x100 (0)
--- SMI: smm_inb(0x3a) == 0x01
GPE 8 triggered
CPU_TEMP_INTR (GEVENT8)
This indicates a CPU temperature event triggered a GPIO interrupt.
See Code 37, sub-code 0x2 for resolution information.
S-Series and E-Series (P4) nodes:
--- SMI: smm_inb(0x3a) == 0x01
GPE 8 triggered
S_SERR (GEVENT8)
This error indicates one or more of the system's chipset is asserting S_SERR (secondary
side system error).
See Code 37, sub-code 0x40 for resolution information.
T-Series and F-Series (5000P) nodes:
--- SMI request via EXT_SMI
This error indicates another node in the cluster has forced this node to handle an SMI.
Most likely the other node is attempting to force a panic dump because the local node
has stopped responding.
Resolution:
A) Inspect the core dump to determine if the cause was a software or hardware failure.
B) Replace the node motherboard if the issue recurs and can not be identified as a
software failure.
This error indicates an internal error. This should not occur in a V-Series system.
See Code 37, sub-code 0x1 for resolution information.
InForm OS Failed Error Codes and Resolution 109
DescriptionSubcode
S-Series (PIII) nodes:37 - 0x200 (0)
This error indicates one or more of the system's chipset is asserting SERR (system error).
Output is followed by the PCI scan results, which displays outstanding PCI errors of all
PCI bus devices.
See below (P4) for resolution information.
S-Series and E-Series (P4) nodes:
--- SMI: smm_inb(0x3a) == 0x02
GPE 9 triggered
S_PERR (GEVENT8)
This error indicates one or more of the system's chipset is asserting S_PERR (secondary
side parity error).
Resolution:
A) Identify and replace failing PCI card based on error output. It may be necessary to
contact hardware engineering with BIOS output to determine which PCI slot is at fault.
B) Remove all PCI cards.
C) Replace the node motherboard.
This error indicates that CPU 0 has asserted IERR#, which is used to indicate a processor
internal error event occurred. The Intel documentation indicates one cause of this error
is a machine check exception when exceptions have not yet been enabled. From our
experience in the field, the problem is possibly a CPU or node motherboard failure.
Resolution:
A) Cycle power on the node.
B) Verify the system is getting adequate ventilation.
C) Remove all PCI cards.
D) Replace the node motherboard.
This error indicates that CPU 1 has asserted IERR#, which is used to indicate a processor
internal error event occurred.
37 - 0x400 (0)
See Code 37, sub-code 0x200 for resolution information.
Both Power Supplies failed: DC Output Bad38 - 0x9 (data)
This error indicates there is a hardware problem in one of the node power supplies. If
this failure is transient, it could also be caused by turning the power supply off and then
on or by a quick AC loss followed by AC being restored. f both power supplies fail
simultaneously (not likely), this is a fatal error.
The data value may be decoded to determine which power supply triggered this error.
The low 2 bits are a bitmask of the DC Output status for the two power supplies.
As a Fatal error, the value will be 3, indicating PS0 and PS1 both had a DC Output
Bad.
Resolution:
A) Ensure a service operation was not taking place
at the time, and that AC had not also failed.
B) Replace the power supply.
C) Replace the node motherboard.
PS x has down-rev firmware (x)38 - 0x11 (data)
This failure code indicates the power supply firmware revision is not up-to-date and
therefore not supported.
Resolution: Replace power supply.
PS x Battery has down-rev firmware (rev)38 - 0x12 (data)
This failure code indicates the battery attached to the power supply indicated has
firmware that is not up-to-date and therefore not supported.
110 CBIOS Error Codes
DescriptionSubcode
Resolution: Replace battery.
Maximum count for no successful OS boot (xxxx) exceeded.39 - 0x1 (0)
Type "unset cnt_no_os_boot" to clear this error.
This error indicates that the BIOS has detected that the node has not successfully booted
the OS and will now prohibit boots until operator intervention clears this error.
Resolution:
A) Clear this error as suggested in the error text. You may also turn off this checking
mechanism if it does not meet your application. To do this, type, "unset max_no_os_boot"
at a Whack prompt.
B) Verify that a valid operating system image is installed on the node's internal disk.
Reinstall the operating system if defective.
C) Replace the IDE drive.
Maximum count for OS boot with no cluster (xxxx) exceeded.39 - 0x2 (0)
Type "unset cnt_no_cluster" to clear this error.
This error indicates that the BIOS has detected that the node has booted, but the cluster
has not successfully formed several times. The BIOS will prohibit boots until operator
intervention clears this error. This is to prevent cyclic node up/down caused by a
hardware or software failure. This increases the reliability of the cluster by preventing
the node from continuously attempting to join the cluster.
Resolution:
A) Clear this error as suggested in the error text. You may also turn off this checking
mechanism if it does not meet your application. To do this, type, "unset max_no_cluster"
at a Whack prompt.
B) Verify that a valid operating system image is installed on the node's internal disk.
Reinstall the operating system if defective.
C) Replace the IDE drive.
Maximum count for OS panic (xxxx) exceeded.39 - 0x3 (0)
Type "unset cnt_os_panic" to clear this error.
This error indicates that the BIOS has detected that the node has booted and then caused
a panic several times. When the OS causes a panic, it notifies the BIOS of this event,
so the BIOS can track problems. Once a limit is exceeded, the BIOS will prohibit boots
until operator intervention clears this error. This is to prevent cyclic node up/down
caused by a hardware or software failure. This increases the reliability of the cluster by
preventing the node from continuously attempting to join the cluster.
Resolution:
A) Clear this error as suggested in the error text. You may also turn off this checking
mechanism if it does not meet your application. To do this, type, "unset max_os_panic"
at a Whack prompt.
B) Verify that a valid operating system image is installed on the node's internal disk.
Reinstall the operating system if defective.
C) Replace the IDE drive.
Maximum count for OS cluster without shutdown (xxxx) exceeded.39 - 0x4 (0)
Type "unset cnt_no_shutdown" to clear this error.
This error indicates that the BIOS has detected that the node has booted, but has not
been shut down properly several times. The BIOS will prohibit boots until operator
intervention clears this error. This is to prevent cyclic node up/down caused by a
hardware or software failure. This increases the reliability of the cluster by preventing
the node from continuously attempting to join the cluster.
Resolution:
A) Clear this error as suggested in the error text. You may also turn off this checking
mechanism if it does not meet your application. To do this, type, "unset
max_no_shutdown" at a Whack prompt.
InForm OS Failed Error Codes and Resolution 111
DescriptionSubcode
B) Verify that a valid operating system image is installed on the node's internal disk.
Reinstall the operating system if defective.
C) Replace the IDE drive.
Maximum count for same fatal error (xxxx) exceeded.39 - 0x5 (0)
Type "unset cnt_same_fatal" to clear this error.
This error indicates that the BIOS has detected that the same fatal or non-fatal error has
occurred repeatedly. The BIOS will prohibit boots until operator intervention clears this
error. This is to prevent cyclic node up/down caused by a hardware or software failure.
This increases the reliability of the cluster by preventing the node from continuously
attempting to join the cluster.
Resolution:
A) Observe other errors present in the PROM log to determine the cause of this error.
B) Clear this error as suggested in the error text. You may also turn off this checking
mechanism if it does not meet your application. To do this, type, "unset max_same_fatal"
at a Whack prompt.
B) Verify that a valid operating system image is installed on the node's internal disk.
Reinstall the operating system if defective.
C) Replace the IDE drive.
Maximum count for errors logged (xxxx) exceeded.39 - 0x6 (0)
Type "unset cnt_log_error" to clear this error.
This error indicates that the BIOS has detected that it has recorded too many fatal or
non-fatal errors in the board serial PROM and that it should prohibit further boots until
operator intervention clears this error. This is to prevent cyclic node up/down caused
by a hardware or software failure. This increases the reliability of the cluster by
preventing the node from continuously attempting to join the cluster.
Resolution:
A) Observe other errors present in the PROM log to determine the cause of this error.
B) Clear this error as suggested in the error text. You may also turn off this checking
mechanism if it does not meet your application. To do this, type, "unset max_log_error"
at a Whack prompt.
C) Verify that a valid operating system image is installed on the node's internal disk.
Reinstall the operating system if defective.
D) Replace the IDE drive.
Invalid boot sector.39 - 0x10 (0)
Use "boot net install" to correct this.
The IDE disk is used for booting the operating system. This error indicates the boot sector
which has been loaded from the disk does not have a valid signature. The most likely
cause of this error is that a fresh IDE drive has been installed in the node and it needs
to be field net installed.
Disk MBR does not have a valid partition table
You may also see the above line immediately following the fatal error. This message
indicates the partition table in the boot sector (Master Boot Record) was also invalid,
and that a "ide log" entry could not be written.
Resolution:
A) If no hardware has been replaced, first try cycling power on the node.
B) Perform a field IDE net install on the drive, or use "boot net install".
C) Use the "ide smart status" to acquire the drive SMART status. Replace the IDE drive
if a failure is reported.
C) Replace the IDE cable.
D) Replace the IDE drive.
E) Replace the node motherboard.
112 CBIOS Error Codes
DescriptionSubcode
The computed CPU speed is lower than the expected minimum supported in a 3PAR
node. Most likely this is due to a hardware failure. Since the CPU speed computation
41 - 0x0 (0)
depends upon access to the RTC, it is most likely there is a communication problem
with the SuperIO containing the RTC. If you need to run with a reduced CPU speed,
enter the following command on the node:
Whack> set perm cpu_slow_ok
See Code 41, sub-code 0x0 for resolution information.
After the CPU speed is computed, the memory bus (FSB) speed is computed. It is
computed based on the CPU speed, and bus speed multiplier as reported by the CPU.
41 - 0x1 (0)
If you need to run with a reduced Memory bus speed, enter the following command on
the node:
Whack> set perm mem_slow_ok
Resolution:
A) Cycle power on the node.
B) Replace the bootstrap CPU.
C) Replace the node motherboard.
Failed CP PROM ww.xx.yy.zz read42 - 0x1 (0)
Centerpanel access using Manufacturing PROM: FAILURE
The centerpanel is used by the 3PAR cluster for the nodes to communicate. The CM
links and backup serial links serve this purpose. There is also a diagnostic I2C bus
present in the centerpanel which is used by nodes to diagnose error conditions and
reset other nodes in the cluster.
As part of the manufacturing process, this bus is tested by accessing the serial PROM
which is present on a manufacturing centerpanel. If this test fails, it is likely the node
will have a problem accessing the centerpanel I2C bus.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
Failed CP PROM ww.xx.yy.zz write42 - 0x2 (0)
Centerpanel access using Manufacturing PROM: FAILURE
See Code 42, sub-code 0x1 for resolution information.
CP PROM node data does not match what is written: Addr xxxx42 - 0x3 (0)
Centerpanel access using Manufacturing PROM: FAILURE
See Code 42, sub-code 0x1 for resolution information.
CP PROM pattern data read is incorrect42 - 0x4 (0)
Addr xx Expected yy Read zz
...
Centerpanel access using Manufacturing PROM: FAILURE
See Code 42, sub-code 0x1 for resolution information.
Failed I2C access to board register x.y.z42 - 0x5 (0)
Centerpanel access using Manufacturing PROM: FAILURE
See Code 42, sub-code 0x1 for resolution information.
Failed I2C access to board register x.y.z42 - 0x6 (0)
Centerpanel access using Manufacturing PROM: FAILURE
Titan specific. It does read accessibility check for extra I2C addresses while testing CP
PROM 0.a0 and fails with fatal error message if the address is not accessible. Note
that if the failure is not related to CP PROM 0.a0, it will not print "CP PROM at 0.a0:"
message and only "Failed I2C access to board register x.yy".
InForm OS Failed Error Codes and Resolution 113
DescriptionSubcode
See Code 42, sub-code 0x1 for resolution information.
Voltage ID indicates CPUxx present but TEMP sensor disagrees.43 - 0x0 (data)
This error indicates either a CPU failure or onboard sensors are reading incorrect values
for the specified CPU.
The VID (voltage ID sense) lines are attached to each physical CPU and used to indicate
to the VRMs (voltage regulator modules) the voltage level expected by the CPU. These
lines are also connected to the LM87 which use this to determine the correct voltage
which should be delivered to the CPU.
The TEMP (temperature) sensor is connected to an on-die CPU thermal diode. If its
reading is out of acceptable range, the BIOS determines the sensor is not reliably
connected to a CPU, or a CPU is not present.
Bits 0-1 of data indicate CPU non-presence as determined by the VID sense lines. Bits
8-9 of data indicate CPU non-presence as determined by connection to the thermal
diode.
Data Value Failure
---------- ----------------------------------------------------
1 CPU0 does not respond to startup
2 CPU1 does not respond to startup
10 CPU0 thermal sensor/voltage ID indicates not present
20 CPU1 thermal sensor/voltage ID indicates not present
Resolution:
A) Cycle power on the node.
B) Remove physical CPU from specific socket and test with no CPU present.
B1) If error persists, replace node motherboard.
B2) If error clears, replace CPU.
C) Replace the node motherboard.
Voltage ID indicates CPUxx not present but TEMP sensor disagrees.43 - 0x1 (data)
See Code 43, sub-code 0x0 for resolution information.
Physical CPUxx active, but thermal sensor disagrees43 - 0x2 (data)
Bits 0-1 of data indicate CPU non-presence as determined by the running CPU APIC
addresses. Bits 8-9 of data indicate CPU non-presence as determined by connection to
the thermal diode.
See Code 43, sub-code 0x0 for resolution information.
Physical CPUxx not active, but thermal sensor disagrees43 - 0x3 (data)
Bits 0-1 of data indicate CPU non-presence as determined by the running CPU APIC
addresses. Bits 8-9 of data indicate CPU non-presence as determined by connection to
the thermal diode.
See Code 43, sub-code 0x0 for resolution information.
Not all hyper-threads started on physical CPUxx43 - 0x4 (data)
Bits 0-1 of data indicate logical CPU non-presence in physical CPU0 as determined by
the running CPU APIC addresses.
Bits 2-3 of data indicate logical CPU non-presence in physical CPU1 as determined by
the running CPU APIC addresses.
See Code 43, sub-code 0x0 for resolution information.
Not all cores started on physical CPUxx43 - 0x5 (data)
Bits 0-3 of data indicate logical CPU non-presence in physical CPU0 as determined by
the running CPU APIC addresses.
Bits 4-7 of data indicate logical CPU non-presence in physical CPU1 as determined by
the running CPU APIC addresses.
114 CBIOS Error Codes
DescriptionSubcode
See Code 43, sub-code 0x0 for resolution information.
CMIC heatsink disconnected: yy43 - 0x10 (xx)
The GPIOs reporting proper connection of the CMIC (North Bridge) heatsink report a
loss of connection. This is a board failure which requires a lab technician to reattach
the heatsink.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
The VSC055 reports tachometer inputs for both node fans, 0 and 1. This is a dual node
fan failure which requires both of the fans to be replaced. The system may overheat.
44 - 0x01 (xx)
Resolution:
A) Cycle power on both nodes.
B) Replace both node fans.
This error code indicates an error while running the QLogic45 - 0x00 (data)
iSCSI POST.
Failed Test (bits 8-15), Slot (bits 4-7) and Port (bits 0-3) are packed into data.
Failed Test is one of the following:
<QLogic internal card diagnostics>
2 Test Local RAM Size
3 Test Local RAM R/W
4 Test RISC RAM
5 Test NVRAM
6 Test Flash ROM
7 Test Network Internal Loopback
8 Test Network External Loopback
9 Test DMA Transfer
240 (0xf0) Test NOP
241 (0xf1) Test Registers
242 (0xf2) Test DMA Transfer to CPU memory
243 (0xf3) Test DMA Transfer to Cluster memory
244 (0xf4) Card Initialization
Resolution:
A) Cycle power on failing node.
B) Re-seat failing iSCSI card
C) Replace failing iSCSI card
This error code indicates CBIOS does not recognize the chipset installed on the node's
motherboard.
46 - 0x1 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
No CPU SDRAM is available.47 - 0x00 (0)
This error indicates that CBIOS has no working CPU memory available for it to continue
with POST and ultimately boot the node.
Resolution:
A) Cycle power on the node.
B) Replace CPU DIMMs.
InForm OS Failed Error Codes and Resolution 115
DescriptionSubcode
C) Replace the node motherboard.
This error code indicates CBIOS does not recognize the board type for the chipset
installed on the node's motherboard.
48 - 0x0 (XXXXXXXX)
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
Failed to find USB device handle49 - 0x1 (data)
or
Inquiry Request Failed rc = xxxx
The USB controller failed to perform a self test. A data value of 0 indicates the BIOS
failed to find a USB handle.
Resolution:
A) If a USB Flash drive is not expected to be present, set the "usb_nodevice_ok" NVRAM
variable to override BIOS requiring a USB Flash drive be found.
B) Replace the USB Flash drive.
C) Replace the node motherboard.
There was a USB failure in data requested by the operating system bootstrap. It is
possible that data on the disk has become corrupt to the point the operating system will
not successfully load.
49 - 0x4 (0)
Resolution: Reinstall the operating system bootstrap with the "boot net install" command.
USB reported a failure in the read verify command.49 - 0x6 (0)
See Code 49, sub-code 0x1 for resolution information.
USB reported a failure in the write verify command.49 - 0x7 (0)
See Code 49, sub-code 0x1 for resolution information.
Invalid control cache setup.50 - 0x1 (0)
Resolution: Contact 3PAR technical support.
Incompatible FB-DIMM installed.50 - 0x2 (<DIMM>)
Resolution: Replace DIMM.
Electrically isolated FB-DIMM.50 - 0x3 (<DIMM>)
Resolution:
A) Replace DIMM.
B) Replace node.
Incompatible module installed.50 - 0x4 (<DIMM>)
Resolution: Replace DIMM.
Mismatched DIMM pair.50 - 0x5 (<DIMM>)
Resolution: Replace DIMM.
Odd rank disabled.50 - 0x6 (<DIMM>)
Resolution: Replace DIMM.
FB-DIMM branch failed to train and lockstep mode has been disabled.50 - 0x7 (0)
Resolution:
A) Replace all DIMMs.
B) Replace node.
116 CBIOS Error Codes
DescriptionSubcode
FB-DIMM northbound merge has been disabled.50 - 0x9 (<DIMM>)
Resolution: Replace DIMM.
FB-DIMM disabled due to lockstep skew.50 - 0xa (<DIMM>)
Resolution: Replace DIMM.
FB-DIMM rank disabled due to Built-in Self Test failure.50 - 0xb (<DIMM>)
Resolution: Replace DIMM.
Memory interleave range limit invalid.50 - 0xe (0)
Resolution: Contact 3PAR technical support.
High temp disabled.50 - 0xf (0)
Resolution: Contact 3PAR technical support.
Logical rank with CECC detected.50 - 0x10 (<DIMM>)
Resolution: Replace DIMM.
Sub-optimal FB-DIMM channel population detected.50 - 0x12 (0)
Resolution: Contact 3PAR technical support.
Mismatched AMB pair.50 - 0x13 (0)
Resolution: Replace all DIMMs.
FB-DIMM branch disabled.50 - 0x14 (0)
Resolution: A) Replace all DIMMs.
B) Replace node.
FB-DIMM thermal throttling has been disabled.50 - 0x15 (0)
Resolution: Contact 3PAR technical support.
Last FB-DIMM AMB has been disabled.50 - 0x16 (0)
Resolution: Contact 3PAR technical support.
The FB-DIMM memory branches do not match in size.50 - 0x17 (0)
Resolution: Contact 3PAR technical support.
The BIST (Built-in Self Test) in Harrier reported either a BAD value or a different value
from what was recorded in the node PROM during MFG board assembly. (Data =
Harrier BIST result)
51 - 0x1 (Data)
Resolution: Replace the node. Note for OPS that Harrier BIST failed, and that the PROM
should not be wiped.
The CPU was unable to communicate with the FPGA.53 - 0x0 (xxxx)
Resolution: Replace node motherboard.
A CPU VRM is missing.54 - 0x0 (xxyy)
or
A CPU VRM is not providing power.
Resolution:
A) Replace CPU VRM yy.
B) Replace node motherboard.
UEFI failed to boot, failed during PEI due to assert.55 - 0xzzzzzzzz (yyy)
Look-up zzzzzzzz in doc/udk_hash_index.csv of udk2010_up3 tree to determine
filename of assert. yyy specifies line number (in hex).
InForm OS Failed Error Codes and Resolution 117
DescriptionSubcode
Resolution: Contact 3PAR technical support.
UEFI failed to boot, failed during Intel MRC memory training code.56 - 0xaabb (yyy)
aa specifies the major code, bb specifies the minor code.
* Major Code Table
0xE8 ERR_NO_MEMORY
0xE9 ERR_LT_LOCK
0xEA ERR_DDR_INIT
0xEB ERR_MEM_TEST
0xEC ERR_VENDOR_SPECIFIC
0xED ERR_DIMM_COMPAT
0XEE ERR_MRC_COMPATIBILITY
0xEF ERR_MRC_STRUCT
Resolution: Contact 3PAR technical support.
UEFI failed to boot, failed during DXE due to assert.57 - 0xzzzzzzzz (yyy)
Look-up zzzzzzzz in doc/udk_hash_index.csv of udk2010_up3 tree to determine
filename of assert. yyy specifies line number (in hex).
Resolution: Contact 3PAR technical support.
CBIOS Degraded Error Codes and Resolution
This table explains the codes, sub-codes, error code descriptions, and problem resolutions for the
CBIOS error codes.
Degraded Alerts
DescriptionCode
The Real-Time Clock (RTC) is a function of the SuperIO which provides a battery backed
system clock and a small quantity of battery backed Non-Volatile RAM for system
2 - 0x1 (0)
configuration flags. This error indicates the RTC memory has become corrupt, possibly
due to a dead battery or battery removal when no mainline power was available.
Resolution:
A) Power down, wait 30 seconds, power up. This error should self-correct (likely with
a loss of current date/time and other NVRAM contents). Set the date and
time using the Whack "rtc date" command.
B) Replace the RTC battery, located near the SuperIO ASIC.
C) Use the Whack command "rtc date" to set the RTC date and time.
D) Replace the node motherboard.
RTC_BATTERY_LOW2 - 0x2 (0)
RTC / NVRAM Battery Failure - Replace battery.
The RTC / NVRAM battery was found to have a low voltage by the built-in monitoring
circuit of the Real Time Clock (RTC).
The RTC battery provides power to the RTC clock function of the SuperIO while the
board is not drawing mainline supply power. Over time, this battery's available power
will decay (rated for over five years normal operation).
Resolution:
A) Replace the RTC lithium cell battery on the node motherboard.
B) Replace the node motherboard.
RTC_INVALID_TIME2 - 0x3 (0)
118 CBIOS Error Codes
DescriptionCode
The current RTC date/time is invalid. Enter the correct date/time or press Tab to acquire
it from the network.
If the time has not yet been set, or becomes invalid due to loss of battery power, this
BIOS will report this error and wait for the user to update the time.
Resolution:
A) Enter the correct time.
B) Press TAB to acquire the time from the network.
C) Press ^C to abort prompt and resume boot.
RTC_BATTERY_LOW2 - 0x4 (0)
RTC / NVRAM Battery Failure - Replace battery.
The RTC / NVRAM battery was found to have a low voltage by the built-in monitoring
circuit of the RTC (TOD clock).
Resolution:
A) Replace the lithium-ion cell battery on the node.
B) Replace the node motherboard.
The indicated PLX switch chip has incorrect hardware configuration strappings.10 - 0x1e (0)
Resolution: Replace the node motherboard.
This error indicates that the device found is not running at the correct PCIe link width.10 - 0x1f (YYYYYYxx)
If the "xx" portion of the data field is non-zero, it indicates a problem with a particular
PCI slot. The specific codes for "xx" are as follows:
30 is PCI Slot 0
31 is PCI Slot 1
32 is PCI Slot 2
33 is PCI Slot 3
34 is PCI Slot 4
35 is PCI Slot 5
36 is PCI Slot 6
37 is PCI Slot 7
38 is PCI Slot 8
To ignore this error, enter Whack by pressing ^W and entering:
Whack> set perm pci_speed_any
Resolution:
A) Replace indicated card (if "xx" is non-zero).
B) Replace node motherboard.
This error indicates that the device found is not running at the correct PCIe link speed.10 - 0x20 (YYYYYYxx)
See Code 10, sub-code 0x1f for resolution information.
This error indicates that a PCI device was found in a slot which was expected to be
empty. The likely cause of this failure is an HBA which is not fully seated. If this is an
expected failure, you can set "pci_missing_ok" to override this check.
10 - 0x21 (xxx)
Resolution:
A) Reseat or replace the indicated HBA.
B) Replace node motherboard.
This error indicates that no PCI device was found in a slot which was expected to be
populated (HBA present). The likely cause of this failure is an HBA which has failed. If
this is an expected failure, you can set "pci_missing_ok" to override this check.
10 - 0x22 (xxx)
Resolution:
CBIOS Degraded Error Codes and Resolution 119
DescriptionCode
A) Reseat or replace the indicated HBA.
B) Replace node motherboard.
This error indicates that during a previous PCI scan, the CPU hung. The most probable
cause of this error is a defective HBA. The data field provides several details about the
10 - 0x23 (data)
suspect device. The low byte indicates which PCI slot, if known. Value 0x30 corresponds
to PCI Slot 0, 0x31 is PCI Slot 1, ..., 0x38 is PCI Slot 8. Byte 2 and byte 1 correspond
to the PCI bus.dev.func. Byte 3 indicates whether the failure occurred during a PCI error
scan, and whether this is a repeat failure. Decode table for data:
bits 0..7 PCI Slot (0x00=MB, 0x30..0x38=PCI Slot 0..8)
bits 8..10 PCI func
bits 12..15 PCI dev
bits 16..23 PCI bus
bits 24..28 Reserved (0)
bit 29 Repeat flag (1=repeat -- fatal error)
bit 30 Hang during (0=PCI scan, 1=PCI error scan)
bit 31 Reserved (1)
Example (data=c00a0a35):
The 0x35 value implicates PCI Slot 5.
The 0a0a value is bus.dev.func 0a.01.02.
The c0 value tells the hang occurred during a PCI error scan.
Example (data=a0090831):
The 0x31 value implicates PCI Slot 1.
The 0908 value is bus.dev.func 09.01.00.
The a0 value indicates a repeated hang during the PCI scan.
Resolution:
A) Replace HBA if PCI Slot is indicated.
B) Convert to PCI bus.dev.func and match with the suspect PCI device from previous
BIOS messages. If this is an onboard device, replace the node
motherboard.
A disk SMART threshold was triggered. This would indicate an imminent boot drive
failure.
17 - 0x10 (data)
Resolution: Replace the IDE or SATA boot drive.
No IDE device was found.17 - 0x16 (0)
Resolution:
A) Install or replace the IDE or SATA drive.
B) Replace the node motherboard.
DMA xfer error code xxxx17 - 0x19 (0)
The drive DMA test failed due to a timeout. Although each sequential DMA read
operation is succeeding, the total test time was exceeded. The likely cause of this failure
is a drive which is having to perform a large number of relocations due to failed sectors,
or a drive interface failure which only shows up under stress.
Resolution: Replace the IDE or SATA boot drive.
Drive returned an error status after command execution. xxxxxxxx, AHCI Port Status
register, for lab debug
17 - 0x40 (xxxxxxxx)
Resolution: Replace the IDE or SATA boot drive.
Drive returned an error status after command execution.17 - 0x41 (xxxxxxxx)
xxxxxxxx, AHCI Port Error register, for lab debug
120 CBIOS Error Codes
DescriptionCode
Resolution: Replace the IDE or SATA boot drive.
Drive returned an error status after command execution.17 - 0x42 (xxxxxxxx)
xxxxxxxx, AHCI Port TFD register, for lab debug
Resolution: Replace the IDE or SATA boot drive.
*** Real-mode BIOS interrupt: xxxx (error: yyyy)18 - zzzz (0)
This error most commonly indicates a bad or missing boot area of the USB disk. Customer
Service node-disks or node spares (FRUs) might not be shipped with an operating system.
Attempting to boot from one of these disks without first installing the system software
might produce this error message. From the Whack prompt, use the "boot net install"
command to install the system software.
In order for Linux to boot, LILO must load the kernel image. It needs assistance from the
BIOS in order to perform this task. Linux also acquires some information from the BIOS
using 16 bit BIOS interrupts. CBIOS automatically accepts and emulates traditional 16
bit BIOS interrupts to support these methods.
If LILO or Linux triggers an interrupt which is not supported by CBIOS, this possibly fatal
error will result. There are many obsolete BIOS facilities which are not supported by
CBIOS. In some cases, the system boot may be able to continue after this error.
The sub-code and minor code indicate the specific BIOS interrupt called and the eax
register parameter value. This information may be useful to Engineering.
Resolution:
A) Reboot. Attempt to reproduce the problem.
B) Reinstall system software on the disk. This may require a "boot net install" in order
to reinstall the operating system.
C) There may be a bug in the OS you are using or it has been misconfigured. Confirm
this version of the OS has been verified to work on a 3PAR
node board. Or, temporarily swap system disks with a known good system disk.
D) Replace the boot drive and reinstall the system software.
E) Replace the node motherboard.
CRC mismatch for failsafe CBIOS23 - 0x0 (0)
Upon startup, CBIOS computes a strong CRC over all executable code and data stored
in the flash. This is done to guard against flash corruption which also ensures reliable
system initialization and testing. This specific sub-code indicates that a CRC error was
detected in the failsafe component of CBIOS. The majority of the failsafe is only executed
if corruption is detected in the main CBIOS.
Resolution:
A) Try pressing ^C to resume. Perform a flash update as soon as possible. If flash
updating under Linux, make sure to specify the 'failsafe' option to update the failsafe
area as well.
B) If the flash update is successful, but you still get a CRC error, verify that your flash
image is intact. The Linux flash utility does this automatically using the same strong CRC
algorithm as the BIOS uses.
C) Replace the node motherboard.
Invalid entry point for full CBIOS23 - 0x1 (0)
Boot with clustering disabled and update flash immediately!
Prior to starting up the non-failsafe (full diagnostic) CBIOS image, the failsafe CBIOS
performs some consistency checks over the image. This error indicates corruption was
detected in the entry point to the main routine of the full CBIOS. If you are have recently
installed a new CBIOS which is larger than the previous, it is possible to get this error
because the failsafe BIOS present cannot properly verify the larger size BIOS.
Resolution:
CBIOS Degraded Error Codes and Resolution 121
DescriptionCode
A) Try pressing ^C to resume. Perform a flash update as soon as possible. Boot with
clustering disabled by typing "tpd nokmod" at the LILO prompt. Once the node has
booted, login as root and use the flash command. Example:
# flash /opt/tpd/bios/bios-1.9.4
Upon completion of the flash update, reboot and observe console messages to ensure
the CRC error no longer occurs.
B) If the flash update is successful, but you still get this error, verify that your flash image
is intact. The Linux flash utility does this automatically using the same strong CRC
algorithm as the BIOS uses.
C) Replace the node motherboard.
Ethernet MAC xx:xx:xx:xx:xx:xx mismatches PROM: yy:yy:yy:yy:yy:yy25 - 0x2 (0)
The "prom mac" command may fix this.
This error indicates the MAC address stored in the onboard Ethernet controller's PROM
does not match that which can be computed from the board revision and serial number
stored in the node's PROM. This mismatch suggests that one or the other PROM may
contain corrupt contents.
If the Ethernet MAC address was purposely set to an address (see "prom mac"
command), then this check may be overridden by setting the NVRAM "oddmac" flag.
Example:
Whack> set perm oddmac
Resolution:
A) Look for a prior message indicating an invalid board type or check the banner to
ensure the board type and serial number are correct for this node. If either is not correct,
use the 'prom edit' command to repair the corruption.
B) Use the "prom mac" command to reprogram the MAC address in the Ethernet
controller's PROM.
C) Replace the node motherboard.
eth0 device self test: FAIL All tests: xxxx (timeout)26 - 0x1 (ethdev)
During initialization, CBIOS has the Ethernet controller perform an internal test to verify
correct operation. If the Ethernet controller does not respond within a reasonable amount
of time, this error will be displayed.
"ethdev" indicates the PCI Slot in which the failed Ethernet device is located. This is an
ASCII value, so 0x30 indicates PCI slot 0. If the Ethernet device is located on the node
motherboard, then ethdev will have a value of 0x00.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
eth0 device self test: FAIL xxxx yyyy26 - 0x2 (ethdev)
If the Ethernet controller fails its internal test, this error will be displayed. Since this is
an internal test, it is likely the Ethernet controller itself which has failed.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
No Ethernet devices available for loopback test26 - 0x3 (0)
This error indicates that no Ethernet devices could be found or initialized on the node.
This is possibly the result of a hardware failure.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
122 CBIOS Error Codes
DescriptionCode
No loopback connections were found. An external loopback plug is required if this
node has only one Ethernet port. A crossover cable is required if this node has more
than a single Ethernet port.
26 - 0x4 (0)
Resolution:
A) Make sure the Ethernet loopback plug is in the Ethernet connector (you should see
link status lights illuminated). In the case of a node having two Ethernet ports, make
sure a crossover cable is connected between the Ethernet ports.
B) Cycle power on the node.
C) Replace the node motherboard.
eth2 loopback PHY internal: FAIL26 - 0x5 (slotid)
This error indicates that the internal loopback of the PHY did not correctly loop back
packets. If the device being tested is onboard the node (82559ER or 82551ER), then
this is a failure. Some plug-in PCI boards (such as 82557) do not fully support PHY
loopback. Those devices will cause the following warning:
eth2 loopback PHY internal: Unavailable
No error stop will occur in the case of a PHY not supporting internal loopback.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
eth0 sends to eth1 but cannot receive from it26 - 0x6 (slotid)
This is an unusual error in that one Ethernet device is able to reliably receive packets
from the other, but the opposite is not true.
Resolution:
A) Run the test again. If the nodes are attached to a hub, the failure may be due to
another Ethernet node flooding the network.
B) Cycle power on the node.
C) Ensure that there is no a switch between the Ethernet ports. A switch may prevent
the test from functioning properly if the MAC address of an interface is in use elsewhere
or the switch is really an IP router.
D) Ensure that there is no a switch between the Ethernet ports. A switch may prevent
the test from functioning properly if the MAC address of an interface is in use elsewhere
or the switch is really an IP router.
eth0 loopback wwwww: FAIL - receive timeout (xx seconds)26 - 0x7 (slotid)
This error indicates the Ethernet device did not successfully receive the loopback pattern
sent to test the Ethernet device's transceiver. The failure to receive a loopback pattern
usually means the Ethernet device has failed.
"ethdev" indicates the PCI Slot in which the failed Ethernet device is located. This is an
ASCII value, so 0x30 indicates PCI slot 0. If the Ethernet device is located on the node
motherboard, then ethdev will have a value of 0x00.
The following are normal test results that you would expect to see. If this error occurs,
then one of the following has not happened:
eth0 loopback All zeros: PASS
eth0 loopback All ones: PASS
eth0 loopback Walking ones: PASS
eth0 loopback Walking zeros: PASS
eth0 loopback Random pattern: PASS
This error indicates that within 100 packets successfully transmitted, there were no
packets successfully received.
Resolution:
A) Cycle power on the node.
CBIOS Degraded Error Codes and Resolution 123
DescriptionCode
B) Unplug the network cable and run the test again. If the node is attached to a hub,
the failure may be due to another Ethernet node flooding the network. This is not very
likely.
C) If the Ethernet device is located in a PCI slot, replace the card.
D) Replace the node motherboard
eth0 loopback wwwww: Packet transmit failed26 - 0x8 (slotid)
This error indicates that the Ethernet device was not able to successfully transmit packets.
This is really a serious failure, since the Ethernet code will under any condition not fail
to transmit unless the Ethernet device failed to initialize.
Resolution:
A) Use "eth reset" to reset the Ethernet device.
B) Cycle power on the node.
C) Replace the node motherboard if the failed Ethernet device is on the node.
eth0 loopback wwwww: FAIL - miscompare26 - 0x9 (slotid)
stuck high=xxxx stuck low=yyyy toggle=zzzz
This error is displayed if one of the Ethernet tests detects a mismatch between the packet
send and the data received. It also includes a diagnostic line which is useful to see in
what way the data is different.
Resolution:
A) Use "eth reset" to reset the Ethernet device.
B) Cycle power on the node.
C) Replace the node motherboard if the failed Ethernet device is on the node.
ethxxx device registers: FAIL26 - 0xa (slotid)
Onboard Ethernet device did not read valid config from EEPROM. A powercycle might
clear this failure if this is a new node.
This error indicates the Ethernet device failed to initialize properly, probably because
it read invalid content from the attached EEPROM device. If this an onboard GigE on
the 5000P chipset (Tx00, Fx00, Vx00, Gx00), then it is likely this is the first time the
node has ever been powered on. Once the BIOS writes a configuration to the SPI
EEPROM attached to the GigE, it is necessary for the board to be power cycled before
the GigE device is usable. If the board is not new and you see this failure, then it's likely
a component on the node motherboard has failed.
Resolution:
A) Power cycle the node.
B) Replace the node motherboard if the failed Ethernet device is on the node.
Each node board has multiple temperature and voltage sensors and fan RPM sensors
which monitor the environment to ensure the temperature, voltage, and fan RPM are
within operating tolerances. This directly results in increased reliability of the product.
27 - 0x0 (#)
If a temperature or a voltage falls outside a programmed tolerance level, CBIOS will
alert the user to this condition. The sub-code displayed reflects the type of (the first) error
detected. The data value is a count of the number of temperature/voltage/fan problems
detected.
A sub-code value of 0x0 indicates a fan RPM problem.
A sub-code value of 0x1 indicates a temperature problem.
A sub-code value of 0x2 indicates a voltage problem.
This particular sub-code indicates a programmed temperature limit has been exceeded.
Resolution:
A) Cycle power on the node. If it is a temperature related problem, verify the system is
getting adequate ventilation.
B) Verify the limit settings are reasonable. Use the Whack "i2c env" command. The
Whack "i2c env defaults" command resets all defaults.
124 CBIOS Error Codes
DescriptionCode
C) Verify both power supply fans are spinning freely and that the supply amber failure
light is not illuminated. If only a single supply is installed, make sure the second slot
either has a fan or is covered.
D) Replace the power supply.
E) If it's CPU temperature, verify the heatsink is conducting heat well.
F) If it's CPU voltage, try swapping out the CPU voltage regulators.
G) Replace the node motherboard.
This sub-code indicates a programmed temperature limit has been exceeded.27 - 0x1 (#)
See Code 27, sub-code 0x0 for resolution information.
This sub-code indicates a programmed voltage limit has been exceeded.27 - 0x2 (#)
See Code 27, sub-code 0x0 for resolution information.
This sub-code indicates a sensor interrupt test failed.27 - 0x3 (0)
See Code 27, sub-code 0x0 for resolution information.
This error indicates the BIOS detected the SDRAM DIMMs in the cluster memory bank
pair are of a different type. One DIMM number of the mismatched pair will be logged
in the data field of the Fatal Error.
28 - 0x3 (mm)
Resolution:
Ensure both DIMMs in the pair are identical. Note that two DIMMs may have the same
capacity but have different number of rows, columns, or banks. The DIMM configuration
must exactly match. If the DIMMs have similar markings and capacity, they are probably
identical.
See Code 28, sub-code 0x1 for more resolution information.
FAIL (high)31 - 0x0 (0)
Port (6) Bit (4) wrote 0 (0x1)
Port (7) Bit (4) read 1, expected 0 (0x3)
The Vitesse VSC055 2 Wire Backplane Controller chip controls interfaces to the
Centerplane, LEDs, Power Supplies, Nickel battery, and PCI slots. It is connected to the
I2C bus. In normal 2, 4, or 8 node centerplanes, the chip will get its ports initialized
as inputs or outputs and start monitoring peripheral systems. No tests available.
When connected to a Manufacturing Centerplane, it will have selected pins routed to
other pins for loopback testing. See the Manufacturing Centerplane Specification for
details. During this test, proper VSC operation will be confirmed.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
Failed I2C VSC055 1.ce.yy write zzzz31 - 0x1 (0)
During initialization, the VSC055 registers are programmed for proper system operation.
This is done over the I2C bus. If an I2C operation fails during VSC055 initialization,
this error will result.
Resolution:
A) Cycle power on the node.
B) Replace the node motherboard.
FPGA Interrupt Test failed.31 - 0x3 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node.
NEMOE Loopback Test failed.31 - 0x4 (0)
CBIOS Degraded Error Codes and Resolution 125
DescriptionCode
Resolution:
A) Cycle power on the node.
B) Replace the node.
During the "Board GPIO Test", the FPGA ID is not what it expects it to be.31 - 0x5 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node.
During the "Board GPIO Test", the FPGA Revision is not what it expects it to be.31 - 0x6 (0)
Resolution:
A) Cycle power on the node.
B) Replace the node.
Titan specific. During the "Manufacturing Centerpanel GPIO Test", one or more tests
have failed depending upon the output.
31 - 0x7 (0)
A) If failed during 'Testing Expanders (o/p) <--> FPGA (i/p) connections:' For example,
FAIL (low)
Port (76) Bit (1) wrote (0x00)
Port (302) Bit (4) read 0xff, expected (0xef)
1) Program I2C expander by following command:
Whack> cb i2c 9.76.3 0
Here "76" is reported port number. "3" is config register offset for the expander. "0"
makes all expander bits as output.
2) Set the bit in I2C expander.
Whack> cb i2c 9.76.1 2
Here "1" is rdwr register offset for the expander. "2" is reported bit 1 (1 << "1") in
expander.
3) Read a byte from FPGA offset.
Whack> db fpga 302 1
Here "302" is reported FPGA offset.
Confirm if the bit "4" in read value is set.
Repeat step 2) and 3) by writing 0 to I2C expander 9.76.1 and checking if the bit "4"
in FPGA offset 0x302 is cleared.
B) If failed during 'Testing FPGA (o/p) <--> Expanders (i/p) connections:' For example,
FAIL (low)
Port (305) Bit (4) wrote (0x00)
Port (7e) Bit (7) read 0x86, expected (0x06)
1) Program I2C expander by following command:
Whack> cb i2c 9.7e.3 ff
Here "7e" is reported port number. "3" is config register offset for the expander. "ff"
makes all expander bits as input.
2) Write a byte to FPGA offset.
Whack> db fpga 305 10
Here "305" is reported FPGA offset. Writing 0x10 will set the bit "4" in that offset.
3) Read a byte from I2C expander.
Whack> db i2c 9.7e.0 1
Here "7e" is reported port number. "0" is read register offset for the expander. Confirm
if the bit "7" in read value is set.
Repeat step 2) and 3) by writing 0 to the FPGA offset and checking if the bit "7" in I2C
Expander 9.7e.0 is cleared.
126 CBIOS Error Codes
DescriptionCode
C) For all other failure cases refer to Section # 18.2 "Manufacturing Centerplane GPIO
Test Diagnostics" of CBIOS user guide at
http://engweb/twiki/bin/view/Main/TitanMfgCpFpgaGpioTestDiag
Power Supply xx indicates invalid battery configuration: y batteries Verify battery
connection and individual battery units.
38 - 0x0 (data)
The maximum count of batteries in a string which are supported by software is 3. Any
greater number will result in this non-fatal error.
The data value may be decoded to determine which power supply and the battery
count. The high 8 bits are a bitmask of the power supply. The lower 16 bits are the
number of batteries counted. Thus, a data value of 100000c indicates PS1 had a battery
count of 12. A data value of 4 indicates PS0 had a battery count of 4.
Resolution:
A) Verify no more than 3 batteries in a string are connected to any one power supply.
B) Cycle power on the node.
C) Remove batteries one at a time to determine if there is a faulty connection or battery.
Replace the faulty cable or battery.
D) Replace the power supply.
E) Replace the node motherboard.
RTC / NVRAM Battery Failure - Replace battery.38 - 0x1 (0)
The RTC / NVRAM battery was found to have a low voltage by the built-in monitoring
circuit of the RTC (TOD clock).
Resolution:
A) Replace the lithium-ion cell battery on the node.
B) Replace the node motherboard.
No batteries present on power supply xx38 - 0x3 (data)
This error indicates no batteries were found on a node power supply.
This warning may be enabled by setting "warn_nobat" in NVRAM.
The data value may be decoded to determine which power supply triggered this error.
The high 8 bits are a bitmask of the power supply. Thus, a data value of 0 indicates
PS0 is not present. A data value of 1000000 indicates PS1 is not present.
Resolution:
A) Verify there is at least one battery connected.
B) Cycle power on the node.
C) Exchange cables and batteries.
D) Replace the power supply.
E) Replace the node motherboard.
Power supply missing: node power configuration is not redundant38 - 0x4 (data)
This error indicates one of the two power supplies for a node is not present. This warning
may be enabled by setting "warn_ps" in NVRAM.
The data value may be decoded to determine which power supply triggered this error.
The high 8 bits are a bitmask of the power supply. Thus, a data value of 0 indicates
PS0 is not present. A data value of 1000000 indicates PS1 is not present.
Resolution:
A) Verify both power supplies are present and powered on.
B) Power off the missing supply, remove it, and reinsert it in the chassis.
C) Replace the power supply.
D) Replace the node motherboard.
Battery failure on Power Supply38 - 0x5 (0)
CBIOS Degraded Error Codes and Resolution 127
DescriptionCode
This error indicates that a battery on the power supply has reported a hardware error.
The status light on the back of the failed battery will be amber.
Resolution:
A) Verify both power supplies are present and powered on. Verify batteries are present
and powered on.
B) Power off the failed battery, remove the cable, and reinsert it in the Power Supply.
Turn it back on. If that does not reset the FAILED condition, replace the battery.
C) Replace the power supply.
D) Replace the node motherboard.
Powering off PSxx because it is on battery power. This will shut down the node until
AC is restored.
38 - 0x6 (data)
This message indicates that a power supply lost input AC Power and that the BIOS
powered down the node to avoid draining the battery.
The data value may be decoded to determine which power supply triggered this error.
The low 2 bits are a bitmask of the DC power status. Bit 0 represents power supply 0
and Bit 1 represents power supply 1. If this bit is 1, then the DC output from the power
supply was good when the system shut down.
Resolution:
A) Apply AC power to the node.
B) Replace the power supply.
Power supply xx failure: Fan Bad38 - 0x7 (data)
or
Power supply xx failure: Fan 0 Bad
or
Power supply xx failure: Fan 1 Bad
This error indicates there is a hardware problem in one of the node power supplies.
One or more of the fans may have failed.
The data value may be decoded to determine which power supply (and fan) triggered
this error. The low 2 bits are a bitmask of the fan status for Power Supply 0. The next
2 bits are a bitmask of the fan status for Power Supply 1. Thus:
1: PS0 had a Fan0 failure 2: PS0 had a Fan1 failure
3: PS0 had a double fan failure c: PS1 had a double fan failure
4: PS1 had a Fan0 failure 8: PS1 had a Fan 1 failure
Resolution:
A) Replace the power supply.
B) Replace the node motherboard.
Power supply xx failure: Charger Overload38 - 0x8 (data)
This error indicates there is a hardware problem in one of the node power supplies,
specifically that the charger cannot handle the battery charge current draw. If you need
to override this error so the node continues, you can set "ignore_chargefail" in NVRAM.
The data value may be decoded to determine which power supply triggered this error.
The low 2 bits are a bitmask of the charger status for the two power supplies. This a
value of 1 indicates PS0 had a charger overload. A value of 2 indicates PS1 had a
charger overload. A value of 3 indicates PS0 and PS1 both had a charger overload.
Resolution:
A) Check battery connection.
B) Exchange cables and batteries.
C) Replace the power supply.
D) Replace the node motherboard.
Power supply xx failure: DC Output Bad38 - 0x9 (data)
128 CBIOS Error Codes
DescriptionCode
This error indicates there is a hardware problem in one of the node power supplies. If
this failure is transient, it could also be caused by turning the power supply off and then
on or by a quick AC loss followed by AC being restored. If both power supplies fail
simultaneously (not likely), this is a fatal error.
The data value may be decoded to determine which power supply triggered this error.
The low 2 bits are a bitmask of the DC Output status for the two power supplies.
A value of 1 indicates PS0 had a DC Output Bad.
A value of 2 indicates PS1 had a DC Output Bad.
Resolution:
A) Ensure a service operation was not taking place at the time, and that AC had not
also failed.
B) Replace the power supply.
C) Replace the node motherboard.
Power supply xx failure: AC Input Bad38 - 0xa (data)
This error indicates that AC input power is not being supplied to one or more power
supplies. The likely cause is either a real AC Failure or that the power supply has been
switched to the off position. In the case of an AC Failure, the power supply will be
automatically shut down to preserve batteries (if "ignore_acfail" is set then the power
supply will not be shut down).
The lower 2 bits of the data value may be decoded to determine which power supply
lost AC power. A value of 1 indicates PS0. A value of 2 indicates PS1. A value of 3
indicates both power supplies lost AC power.
Resolution:
A) Verify AC power is present and the power supply switch is turned on.
B) Check the Power Distribution Unit (PDU) breaker.
C) Replace the power supply.
D) Replace the node motherboard.
**** Power Supplies mismatch ****38 - 0xb (0)
Power Supply 0: I2C accessible
Power Supply 1: I2C inaccessible
This error indicates one of the power supplies is a new style (I2C interface) and the
other power supply is not responding using I2C, but has been detected as present. This
is not a
supported configuration. If you need to override this error, set "ignore_psdiff" in NVRAM.
Resolution:
A) Pull and reinsert the inaccessible power supply.
B) Check the Power Distribution Unit (PDU) breaker for the inaccessible power supply.
C) Replace the power supply.
D) Replace the node motherboard.
This error indicates Power Supply 0 reported a limit was exceeded while performing
the power supply status test. Each power supply has integrated monitors for temperature,
38 - 0xc (data)
voltage, and current draw. The BIOS reads these sensors as part of initialization to
determine if the power supply is operating within specifications.
The data value may be decoded to determine the particular cause of the limit failure.
Each bit represents a unique sensor. Data values may be decoded as follows:
00000001 - Temperature
00000004 - 3.3V
00000008 - 3.3V Current
00000010 - 5V
00000020 - 5V Current
CBIOS Degraded Error Codes and Resolution 129
DescriptionCode
00000040 - 12V
00000080 - 12V Current
00000100 - 24V
00000200 - 24V Current
00000400 - 48V
00000800 - 48V Current
00001000 - Bat0 48V
00002000 - Bat1 48V
00004000 - Bat2 48V
00008000 - Bat0 12V
00010000 - Undefined ... to ...
00400000 - Undefined
00800000 - Battery LED is Amber
01000000 - Battery Relay is Off
02000000 - PS LED is Amber
04000000 - Fan Fail
08000000 - DC Fail
10000000 - AC Fail
20000000 - Power Supply is Disabled
40000000 - Power Supply Switch is Off
80000000 - Low Limit exceeded (combined with bits above)
Resolution: Contact 3PAR technical support.
This error indicates Power Supply 1 reported a limit was exceeded while performing
the power supply status test.
38 - 0xd (data)
See Code 38, sub-code 0xc for resolution information.
Each newer generation (Magnetek) power supply and battery has an I2C interface
which allows the node to acquire power supply internal temperature, voltages, and
38 - 0xe (data)
current loads. The BIOS will verify these readings are within acceptable limits as part
of normal initialization.
This failure code indicates a limit has been exceeded on a battery attached to a power
supply on the node. The data value may be decoded to determine which power supply
and battery. The lower 2 bits are a bitmask of the power supply. The upper 16 bits are
a bitmask of the failing battery. Thus, a data value of 10002 indicates PS1 Bat0 has
exceeded a limit. A data value of 40001 indicates PS0 Bat2 has exceeded a limit.
Resolution:
A) Check battery expiration date and replace as necessary.
B) Power cycle the failing battery.
C) Replace battery cable.
I2C errors prevented completion of the power test.38 - 0xf (data)
Each newer generation (Magnetek) power supply and battery has an I2C interface
which allows the node to acquire power supply status. This failure codes indicates the
BIOS was unable to read one of the Power Supply or battery status registers.
The lower 2 bits of the data value may be decoded to determine which power supply
failed. A value of 1 indicates PS0. A value of 2 indicates PS1. A value of 3 indicates
both power supplies failed.
Resolution:
A) Power cycle the indicated power supply.
B) Replace power supply.
C) Replace all attached batteries to the power supply.
130 CBIOS Error Codes
DescriptionCode
D) Replace the node motherboard.
PSwwww Batxxxx Switch Off38 - 0x10 (data)
This failure code indicates a battery has its power switch in the off position, and is thus
unable to supply back up power to the node in the case of AC Failure. The data value
may be decoded to determine which power supply and battery. See Code 38, sub-code
0xd for decoding information.
Resolution:
A) Turn battery on.
B) Power cycle the indicated battery.
C) Replace battery cable.
D) Replace power supply.
Turning off BBU because the node is on battery power. This will shut down the node
until AC is restored.
38 - 0x13 (data)
This message indicates that all power supplies lost input AC Power and that the BIOS
powered down the node to avoid draining the battery.
The data value provides a mask of power supplies which have AC good input but failed
DC output.
Resolution:
A) Apply AC power to the node.
B) Replace the power supplies.
During CPU SMI initialization, the queue facility to send messages between the BIOS
and TPD is tested. If there is a problem triggering an SMI, or some other error which
40 - 0x1 (0)
causes message corruption, this error will result. This error is recoverable because the
OS can still come up and function at a degraded level even if the communication
between the OS and BIOS is not functioning.
Resolution:
A) View prom log to see if this is repeatable. If not, ignore a single occurrence.
B) Cycle power on the node.
C) Replace the bootstrap CPU.
D) Replace the node motherboard.
The VSC055 reports tachometer inputs for both node fans, 0 and 1. This is a single
node fan failure which requires the fan to be replaced.
44 - 0x00 (xx)
Resolution:
A) Cycle power on the node.
B) Replace the node fan.
No USB device was found.49 - 0x17 (0)
Resolution: Install or replace the USB Flash drive.
During Harrier initialization, the CMA BIST test failed but due to some other (e.g. I2C
I/O error) reason. This error codes indicates that the BIST test itself hasn't failed but
51 - 0x2 (Data)
there was an error which occurred either during book-keeping (PROM0 read/write) or
the test was not performed at all because it failed to read a Harrier register. (Data =
0x2f)
Resolution: Monitor and replace the node if the issue recurs. If the node is replaced,
note for OPS that they should verify I2C to the node PROM is functional.
One or more bits in CPU's 2 General Power Management registers were set due to
abnormal power reset. The set bits are printed describing the cause.
52 - 0x0
CBIOS Degraded Error Codes and Resolution 131
DescriptionCode
Resolution: Contact engineering with data.
CBIOS failed to obtain the ME firmware flash unlock code through the HECI interface.
This could prevent flash commands from functioning.
58 - 0x0
Resolution: Try rebooting the node
132 CBIOS Error Codes
7 Support and Other Resources
Contacting HP
For worldwide technical support information, see the HP support website:
http://www.hp.com/support
Before contacting HP, collect the following information:
• Product model names and numbers
• Technical support registration number (if applicable)
• Product serial numbers
• Error messages
• Operating system type and revision level
• Detailed questions
Specify the type of support you are requesting:
Support requestHP 3PAR storage system
StoreServ 7000 StorageHP 3PAR StoreServ 7200, 7400, and 7450 Storage
systems
3PAR or 3PAR StorageHP 3PAR StoreServ 10000 Storage systems
HP 3PAR T-Class storage systems
HP 3PAR F-Class storage systems
HP 3PAR documentation
See:For information about:
The Single Point of Connectivity Knowledge for HP
Storage Products (SPOCK) website:
Supported hardware and software platforms
http://www.hp.com/storage/spock
The HP 3PAR StoreServ Storage site:Locating HP 3PAR documents
http://www.hp.com/go/3par
To access HP 3PAR documents, click the Support link for
your product.
HP 3PAR storage system software
HP 3PAR StoreServ Storage Concepts GuideStorage concepts and terminology
HP 3PAR Management Console User's GuideUsing the HP 3PAR Management Console (GUI) to configure
and administer HP 3PAR storage systems
HP 3PAR Command Line Interface Administrator’s
Manual
Using the HP 3PAR CLI to configure and administer storage
systems
HP 3PAR Command Line Interface ReferenceCLI commands
HP 3PAR System Reporter Software User's GuideAnalyzing system performance
HP 3PAR Host Explorer User’s GuideInstalling and maintaining the Host Explorer agent in order
to manage host configuration and connectivity information
HP 3PAR CIM API Programming ReferenceCreating applications compliant with the Common Information
Model (CIM) to manage HP 3PAR storage systems
Contacting HP 133
See:For information about:
HP 3PAR-to-3PAR Storage Peer Motion GuideMigrating data from one HP 3PAR storage system to another
HP 3PAR Secure Service Custodian Configuration Utility
Reference
Configuring the Secure Service Custodian server in order to
monitor and control HP 3PAR storage systems
HP 3PAR Remote Copy Software User’s GuideUsing the CLI to configure and manage HP 3PAR Remote
Copy
HP 3PAR Upgrade Pre-Planning GuideUpdating HP 3PAR operating systems
HP 3PAR F-Class, T-Class, and StoreServ 10000 Storage
Troubleshooting Guide
Identifying storage system components, troubleshooting
information, and detailed alert information
HP 3PAR Policy Server Installation and Setup GuideInstalling, configuring, and maintaining the HP 3PAR Policy
Server HP 3PAR Policy Server Administration Guide
134 Support and Other Resources
See:For information about:
Planning for HP 3PAR storage system setup
Hardware specifications, installation considerations, power requirements, networking options, and cabling information
for HP 3PAR storage systems
HP 3PAR StoreServ 7000 Storage Site Planning ManualHP 3PAR 7200, 7400, and 7450 storage systems
HP 3PAR StoreServ 7450 Storage Site Planning Manual
HP 3PAR StoreServ 10000 Storage Physical Planning
Manual
HP 3PAR 10000 storage systems
HP 3PAR StoreServ 10000 Storage Third-Party Rack
Physical Planning Manual
Installing and maintaining HP 3PAR 7200, 7400, and 7450 storage systems
HP 3PAR StoreServ 7000 Storage Installation GuideInstalling 7200, 7400, and 7450 storage systems and
initializing the Service Processor HP 3PAR StoreServ 7450 Storage Installation Guide
HP 3PAR StoreServ 7000 Storage SmartStart Software
User’s Guide
HP 3PAR StoreServ 7000 Storage Service GuideMaintaining, servicing, and upgrading 7200, 7400, and
7450 storage systems HP 3PAR StoreServ 7450 Storage Service Guide
HP 3PAR StoreServ 7000 Storage Troubleshooting GuideTroubleshooting 7200, 7400, and 7450 storage systems
HP 3PAR StoreServ 7450 Storage Troubleshooting Guide
HP 3PAR Service Processor Software User GuideMaintaining the Service Processor
HP 3PAR Service Processor Onsite Customer Care
(SPOCC) User's Guide
HP 3PAR host application solutions
HP 3PAR Recovery Manager Software for Oracle User's
Guide
Backing up Oracle databases and using backups for disaster
recovery
HP 3PAR Recovery Manager Software for Microsoft
Exchange 2007 and 2010 User's Guide
Backing up Exchange databases and using backups for
disaster recovery
HP 3PAR Recovery Manager Software for Microsoft SQL
Server User’s Guide
Backing up SQL databases and using backups for disaster
recovery
HP 3PAR Management Plug-in and Recovery Manager
Software for VMware vSphere User's Guide
Backing up VMware databases and using backups for
disaster recovery
HP 3PAR VSS Provider Software for Microsoft Windows
User's Guide
Installing and using the HP 3PAR VSS (Volume Shadow Copy
Service) Provider software for Microsoft Windows
HP 3PAR Storage Replication Adapter for VMware
vCenter Site Recovery Manager Implementation Guide
Best practices for setting up the Storage Replication Adapter
for VMware vCenter
HP 3PAR Storage Replication Adapter for VMware
vCenter Site Recovery Manager Troubleshooting Guide
Troubleshooting the Storage Replication Adapter for VMware
vCenter Site Recovery Manager
HP 3PAR VAAI Plug-in Software for VMware vSphere
User's Guide
Installing and using vSphere Storage APIs for Array
Integration (VAAI) plug-in software for VMware vSphere
HP 3PAR documentation 135
Typographic conventions
Table 20 Document conventions
ElementConvention
Bold text • Keys that you press
• Text you typed into a GUI element, such as a text box
• GUI elements that you click or select, such as menu items, buttons,
and so on
Monospace text • File and directory names
• System output
• Code
• Commands, their arguments, and argument values
<Monospace text in angle brackets> • Code variables
• Command variables
Bold monospace text • Commands you enter into a command line interface
• System output emphasized for scannability
WARNING! Indicates that failure to follow directions could result in bodily harm or death, or in
irreversible damage to data or to the operating system.
CAUTION: Indicates that failure to follow directions could result in damage to equipment or data.
NOTE: Provides additional information.
Required
Indicates that a procedure must be followed as directed in order to achieve a functional and
supported implementation based on testing at HP.
HP 3PAR branding information
• The server previously referred to as the "InServ" is now referred to as the "HP 3PAR StoreServ
Storage system."
• The operating system previously referred to as the "InForm OS" is now referred to as the "HP
3PAR OS."
• The user interface previously referred to as the "InForm Management Console (IMC)" is now
referred to as the "HP 3PAR Management Console."
• All products previously referred to as “3PAR” products are now referred to as "HP 3PAR"
products.
136 Support and Other Resources
8 Documentation feedback
HP is committed to providing documentation that meets your needs. To help us improve the
documentation, send any errors, suggestions, or comments to Documentation Feedback
(docsfeedback@hp.com). Include the document title and part number, version number, or the URL
when submitting your feedback.
137

More Related Content

PPTX
HTML Semantic Elements
PPT
Overview of PHP and MYSQL
ODP
CSS Basics
PDF
Basic-CSS-tutorial
PPTX
Html images
PPTX
.Net Core
PPTX
File system vs database
HTML Semantic Elements
Overview of PHP and MYSQL
CSS Basics
Basic-CSS-tutorial
Html images
.Net Core
File system vs database

What's hot (20)

PPT
PHP POWERPOINT SLIDES
PDF
Event Driven programming(ch1 and ch2).pdf
PPTX
Xml namespace
PPTX
Key and its different types
PPTX
PPTX
Html table
PPTX
Html list
PPT
Introduction To PHP
PPTX
PPT
Php Presentation
PDF
React for Dummies
PPTX
Java script cookies
PDF
jQuery for beginners
PPT
Html basics
PPTX
Step by step how to create database with phpmyadmin
PPTX
HTML (Web) basics for a beginner
PDF
Web Worker, Service Worker and Worklets
PPT
Introduction to JavaScript (1).ppt
PHP POWERPOINT SLIDES
Event Driven programming(ch1 and ch2).pdf
Xml namespace
Key and its different types
Html table
Html list
Introduction To PHP
Php Presentation
React for Dummies
Java script cookies
jQuery for beginners
Html basics
Step by step how to create database with phpmyadmin
HTML (Web) basics for a beginner
Web Worker, Service Worker and Worklets
Introduction to JavaScript (1).ppt
Ad

Similar to Hw maintainace guide (20)

PDF
HPE-Synergy-12000-Frame-Setup-and-Installation-Guide.pdf
PDF
Reputation Security Monitor (RepSM) v1.01 Solution Guide for ArcSight Express...
PDF
3PAR Service processor User document .pdf
PDF
HP Micro Server remote access card user manual
PDF
Configuration Guide for ArcSight Express v4.0
PDF
ESM 101 for ArcSight Express v4.0
PDF
Esm scg intrusion_6.0c
PDF
Installation Guide for ESM 6.5c
PDF
Esm install guide_6.0c
PDF
Wsp install guide
PDF
ArcSight Connector Appliance 6.4 Administrator's Guide
PDF
Admin and System/Core Standard Content Guide for ArcSight Express v4.0
PDF
ESM 6.5c SP1 Command Center User's Guide
PDF
ESM5.6_SCG_Sys_Admin.pdf
PDF
Forwarding Connector v5.2.7.6582.0 User's Guide for ArcSight Express v4.0
PDF
ESM 6.5c SP1 Installation and Configuration Guide
PDF
ESM 6.5c SP1 Administrator's Guide
PDF
ArcSight Core Security, ArcSight Administration, and ArcSight System Standard...
PDF
Esm 101 5.5
HPE-Synergy-12000-Frame-Setup-and-Installation-Guide.pdf
Reputation Security Monitor (RepSM) v1.01 Solution Guide for ArcSight Express...
3PAR Service processor User document .pdf
HP Micro Server remote access card user manual
Configuration Guide for ArcSight Express v4.0
ESM 101 for ArcSight Express v4.0
Esm scg intrusion_6.0c
Installation Guide for ESM 6.5c
Esm install guide_6.0c
Wsp install guide
ArcSight Connector Appliance 6.4 Administrator's Guide
Admin and System/Core Standard Content Guide for ArcSight Express v4.0
ESM 6.5c SP1 Command Center User's Guide
ESM5.6_SCG_Sys_Admin.pdf
Forwarding Connector v5.2.7.6582.0 User's Guide for ArcSight Express v4.0
ESM 6.5c SP1 Installation and Configuration Guide
ESM 6.5c SP1 Administrator's Guide
ArcSight Core Security, ArcSight Administration, and ArcSight System Standard...
Esm 101 5.5
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT
introduction to datamining and warehousing
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Well-logging-methods_new................
PPTX
Geodesy 1.pptx...............................................
PPTX
web development for engineering and engineering
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CH1 Production IntroductoryConcepts.pptx
573137875-Attendance-Management-System-original
UNIT 4 Total Quality Management .pptx
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
introduction to datamining and warehousing
Current and future trends in Computer Vision.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Operating System & Kernel Study Guide-1 - converted.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Well-logging-methods_new................
Geodesy 1.pptx...............................................
web development for engineering and engineering
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Sustainable Sites - Green Building Construction
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CH1 Production IntroductoryConcepts.pptx

Hw maintainace guide

  • 1. HP 3PAR StoreServ 7000 Storage Troubleshooting Guide Service Edition Abstract This guide is intended for experienced users and system administrators troubleshooting HP 3PAR StoreServ 7000 Storage systems and have a firm understanding of RAID schemes. HP Part Number: QR482-96619 Published: March 2014
  • 2. © Copyright 2014 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Acknowledgments Microsoft® and Windows® are U.S. registered trademarks of Microsoft Corporation. Warranty To obtain a copy of the warranty for this product, see the warranty information website: http://www.hp.com/go/storagewarranty
  • 3. Contents 1 Identifying Storage System Components........................................................7 Understanding Component Numbering.......................................................................................7 Drive Enclosures...................................................................................................................7 Controller Nodes.................................................................................................................8 PCIe Slots and Ports.............................................................................................................9 I/O Modules ....................................................................................................................10 Power Cooling Modules......................................................................................................10 Power Distribution Units......................................................................................................11 Service Processor...............................................................................................................11 2 Understanding LED Indicator Status.............................................................12 Enclosure LEDs.......................................................................................................................12 Bezel LEDs........................................................................................................................12 Disk Drive LEDs..................................................................................................................13 Storage System Component LEDs..............................................................................................13 PCM LEDs.........................................................................................................................13 Drive PCM LEDs.................................................................................................................15 I/O Module LEDs..............................................................................................................16 External Port Activity LEDs...................................................................................................17 Controller Node and Internal Component LEDs...........................................................................18 Ethernet LEDs....................................................................................................................18 FC Port LEDs......................................................................................................................19 SAS Port LEDs....................................................................................................................20 Interconnect Port LEDs.........................................................................................................20 Fibre Channel Adapter Port LEDs..........................................................................................21 Converged Network Adapter Port LEDs.................................................................................21 Service Processor LEDs............................................................................................................22 3 Powering Off/On the Storage System..........................................................24 Powering Off the Storage System..............................................................................................24 Powering On the Storage System..............................................................................................24 4 Alerts......................................................................................................26 Getting Recommended Actions.................................................................................................26 5 Troubleshooting........................................................................................28 checkhealth Command............................................................................................................28 Using the checkhealth Command.........................................................................................28 Troubleshooting Storage System Components.............................................................................31 Alert................................................................................................................................32 Format of Possible Alert Exception Messages.....................................................................32 Alert Example...............................................................................................................32 Alert Suggested Action..................................................................................................33 Cabling............................................................................................................................33 Format of Possible Cabling Exception Messages................................................................33 Cabling Example 1.......................................................................................................34 Cabling Suggested Action 1...........................................................................................34 Cabling Example 2.......................................................................................................34 Cabling Suggested Action 2...........................................................................................34 Cage...............................................................................................................................35 Format of Possible Cage Exception Messages...................................................................35 Cage Example 1...........................................................................................................35 Cage Suggested Action 1..............................................................................................35 Contents 3
  • 4. Cage Example 2...........................................................................................................37 Cage Suggested Action 2..............................................................................................37 Cage Example 3...........................................................................................................37 Cage Suggested Action 3..............................................................................................37 Cage Example 4...........................................................................................................38 Cage Suggested Action 4..............................................................................................38 Cage Example 5...........................................................................................................39 Cage Suggested Action 5..............................................................................................39 Consistency.......................................................................................................................40 Format of Possible Consistency Exception Messages...........................................................40 Consistency Example.....................................................................................................40 Consistency Suggested Action.........................................................................................40 Data Encryption (DAR)........................................................................................................40 Format of Possible DAR Exception Messages.....................................................................41 DAR Suggested Action...................................................................................................41 DAR Example 2............................................................................................................41 DAR Suggested Action 2................................................................................................41 Date................................................................................................................................41 Format of Possible Date Exception Messages.....................................................................41 Date Example...............................................................................................................41 Date Suggested Action..................................................................................................41 File..................................................................................................................................42 File Format of Possible Exception Messages......................................................................42 File Example 1..............................................................................................................42 File Suggested Action 1.................................................................................................42 File Example 2..............................................................................................................42 File Suggested Action 2.................................................................................................42 File Example 3..............................................................................................................43 File Suggested Action 3.................................................................................................43 LD....................................................................................................................................43 Format of Possible LD Exception Messages........................................................................43 LD Example 1...............................................................................................................44 LD Suggested Action 1...................................................................................................44 LD Example 2...............................................................................................................44 LD Suggested Action 2...................................................................................................44 LD Example 3...............................................................................................................45 LD Suggested Action 3...................................................................................................45 LD Example 4...............................................................................................................45 LD Suggested Action 4...................................................................................................45 License.............................................................................................................................46 Format of Possible License Exception Messages.................................................................46 License Example............................................................................................................46 License Suggested Action...............................................................................................46 Network...........................................................................................................................46 Format of Possible Network Exception Messages...............................................................46 Network Example 1......................................................................................................46 Network Suggested Action 1..........................................................................................47 Network Example 2......................................................................................................47 Network Suggested Action 2..........................................................................................47 Node...............................................................................................................................47 Format of Possible Node Exception Messages...................................................................48 Node Suggested Action.................................................................................................48 Node Example 1..........................................................................................................48 Node Suggested Action 1..............................................................................................48 Node Example 2..........................................................................................................49 4 Contents
  • 5. Node Suggested Action 2..............................................................................................49 Node Example 3..........................................................................................................50 Node Suggested Action 3..............................................................................................50 Example Node 4..........................................................................................................50 Suggested Action Node 4..............................................................................................50 PD...................................................................................................................................51 Format of Possible PD Exception Messages.......................................................................51 PD Example 1...............................................................................................................51 PD Suggested Action 1..................................................................................................51 PD Example 2...............................................................................................................52 PD Suggested Action 2..................................................................................................52 PD Example 3...............................................................................................................54 PD Suggested Action 3..................................................................................................54 PD Example 4...............................................................................................................54 PD Suggested Action 4..................................................................................................54 PD Example 5...............................................................................................................55 PD Suggested Action 5..................................................................................................55 PD Example 6...............................................................................................................55 PD Suggested Action 6..................................................................................................55 PDCH...............................................................................................................................55 Format of Possible PDCH Exception Messages...................................................................55 PDCH Example 1..........................................................................................................55 Suggested PDCH Action 1..............................................................................................56 PDCH Example 2..........................................................................................................56 PDCH Suggested Action 2..............................................................................................56 Port..................................................................................................................................57 Format of Possible Port Exception Messages......................................................................57 Port Suggested Actions...................................................................................................57 Port Example 1.............................................................................................................57 Port Suggested Action 1.................................................................................................58 Port Example 2.............................................................................................................59 Port Suggested Action 2.................................................................................................59 Port Example 3.............................................................................................................59 Port Suggested Action 3.................................................................................................59 Port Example 4.............................................................................................................59 Port Suggested Action 4.................................................................................................59 Port Example 5.............................................................................................................60 Port Suggested Action 5.................................................................................................60 Port Example 6.............................................................................................................60 Port Suggested Action 6.................................................................................................60 Port Example 7.............................................................................................................60 Port Suggested Action 7.................................................................................................61 Port CRC...........................................................................................................................61 Format of Possible Port CRC Exception Messages...............................................................61 Port CRC Example.........................................................................................................61 Port PELCRC......................................................................................................................61 Format of Possible PELCRC Exception Messages................................................................61 Port PELCRC Example.....................................................................................................61 RC...................................................................................................................................62 Format of Possible RC Exception Messages.......................................................................62 RC Example.................................................................................................................62 RC Suggested Action.....................................................................................................62 SNMP..............................................................................................................................62 Format of Possible SNMP Exception Messages..................................................................62 SNMP Example............................................................................................................62 Contents 5
  • 6. SNMP Suggested Action................................................................................................62 SP....................................................................................................................................62 Format of Possible SP Exception Messages........................................................................63 SP Example..................................................................................................................63 SP Suggested Action......................................................................................................63 Task.................................................................................................................................63 Format of Possible Task Exception Messages.....................................................................63 Task Example...............................................................................................................63 Task Suggested Action...................................................................................................63 VLUN...............................................................................................................................64 Format of Possible VLUN Exception Messages...................................................................64 VLUN Example.............................................................................................................64 VLUN Suggested Action.................................................................................................64 VV...................................................................................................................................64 Format of Possible VV Exception Messages.......................................................................65 VV Suggested Action.....................................................................................................65 Troubleshooting Storage System Setup.......................................................................................65 Storage System Setup Wizard Errors.....................................................................................65 Collecting SmartStart Log Files.............................................................................................71 Collecting Service Processor Log Files...................................................................................71 Contacting HP Support about System Setup...........................................................................72 6 CBIOS Error Codes...................................................................................73 LED Blink Codes.....................................................................................................................73 InForm OS Failed Error Codes and Resolution.............................................................................73 Failed Alerts......................................................................................................................73 CBIOS Degraded Error Codes and Resolution..........................................................................118 Degraded Alerts..............................................................................................................118 7 Support and Other Resources...................................................................133 Contacting HP......................................................................................................................133 HP 3PAR documentation........................................................................................................133 Typographic conventions.......................................................................................................136 HP 3PAR branding information...............................................................................................136 8 Documentation feedback.........................................................................137 6 Contents
  • 7. 1 Identifying Storage System Components NOTE: The illustrations in this chapter are used examples only and may not reflect your storage system configuration. Understanding Component Numbering Due to the large number of possible configurations, component placement and internal cabling is standardized to simplify installation and maintenance. System components are placed in the rack according to the principles outlined in this chapter, and are numbered according to their order and location in the cabinet. The Storage system includes the following types of drive and node enclosures: • The HP M6710 Drive Enclosure (2U24) holds up to 24, 2.5 inch small form factor (SFF) Serial Attached SCSI (SAS) disk drives arranged vertically in a single row on the front of the enclosure. Two 580 W power cooling modules (PCMs) and two I/O modules are located at the rear of the enclosure. • The HP M6720 Drive Enclosure (4U24) holds up to 24, 3.5 inch large form factor (LFF) SAS disk drives, arranged horizontally with four columns of six disk drives located on the front of the enclosure. Two 580 W PCMs and two I/O modules are located at the rear of the enclosure. • The HP 3PAR StoreServ 7200 and 7400 (two-node configuration) storage enclosures hold up to 24, 2.5 inch SFF SAS disk drives arranged horizontally in a single row located on the front of the enclosure. Two 764 W PCMs and two controller nodes are located at the rear of the enclosure. NOTE: In the HP 3PAR Management Console or CLI, the enclosures are displayed as DCS2 for 2U24 (M6710) , DCS1 (M6720) for 4U24, and DCN1 for a node enclosure. Drive Enclosures The maximum number of supported drive enclosures depends on the model and the number of nodes. Disk Drive Numbering The disk drives are mounted on a drive carrier and reside at the front of the enclosures. There are two types of disk drives for specific drive carriers: • Vertical, 2.5 inch SFF disks. The 2U24 enclosure numbering starts with 0 on the left and ends with 23 on the right. See Figure 1 (page 8). • Horizontal, 3.5 inch LFF disks. The 4U24 enclosure are numbered with 0 on the lower left to 23 on the upper right, with six rows of four. See Figure 2 (page 8). Understanding Component Numbering 7
  • 8. Figure 1 HP M6710 Drive Enclosure (2U24) Figure 2 HP M6720 Drive Enclosure (4U24) Controller Nodes The controller node caches and manages data in a system providing a comprehensive, virtualized view of the system. The controller nodes are located at the rear of the node enclosure. The HP 3PAR StoreServ 7200 Storage system contains two nodes numbered 0 and 1 (see Figure 3 (page 8)). The HP 3PAR StoreServ 7400 Storage system has either two nodes or four nodes. The four-node configuration is numbered 0 and 1 on the bottom, and 2 and 3 on the top (see Figure 4 (page 9)). Figure 3 HP 3PAR StoreServ 7200 Storage Numbering 8 Identifying Storage System Components
  • 9. Figure 4 HP 3PAR StoreServ Four-node Configuration Storage Numbering PCIe Slots and Ports This table describes the default port configurations for the HP 3PAR StoreServ 7000 Storage systems. See Table 1 (page 9) for details. Table 1 Storage System Expansion Cards Nodes 2 and 3Nodes 0 and 1Expansion cards No expansion card1 FC HBA each2 FC HBAs only No expansion card1 10 Gb/s CNA each2 10 Gb/s (CNA) only 1 10 Gb/s CNA each1 FC HBA each2 FC HBAs + 2 10 Gb/s CNAs You can have either a 10 Gb/s Converge Network Adapter (CNA) or Fibre Channel (FC) card in the expansion slots of all nodes, or a combination of the two in a four-node system (for example, two 10 Gb/s CNAs and two FCs). Each node enclosure must have matching PCIe cards. The following figure shows the location of the controller node ports (see Figure 5 (page 9)). NOTE: If you are upgrading from a two-node to a four-node configuration, you can have CNAs installed in node 0 and node 1, and FC HBAs installed in node 2 and node 3. Figure 5 Location of Controller Node Ports Understanding Component Numbering 9
  • 10. Table 2 Description of Controller Node Ports PortItem 2 Ethernet MGMT--Connects to the storage array management interfaces 1 RC--Connects to Remote Copy Fibre Channel (FC-1 and FC-2)--Connects to host systems2 SAS (DP-2 and DP-1)--Connects the drive enclosures and I/O modules using SAS cables 3 Node Interconnect--Connects four directional interconnect cables that connect the controller nodes (four node 7400 only) 4 PCIe slot for optional four-port 8 Gb/s FC HBA or two-port 10 Gb/s CNA5 NOTE: The MFG port is not used. I/O Modules The I/O modules connect the controller nodes to the hard drives using a SAS cable and enabling data transfer between the nodes, hard drives, PCMs, and enclosures. There are two I/O modules located at the rear of the drive enclosure. There are two I/O modules per enclosure, numbered 0 and 1 from bottom to top. See Figure 6 (page 10). Figure 6 I/O Module Numbering for HP M6710 (2U) and HP M6720 (4U) Drive Enclosures NOTE: The I/O modules are located in slots 0 and 1 of the HP M6710 and M6720 drive enclosures. Power Cooling Modules The PCM is an integrated power supply, battery, and cooling fan. There are two types of PCMs: • The 580 W is used in drive enclosures and does not include a battery. • The 764 W is used in node enclosures and includes a replaceable battery. The PCMs are located at the rear of the storage system, and on the sides of the enclosure. There are two PCMs per enclosure. The PCMs are numbered 0 and 1 from left to right. 10 Identifying Storage System Components
  • 11. Figure 7 PCM Numbering In the HP M6720 Drive Enclosure, the two PCMs are located diagonally from one another. The remaining PCM slots are blank. See Figure 8 (page 11)). Figure 8 PCMs in a HP M6710 (2U) and HP M6720 (4U) Drive Enclosures Power Distribution Units Two power distribution units (PDU) are mounted horizontally at the bottom of the rack. The PDUs are numbered 0 and 1 from bottom to top. The default configuration for the HP Intelligent Series Racks is two PDUs mounted vertically at the bottom of the rack so to provide a front-mounting unit space. NOTE: Depending on configuration, PDUs can also be mounted vertically. Service Processor The HP 3PAR StoreServ 7000 Storage system uses either a physical service processor (SP) or virtual service processor (VSP). If your configuration includes an SP, the SP rests at the bottom of the rack under the enclosures and above the PDUs. Figure 9 HP 3PAR Service Processor DL 320e Understanding Component Numbering 11
  • 12. 2 Understanding LED Indicator Status Storage system components have LEDs indicating status of the hardware. Use the LED indicators to help diagnose basic hardware problems. This chapter provides tables and illustrations of component LEDs. Enclosure LEDs Bezel LEDs The bezel LEDs are located at the front of the system on each side of the drive enclosure. The bezels have three LED indicators. See Figure 10 (page 12). Figure 10 Location of Bezel LEDs Table 3 Description of Bezel LEDs IndicatesLED Appearance LEDCallout On – System power is available.GreenSystem Power1 On – System is running on battery power.Amber On – System hardware fault to I/O modules or PCMs within the enclosure. At the rear of the enclosure, identify if the PCM or I/O module LED is also Amber. AmberModule Fault2 On – There is a disk fault on the system.AmberDisk Drive Status 3 NOTE: Prior to running installation scripts, the numeric display located under the Disk Drive Status LED may not display the proper numeric order in relation to their physical locations. The correct sequence will be displayed after the installation script is completed. 12 Understanding LED Indicator Status
  • 13. Disk Drive LEDs Disk Drive LEDs are located on the front of the disk drives. Disk drives have two LED indicators. Figure 11 Location of Disk Drive LEDs Table 4 Description of Disk Drive LEDs IndicatesLED AppearanceLEDCallout On – Normal operationGreenActivity1 Flashing – Activity On – Disk failed and is ready to be replaced. Flashing – The locatecage command is issued (which blinks all drive fault LEDs AmberFault2 for up to 15 minutes (The I/O module Fault LEDs at the rear of the enclosure also blinks). Fault LEDs for failed disk drives do not blink. Storage System Component LEDs PCM LEDs The 764 W PCMs are used in controller node enclosures and include six LEDs. The 580 W PCMs are used in drive enclosures and include four LEDs. The LEDs are located are located in the corner of the module. See Table 5 (page 14) for details of PCM LEDs. Storage System Component LEDs 13
  • 14. Figure 12 Location of Controller Node PCM LEDs Table 5 Description of Controller Node PCM LEDs IndicatesAppearanceDescriptionIcon No AC power or PCM faultOn AmberAC input fail Firmware downloadFlashing AC present and PCM On / OKOn GreenPCM OK Standby modeFlashing PCM fail or PCM faultOn AmberFan Fail Firmware downloadFlashing No AC power, PCM fault or out of toleranceOn AmberDC Output Fail Firmware downloadFlashing Hard fault (not recoverable)On AmberBattery Fail Soft fault (recoverable)Flashing 14 Understanding LED Indicator Status
  • 15. Table 5 Description of Controller Node PCM LEDs (continued) IndicatesAppearanceDescriptionIcon Present and chargedOn GreenBattery Good Charging or disarmedFlashing Drive PCM LEDs The following figure shows the location of drive 580 W PCM LEDs. See Table 6 (page 15) for details of PCM LEDs.. Figure 13 Location of Drive PCM LEDs Table 6 Description of Drive PCM LEDs IndicatesLED AppearanceDescriptionIcon No AC power or PCM faultOn AmberAC input fail Firmware DownloadFlashing AC Present and PCM On / OKOn GreenPCM OK Standby modeFlashing PCM fail or PCM faultOn AmberFan Fail Firmware downloadFlashing Storage System Component LEDs 15
  • 16. Table 6 Description of Drive PCM LEDs (continued) IndicatesLED AppearanceDescriptionIcon No AC power, PCM fault or out of toleranceOn AmberDC Output Fail Firmware downloadFlashing I/O Module LEDs I/O modules are located on the back of the system. I/O modules have two mini-SAS universal ports, which can be connected to HBAs or other ports. Each port includes External Port Activity LEDs, labeled 0 to 3. The I/O module also includes a Power and Fault LED. Figure 14 Location of HP M6710/M6720 I/O Module LEDs Figure 15 I/O Module Power and Fault LEDs Table 7 Description of I/O module Power and Fault LEDs IndicatesStateAppearanceFunctionIcon Power is onOnGreenPower Power is offOff 16 Understanding LED Indicator Status
  • 17. Table 7 Description of I/O module Power and Fault LEDs (continued) IndicatesStateAppearanceFunctionIcon FaultOnAmberFault Normal operationOff Locate command issuedFlashing External Port Activity LEDs Figure 16 Location of External Port Activity LEDs IndicatesStateAppearanceFunction Ready, no activityOnGreenExternal Port Activity; 4 LEDs for Data Ports 0 through 3 Not ready or no powerOff ActivityFlashing Storage System Component LEDs 17
  • 18. Controller Node and Internal Component LEDs NOTE: Enter the locatenode command to flash the hotplug LED blue. Figure 17 Location of Controller Node LEDs Table 8 Description of Controller Node LEDs IndicatesAppearanceLEDCallout Node status GoodGreenStatus1 • On – No cluster • Quick Flashing – Boot • Slow Flashing – Cluster Node FRU IndicatorBlueHotplug2 • On – OK to remove • Off – Not OK to remove • Flashing – locatenode command has been issued Node status FaultAmberFault3 • On – Fault • Off – No fault • Flashing – Node in cluster and there is a fault Ethernet LEDs The controller node has two built-in Ethernet ports. Each built-in Ethernet ports has two LEDs. 18 Understanding LED Indicator Status
  • 19. Figure 18 Location of Ethernet LEDs Table 9 Description of Ethernet LEDs IndicatesAppearanceLEDCallout On – 1 GbE LinkGreenLink Up Speed 1 On – 100 Mb LinkAmber Off – No link established or 10 Mb Link On – No Link activityGreenActivity2 Off – No link established Flashing – Link activity FC Port LEDs The controller node has two FC ports. Each FC port has two LEDs. The arrow-head shaped LEDs point to the associated port. Figure 19 Location of FC Port LEDs Table 10 Description of FC Port LEDs IndicatesLED AppearanceLEDPort Wake up failure (dead device) or power is not appliedOffNo lightAll ports Not connectedOffAmberFC-1 Connected at 4 Gbs3 fast blinks Connected at 8 Gbs4 fast blinks Normal/Connected – link upOnGreenFC-2 Link down or nor connectedFlashing Controller Node and Internal Component LEDs 19
  • 20. SAS Port LEDs The controller node has two SAS ports. Each SAS port has four LEDs and numbered 0 to 3: Figure 20 Location of SAS Port LEDs Table 11 Description of SAS port LEDs IndicatesAppearanceLEDCallout Off– SAS link is present or not, this LED does not remain litGreenDP-11 Flashing–Activity on port Off–SAS link is present or not, this LED does not remain litGreenDP-22 Flashing–Activity on port Interconnect Port LEDs The controller node has two interconnect ports. Each interconnect port includes two LEDs. Figure 21 Location Interconnect Port LEDs Table 12 Description of Interconnect Port LEDs IndicatesAppearanceLEDCallout On – Link establishedGreenStatus1 20 Understanding LED Indicator Status
  • 21. Table 12 Description of Interconnect Port LEDs (continued) Off – Link not yet established On – Failed to establish link connectionAmberFault2 Off – No errors currently on link Flashing – Cluster link cabling error, controller node in wrong slot, or serial number mismatch between controller nodes. Fibre Channel Adapter Port LEDs The Fibre Channel adapter in the controller node includes Fibre Channel port LEDs: Figure 22 Location of Fibre Channel Adapter Port LEDs Table 13 Description of Fibre Channel Adapter Port LEDs IndicatesAppearanceLEDCallout Off – Wake up failure (dead device) or power is not applied No lightAll ports Off – Not connectedAmberPort speed1 3 fast blinks – Connected at 4 Gb/s. 4 fast blinks – Connected at 8 Gb/s. On – Normal/Connected - link upGreenLink status2 Flashing – Link down or not connected Converged Network Adapter Port LEDs The CNA in the controller node includes two ports. Each port has a Link and Activity LED. Figure 23 Location of CNA Port LEDs Controller Node and Internal Component LEDs 21
  • 22. Table 14 Description of CNA Port LEDs IndicatesAppearanceLEDCallout Off – Link downGreenLink1 On – Link up Off – No activityGreenACT (Activity)2 On – Activity Service Processor LEDs The HP 3PAR SP (Proliant DL320e) LEDs are located at the front and rear of the SP. Figure 24 Front Panel LEDs Table 15 Front panel LEDs DescriptionAppearanceLEDItem ActiveBlueUID LED/button1 System is being managed remotelyFlashing Blue DeactivatedOff System is onGreenPower On/Standby button and system power 2 Waiting for powerFlashing Green System is on standby, power still onAmber Power cord is not attached or power supplied has failed Off System is on and system health is normal GreenHealth3 System health is degradedFlashing Amber System health is criticalFlashing Red System power is offOff Linked to networkGreenNIC status4 Network activityFlashing Green No network linkOff 22 Understanding LED Indicator Status
  • 23. Figure 25 Rear Panel LEDs Table 16 Rear panel LEDs DescriptionAppearanceLEDItem LinkGreenNIC link1 No linkOff ActivityGreen or Flashing GreenNIC status2 No activityOff ActiveBlueUID LED/button3 System is being managed remotelyFlashing Blue DeactivatedOff NormalGreenPower supply NOTE: May not be applicable to your system (for hot-plug HP CS power supplies ONLY) 4 Off = one or more of the following conditions: Off • Power is unavailable • Power supply has failed • Power supply is in standby mode • Power supply error Service Processor LEDs 23
  • 24. 3 Powering Off/On the Storage System This chapter describes how to power the storage system on and off. Powering Off the Storage System NOTE: Power distribution units (PDU) in any expansion cabinets connected to the storage system may need to be shut off. Use the locatesys command to identify all connected cabinets before shutting down the system. The command blinks all node and drive enclosure LEDs. Before you power off, use either SPmaint or SPOCC to shut down the system (see Service Processor Onsite Customer Care in the HP 3PAR StoreServ 7000 Storage Service Guide). The system must be shut down before powering off by using any of the following three methods: Using SPOCC 1. Select InServ Product Maintenance. 2. Select Halt an InServ cluster/node. 3. Follow the prompts to shutdown a cluster. Do not shut down individual nodes. 4. Turn off power to the node PCMs. 5. Turn off power to the drive enclosure PCMs. 6. Turn off all PDUs in the rack. Using SPmaint 1. Select option 4 (InServ Product Maintenance). 2. Select Halt an InServ cluster/node. 3. Follow the prompts to shutdown a cluster. Do not shut down individual nodes. 4. Turn off power to the node PCMs. 5. Turn off power to the drive enclosure PCMs. 6. Turn off all PDUs in the rack. Using CLI Directly on the Controller Node if the SP is Inaccessible 1. Enter the CLI command shutdownsys – halt. Confirm all prompts. 2. Allow 2 to 3 minutes for the node to halt, then verify that the node Status LED is flashing green and the node hotplug LED is blue, indicating that the node has been halted. For information about LEDs status, see “Understanding LED Indicator Status” (page 12). CAUTION: Failure to wait until all controller nodes are in a halted state could cause the system to view the shutdown as uncontrolled and place the system in a checkld state upon power up. This can seriously impact host access to data. 3. Turn off power to the node PCMs. 4. Turn off power to the drive enclosure PCMs. 5. Turn off power to all PDUs in the rack. Powering On the Storage System 1. Set the circuit breakers on the PDUs to the ON position. 2. Set the switches on the power strips to the ON position. 3. Power on the drive enclosure PCMs. 24 Powering Off/On the Storage System
  • 25. NOTE: To avoid any cabling errors, all drive enclosures must have at least one or more hard drive(s) installed before powering on the enclosure. 4. Power on the node enclosure PCMs. 5. Verify the status of the LEDs. See “Understanding LED Indicator Status” (page 12). Powering On the Storage System 25
  • 26. 4 Alerts Alerts are triggered by events that require system administrator intervention. This chapter provides a list of alerts identified by message code, the messages, and what action should be taken for each alert. To learn more about alerts, see the HP 3PAR StoreServ Storage Concepts Guide. For information about system alerts, go to HP Guided Troubleshooting at http://www.hp.com/ support/hpgt/3par and select your server platform. To view the alerts, use the showalert command. Alert message codes have seven digits in the schema AAABBBB, where: • AAA is a 3-digit major code • BBBB is a 4-digit sub-code • 0x precedes the code to indicate hexadecimal notation NOTE: Message codes ending in de indicate a degraded state alert. Message codes ending in fa indicate a failed state alert. See the HP 3PAR OS Command Line Interface Reference for complete information on the display options on the event logs. Table 17 Alert Severity Levels DescriptionSeverity A fatal event has occurred. It is no longer possible to take remedial action. Fatal The event is critical and requires immediate action.Critical The event requires immediate action.Major An event has occurred that requires action, but the situation is not yet serious. Minor An aspect of performance or availability may have become degraded. You must determine whether action is necessary. Degraded The event is informational. No action is required other than to acknowledge or remove the alert. Informational Getting Recommended Actions For disk drive alerts, the component line in the right column lists the cage number, magazine number, and drive number (cage:magazine:disk). The first and second numbers are sufficient to identify the exact disk in an HP 3PAR StoreServ 7000 Storage system, since there is always only a single disk (disk 0) in a single magazine. 1. Follow the link to alert actions under Recommended Actions. 2. At the HP Storage Systems Guided Troubleshooting website, follow the link for your product. 3. At the bottom of the HP 3PAR product page, click the link for HP 3PAR Alert Messages. 4. At the bottom of the Alert Messages page, choose the correct message code series based on the first four characters of the alert. 5. Choose the link that matches the first five characters of the message code. 6. On the next page, select the message code that matches the code in the alert. The next page shows the message type based on the message code selected and provides a link to the suggested action. 7. Follow the link. 26 Alerts
  • 27. 8. On the suggested actions page, scroll through the list to find the message state listed in the alert message. The recommended action is listed next to the message state. Getting Recommended Actions 27
  • 28. 5 Troubleshooting The HP 3PAR OS CLI checkhealth command checks and displays the status of storage system hardware and software components. For example, the checkhealth command can check for unresolved system alerts, display issues with hardware components, or display information about virtual volumes that are not optimal. By default the checkhealth command checks most storage system components, but you can also check the status of specific components. For a complete list of storage system components analyzed by the checkhealth command, see “checkhealth Command” (page 28). The checkhealth -svc option is available only to users with Super CLI accounts. The -svc option provides a summary of service related issues by default. If you use the -detail option, both a summary and a detailed list of service issues are displayed. The service information displayed is for service providers only, because it may produce cryptic output that only a service provider understands, or it displays issues that only a service provider can resolve. The -svc option displays the service related information in addition to the customer related information. Alerts are processed by the SP. The HP Business Support Center (BSC) takes action on alerts that are not customer administration alerts. Customer administration alerts are managed by customers. The SP also runs the checkhealth command once an hour and sends the information to the BSC where the information is monitored periodically for unusual system conditions. checkhealth Command The checkhealth command checks and displays the status of system hardware and software components. Command syntax is: checkhealth [<options> | <component>...] Command authority is Super, Service Command options are listed: • -list, lists all components that checkhealth can analyze • -quiet, suppresses the display of the item currently being checked • -detail, displays detailed information regarding the status of the system • -svc, performs service related checks on the system and reports the status. This is a hidden option and does not appear in the CLI Help. This option is not intended for customers and is available only to Service users • -full, displays information about the status of the full system. This is a hidden option and it does not appear in the CLI Help. This option has no effect if the –svc option is omitted. Some of the additional components evaluated take longer to run than other components The <component> is the command specifier, which indicates the component to check. Use the -list option to view the list of components. Using the checkhealth Command Use the checkhealth command without any specifiers to check the health of all the components that can be analyzed by the checkhealth command. The following example lists both summary and detailed information about the hardware and software components: cli% checkhealth -detail Checking alert Checking cage Checking dar 28 Troubleshooting
  • 29. Checking date Checking ld Checking license Checking network Checking node Checking pd Checking port Checking rc Checking snmp Checking task Checking vlun Checking vv Component -----------Description----------- Qty Alert New alerts 4 Date Date is not the same on all nodes 1 LD LDs not mapped to a volume 2 License Golden License. 1 vlun Hosts not connected to a port 5 The following information is reported with the -detail option: Component ----Identifier---- -----------Description------- Alert sw_port:1:3:1 Port 1:3:1 Degraded (Target Mode Port Went Offline) Alert sw_port:0:3:1 Port 0:3:1 Degraded (Target Mode Port Went Offline) Alert sw_sysmgr Total available FC raw space has reached threshold of 800G (2G remaining out of 544G total) Alert sw_sysmgr Total FC raw space usage at 307G (above 50% of total 544G) Date -- Date is not the same on all nodes LD ld:name.usr.0 LD is not mapped to a volume LD ld:name.usr.1 LD is not mapped to a volume vlun host:group01 Host wwn:2000000087041F72 is not connected to a port vlun host:group02 Host wwn:2000000087041F71 is not connected to a port vlun host:group03 Host iscsi_name:2000000087041F71 is not connected to a port vlun host:group04 Host wwn:210100E08B24C750 is not connected to a port vlun host:Host_name Host wwn:210000E08B000000 is not connected to a port If there are no faults or exception conditions, the checkhealth command indicates the system is healthy: cli% checkhealth Checking alert Checking cage … Checking vlun Checking vv System is healthy Use the <component> specifier to check the status of one or more specific storage system components. For example: cli% checkhealth node pd Checking node Checking pd The following components are healthy: node, pd The -svc option provides a summary of service related issues by default. If you use the -detail option, both a summary and a detailed list of service issues are displayed. The -svc option displays the service related information in addition to the customer related information. checkhealth Command 29
  • 30. The following example displays information intended only for service users: cli% checkhealth -svc Checking alert Checking cabling Checking cage ... Checking vlun Checking vv Component -------------------Description------------------- Qty Alert New alerts 2 File Nodes with Dump or HBA core files 1 PD There is an imbalance of active pd ports 1 PD PDs that are degraded or failed 2 pdch LDs with chunklets on a remote disk 2 pdch LDs with connection path different than ownership 2 Port Missing SFPs 6 The following information is included with the -detail option. The detailed output can be very long if a node or cage is down. cli% checkhealth -svc -detail Checking alert Checking cabling Checking cage ... Checking vlun Checking vv Component -------------------Description------------------- Qty Alert New alerts 2 File Nodes with Dump or HBA core files 1 PD There is an imbalance of active pd ports 1 PD PDs that are degraded or failed 2 pdch LDs with chunklets on a remote disk 2 pdch LDs with connection path different than ownership 2 Port Missing SFPs 6 Component --------Identifier--------- ----------------Description--------------------- Alert hw_cage_sled:3:8:3,sw_pd:91 Magazine 3:8:3, Physical Disk 91 Degraded (Prolonged Missing B Port) Alert hw_cage_sled:N/A,sw_pd:54 Magazine N/A, Physical Disk 54 Failed (Prolonged Missing, Missing A Port, Missing B Port) File node:0 Dump or HBA core files found PD disk:54 Detailed State: prolonged_missing PD disk:91 Detailed State: prolonged_missing_B_port PD -- There is an imbalance of active pd ports pdch LD:35 Connection path is not the same as LD ownership pdch LD:54 Connection path is not the same as LD ownership pdch ld:35 LD has 1 remote chunklets pdch ld:54 LD has 10 remote chunklets Port port:2:2:3 Port or devices attached to port have experienced within the last day To check for inconsistencies between the System Manager and kernel states and CRC errors for FC and SAS ports, use the -full option: checkhealth -svc -full checkhealth -list -svc -full 30 Troubleshooting
  • 31. Component -----------------------------------Description------------------------------ alert Displays any non-resolved alerts. cabling Displays any cabling errors.* cage Displays non-optimal drive cage conditions. consistency Displays inconsistencies between sysmgr and kernel** dar Displays Data Encryption issues. date Displays if nodes have different dates. file Displays non-optimal file system conditions.* host Checks for FC host ports that are not configured for virtual port support.* ld Displays non-optimal LDs. license Displays license violations. network Displays ethernet issues. node Displays non-optimal node conditions. pd Displays PDs with non-optimal states or conditions. pdch Displays chunklets with non-optimal states.* port Displays port connection issues. portcrc Checks for increasing port CRC errors.** portpelcrc Checks for increasing SAS port CRC errors.** rc Displays Remote Copy issues. snmp Displays issues with SNMP. sp Checks the status of connection between sp and nodes.* task Displays failed tasks. vlun Displays inactive VLUNs and those which have not been reported by the host agent. vv Displays non-optimal VVs. NOTE: • One asterisk (*) at the end of the output indicates that it is checked only if –svc is part of the command. • Two asterisks (**) at the end of the output indicate that it is checked only if –svc –full is part of the command. Troubleshooting Storage System Components Use the checkhealth -list command to list all components that can be analyzed by the checkhealth command. For detailed troubleshooting information about specific components, examples, and suggested actions for correcting issues with components. See the component names in Table 18 (page 32). Troubleshooting Storage System Components 31
  • 32. Table 18 Component Functions FunctionComponent Displays unresolved alertsAlert Displays any cabling errorsCabling Displays drive cage conditions that are not optimalCage Displays inconsistencies between sysmgr and the kernelConsistency Displays data encryption issuesDar Displays if nodes have different datesDate Displays file system conditions that are not optimalFile Checks for FC host ports that are not configured for virtual port support Host Displays LDs that are not optimalLD Displays license violationsLicense Displays Ethernet issuesNetwork Displays node conditions that are not optimalNode Displays PDs with states or conditions that are not optimalPD Displays chunklets with states that are not optimalPDCH Displays port connection issuesPort Checks for increasing port CRC errorsPortcrc Checks for increasing SAS port CRC errorsPortpelcrc Displays Remote Copy issuesRC Displays issues with SNMPSNMP Checks the status of Ethernet connections between the Service Processor and nodes, when run from the SP SP Displays failed tasksTask Displays inactive VLUNs and VLUNs that have not been reported by the host agent VLUN Displays VVs that are not optimalVV Alert Displays unresolved alerts and shows any alerts generated by showalert -n. Format of Possible Alert Exception Messages Alert <component> <alert_text> Alert Example Component -Identifier- --------Description-------------------- Alert hw_cage:1 Cage 1 Degraded (Loop Offline) Alert sw_cli 11 authentication failures in 120 secs 32 Troubleshooting
  • 33. Alert Suggested Action View the full Alert output using the MC (GUI) or the showalert -d CLI command. Cabling Displays any cabling errors. Checks for compliance of standard cabling rules between nodes and drive cages (same slot and port numbers on two different nodes to a cage). NOTE: To avoid any cabling errors, all drive enclosures must have at least one or more hard drives installed before powering on the enclosure. Format of Possible Cabling Exception Messages Cabling Bad SAS connection 20 Cabling cage1 Check connections or replace cable from (cage0, I/O 0, DP-1) to (cage1, I/O 0, DP-1) Cabling Unexpected cage found 24 Cabling -- Unexpected cage found on node3 DP-2 Cabling Wrong I/O or port 1 Cabling cage8 Cable in (cage8, I/O 1, Mfg) should be in (cage8, I/O 1, DP-2) Cabling SAS cabling check incomplete 1 Cabling cage8 All three SAS ports of I/O 1 used, cabling check incomplete Cabling Incorrect drive cage chaining 1 Cabling cage5 Cable in (cage5, I/O 1, DP-2) should be in (cage9, I/O 1, DP-1) Cabling Mismatched cage order 1 Cabling cage0 node1 DP-2 should be cabled in the order: cage9 cage8 cage7 cage6 cage5 Cabling Cable chains are unbalanced 1 Cabling cage0 node0 DP-2 has 5 cages, node1 DP-2 has 4 cages Cabling Missing I/O module 1 Cabling cage5 I/O 1 missing. Check status and cabling to cage5 I/O 1 Cabling Cable chain too long 1 Cabling cage0 node1 DP-2 has 6 cages connected, Maximum is 5 (cage9 cage8 cage7 cage6 cage5 cage11) Cabling Cages not connected to paired nodes 1 Cabling cage11 Cage connected to non-paired nodes node1 DP-2 and node2 DP-2 Cabling Cages cabled to too many nodes 6 Cabling cage5 Cabled to node0 DP-2 and node1 DP-2 and node3 DP-2, remove a cable from node3 Cabling Multiple node ports on a single cable chain 1 Cabling cage11 Cage is connected to too many node ports (node2 DP-1 & DP-2 and node3 DP-1 & DP-2) Cabling Cages cabled to nodes twice 1 Cabling cage11 Cabled to node2 DP-2 and node3 DP-1 & DP-2, remove a cable from node3 Cabling Cages with multiple paths to node ports 1 Cabling cage11 Cage has multiple paths to node2 DP-2 and node3 DP-2, correct cabling Troubleshooting Storage System Components 33
  • 34. Cabling Cages with multiple paths to node ports 2 Cabling cage5 Cage has multiple paths to node0 DP-2, correct cabling Cabling Cages cabled to a single node 1 Cabling cage11 Cage not connected to node2, move one connection from node3 to node2 Cabling Cages not connected to same slot & port 1 Cabling cage11 Cage connected to different ports node2 DP-1 and node3 DP-2 Cabling Example 1 Component -Identifier- ---Description-- Cabling cage:3 Missing Port Cabling Suggested Action 1 Check the status of the nodes, FC ports, cage, and paths to the drive cage using CLI commands such as showcage, showcage -d, showpd, and shownode. If a node is offline, multiple cages are affected. cli% showcage cage3 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 3 cage3 --- 0 3:0:4 0 28 27-34 2.33 2.33 DC2 n/a cli% showpd -p -cg 3 ---Size(MB)---- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 48 3:0:0 FC 10 degraded 139520 120320 ----- 3:0:4* 49 3:0:1 FC 10 degraded 139520 126464 ----- 3:0:4* 50 3:0:2 FC 10 degraded 139520 120320 ----- 3:0:4* 51 3:0:3 FC 10 degraded 139520 126464 ----- 3:0:4* cli% showpd -p -cg 3 -path -------Paths------- Id CagePos Type -State-- A B Order 48 3:0:0 FC degraded 2:0:4missing 3:0:4 3/- 49 3:0:1 FC degraded 2:0:4missing 3:0:4 3/- 50 3:0:2 FC degraded 2:0:4missing 3:0:4 3/- 51 3:0:3 FC degraded 2:0:4missing 3:0:4 3/- cli% showport 2:0:4 initiator offline 2FF70002AC00054C 22040002AC00054C free or 2:0:4 initiator loss_sync 2FF70002AC00054C 22040002AC00054C free Cabling Example 2 Component -Identifier- --------Description------------------------ Cabling cage:0 Not connected to the same slot & port Cabling Suggested Action 2 The recommended and factory-default node-to-drive-chassis (cage) cabling configuration is to connect a drive cage to a node-pair (two node). Generally nodes 0/1, 2/3, 4/5 or 6/7, achieve 34 Troubleshooting
  • 35. symmetry between slots and ports (use the same slot and port on each node to a cage). In the next example, cage0 is incorrectly connected to either slot-0 of node-0 or slot-1 of node-1. cli% showcage cage0 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 0 cage0 0:0:1 0 1:1:1 0 24 28-38 2.37 2.37 DC2 n/a After determining the desired cabling and reconnecting correctly to slot-0 and port-1 of nodes 0 & 1, the output should look like this: cli% showcage cage0 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 0 cage0 0:0:1 0 1:0:1 0 24 28-38 2.37 2.37 DC2 n/a Cage Displays drive cage conditions that are not optimal and reports exceptions if any of the following do not have normal states: • Ports • Drive magazine states (DC1, DC2, & DC4) • Small form-factor pluggable (SFP) voltages (DC2 and DC4) • SFP signal levels (RX power low and TX failure) • Power supplies • Cage firmware (is not current) Reports if a servicecage operation has been started and has not ended. Format of Possible Cage Exception Messages Cage cage:<cageid> "Missing A loop" (or "Missing B loop") Cage cage:<cageid> "Interface Card <STATE>, SFP <SFPSTATE>" (is unqualified, is disabled, Receiver Power Low: Check FC Cable, Transmit Power Low: Check FC Cable, has RX loss, has TX fault)" Cage cage:<cageid>,mag:<magpos> "Magazine is <MAGSTATE>" Cage cage:<cageid> "Power supply <X> fan is <FANSTATE>" Cage cage:<cageid> "Power supply <X> is <PSSTATE>" (Degraded, Failed, Not_Present) Cage cage:<cageid> "Power supply <X> AC state is <PSSTATE>" Cage cage:<cageid> "Cage is in 'servicing' mode (Hot-Plug LED may be illuminated)" Cage cage:<cageid> "Firmware is not current" Cage Example 1 Component -------------Description-------------- Qty Cage Cages missing A loop 1 Cage SFPs with low receiver power 1 Component -Identifier- --------Description------------------------ Cage cage:4 Missing A loop Cage cage:4 Interface Card 0, SFP 0: Receiver Power Low: Check FC Cable Cage Suggested Action 1 Check the connection/path to the SFP in the cage and the level of signal the SFP is receiving. An RX Power reading below 100 µW signals the RX Power Low condition; typical readings are between 300 and 400 µW. Useful CLI commands are showcage -d and showcage -sfp ddm. Troubleshooting Storage System Components 35
  • 36. At least two connections are expected for drive cages, and this exception is flagged if that is not the case. cli% showcage -d cage4 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 4 cage4 --- 0 3:2:1 0 8 28-36 2.37 2.37 DC4 n/a -----------Cage detail info for cage4 --------- Fibre Channel Info PortA0 PortB0 PortA1 PortB1 Link_Speed 0Gbps -- -- 4Gbps ----------------------------------SFP Info----------------------------------- FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 0 0 OK FINISAR CORP. 4.1 No No Yes Yes 1 1 OK FINISAR CORP. 4.1 No No No Yes Interface Board Info FCAL0 FCAL1 Link A RXLEDs Off Off Link A TXLEDs Green Off Link B RXLEDs Off Green Link B TXLEDs Off Green LED(Loop_Split) Off Off LEDS(system,hotplug) Green,Off Green,Off -----------Midplane Info----------- Firmware_status Current Product_Rev 2.37 State Normal Op Loop_Split 0 VendorId,ProductId 3PARdata,DC4 Unique_ID 1062030000098E00 ... -------------Drive Info------------- ----LoopA----- ----LoopB----- Drive NodeWWN LED Temp(C) ALPA LoopState ALPA LoopState 0:0 2000001d38c0c613 Green 33 0xe1 Loop fail 0xe1 OK 0:1 2000001862953510 Green 35 0xe0 Loop fail 0xe0 OK 0:2 2000001862953303 Green 35 0xdc Loop fail 0xdc OK 0:3 2000001862953888 Green 31 0xda Loop fail 0xda OK cli% showcage -sfp cage4 Cage FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 4 0 0 OK FINISAR CORP. 4.1 No No Yes Yes 4 1 1 OK FINISAR CORP. 4.1 No No No Yes cli% showcage -sfp -ddm cage4 ---------Cage 4 Fcal 0 SFP 0 DDM---------- -Warning- --Alarm-- --Type-- Units Reading Low High Low High Temp C 33 -20 90 -25 95 Voltage mV 3147 2900 3700 2700 3900 TX Bias mA 7 2 14 1 17 TX Power uW 394 79 631 67 631 RX Power uW 0 15 794 10* 1259 ---------Cage 4 Fcal 1 SFP 1 DDM---------- -Warning- --Alarm-- --Type-- Units Reading Low High Low High Temp C 31 -20 90 -25 95 Voltage mV 3140 2900 3700 2700 3900 TX Bias mA 8 2 14 1 17 36 Troubleshooting
  • 37. TX Power uW 404 79 631 67 631 RX Power uW 402 15 794 10 1259 Cage Example 2 Component -------------Description-------------- Qty Cage Degraded or failed cage power supplies 2 Cage Degraded or failed cage AC power 1 Component -Identifier- ------------Description------------ Cage cage:1 Power supply 0 is Failed Cage cage:1 Power supply 0's AC state is Failed Cage cage:1 Power supply 2 is Off Cage Suggested Action 2 A cage power supply or power supply fan has failed, is missing input AC power, or the switch is turned OFF. The showcage -d cageX and showalert commands provide more detail. cli% showcage -d cage1 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 1 cage1 0:0:2 0 1:0:2 0 24 27-39 2.37 2.37 DC2 n/a -----------Cage detail info for cage1 --------- Interface Board Info FCAL0 FCAL1 Link A RXLEDs Green Off Link A TXLEDs Green Off Link B RXLEDs Off Green Link B TXLEDs Off Green LED(Loop_Split) Off Off LEDS(system,hotplug) Amber,Off Amber,Off -----------Midplane Info----------- Firmware_status Current Product_Rev 2.37 State Normal Op Loop_Split 0 VendorId,ProductId 3PARdata,DC2 Unique_ID 10320300000AD000 Power Supply Info State Fan State AC Model ps0 Failed OK Failed POI <AC input is missing ps1 OK OK OK POI ps2 Off OK OK POI <PS switch is turned off ps3 OK OK OK POI Cage Example 3 Component -Identifier- --------------Description---------------- Cage cage:1 Cage has a hotplug enabled interface card Cage Suggested Action 3 When a servicecage operation is started, the targeted cage goes into servicing mode, illuminating the hot plug LED on the FCAL module (DC1, DC2, DC4), and routing I/O through another path. When the service action is finished, enter the servicecage endfc command to return the cage to normal status. The checkhealth exception is reported if the FCAL module's hot plug LED is illuminated or if the cage is in servicing mode. If a maintenance activity is currently occurring on the drive cage, this condition may be ignored. Troubleshooting Storage System Components 37
  • 38. NOTE: The primary path can be seen by an asterisk (*) in showpd's Ports columns. cli% showcage -d cage1 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 1 cage1 0:0:2 0 1:0:2 0 24 28-40 2.37 2.37 DC2 n/a -----------Cage detail info for cage1 --------- Interface Board Info FCAL0 FCAL1 Link A RXLEDs Green Off Link A TXLEDs Green Off Link B RXLEDs Off Green Link B TXLEDs Off Green LED(Loop_Split) Off Off LEDS(system,hotplug) Green,Off Green,Amber -----------Midplane Info----------- Firmware_status Current Product_Rev 2.37 State Normal Op Loop_Split 0 VendorId,ProductId 3PARdata,DC2 Unique_ID 10320300000AD000 cli% showpd -s Id CagePos Type -State-- -----Detailed_State------ 20 1:0:0 FC degraded disabled_B_port,servicing 21 1:0:1 FC degraded disabled_B_port,servicing 22 1:0:2 FC degraded disabled_B_port,servicing 23 1:0:3 FC degraded disabled_B_port,servicing cli% showpd -p -cg 1 ---Size(MB)---- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 20 1:0:0 FC 10 degraded 139520 119808 0:0:2* 1:0:2- 21 1:0:1 FC 10 degraded 139520 122112 0:0:2* 1:0:2- 22 1:0:2 FC 10 degraded 139520 119552 0:0:2* 1:0:2- 23 1:0:3 FC 10 degraded 139520 122368 0:0:2* 1:0:2- Cage Example 4 SComponent ---------Description--------- Qty Cage Cages not on current firmware 1 Component -Identifier- ------Description------ Cage cage:3 Firmware is not current Cage Suggested Action 4 Check the drive cage firmware revision using the commands showcage and showcage -d cageX. The showfirwaredb command displays current firmware level required for the specific drive cage type. NOTE: The DC1 and DC3 cages have firmware in the FCAL modules. The DC2 and DC4 cages have firmware on the cage mid-plane. Use the upgradecage command to upgrade the firmware. cli% showcage Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 2 cage2 2:0:3 0 3:0:3 0 24 29-43 2.37 2.37 DC2 n/a 3 cage3 2:0:4 0 3:0:4 0 32 29-41 2.36 2.36 DC2 n/a 38 Troubleshooting
  • 39. cli% showcage -d cage3 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 3 cage3 2:0:4 0 3:0:4 0 32 29-41 2.36 2.36 DC2 n/a -----------Cage detail info for cage3 --------- . . . -----------Midplane Info----------- Firmware_status Old Product_Rev 2.36 State Normal Op Loop_Split 0 VendorId,ProductId 3PARdata,DC2 Unique_ID 10320300000AD100 cli% showfirmwaredb Vendor Prod_rev Dev_Id Fw_status Cage_type Firmware_File ... 3PARDATA [2.37] DC2 Current DC2 /opt...dc2/lbod_fw.bin-2.37 Cage Example 5 Component -Identifier- ------------Description------------ Cage cage:4 Interface Card 0, SFP 0 is unqualified Cage Suggested Action 5 In this example, a 2 Gb/s SFP was installed in a 4 Gb/s drive cage (DC4), and the 2 Gb/s SFP is not qualified for use in this drive cage. For cage problems, the following CLI commands are useful: showcage -d, showcage -sfp, showcage -sfp -ddm, showcage -sfp -d, and showpd -state. cli% showcage -d cage4 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 4 cage4 2:2:1 0 3:2:1 0 8 30-37 2.37 2.37 DC4 n/a -----------Cage detail info for cage4 --------- Fibre Channel Info PortA0 PortB0 PortA1 PortB1 Link_Speed 2Gbps -- -- 4Gbps ----------------------------------SFP Info----------------------------------- FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 0 0 OK SIGMA-LINKS 2.1 No No No Yes 1 1 OK FINISAR CORP. 4.1 No No No Yes Interface Board Info FCAL0 FCAL1 Link A RXLEDs Green Off Link A TXLEDs Green Off Link B RXLEDs Off Green Link B TXLEDs Off Green LED(Loop_Split) Off Off LEDS(system,hotplug) Amber,Off Green,Off ... cli% showcage -sfp -d cage4 --------Cage 4 FCAL 0 SFP 0-------- Cage ID : 4 Fcal ID : 0 SFP ID : 0 Troubleshooting Storage System Components 39
  • 40. State : OK Manufacturer : SIGMA-LINKS Part Number : SL5114A-2208 Serial Number : U260651461 Revision : 1.4 MaxSpeed(Gbps) : 2.1 Qualified : No <<< Unqualified SFP TX Disable : No TX Fault : No RX Loss : No RX Power Low : No DDM Support : Yes --------Cage 4 FCAL 1 SFP 1-------- Cage ID : 4 Fcal ID : 1 SFP ID : 1 State : OK Manufacturer : FINISAR CORP. Part Number : FTLF8524P2BNV Serial Number : PF52GRF Revision : A MaxSpeed(Gbps) : 4.1 Qualified : Yes TX Disable : No TX Fault : No RX Loss : No RX Power Low : No DDM Support : Yes Consistency Displays inconsistencies between sysmgr and the kernel. The check is added to find inconsistent and unusual conditions between of the system manager and the node kernel. The check requires the hidden -svc -full parameter because the check can take 20 minutes or longer for a large system. Format of Possible Consistency Exception Messages Consistency --<err> Consistency Example Component -Identifier- --------Description------------------------ Consistency -- Region Mover Consistency Check Failed Consistency -- CH/LD/VV Consistency Check Failed Consistency Suggested Action Gather InSplore data and escalate to HP BSC. Data Encryption (DAR) Checks issues with data encryption. If the system is not licensed for HP 3PAR Data Encryption, no checks are made. 40 Troubleshooting
  • 41. Format of Possible DAR Exception Messages Dar -- "There are 5 disks that are not self-encrypting" DAR Suggested Action Remove the drives that are not self-encrypting from the system because the non-encrypted drives cannot be admitted into a system that is running with data encryption. Also, if the system is not yet enabled for data encryption, the presence of these disks prevents data encryption from being enabled. DAR Example 2 Dar -- "DAR Encryption key needs backup" DAR Suggested Action 2 Issue the controlencryption backup command to generate a password-enabled backup file. Date Checks the date and time on all nodes. Format of Possible Date Exception Messages Date -- "Date is not the same on all nodes" Date Example Component -Identifier- -----------Description----------- Date -- Date is not the same on all nodes Date Suggested Action The time on the nodes should stay synchronized whether there is an NTP server or not. Use showdate to see if a node is out of sync. Use shownet and shownet -d commands to view network and NTP information. cli% showdate Node Date 0 2010-09-08 10:56:41 PDT (America/Los_Angeles) 1 2010-09-08 10:56:39 PDT (America/Los_Angeles) cli% shownet IP Address Netmask/PrefixLen Nodes Active Speed 192.168.56.209 255.255.255.0 0123 0 100 Duplex AutoNeg Status Full Yes Active Default route: 192.168.56.1 NTP server : 192.168.56.109 Troubleshooting Storage System Components 41
  • 42. File Displays file system conditions that are not optimal. Checks for the following: • The presence of special files on each node, for example: touch manualstartup • That the persistent repository (Admin VV) is mounted • Whether the file-systems on any node disk are close to full • The presence of any HBA core files or user process dumps • Whether the amount of free node memory is sufficient File Format of Possible Exception Messages File node:<node> "Behavior altering file "<file> " exists created on <filetime>" File node:master "Admin Volume is not mounted" File Node node:,<node> "Filesystem <filesys> mounted on "<mounted_on> " is over xx% full" (Warnings are given at 80 and 90%.) File node:<node> "Dump or HBA core files found" File -- "An online upgrade is in progress" File Example 1 File node:2 Behavior altering file "manualstartup" exists created on Oct 7 14:16 File Suggested Action 1 After understanding the condition of the file, remove the file to prevent unwanted behavior. As root on a node, use the UNIX rm command to remove the file. A known condition includes some undesirable touch files are not being detected (bug 45661). File Example 2 Component -----------Description----------- Qty File Admin Volume is not mounted 1 File Suggested Action 2 Each node has a file system link so the admin volume can be mounted if the node is the master node. This exception is reported if the link is missing or if the System Manager (sysmgr) is not running at the time. For example, sysmgr may have restarted manually, due to error or during a change of master-nodes. If sysmgr is restarted, the sysmgr to remount the admin volume every few minutes. Every node should have the following file system link so that the admin volume can be mounted, if the node becomes the master node: root@1001356-1~# onallnodes ls -l /dev/tpd_vvadmin Node 0: lrwxrwxrwx 1 root root 12 Oct 23 09:53 /dev/tpd_vvadmin -> tpddev/vvb/0 42 Troubleshooting
  • 43. Node 1: ls: /dev/tpd_vvadmin: No such file or directory The corresponding alert when the admin volume is not properly mounted is as follows: Message Code: 0xd0002 Severity : Minor Type : PR transition Message : The PR is currently getting data from the internal drive on node 1, not the admin volume. Previously recorded alerts will not be visible until the PR transitions to the admin volume. If a link for the admin volume is not present, it can be recreated by rebooting the node. File Example 3 Component -----------Description----------- Qty File Nodes with Dump or HBA core files 1 Component ----Identifier----- ----Description------ File node:1 Dump or HBA core files found File Suggested Action 3 This condition may be transient because the Service Processor retrieves the files and cleans up the dump directory. If the SP is not gathering the dump files, check the condition and state of the SP. LD Checks the following and displays logical disks (LD) that are not optimal: • Preserved LDs • Verifies that current and created availability are the same • Owner and backup • Verifies preserved data space (pdsld) is the same as total data cache • Size and number of logging LDs Format of Possible LD Exception Messages LD ld:<ldname> "LD is not mapped to a volume" LD ld:<ldname> "LD is in write-through mode" LD ld:<ldname> "LD has <X> preserved RAID sets and <Y> preserved chunklets" LD ld:<ldname> "LD has reduced availability. Current: <cavail>, Configured: <avail>" LD ld:<ldname> "LD does not have a backup" LD ld:<ldname> "LD does not have owner and backup" LD ld:<ldname> "Logical Ddisk is owned by <owner>, but preferred owner is <powner>" LD ld:<ldname> "Logical Disk is backed by <backup>, but preferred backup is <pbackup>" LD ld:<ldname> "A logging LD is smaller than 20G in size" LD ld:<ldname> "Detailed State:<ldstate>" (degraded or failed) LD -- "Number of logging LD's does not match number of nodes in the cluster" LD -- "Preserved data storage space does not equal total node's Data memory" Troubleshooting Storage System Components 43
  • 44. LD Example 1 Component -------Description-------- Qty LD LDs not mapped to a volume 10 Component -Identifier-- --------Description--------- LD ld:Ten.usr.0 LD is not mapped to a volume LD Suggested Action 1 Examine the identified LDs using the following CLI commands:showld, showld –d, showldmap, and showvvmap. LDs are normally mapped to (used by) VVs but they can be disassociated with a VV if a VV is deleted without the underlying LDs being deleted, or by an aborted tune operation. Normally, you would remove the unmapped LD to return its chunklets to the free pool. cli% showld Ten.usr.0 Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV 88 Ten.usr.0 0 normal 0/1/2/3 8704 0 V 0 --- N N cli% showldmap Ten.usr.0 Ld space not used by any vv LD Example 2 Component -------Description-------- Qty LD LDs in write through mode 3 Component -Identifier-- --------Description--------- LD ld:Ten.usr.12 LD is in write-through mode LD Suggested Action 2 Examine the identified LDs for failed or missing disks by using the following CLI commands:showld, showld –d, showldch, and showpd. Write-through mode (WThru) indicates that host I/O operations must be written through to the disk before the host I/O command is acknowledged. This is usually due to a node-down condition, when node batteries are not working, or where disk redundancy is not optimal. cli% showld Ten* Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV 91 Ten.usr.3 0 normal 1/0/3/2 13824 0 V 0 --- N N 92 Ten.usr.12 0 normal 2/3/0/1 28672 0 V 0 --- Y N cli% showldch Ten.usr.12 Ldch Row Set PdPos Pdid Pdch State Usage Media Sp From To 0 0 0 3:3:0 108 6 normal ld valid N --- --- 11 0 11 --- 104 74 normal ld valid N --- --- 44 Troubleshooting
  • 45. cli% showpd 104 -Size(MB)-- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 104 4:9:0? FC 15 failed 428800 0 ----- ----- LD Example 3 Component ---------Description--------- Qty LD LDs with reduced availability 1 Component --Identifier-- ------------Description--------------- LD ld:R1.usr.0 LD has reduced availability. Current: ch, Configured: cage LD Suggested Action 3 LDs are created with certain high-availability characteristics, such as ha-cage. Reduced availability can occur if chunklets in an LD are moved to a location where the current availability (CAvail) is below the desired level of availability (Avail). Chunklets may have been manually moved with movech or by specifying it during a tune operation or during failure conditions such as node, path, or cage failures. The HA levels from highest to lowest are port, cage, mag, and ch (disk). Examine the identified LDs for failed or missing disks by using the following CLI commands: showld, showld –d, showldch, and showpd. In the example below, the LD should have cage-level availability, but it currently has chunklet (disk) level availability (the chunklets are on the same disk). cli% showld -d R1.usr.0 Id Name CPG RAID Own SizeMB RSizeMB RowSz StepKB SetSz Refcnt Avail CAvail 32 R1.usr.0 --- 1 0/1/3/2 256 512 1 256 2 0 cage ch cli% showldch R1.usr.0 Ldch Row Set PdPos Pdid Pdch State Usage Media Sp From To 0 0 0 0:1:0 4 0 normal ld valid N --- --- 1 0 0 0:1:0 4 55 normal ld valid N --- --- LD Example 4 Component -Identifier-- -----Description------------- LD -- Preserved data storage space does not equal total node's Data memory LD Suggested Action 4 Preserved data LDs (pdsld) are created during system initialization Out-of-the-Box (OOTB) and after some hardware upgrades (through admithw command). The total size of the pdsld should match the total size of all data-cache in the storage system (see below). This message appears if a node is offline because the comparison of LD size to data cache size does not match. This message can be ignored unless all nodes are online. If all nodes are online and the error condition persists, determine the cause of the failure. Use the admithw command to correct the condition. cli% shownode Control Data Cache Node --Name--- -State- Master InCluster ---LED--- Mem(MB) Mem(MB) Available(%) 0 1001335-0 OK Yes Yes GreenBlnk 2048 4096 100 Troubleshooting Storage System Components 45
  • 46. 1 1001335-1 OK No Yes GreenBlnk 2048 4096 100 cli% showld pdsld* Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV 19 pdsld0.0 1 normal 0/1 256 0 P,F 0 --- Y N 20 pdsld0.1 1 normal 0/1 7680 0 P 0 --- Y N 21 pdsld0.2 1 normal 0/1 256 0 P 0 --- Y N ---------------------------------------------------------------------------- 3 8192 0 License Displays license violations. Format of Possible License Exception Messages License <feature_name> "License has expired" License Example Component -Identifier- --------Description------------- License -- System Tuner License has expired License Suggested Action Request a new or updated license from your Sales Engineer. Network Displays Ethernet issues for administrative and Remote Copy over IP (RCIP) networks that have been logged on the previous 24-hours. Also, reports the storage system has fewer than two nodes with working administrative Ethernet connections. • Check the number of collisions in the previous day log. The number of collisions should be less than 5% of the total packets for the day. • Check for Ethernet errors and transmit (TX) or receive (RX) errors in previous day’s log. Format of Possible Network Exception Messages Network -- "IP address change has not been completed" Network "Node<node>:<type>" "Errors detected on network" Network "Node<node>:<type>" "There is less than one day of network history for this node" Network -- "No nodes have working admin network connections" Network -- "Node <node> has no admin network link detected" Network -- "Nodes <nodelist> have no admin network link detected" Network -- "checkhealth was unable to determine admin link status Network Example 1 Network -- "IP address change has not been completed" 46 Troubleshooting
  • 47. Network Suggested Action 1 The setnet command is issued to change some network parameter, such as the IP address, but the action has not completed. Use setnet finish to complete the change, or setnet abort to cancel. Use the shownet command to examine the current condition. cli% shownet IP Address Netmask/PrefixLen Nodes Active Speed Duplex AutoNeg Status 192.168.56.209 255.255.255.0 0123 0 100 Full Yes Changing 192.168.56.233 255.255.255.0 0123 0 100 Full Yes Unverified Network Example 2 Component ---Identifier---- -----Description---------- Network Node0:Admin Errors detected on network Network Suggested Action 2 Network errors have been detected on the specified node and network interface. Commands such as shownet and shownet -d are useful for troubleshooting network problems. These commands display current network counters as checkhealth shows errors from the last logging sample. NOTE: The error counters shown by shownet and shownet -d cannot be cleared except by rebooting a controller node. Because checkhealth is showing network counters from a history log, checkhealth stops reporting the issue if there is no increase in error in the next log entry. shownet -d IP Address: 192.168.56.209 Netmask 255.255.255.0 Assigned to nodes: 0123 Connected through node 0 Status: Active Admin interface on node 0 MAC Address: 00:02:AC:25:04:03 RX Packets: 1225109 TX Packets: 550205 RX Bytes: 1089073679 TX Bytes: 568149943 RX Errors: 0 TX Errors: 0 RX Dropped: 0 TX Dropped: 0 RX FIFO Errors: 0 TX FIFO Errors: 0 RX Frame Errors: 60 TX Collisions: 0 RX Multicast: 0 TX Carrier Errors: 0 RX Compressed: 0 TX Compressed: 0 Node Checks the following node conditions and displays nodes that are not optimal: • Verifies node batteries have been tested in the last 30 days • Offline nodes • Power supply and battery problems The following checks are performed only if the -svc option is used. • Checks for symmetry of components between nodes such as Control-Cache and Data-Cache size, OS version, bus speed, and CPU speed • Checks if diagnostics such as ioload are running on any of the nodes • Checks for stuck-threads, such as I/O operations that cannot complete Troubleshooting Storage System Components 47
  • 48. Format of Possible Node Exception Messages Node node:<nodeID> "Node is not online" Node node:<nodeID> "Power supply <psID> detailed state is <status> Node node:<nodeID> "Power supply <psID> AC state is <acStatus>" Node node:<nodeID> "Power supply <psID> DC state is <dcStatus>" Node node:<nodeID> "Power supply <psID> battery is <batStatus>" Node node:<nodeID> "Node <nodeID> battery is <batStatus>" Node node:<priNodeID> "<bat> has not been tested within the last 30 days" Node node:<nodeID> "Node <nodeID> battery is expired" Node node:<nodeID> "Power supply <psID> is expired" Node node:<nodeID> "Fan is <fanID> is <status>" Node node:<nodeID> "Power supply <psID> fan module <fanID> is <status>" Node node:<nodeID> "Fan module <fanID> is <status> Node node:<nodeID> "Detailed State <state>" (degraded or failed) The following checks are performed when the -svc option is used: Node -- "BIOS version is not the same on all nodes" Node -- "Control memory is not the same on all nodes" Node -- "Data memory is not the same on all nodes" Node -- "CPU Speed is not the same on all nodes" Node -- "CPU Bus Speed is not the same on all nodes" Node -- "HP 3PAR OS version is not the same on all nodes" Node node:<nodenum> "Flusher speed set incorrectly to: <speeed>" (should be 0) Node node:<nodenum> "Environmental factor <factor> is <state>" (DDR2, Node), (UNDER LIMIT, OVER LIMIT) Node node:<node> "Ioload is running" Node node:<node> "Node has less than 100MB of free memory" Node node:<node> "BIOS skip mask is <skip_mask>" Node node:<node> "quo_cex_flags are not set correctly" Node node:<node> "clus_upgr_group state is not set correctly" Node node:<node> clus_upgr_state is not set correctly" Node node:<node> Process <processID> has reached 90% of maximum size" Node node:<node> VV <vvID> has outstanding <command> with a maximum wait time of <sleeptime>" Node -- "There is at least one active servicenode operation in progress" Node Suggested Action For node error conditions, examine the node and node-component states by using the following commands: shownode, shownode -s, shownode -d, showbattery, and showsys -d. Node Example 1 Component -Identifier- ---------------Description---------------- Node node:0 Power supply 1 detailed state is DC Failed Node node:0 Power supply 1 DC state is Failed Node node:1 Power supply 0 detailed state is AC Failed Node node:1 Power supply 0 AC state is Failed Node node:1 Power supply 0 DC state is Failed Node Suggested Action 1 Examine the states of the power supplies with commands such as shownode, shownode -s, shownode -ps. Turn on or replace the failed power supply. 48 Troubleshooting
  • 49. NOTE: In the example below, the battery state is considered degraded because the power supply is failed. cli% shownode Control Data Cache Node --Name--- -State-- Master InCluster ---LED--- Mem(MB) Mem(MB) Available(%) 0 1001356-0 Degraded Yes Yes AmberBlnk 2048 8192 100 1 1001356-1 Degraded No Yes AmberBlnk 2048 8192 100 cli% shownode -s Node -State-- -Detailed_State- 0 Degraded PS 1 Failed 1 Degraded PS 0 Failed cli% shownode -ps Node PS -Serial- -PSState- FanState ACState DCState -BatState- ChrgLvl(%) 0 0 FFFFFFFF OK OK OK OK OK 100 0 1 FFFFFFFF Failed -- OK Failed Degraded 100 1 0 FFFFFFFF Failed -- Failed Failed Degraded 100 1 1 FFFFFFFF OK OK OK OK OK 100 Node Example 2 Component -Identifier- ---------Description------------ Node node:3 Power supply 1 battery is Failed Node Suggested Action 2 Examine the state of the battery and power supply by using the following commands: shownode, shownode -s, shownode -ps, showbattery (and showbattery with -d, -s, -log). Turn on, fix, or replace the battery backup unit. NOTE: The condition of the degraded power supply is caused by the failing battery. The degraded PS state is not the expected behavior. This issue will be fixed in a future release. (bug 46682). cli% shownode Control Data Cache Node --Name--- -State-- Master InCluster ---LED--- Mem(MB) Mem(MB) Available(%) 2 1001356-2 OK No Yes GreenBlnk 2048 8192 100 3 1001356-3 Degraded No Yes AmberBlnk 2048 8192 100 cli% shownode -s Node -State-- -Detailed_State- 2 OK OK 3 Degraded PS 1 Degraded cli% shownode -ps Node PS -Serial- -PSState- FanState ACState DCState -BatState- ChrgLvl(%) 2 0 FFFFFFFF OK OK OK OK OK 100 2 1 FFFFFFFF OK OK OK OK OK 100 3 0 FFFFFFFF OK OK OK OK OK 100 3 1 FFFFFFFF Degraded OK OK OK Failed 0 cli% showbattery Node PS Bat Serial -State-- ChrgLvl(%) -ExpDate-- Expired Testing 3 0 0 100A300B OK 100 07/01/2011 No No 3 1 0 12345310 Failed 0 04/07/2011 No No Troubleshooting Storage System Components 49
  • 50. Node Example 3 Component -Identifier- --------------Description---------------- Node node:3 Node:3, Power Supply:1, Battery:0 has not been tested within the last 30 days Node Suggested Action 3 The indicated battery has not been tested in 30 days. A node backup battery is tested every 14 days under normal conditions. If the main battery is missing, expired, or failed, the backup battery is not tested. A backup battery connected to the same node is not tested because testing it can cause loss of power to the node. An untested battery has an unknown status in the showbattery -s output. Use the following commands: showbattery, showbattery -s, and showbattery -d. showbattery -s Node PS Bat -State-- -Detailed_State- 0 0 0 OK normal 0 1 0 Degraded Unknown Examine the date of the last successful test of that battery. Assuming the comment date was 2009-10-14, the last battery test on Node 0, PS 1, Bat 0 was 2009-09-10, which is more than 30 days ago. showbattery -log Node PS Bat Test Result Dur(mins) ---------Time---------- 0 0 0 0 Passed 1 2009-10-14 14:34:50 PDT 0 0 0 1 Passed 1 2009-10-28 14:36:57 PDT 0 1 0 0 Passed 1 2009-08-27 06:17:44 PDT 0 1 0 1 Passed 1 2009-09-10 06:19:34 PDT showbattery Node PS Bat Serial -State-- ChrgLvl(%) -ExpDate-- Expired Testing 0 0 0 83205243 OK 100 04/07/2011 No No 0 1 0 83202356 Degraded 100 04/07/2011 No No Example Node 4 Component ---Identifier---- -----Description---------- Node node:0 Ioload is running Suggested Action Node 4 This output appears only if the -svc option of checkhealth is used. The output it is not displayed for the non-service check. When a disk diagnostic stress test is detected running on the node, and the test can affect the node performance. After installing the HP 3PAR Storage System, diagnostic stress tests exercise the disks for up to two hours following the initial setup (OOTB). If the stress test is detected within three hours of the initial setup, disregard the warning. If the test detected after the setup, the test may have been manually started. Investigate the operation and contact HP Support. 50 Troubleshooting
  • 51. From a node's root login prompt, check the UNIX processes for ioload: root@1001356-0 Tue Nov 03 13:37:31:~# onallnodes ps -ef |grep ioload root 13384 1 2 13:36 ttyS0 00:00:01 ioload -n -c 2 -t 20000 -i 256 -o 4096 /dev/tpddev/pd/100 PD Displays physical disks with states or conditions that are not optimal: • Checks for failed and degraded PDs • Checks for an imbalance of PD ports, for example, if Port-A is used on more disks than Port-B • Checks for an Unknown sparing algorithm. • Checks for disks experiencing a high number of IOPS • Reports if a servicemag operation is outstanding (servicemag status) • Reports if there are PDs that do not have entries in the firmware DB file Format of Possible PD Exception Messages PD disk:<pdid> "Degraded States: <showpd -s -degraded"> PD disk:<pdid> "Failed States: <showpd -s -failed"> PD -- "There is an imbalance of active PD ports" PD -- "Sparing algorithm is not set" PD disk:<pdid> "Disk is experiencing a high level of I/O per second: <iops>" PD -- There is at least one active servicemag operation in progress The following checks are performed when the -svc option is used, or on 7400/7200 hardware: PD File: <filename> "Folder not found on all Nodes in <folder>" PD File: <filename> "Folder not found on some Nodes in <folder>" PD File: <filename> "File not found on all Nodes in <folder>" PD File: <filename> "File not found on some Nodes in <folder>" PD Disk:<pdID> "<pdmodel> PD for cage type <cagetype> in cage position <pos> is missing from firmware database" PD Example 1 Component -------------------Description------------------- Qty PD PDs that are degraded or failed 40 Component -Identifier- ---------------Description----------------- PD disk:48 Detailed State: missing_B_port,loop_failure PD disk:49 Detailed State: missing_B_port,loop_failure ... PD disk:107 Detailed State: failed,notready,missing_A_port PD Suggested Action 1 Both degraded and failed disks are reported. When an FC path to a drive cage is not working, all disks in the cage have a degraded state due to the non-redundant condition. To further diagnose, Troubleshooting Storage System Components 51
  • 52. use the following commands: showpd, showpd -s, showcage, showcage -d, showport -sfp. cli% showpd -degraded -failed ----Size(MB)---- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 48 3:0:0 FC 10 degraded 139520 115200 2:0:4* ----- 49 3:0:1 FC 10 degraded 139520 121344 2:0:4* ----- … 107 4:9:3 FC 15 failed 428800 0 ----- 3:2:1* cli% showpd -s -degraded -failed Id CagePos Type -State-- -----------------Detailed_State-------------- 48 3:0:0 FC degraded missing_B_port,loop_failure 49 3:0:1 FC degraded missing_B_port,loop_failure … 107 4:9:3 FC failed prolonged_not_ready,missing_A_port,relocating cli% showcage -d cage3 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 3 cage3 2:0:4 0 --- 0 32 28-39 2.37 2.37 DC2 n/a -----------Cage detail info for cage3 --------- Fibre Channel Info PortA0 PortB0 PortA1 PortB1 Link_Speed 2Gbps -- -- 0Gbps ----------------------------------SFP Info----------------------------------- FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 0 0 OK SIGMA-LINKS 2.1 No No No Yes 1 1 OK SIGMA-LINKS 2.1 No No Yes Yes Interface Board Info FCAL0 FCAL1 Link A RXLEDs Green Off Link A TXLEDs Green Off Link B RXLEDs Off Off Link B TXLEDs Off Green LED(Loop_Split) Off Off LEDS(system,hotplug) Green,Off Green,Off -------------Drive Info------------- ----LoopA----- ----LoopB----- Drive NodeWWN LED Temp(C) ALPA LoopState ALPA LoopState 0:0 20000014c3b3eab9 Green 34 0xe1 OK 0xe1 Loop fail 0:1 20000014c3b3e708 Green 36 0xe0 OK 0xe0 Loop fail PD Example 2 Component --Identifier-- --------------Description--------------- PD -- There is an imbalance of active pd ports PD Suggested Action 2 The primary and secondary I/O paths for disks (PDs) are balanced between nodes. The primary path is indicated in the showpd -path output and by an asterisk in the showpd output. An imbalance of active ports is usually caused by a nonfunctional path/loop to a cage, or because 52 Troubleshooting
  • 53. an odd number of drives is installed or detected. To further diagnose, use the following commands: showpd, showpd path, showcage, and showcage -d. cli% showpd ----Size(MB)----- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 0 0:0:0 FC 10 normal 139520 119040 0:0:1* 1:0:1 1 0:0:1 FC 10 normal 139520 121600 0:0:1 1:0:1* 2 0:0:2 FC 10 normal 139520 119040 0:0:1* 1:0:1 3 0:0:3 FC 10 normal 139520 119552 0:0:1 1:0:1* ... 46 2:9:2 FC 10 normal 139520 112384 2:0:3* 3:0:3 47 2:9:3 FC 10 normal 139520 118528 2:0:3 3:0:3* 48 3:0:0 FC 10 degraded 139520 115200 2:0:4* ----- 49 3:0:1 FC 10 degraded 139520 121344 2:0:4* ----- 50 3:0:2 FC 10 degraded 139520 115200 2:0:4* ----- 51 3:0:3 FC 10 degraded 139520 121344 2:0:4* ----- cli% showpd -path -----------Paths----------- Id CagePos Type -State-- A B Order 0 0:0:0 FC normal 0:0:1 1:0:1 0/1 1 0:0:1 FC normal 0:0:1 1:0:1 1/0 2 0:0:2 FC normal 0:0:1 1:0:1 0/1 3 0:0:3 FC normal 0:0:1 1:0:1 1/0 ... 46 2:9:2 FC normal 2:0:3 3:0:3 2/3 47 2:9:3 FC normal 2:0:3 3:0:3 3/2 48 3:0:0 FC degraded 2:0:4 3:0:4missing 2/- 49 3:0:1 FC degraded 2:0:4 3:0:4missing 2/- 50 3:0:2 FC degraded 2:0:4 3:0:4missing 2/- 51 3:0:3 FC degraded 2:0:4 3:0:4missing 2/- cli% showcage -d cage3 Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 3 cage3 2:0:4 0 --- 0 32 29-41 2.37 2.37 DC2 n/a -----------Cage detail info for cage3 --------- Fibre Channel Info PortA0 PortB0 PortA1 PortB1 Link_Speed 2Gbps -- -- 0Gbps ----------------------------------SFP Info----------------------------------- FCAL SFP -State- --Manufacturer-- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 0 0 OK SIGMA-LINKS 2.1 No No No Yes 1 1 OK SIGMA-LINKS 2.1 No No Yes Yes Interface Board Info FCAL0 FCAL1 Link A RXLEDs Green Off Link A TXLEDs Green Off Link B RXLEDs Off Off Link B TXLEDs Off Green LED(Loop_Split) Off Off LEDS(system,hotplug) Green,Off Green,Off ... -------------Drive Info------------- ----LoopA----- ----LoopB----- Drive NodeWWN LED Temp(C) ALPA LoopState ALPA LoopState 0:0 20000014c3b3eab9 Green 35 0xe1 OK 0xe1 Loop fail 0:1 20000014c3b3e708 Green 38 0xe0 OK 0xe0 Loop fail 0:2 20000014c3b3ed17 Green 35 0xdc OK 0xdc Loop fail 0:3 20000014c3b3dabd Green 30 0xda OK 0xda Loop fail Troubleshooting Storage System Components 53
  • 54. PD Example 3 Component -------------------Description------------------- Qty PD Disks experiencing a high level of I/O per second 93 Component --Identifier-- ---------Description---------- PD disk:100 Disk is experiencing a high level of I/O per second: 789.0 PD Suggested Action 3 This check samples the I/O per second (IOPS) information in statpd to see if any disks are being overworked, and then it samples again after five seconds. This does not necessarily indicate a problem, but it could negatively affect system performance. The IOPS thresholds currently set for this condition are listed: • NL disks < 75 • FC 10K RPM disks < 150 • FC 15K RPM disks < 200 • SSD < 1500 Operations such as servicemag and tunevv can cause this condition. If the IOPS rate is very high and/or a large number of disks are experiencing very heavy I/O, examine the system further using statistical monitoring commands/utilities such as statpd, the OS MC (GUI) and System Reporter. The following example shows a report for a disk with a total I/O is 150 kb/s or more. cli% statpd -filt curs,t,iops,150 14:51:49 11/03/09 r/w I/O per second KBytes per sec ... Idle % ID Port Cur Avg Max Cur Avg Max ... Cur Avg 100 3:2:1 t 658 664 666 172563 174007 174618 ... 6 6 PD Example 4 Component --Identifier-- -------Description---------- PD disk:3 Detailed State: old_firmware PD Suggested Action 4 The identified disk does not have firmware that the storage system considers current. When a disk is replaced, the servicemag operation should upgrade the disk's firmware. When disks are installed or added to a system, the admithw command can perform the firmware upgrade. Check the state of the disk by using CLI commands such as showpd -s, showpd -i, and showfirmwaredb. cli% showpd -s 3 Id CagePos Type -State-- -Detailed_State- 3 0:4:0 FC degraded old_firmware cli% showpd -i 3 Id CagePos State ----Node_WWN---- --MFR-- ---Model--- -Serial- -FW_Rev- 3 0:4:0 degraded 200000186242DB35 SEAGATE ST3146356FC 3QN0290H XRHJ cli% showfirmwaredb Vendor Prod_rev Dev_Id Fw_status Cage_type ... SEAGATE [XRHK] ST3146356FC Current DC2.DC3.DC4 54 Troubleshooting
  • 55. PD Example 5 Component --Identifier-- -------Description---------- PD -- Sparing Algorithm is not set PD Suggested Action 5 Check the system’s Sparing Algorithm value using the CLI command showsys -param. The value is normally set during the initial installation (OOTB). If it must be set later, use the command setsys SparingAlgorithm; valid values are Default, Minimal, Maximal, and Custom. After setting the parameter, use the admithw command to programmatically create and distribute the spare chunklets. % showsys -param System parameters from configured settings ----Parameter----- --Value-- RawSpaceAlertFC : 0 RawSpaceAlertNL : 0 RemoteSyslog : 0 RemoteSyslogHost : 0.0.0.0 SparingAlgorithm : Unknown PD Example 6 Component --Identifier-- -------Description---------- PD Disk:32 ST3400755FC PD for cage type DC3 in cage position 2:0:0 is missing from the firmware database PD Suggested Action 6 Check the release notes for mandatory updates and patches. Install updates and patches to HP 3PAR OS as needed to support the PD in the cage. PDCH Checks for Physical Disk Chunklets (PDCH) with states that are not optimal. • Chunklets are not used by multiple LDs. • Media failed chunklets. • Verifies if LD ownership is the same as the physical connection. Format of Possible PDCH Exception Messages pdch ch:<pdid> "Chunklet is on a remote disk" pdch LD:<ldid> "LD has <count> remote chunklets pdch LD:<ldid> "Connection path is not the same as LD ownership" pdch ch:<initpdid>:<initpdpos> "Chunklet used previously by multiple LDs" pdch ch:<initpdid>:<initpdpos> "Chunklet used previously by LD <ldname> (ch: <id>: <pdch), currently used by LD <ldname>" PDCH Example 1 Component ------------Description------------ Qty pdch LDs with chunklets on a remote disk 3 Component -Identifier- -------Description-------- Troubleshooting Storage System Components 55
  • 56. pdch ld:19 LD has 3 remote chunklets pdch ld:20 LD has 90 remote chunklets pdch ld:21 LD has 3 remote chunklets Suggested PDCH Action 1 If the message LD has remote chunklets is for Preserved Data LDs (pdslds), those warnings can be ignored. See KB Solution 14550 for details. From the example above, LDs 19, 20, and 21 are pdslds and can be seen from the showld command: cli% showld Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV 19 pdsld0.0 1 normal 1/0 256 0 P,F 0 --- Y N 20 pdsld0.1 1 normal 1/0 7680 0 P 0 --- Y N 21 pdsld0.2 1 normal 1/0 256 0 P 0 --- Y N PDCH Example 2 Component -------------------Description------------------- Qty pdch LDs with connection path different than ownership 23 pdch LDs with chunklets on a remote disk 18 Component -Identifier- ---------------Description-------------- pdch LD:35 Connection path is not the same as LD ownership pdch ld:35 LD has 1 remote chunklet PDCH Suggested Action 2 The primary I/O paths for disks are balanced between the two nodes that are physically connected to the drive cage. The node with the primary path to a disk is considered as the owning node. If the path of the secondary node needs to be used for I/O to the disk, the secondary node is considered remote I/O. These messages usually indicate a node-to-cage FC path problem because the disks (chunklets) are being accessed through their secondary path. The messages are usually a by product of other conditions such as drive-cage/node-port/FC-loop problems and need to be investigated. If a node is offline due to a service action, such as hardware or software upgrades, these exceptions can be ignored until the action is finished and the node is back online. In this example, LD 35, with a name of R1.usr.3, is owned (Own) by nodes 3/2/0/1, and the primary/secondary physical paths to the disks (chunklets) in this LD are from nodes 3 and 2. However, the FC path (Port B) from node 3 to PD 91 is failed/missing, node 2 is performing the I/O to PD 91. When the path from node 3 to cage 3 is fixed (N:S:P 3:0:4 in this example), the condition should disappear. cli% showld Id Name RAID -Detailed_State- Own SizeMB UsedMB Use Lgct LgId WThru MapV 35 R1.usr.3 1 normal 3/2/0/1 256 256 V 0 --- N Y cli% showldch R1.usr.3 Ldch Row Set PdPos Pdid Pdch State Usage Media Sp From To 0 0 0 2:2:3 63 0 normal ld valid N --- --- 1 0 0 3:8:3 91 0 normal ld valid N --- --- cli% showpd 91 63 ----Size(MB)---- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 56 Troubleshooting
  • 57. 63 2:2:3 FC 10 normal 139520 124416 2:0:3* 3:0:3 91 3:8:3 FC 10 degraded 139520 124416 2:0:4* ----- cli% showpd -s -failed -degraded Id CagePos Type -State-- ---------------Detailed_State---------- 91 3:8:3 FC degraded missing_B_port,loop_failure cli% showcage Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 2 cage2 2:0:3 0 3:0:3 0 24 29-42 2.37 2.37 DC2 n/a 3 cage3 2:0:4 0 ----- 0 32 28-40 2.37 2.37 DC2 n/a Normal condition (after fixing): cli% showpd 91 63 ----Size(MB)---- ----Ports---- Id CagePos Type Speed(K) State Total Free A B 63 2:2:3 FC 10 normal 139520 124416 2:0:3* 3:0:3 91 3:8:3 FC 10 normal 139520 124416 2:0:4 3:0:4* Port Checks for the following port connection issues: • Ports in unacceptable states • Mismatches in type and mode, such as hosts connected to initiator ports, or host and Remote Copy over Fibre Channel (RCFC) ports configured on the same FC adapter • Degraded SFPs and those with low power; perform this check only if this FC Adapter type uses SFPs Format of Possible Port Exception Messages Port port:<nsp> "Port mode is in <mode> state" Port port:<nsp> "is offline" Port port:<nsp> "Mismatched mode and type" Port port:<nsp> "Port is <state>" Port port:<nsp> "SFP is missing" Port port:<nsp> SFP is <state>" (degraded or failed) Port port:<nsp> "SFP is disabled" Port port:<nsp> "Receiver Power Low: Check FC Cable" Port port:<nsp> "Transmit Power Low: Check FC Cable" Port port:<nsp> "SFP has TX fault" Port Suggested Actions Some specific examples are displayed below, but in general, use the following CLI commands to check for port SPF errors: showport, showport -sfp,showport -sfp -ddm, showcage, showcage -sfp, and showcage -sfp -ddm. Port Example 1 Component ------Description------ Qty Port Degraded or failed SFPs 1 Component -Identifier- --Description-- Port port:0:0:2 SFP is Degraded Troubleshooting Storage System Components 57
  • 58. Port Suggested Action 1 An SFP in a node-port is reporting a degraded condition. This is most often caused by the SFP receiver circuit detecting a low signal level (RX Power Low), and usually caused by a cable with poor or contaminated FC connection. An alert can identify the following condition: Port 0:0:2, SFP Degraded (Receiver Power Low: Check FC Cable) Check SFP statistics using CLI commands such as showport -sfp, showport -sfp -ddm, showcage. cli% showport -sfp N:S:P -State-- -Manufacturer- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 0:0:1 OK FINISAR_CORP. 2.1 No No No Yes 0:0:2 Degraded FINISAR_CORP. 2.1 No No No Yes In the following example an RX power level of 361 microwatts (uW) for Port 0:0:1 DDM is a good reading; and 98 uW for Port 0:0:2 is a weak reading (< 100 uW). Normal RX power level readings are 200-400 uW. cli% showport -sfp -ddm --------------Port 0:0:1 DDM-------------- -Warning- --Alarm-- --Type-- Units Reading Low High Low High Temp C 41 -20 90 -25 95 Voltage mV 3217 2900 3700 2700 3900 TX Bias mA 7 2 14 1 17 TX Power uW 330 79 631 67 631 RX Power uW 361 15 794 10 1259 --------------Port 0:0:2 DDM-------------- -Warning- --Alarm-- --Type-- Units Reading Low High Low High Temp C 40 -20 90 -25 95 Voltage mV 3216 2900 3700 2700 3900 TX Bias mA 7 2 14 1 17 TX Power uW 335 79 631 67 631 RX Power uW 98 15 794 10 1259 cli% showcage Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 0 cage0 0:0:1 0 1:0:1 0 15 33-38 08 08 DC3 n/a 1 cage1 --- 0 1:0:2 0 15 30-38 08 08 DC3 n/a cli% showpd -s Id CagePos Type -State-- -Detailed_State- 1 0:2:0 FC normal normal ... 13 1:1:0 NL degraded missing_A_port 14 1:2:0 FC degraded missing_A_port cli% showpd -path ---------Paths--------- Id CagePos Type -State-- A B Order 1 0:2:0 FC normal 0:0:1 1:0:1 0/1 ... 13 1:1:0 NL degraded 0:0:2missing 1:0:2 1/- 14 1:2:0 FC degraded 0:0:2missing 1:0:2 1/- 58 Troubleshooting
  • 59. Port Example 2 Component -Description- Qty Port Missing SFPs 1 Component -Identifier- -Description-- Port port:0:3:1 SFP is missing Port Suggested Action 2 FC node-ports that normally contain SFPs will report an error if the SFP has been removed. The condition can be checked using the showport -sfp command. In this example, the SFP in 0:3:1 has been removed from the adapter: cli% showport -sfp N:S:P -State- -Manufacturer- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 0:0:1 OK FINISAR_CORP. 2.1 No No No Yes 0:0:2 OK FINISAR_CORP. 2.1 No No No Yes 0:3:1 - - - - - - - 0:3:2 OK FINISAR_CORP. 2.1 No No No Yes Port Example 3 Component -Description- Qty Port Disabled SFPs 1 Component -Identifier- --Description-- Port port:3:5:1 SFP is disabled Port Suggested Action 3 A node-port SFP will be disabled if the port has been placed offline using the controlport offline command. See Example 4. cli% showport -sfp N:S:P -State- -Manufacturer- MaxSpeed(Gbps) TXDisable TXFault RXLoss DDM 3:5:1 OK FINISAR_CORP. 4.1 Yes No No Yes 3:5:2 OK FINISAR_CORP. 4.1 No No No Yes Port Example 4 Component -Description- Qty Port Offline ports 1 Component -Identifier- --Description-- Port port:3:5:1 is offline Port Suggested Action 4 Check the state of the port with showport. If a port is offline, it is deliberately put in the particular state by using the controlport offline command. Offline ports can be restored using controlport rst. cli% showport N:S:P Mode State ----Node_WWN---- -Port_WWN/HW_Addr- Type 3:5:1 target offline 2FF70002AC00054C 23510002AC00054C free Troubleshooting Storage System Components 59
  • 60. Port Example 5 Component ------------Description------------ Qty Port Ports with mismatched mode and type 1 Component -Identifier- ------Description------- Port port:2:0:3 Mismatched mode and type Port Suggested Action 5 The output indicates that the port's mode, such as an initiator or target, is not correct for the connection type, such as disk, host, iSCSI or RCFC. Useful CLI command include: showport, showport -c, showport -par, showport -rcfc, showcage. cli% showport N:S:P Mode State ----Node_WWN---- -Port_WWN/HW_Addr- Type 2:0:1 initiator ready 2FF70002AC000591 22010002AC000591 disk 2:0:2 initiator ready 2FF70002AC000591 22020002AC000591 disk 2:0:3 target ready 2FF70002AC000591 22030002AC000591 disk 2:0:4 target loss_sync 2FF70002AC000591 22040002AC000591 free Component -Identifier- ------Description------- Port port:0:1:1 Mismatched mode and type cli% showport N:S:P Mode State ----Node_WWN---- -Port_WWN/HW_Addr- Type 0:1:1 initiator ready 2FF70002AC000190 20110002AC000190 rcfc 0:1:2 initiator loss_sync 2FF70002AC000190 20120002AC000190 free 0:1:3 initiator loss_sync 2FF70002AC000190 20130002AC000190 free 0:1:4 initiator loss_sync 2FF70002AC000190 20140002AC000190 free Port Example 6 Component -----------Description------------------------- Qty Port Ports with increasing CRC error counts 2 Component -Identifier- ------Description----------- Port port:3:2:1 Port or devices attached to port have experienced CRC _______________________errors within the last day Port Suggested Action 6 Check the fibre channel error counters for the port using the CLI commands showportlesb single and showportlesb hist. Devices with high InvCRC values are receiving bad packets from an upstream device (disk, HBA, SFP, or cable). cli% showportlesb single 3:2:1 ID ALPA ----Port_WWN---- LinkFail LossSync LossSig InvWord InvCRC <3:2:1> 0x1 23210002AC00054C 20697 2655432 20700 37943749 1756 pd107 0xa3 2200001D38C28AA3 0 157 0 1129 0 pd106 0xa5 2200001D38C0D01E 0 279 0 1551 0 Port Example 7 Component -----------Description------------------------- Qty Port Ports with increasing CRC error counts 2 60 Troubleshooting
  • 61. Component -Identifier- ------Description----------- Port port:2:2:1 CRC errors have been increasing by more than one per day ______________________ over the past week Port Suggested Action 7 Check the fibre channel error counters for the port using the CLI commands showportlesb single and showportlesb hist. The message "CRC errors have been increasing … over the past week" comes from a check of the daily port-LESB history as seen in showportlesb hist. If the error condition is corrected, checkhealth port may continue to report the error until the next daily update is stored. The checkhealth port should stop reporting within 24 hours after the CRC counter stops counting. cli% showportlesb single 3:2:1 ID ALPA ----Port_WWN---- LinkFail LossSync LossSig InvWord InvCRC <3:2:1> 0x1 23210002AC00054C 20697 2655432 20700 37943749 1756 pd107 0xa3 2200001D38C28AA3 0 157 0 1129 0 pd106 0xa5 2200001D38C0D01E 0 279 0 1551 0 Port CRC Checks for increasing FC port CRC errors. Format of Possible Port CRC Exception Messages Port port:<nsp> "There is less than one week of LESB history for this port" Port port:<nsp> "Port or devices attached to port have experienced CRC errors within the last day" Port port:<nsp> "CRC errors have been increasing by more than one per day over the past week" Port CRC Example FC port CRC errors is detected on the specified port. The command showportlesb hist 1:5:1 is useful for troubleshooting FC port CRC problems. This command will display current network counters and the recent log entries that checkhealth uses to evaluate and report errors. Port PELCRC Checks for increasing SAS port CRC errors. Format of Possible PELCRC Exception Messages Portcrc:<nsp> "There is less than one week of PEL history for this port" Port port:<nsp> "Port or devices attached to port have experienced CRC errors within the last day" Port port:<nsp> ""CRC errors have been increasing by more than <maxCRC> per day over the last two days" Port PELCRC Example SAS port CRC errors is detected on the specified port. The command showportpel hist 1:5:1 is useful for troubleshooting SAS port CRC problems. This command will display current network counters and the recent log entries that checkhealth uses to evaluate and report errors. Troubleshooting Storage System Components 61
  • 62. RC Checks for the following Remote Copy issues. • Remote Copy targets • Remote Copy links • Remote Copy Groups and VVs Format of Possible RC Exception Messages RC rc:<name> "All links for target <name> are down but target not yet marked failed." RC rc:<name> "Target <name> has failed." RC rc:<name> "Link <name> of target <target> is down." RC rc:<name> "Group <name> is not started to target <target>." RC rc:<vvname> "VV <vvname> of group <name> is stale on target <target>." RC rc:<vvname> "VV <vvname> of group <name> is not synced on target <target>." RC Example Component -Description- Qty RC Stale volumes 1 Component --Identifier--- ---------Description--------------- RC rc:yush_tpvv.rc VV yush_tpvv.rc of group yush_group.r1127 is stale on target S400_Async_Primary. RC Suggested Action Perform remote copy troubleshooting such as checking the physical links between the storage system. Useful CLI commands are showrcopy, showrcopy -d, showport -rcip, showport -rcfc, shownet -d, controlport rcip ping. SNMP Displays issues with SNMP. Attempts the showsnmpmgr command and reports errors if the CLI returns an error. Format of Possible SNMP Exception Messages SNMP -- <err> SNMP Example Component -Identifier- ----------Description--------------- SNMP -- Could not obtain snmp agent handle. Could be misconfigured. SNMP Suggested Action Any error message that can be produced by showsnmpmgr can display. SP Checks the status of the Ethernet connection between the SP and nodes. The Ethernet connection can only be checked from the SP because it performs a short Ethernet transfer check between the SP and the storage system. 62 Troubleshooting
  • 63. Format of Possible SP Exception Messages Network SP->InServ "SP ethernet Stat <stat> has increased too quickly check SP network settings" SP Example Component -Identifier- --------Description------------------------ SP ethernet "State rx_errs has increased too quickly check SP network settings" SP Suggested Action The <stat> variable can be any of the following: rx_errs, rx_dropped, rx_fifo, rx_frame, tx_errs, tx_dropped, tx_fifo. This message is usually caused by customer network issues, but may be caused by conflicting or mismatching network settings between the SP, customer switch(es), and the storage system. Check the SP network interface settings using SPmaint or SPOCC. Check the storage system settings by using commands such as shownet and shownet -d. Task Displays failed tasks. Checks for any tasks that have failed within the past 24 hours. This is the default time frame for the showtask -failed command. Format of Possible Task Exception Messages Task Task:<Taskid> "Failed Task" Task Example Component --Identifier--- -------Description-------- Task Task:6313 Failed Task In this example, checkhealth also showed an alert. The task failed because the command is entered with a syntax error: Alert sw_task:6313 Task 6313 (type 'background_command', name 'upgradecage -a -f') has failed (Task Failed). Please see task status for details. Task Suggested Action The CLI command showtask -d Task_id displays detailed information about the task. To clean up the alerts and the reporting of checkhealth, you can delete the failed-task alerts. The alerts are not auto-resolved and remain until they are manually removed with the MC (GUI) or CLI with removealert or setalert ack. To display system-initiated tasks, use showtask -all. cli% showtask -d 6313 Id Type Name Status Phase Step 6313 background_command upgradecage -a -f failed --- --- Troubleshooting Storage System Components 63
  • 64. Detailed status is as follows: 2010-10-22 10:35:36 PDT Created task. 2010-10-22 10:35:36 PDT Updated Executing "upgradecage -a -f" as 0:12109 2010-10-22 10:35:36 PDT Errored upgradecage: Invalid option: -f VLUN Displays host agent inactive and non-reported virtual LUNs (VLUNs). Also reports VLUNs that have been configured but are not currently being exported to hosts or host-ports. Format of Possible VLUN Exception Messages vlun vlun:(<vvID>, <lunID>, <hostname>)"Path to <wwn> is not reported by host agent" vlun vlun:(<vvID>, <lunID>, <hostname>)"Path to <wwn> is not is not seen by host" vlun vlun:(<vvID>, <lunID>, <hostname>) "Path to <wwn> is failed" vlun host:<hostname> "Host <ident>(<type>):<connection> is not connected to a port" VLUN Example Component ---------Description--------- Qty vlun Hosts not connected to a port 1 Component -----Identifier----- ---------Description-------- vlun host:cs-wintec-test1 Host wwn:10000000C964121D is not connected to a port VLUN Suggested Action Check the export status and port status for the VLUN and HOST by using CLI commands: showvlun, showvlun -pathsum, showhost, showhost pathsum, showport, servicehost list. cli% showvlun -host cs-wintec-test1 Active VLUNs Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type 2 BigVV cs-wintec-test1 10000000C964121C 2:5:1 host ----------------------------------------------------------- 1 total VLUN Templates Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type 2 BigVV cs-wintec-test1 ---------------- --- host cli% showhost cs-wintec-test1 Id Name Persona -WWN/iSCSI_Name- Port 0 cs-wintec-test1 Generic 10000000C964121D --- 10000000C964121C 2:5:1 cli% servicehost list HostName -WWN/iSCSI_Name- Port host0 10000000C98EC67A 1:1:2 host1 210100E08B289350 0:5:2 Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type 2 BigVV cs-wintec-test1 10000000C964121D 3:5:1 unknown VV Displays Virtual Volumes (VV) that are not optimal. Checks for abnormal state of VVs and Common Provisioning Groups (CPG). 64 Troubleshooting
  • 65. Format of Possible VV Exception Messages VV vv:<vvname> "IO to this volume will fail due to no_stale_ss policy" VV vv:<vvname> "Volume has reached snapshot space allocation limit" VV vv:<vvname> "Volume has reached user space allocation limit" VV vv:<vvname> "VV has expired" VV vv:<vvname> "Detailed State: <state>" (failed or degraded) VV cpg:<cpg> "CPG is unable to grow SA (or SD) space" VV Suggested Action Check status by using CLI commands such as showvv, showvv -d, and showvv -cpg. Troubleshooting Storage System Setup If you are unable to access the SP setup wizard, SP, or the Storage System Setup wizard: 1. Collect the SmartStart log files. See “Collecting SmartStart Log Files” (page 71). 2. Collect the SP log files. See “Collecting Service Processor Log Files” (page 71). 3. Contact HP support and request support for your HP 3PAR StoreServ 7000 Storage system. See “Contacting HP Support about System Setup” (page 72). Storage System Setup Wizard Errors This section describes possible error messages that may display while using the Storage System Setup Wizard. Common error strings that appear in multiple places • The specified system is currently in the storage system initialization process. Only one initialization process can run at one time. This message displays when the wizards of two users try to initialize the same storage system on the same SP. Only one wizard can initialize a storage system. Two options are available when this error displays in a dialog box; you can click Retry or Cancel. When the error does not display in a dialog box, look for another SP by serial number or wait a while and try again later. • Could not communicate with the server. Make sure you are currently connected to the network. This message displays when the client computer that is running the wizard cannot communicate with the SP, such as when network connectivity is lost. The error can occur for one of the following reasons: ◦ Network connectivity is lost. ◦ The SP is no longer running. ◦ The SP is not plugged into the network. ◦ The SP IP address has been changed. • Could not communicate with the storage system. Make sure it is running and connected to the network. This message can display if the HP 3PAR OS loses network connectivity, either by becoming unplugged or by going down for some other reason. Troubleshooting Storage System Setup 65
  • 66. This message displays either in a dialog box or inline. If the message displays in a dialog box, you can click Retry or Cancel in the wizard. If the message appears inline, you can only click Next in the wizard. • Setup encountered an unknown error ({0}). Contact HP support for help. This message displays in a dialog box with Retry and Cancel buttons, where {0} is the error number. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). Errors that appear on the Enter System to Setup page • Unable to execute the command. All required data was not sent to the SP server. Contact HP support for help. This message displays as an inline error on the bottom of the wizard page. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • No uninitialized storage system with the specified serial number could be found. Make sure the SP is on the same network as the specified storage system. This message displays as an inline error on the bottom of the wizard page. In order for the Storage System Setup Wizard to work, the storage system must be on the same network as the SP, and you must type in the serial number of the storage system in order for the SP to find it. If either of these conditions is not met, this error message displays. Verify that the serial number you entered for the SP is correct, and then do one of the following: ◦ Move the SP or storage system so that they are on the same network. ◦ Use a different SP to set up the storage system. • Unable to gather the storage system information. Make sure the specified storage system is running HP 3PAR OS 3.1.2 or later. For more help, contact HP support. This message displays as an inline error on the bottom of the wizard page. The error might be caused by a defect in the Storage System Setup Wizard code or by unexpected information being returned in the CLI. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The SP encountered an unknown error while finding the specified storage system. Contact HP support for help. This message displays as an inline error on the bottom of the wizard page. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The SP does not have a suitable HP 3PAR OS version installed for the specified storage system. Use SPOCC to install HP 3PAR OS version {0}. This message displays as an inline error on the bottom of the wizard page. The SP needs to have the same Major.Minor.Patch TPD package as the storage system’s HP 3PAR OS. If the package is not the same, then the SP cannot communicate with the HP 3PAR OS. 66 Troubleshooting
  • 67. {0} will be the version of the TPD package that the user must install so that the SP will work with the storage system. • The SP does not have an HP 3PAR OS version installed. Use SPOCC to install an HP 3PAR OS package. This message displays as an inline error on the bottom of the wizard page when no TPD package is installed. The SP needs a TPD package installed in order to communicate with an HP 3PAR StoreServ Storage system. Error strings specific to the prepare storage system progress step The following errors occur during the Progress and Results page: • The storage system has not yet discovered all the drive types. Make sure there are no cage problems. This error message displays in a dialog box with Retry and Cancel buttons. It occurs when the HP 3PAR StoreServ Storage is unable to determine all the drive types that are connected to the cage. Wait for about 5 minutes for drive discovery to complete. If the error persists, contact HP Support. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The storage system has not yet discovered all the drive positions. Make sure there are no cage problems. Wait for about 5 minutes for drive position discovery to complete. If the error persists, contact HP Support. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). Error strings specific to the check hardware health progress step • The storage system found an error while checking node health. Details are listed below. {0} appears to be offline. Make sure the node is plugged in all the way and powered on. This error message displays in a dialog box with Retry and Cancel buttons. {0} is the name of the node that appears to be offline. Turn the storage system on and make sure the node is plugged into the backplane. • The storage system found an error while checking node health. Details are listed below. This error message displays in a dialog box with Retry and Cancel buttons. {0} is the port location with the problem. Make sure the port is plugged into the node. • The storage system found an error while checking port health. Details are listed below. Port {0} appears to be offline. This error message displays in a dialog box with Retry and Cancel buttons. Information listed below the message is the CLI output for the checkhwconfig command, which occurs when the SP does not recognize the command, allowing you to see the output. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The storage system found an error while checking port health. Details are listed below. This error message displays in a dialog box with Retry and Cancel buttons. {0} is the location of the port with the problem. Troubleshooting Storage System Setup 67
  • 68. • The storage system found an error while checking cabling health. Details are listed below. This error message displays in a dialog box with Retry and Cancel buttons. The message is followed by a list of errors. The errors may include: ◦ Cage {0} is connected to the same node twice through ports {1} and {2}. Re-cable this cage. This error displays if a cage is connected to the same node twice. {0} will be the name of the cage and {1} and {2} will be the port locations where the cage is connected. Re-cable the cage using best practices. ◦ Cage {0} appears to be missing a connection to a node. It does have a connection on port {1}. Connect the loop pair. This message displays if a cage is connected to only one node. {0} will be the name of the cage, and {1} will be the single port to which the cage is connected. Re-cable the cage using best practices. ◦ Cage {0} is not connected to the same slot and port on the nodes it is connected to. Re-cable this cage. This message displays if a cage is connected to different slots, ports, and nodes. {0} will be the name of the cage with the problem. Re-cable the cage using best practices. • The storage system found an error while checking cabling health. Details are listed below. This error message displays in a dialog box with Retry and Cancel buttons. Information listed below the message is the CLI output for the checkhwconfig command, which occurs when the SP does not recognize the command, allowing you to see the output. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The storage system found an error while checking cage health. The firmware upgrade succeeded, but cage {0} has not come back. Contact HP support for help. This error message displays in a dialog box with Retry and Cancel buttons. This error might occur after the drive cages have had a firmware upgrade. {0} will be the name of the cage with the problem. Although the firmware upgrade may have succeeded, this error might occur if the cage does not boot back up. Contact HP Support. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The storage system found an error while checking cage health. Details are listed below. This error message displays in a dialog box with Retry and Cancel buttons. Information listed below the message is the CLI output for the checkhwconfig command, which occurs when the SP does not recognize the command, allowing you to see the output. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The storage system found an error while checking cage health. There is a problem with a drive cage that has had a firmware upgrade. Cage 68 Troubleshooting
  • 69. {0} did not come back after the firmware upgrade. Contact HP support for help. This error message displays in a dialog box with Retry and Cancel buttons. This error might occur after the drive cages have had a firmware upgrade. {0} will be the name of the cage with the problem. Contact HP Support. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • The storage system found an error while checking disk health. Details are listed below. This error message displays in a dialog box with Retry and Cancel buttons. Information listed below the message is the CLI output for the checkhwconfig command, which occurs when the SP does not recognize the command, allowing you to see the output. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). Error Strings specific to network progress step • Unable to set the storage system network configuration. The storage system's admin volume has not been created. This must be created before any networking information is set. Contact HP support for help. This message displays in a dialog box with Retry and Cancel buttons. This error occurs if a previous command failed and the wizard did not detect the error, or if the system is rebooted for any reason during the installation. Click Cancel to close the wizard, and then begin the setup process again. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • Unable to set the storage system network configuration. An invalid name was specified. A storage system name must start with an alphanumeric character followed by any combination of the following characters: a-z, A-Z, 0-9, period (.), hyphen (-), or underscore (_). This message displays in a dialog box with Retry and Cancel buttons. A storage system name must contain at least 6 characters, must begin with an alphanumeric character, and must include at least one of each of the following characters: lowercase letters (a-z); uppercase letters (A-Z); numbers (0-9); and a period (.), a hyphen (-), or an underscore (_). Click Cancel to close the wizard, and then begin the setup process again. • Unable to set the storage system network configuration. An invalid IPv4 address was specified. This message displays in a dialog box. The error occurs if the storage system detects that the defined storage system name is invalid. Click Back and specify a valid IPv4 address. • Unable to set the storage system network configuration. An invalid subnet was specified. This message displays in a dialog box. The error occurs if the storage system detects that the defined subnet address is invalid. Click Back and specify a valid subnet address. Troubleshooting Storage System Setup 69
  • 70. • Unable to set the storage system network configuration. An invalid IPv4 gateway was specified. This message displays in a dialog box. The error occurs if the storage system detects that the defined IPv4 gateway address is invalid. Click Back and specify a valid IPv4 gateway address. • Unable to set the storage system network configuration. The specified IPv4 gateway address is not reachable by using the specified storage system IPv4 address. This message displays in a dialog box. The error occurs if the storage system detects that the defined IPv4 gateway address could not be reached. Click Back and specify a valid IPv4 gateway address. If the error persists, contact HP Support. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • Unable to set the storage system network configuration. The storage system IPv4 address cannot be the same as the IPv4 gateway. This message displays in a dialog box. The error occurs if the storage system detects that the defined IPv4 gateway address is the same as the configured IPv4 address. Click Back and specify a different address for the IPv4 gateway address. • Unable to set the storage system network configuration. The specified address is already in use by another machine. This message displays in a dialog box. The error occurs if the storage system detects that the defined IPv4 address is already in use by another machine. Click Back and specify a different IPv4 address. • Unable to set the storage system network configuration. The storage system could not be reached at the new IP address. Make sure your network settings are configured correctly. This error message displays in a dialog box with Retry and Cancel buttons. This error displays when the SP is unable to reach the storage system at the new IP address. Click Cancel to close the wizard, and then begin the setup process again. • Unable to set the storage system network configuration. The storage system did not recognize its new IP address as being validated. This error message displays in a dialog box with Retry and Cancel buttons. This error displays when the SP reaches the storage system at the new IP but fails to recognize that the SP was able to do this. Click Back and specify a valid IP address. if the error persists, contact HP Support. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). Errors strings for the time setup progress step • Unable to set the storage system NTP server. An invalid address was specified. This error message displays in a dialog box. This error displays if the storage system detects that the NTP address is invalid. 70 Troubleshooting
  • 71. Click Cancel to close the wizard, and then begin the setup process again. • Unable to set the storage system NTP server. The storage system's admin volume has not been created. This must be created before any networking information is created. Contact HP support for help. This error message displays in a dialog box with Retry and Cancel buttons. This error occurs if a previous command failed and the wizard did not detect the error, or if the system was rebooted for any reason during installation. Click Cancel to close the wizard, and then begin the setup process again. For information about contacting HP Support, see “Contacting HP Support about System Setup” (page 72). • Unable to set the storage system time zone. An invalid time zone was specified. This error message displays in a dialog box. This error occurs if the storage system detects that an unfamiliar time zone was selected. Click Back and specify a valid time zone. • Unable to set the storage system time zone. The storage system saw the time zone as invalid. This error message displays in a dialog box. This error occurs if the storage system detects that an unfamiliar time zone was selected. Click Back and specify a valid time zone. • Unable to set the storage system time. An invalid time was specified. This error message displays in a dialog box. This error occurs if the storage system detects that an unfamiliar time zone was selected. Click Back and specify a valid time zone. • Unable to set the storage system time. The storage system saw the time as invalid. This error message displays in a dialog box. This error occurs if the storage system detects that an invalid time zone was selected. Click Back and specify a valid time zone. Collecting SmartStart Log Files To collect the SmartStart log files for HP support, zip all the files in this folder: C:Users <username>SmartStartlog. NOTE: You can continue to access the SmartStart log files in the Users folder after you have removed SmartStart from your system. Collecting Service Processor Log Files To collect the SP log files for HP support: 1. Connect to SPOCC. 2. Type the SP IP address in a browser. 3. From the navigation pane, click Files. 4. Click the folder icons for files > syslog > apilogs. Troubleshooting Storage System Setup 71
  • 72. 5. In the Action column, click Download for each log file: Service Processor setup logSPSETLOG.log Storage System setup logARSETLOG.system_serial_number.log General errorserrorLog.log 6. Zip the downloaded log files. Contacting HP Support about System Setup For worldwide technical support information, see the HP support website: http://www.hp.com/support Before contacting HP about accessing the SP Setup wizard or the Storage System Setup Wizard, collect the following information: • SmartStart log files • SP log files • Product model names and numbers • Technical support registration number (if applicable) • Product serial numbers • Error messages • Operating system type and revision level • Detailed questions When contacting HP, specify that you are requesting support for your HP 3PAR StoreServ 7000 Storage system. 72 Troubleshooting
  • 73. 6 CBIOS Error Codes LED Blink Codes The two stacked LEDs located on the face of each node in the chassis are used by the CBIOS to communicate status and error conditions. The following are normal conditions (in sequence): Table 19 LED Blink Codes StatusBottomTop Low level CPU initialization completeoffamber High level CPU and SRAM initialization complete amberoff Main memory scrubbedgreen1off Low 1 MB testedgreen2off If a critical failure (fatal condition) is detected during CBIOS initialization or during system operation, CBIOS stops the node at a fatal error: StatusBottomTop Fatal error (node may be hot-plugged)AmberAmber In the table above, if a number follows a color, this indicates the flash rate as follows: • 1 - slow flash, 1 Hz (1 flash per second) • 2 - normal flash, 2Hz (2 flashes per second) • 3 - fast flash, 3 Hz (3 flashes per second) • 3p - fast flash, 3 Hz (3 flashes per second, then pause for 1 second) A critical failure can be remedied. Simply enter Ctrl+ W (^W) on the node's serial console, then reboot. Although discouraged, if ^C is entered during a fatal error the BIOS will attempt to resume after the point of the error. WARNING! Entering ^W on the serial console of a node that is part of a cluster will abruptly remove the node from the cluster. NOTE: The most likely cause for a critical failure is failed or incompatible hardware in the user’s system. InForm OS Failed Error Codes and Resolution These hardware error codes occur while the InForm OS is running, not in CBIOS mode. Each error will present the following data when triggered: code, sub-code and a message. The table below includes this information, as well as background information on the problem’s most likely cause. Also included within the Description column are the steps to take in order to remedy the malfunction, in the event that anything can be done. Failed Alerts DescriptionSubcode This is actually not a node hardware or software initialization or test failure. This code should never occur, and suggests corruption of the PROM log if it is seen. 0 - 0x0 (0) LED Blink Codes 73
  • 74. DescriptionSubcode Resolution: Contact 3PAR technical support. Bad or unknown CPU ID (non-Intel). The BIOS is unable to fully identify the processor. This sub-code indicates the CPUID string is not "GenuineIntel". 1 - 0x1 (0) Resolution: A) Replace the processor. B) Try moving the processor to the other CPU socket. It could be a single socket problem. C) Try moving the processor to another system. It could be node hardware or software. D) Replace the node motherboard. Each class of CPU has a list of technology features it supports. If this error occurs, it is because the CPU is either severely downrev, the CPU is bad, or the motherboard is bad. 1 - 0x2 (0) Resolution: A) Replace the processor. B) Try moving the processor to the other CPU socket. It could be a single socket problem. C) Try moving the processor to another system. It could be node hardware or software. Each run of CPU has a major revision and a minor stepping number. If you receive this message, the processor has not yet been verified by 3PAR for reliable operation. If this 1 - 0x3 (0) is a new processor, it may be acceptable to press ^C to resume after this error. If you are testing a new stepping of the processor and need to use it, use the following Whack command to ignore an unknown CPUID: Whack> set perm cpu_unqual_ok Resolution: A) Upgrade to the latest CBIOS to ensure newer certified processors are acceptable. B) Replace the processor with one certified by 3PAR for use with the board. If more than one processor is installed, both CPUs must be certified to operate in multiprocessor mode. This error 1 - 0x4 (0) indicates that the bootstrap processor was found to not be certified to run in a multiprocessor mode. See Code 1, sub-code 0x3 for resolution information. See Code 1, sub-code 0x3 for resolution information.1 - 0x5 (0) This is an internal CBIOS consistency check error. If you see this error, most likely processor execution out of flash is not stable. The CPU identification is performed after 1 - 0x6 (0) the flash is fully CRC verified, so this error is likely the result of a failing CPU or transient bus operation. Resolution: A) Replace the processor. B) Re-flash the CBIOS (no need to upgrade). This is another internal CBIOS consistency check error. Before each block of update microcode is uploaded to the Pentium, a checksum on it is first verified. If this checksum is not valid, the block will be rejected with this error. 1 - 0x7 (0) See Code 1, sub-code 0x3 for resolution information. The processor has rejected the microcode update. This could be any number of things, but is likely due to a failing processor. At this point a strong 64-bit CRC has been run successfully across the BIOS and a checksum for each update line has also passed. 1 - 0x8 (0) See Code 1, sub-code 0x4 for resolution information. The BIOS was not able to locate a microcode update for this particular processor, yet it is listed as a CPU which requires a microcode update. This is likely due to use of an unqualified processor. 1 - 0x9 (0) 74 CBIOS Error Codes
  • 75. DescriptionSubcode See Code 1, sub-code 0x4 for resolution information. The processor has failed its own built in self test. This indicates strongly that the processor is at fault. 1 - 0xa (0) Resolution: A) Replace the processor. B) Replace both processor VRM modules. C) Replace the node motherboard. The two processors in the system board do not have the same bus clock multiplier. The likely cause is that the processors are of different clock speeds (or less likely minor 1 - 0xb (0) steppings). The "First CPU" as written above is the bootstrap CPU. On a PIII board, the bootstrap CPU (CPU3) is to the right, nearest the PromJet interface. Resolution: A) Remove both heatsinks and verify the processors are rated for the same clock speed and bus multiplier. B) Replace each processor individually. C) Replace the node motherboard. This CPU does not support clock multiplier changes1 - 0xc (0) In the supported configuration, the two CPUs present in the node must run at the same clock speed. If the BIOS detects CPUs which have different clock multipliers, it will automatically configure all CPUs to use the highest common clock multiplier. If a CPU's multiplier cannot be changed, then this fatal error will result. See Code 1, sub-code 0x4 for resolution information. Desired clock multiplier xx is too high for this CPU1 - 0xd (0) This error indicates the CPU does not support a clock multiplier the BIOS is attempting to set. See Code 1, sub-code 0xc for resolution information. Desired clock multiplier xx is illegal for this CPU1 - 0xe (0) See Code 1, sub-code 0xd for information on this error. During initialization, memory areas are tested before they are used. SRAM is used by the processor for persistent storage during early initialization and the CPU memory tests. 3 - 0x0 (0) This sub-code indicates that the SRAM walking bits test has failed and that the onboard SRAM may not be reliable. Resolution: A) Power down, wait 30 seconds, power up. This problem is likely not a one time occurrence, so this problem is likely to recur. B) Replace the node motherboard. After SRAM contents have been updated with the BIOS static data, a test is performed to ensure the data arrived intact. If it did not, this error is generated. The error could indicate an SRAM failure with the same conditions as above. 3 - 0x1 (0) See Code 3, sub-code 0x0 for resolution information. The SDRAM DIMMs located on the motherboard are used for main CPU memory and are critical to the proper operation of a node. Even before the memory is thoroughly 4 - 0x1 (0) tested for proper operation, it must be configured to appear in CPU-addressable space. Each DIMM has a small embedded serial EEPROM which holds DIMM configuration information such as the number of rows, columns, and banks, as well as memory timing. If this serial EEPROM becomes corrupt, data stored in it regarding the DIMM configuration cannot be trusted. So, this EEPROM also contains a checksum which the BIOS verifies is correct before configuring the DIMM. If this checksum does not match the checksum the BIOS computes across the DIMM, this error will result. InForm OS Failed Error Codes and Resolution 75
  • 76. DescriptionSubcode The minor code reported is the total count of errors for the DIMM. Resolution: A) Replace the defective CPU DIMM with an identical one. B) If an identical one is not available, replace the CPU DIMM pair. See Code 15 for more resolution information. This error indicates that a CPU memory DIMM was detected but that the EEPROM present on the DIMM could not be reliably read. The read operation is done through I2C. 4 - 0x2 (0) See Code 4 above for resolution information. This error indicates the BIOS detected the CPU SDRAM DIMMs in the bank pair are of a different type. 4 - 0x4 (0) Resolution: Ensure both DIMMs in the pair are identical. Note that two DIMMs may have the same capacity but have different number of rows, columns, or banks. The DIMM configuration must exactly match. If the DIMMs have the same manufacturer, markings and capacity, they are probably identical. See Code 15 for more resolution information. This error indicates the value the DIMM reports for refresh is not valid (greater than the maximum refresh counter). 4 - 0x8 (0) See Code 4 above for resolution information. This error indicates the values the DIMM reports for rows, columns, and banks do not correspond to any known configuration for a valid DIMM. It is possible the DIMM 4 - 0x10 (0) EEPROM data has become corrupt or that the DIMM is a higher capacity than what is currently supported. See Code 4 above for resolution information. This is P4 only. This error indicates that BIOS failed to find a set of acceptable DQS values for every or one nibble of the DIMMs. 4 - 0x20 (0) See Code 4 above for resolution information. This error indicates the DIMM pair requires a memory controller setting which is outside tolerance for the chipset's memory controller. This DIMM pair would likely not function correctly if it were allowed to be used. 4 - 0x100 (0) Resolution: A) Replace CPU DIMMs with 3PAR-certified products. B) Replace the node motherboard. C) If there is no other choice, override this error with a BIOS variable, setting "mem_margin" to the percentage outside margin. Example: *** Fatal error: Code 4, sub-code 0x0 (2). Whack> set perm mem_margin=2 Whack> reboot This error indicates the DIMM pair requires a memory controller setting which is outside tolerance for the chipset's memory controller. This DIMM pair would likely not function correctly if it were allowed to be used. 4 - 0x200 (0) See Code 4, sub-code 0x100 for resolution information. This error indicates the DIMM pair requires a memory controller setting which is outside tolerance for the chipset's memory controller. This DIMM pair would likely not function correctly if it were allowed to be used. 4 - 0x400 (0) See Code 4, sub-code 0x100 for resolution information. 76 CBIOS Error Codes
  • 77. DescriptionSubcode This error indicates the DIMM pair requires a memory controller setting which is outside tolerance for the chipset's memory controller. This DIMM pair would likely not function correctly if it were allowed to be used. 4 - 0x800 (0) See Code 4, sub-code 0x100 for resolution information. This error indicates the DIMM pair requires a memory controller setting which is outside tolerance for the chipset's memory controller. This DIMM pair would likely not function correctly if it were allowed to be used. 4 - 0x1000 (0) See Code 4, sub-code 0x100 for resolution information. This error indicates the DIMM pair requires a memory controller setting which is outside tolerance for the chipset's memory controller. This DIMM pair would likely not function correctly if it were allowed to be used. 4 - 0x2000 (0) See Code 4, sub-code 0x100 for resolution information. This exception should never happen unless an earlier exception was ignored by pressing ^C. This is because this exception will only occur if the main initialization, diagnostic 5 - 0x1 (0) test, and boot sequence fails to complete a boot and then the user chooses to ignore the error. A further explanation is necessary. There are two halves to system initialization. The first half relies on only SRAM being available and so stack and runtime variables are stored there. Once main CPU memory has been tested, initialization switches to the second half which relies on the tested SDRAM for all data structures. This second half completes initialization and testing of all other node board devices and executes the boot process. For this last step to fail, the IDE disk must either not be present or contains an invalid boot. At that point a fatal error is generated. Do not ignore this condition. It is a final recourse and an abort will reboot or hang the node board. It is safer at this stage to press ^W and enter Whack. From Whack, you can reboot with the "reboot" command. Resolution: A) Check control cache (CPU) DIMMs are installed and pass initialization. B) Verify the node boot drive is present and node software has been installed. C) Replace the node, including CPU DIMMs and boot drive. *** SRAM failure: address xxxxxxxx wrote yy but read zz6 - 0x1 (0) This failure indicates an early SRAM verification test revealed a problem with the SRAM. This is an unrecoverable error which likely requires hardware diagnostic. This error is displayed by low level init code. It will never be written to the PROM log because hardware which writes to the PROM relies on correctly functioning SRAM. Resolution: A) Cycle power on the node. B) Replace the bootstrap CPU. C) Replace the node motherboard. This error indicates the BIOS has detected that the front side bus speed exceeds the expected speed (133 MHz on PIII, 533 MHz on P4, 1333 MHz on 5000P). The system may not perform reliably. 7 - xxxx (yyyy) Resolution: A) Cycle power on the node. B) Replace the bootstrap CPU. C) Replace the node motherboard. Machine check:8 - xxxxxxxx (yyyyyyyy) MCG_STATUS == xxxxxxxx yyyyyyyy During BIOS initialization and testing, the processor must execute instructions. If this error results at any point, it is likely due to failing hardware related to the CPU's instruction execution path. InForm OS Failed Error Codes and Resolution 77
  • 78. DescriptionSubcode Resolution: A) Cycle power on the node. B) Update the node firmware to the latest version. C) Replace CPU SDRAM in pairs. D) Replace the node motherboard. *** Entering memory segment test: Stack is in xxx ***9 - 0x0 (0) One of the first memory tests performed in diagnostic mode is a sequential address or random data test. If there is no memory in the system, or the memory DIMMs are mismatched, or there is a memory subsystem problem, this error may result. Resolution: A) Verify memory is installed and in matched pairs (same manufacturer, exact same memory configuration and speed). B) Replace CPU DIMMs with a set of known good ones. C) Replace the node motherboard. Insufficient memory: BSS end == xxxx, stack limit == yyyy9 - 0x1 (0) During the first part of initialization, system stack comes from SRAM. The second part of initialization, system stack comes from CPU memory. If there is insufficient SDRAM (such as no DIMMs installed) this error may result. It is a bad idea to ignore this error with ^C as the system stack will fall past the available memory and probably hang hard the initialization. See Code 9, sub-code 0x0 for resolution information. Expected sdram_init_test to be xxxx, but it was yyyy.9 - 0x2 (0) After SDRAM has been initialized and scrubbed, the BIOS copies runtime variables from Flash to CPU memory. The fact this data is copied to SDRAM is later verified. This fatal error may be caused by either a software error in the BIOS, a hardware error (such as flaky CPU memory), or user intervention such as modifying the memory containing the SDRAM copy of the runtime variables. Resolution: A) Reboot. If the problem is caused by flaky hardware, a prior memory test should catch this condition. B) Upgrade BIOS version. Not a likely solution since this code path is well tested every time the system is booted. C) Replace CPU DIMMs with a set of known good ones. D) Replace the node motherboard. Low 1M test: Test completed: x iterations, y probes, z errors found9 - 0x3 (0) The low 1 MB of memory is thoroughly tested to ensure reliable operation as this is the memory area that the BIOS and Whack use during further initialization and testing. If this test fails, it should not be ignored with ^C as having reliable system memory is critical to proper operation. Resolution: A) Cycle power on the node. Occasionally, memory will fail during a memory test due to metallic dust. B) Reseat CPU memory DIMMs. C) Pull CPU DIMMs, blow dust from sockets, reseat. D) Replace CPU memory DIMMs in pairs to ensure replacement parts are matched. PIII nodes: Non-paired DIMMs are proximally closest. Paired DIMMs are the leftmost-leftmost and rightmost-rightmost of each two which are proximally closest. P4 nodes: Paired DIMMs are proximally closest. DIMM0 and DIMM1 are a pair. DIMM2 and DIMM3 are a pair. E200, Ironman, and Tinman nodes: There is only a single pair of CPU memory DIMMs. 78 CBIOS Error Codes
  • 79. DescriptionSubcode E) Replace the node motherboard. High 64K test: Test completed: x iterations, y probes, z errors found9 - 0x4 (0) In addition to the low 1 MB of memory, older BIOS versions also thoroughly tested the high 64 KB of memory. This is because the operational stack for the CBIOS and Whack used to reside at this address, which made the memory critical for proper initialization and testing. The current BIOS now uses memory below 1 MB for stack space, so this failure code is deprecated. See Code 9, sub-code 0x3 for resolution information. SDRAM walk: Test completed: xx iterations, yy probes, zz errors found9 - 0x5 (0) During initialization (prior to a thorough test of the low 1 MB of memory), a quick walk through all CPU memory is performed. If an error is found, this fatal error is displayed. See Code 9, sub-code 0x3 for resolution information. Full SDRAM test: Test completed: xx iterations, yy probes, zz errors found9 - 0x6 (0) During later testing, a full SDRAM test is performed which more completely verifies proper memory operation than the cursory SDRAM walk. This test is very similar to the initial thorough 1 MB test done during initialization. See Code 9, sub-code 0x3 for resolution information. Pairwwww DIMMxxxx: Illegal SPD value <name of value> <value>9 - 0x7 (0) This error indicates that a CPU DIMM was detected but that the EEPROM present on the DIMM reported an illegal or unsupported value for our memory controller. Example: Density (SPD byte 31) has more than 1 bit set (ie. 0x30) which indicates a non-standard part. See Code 9, sub-code 0x3 for resolution information. Most likely, the DIMM is not qualified for use in our node board. The DIMM number is logged in the Data field of the Fatal Error. Cannot allocate xx bytes for PCI bus yy scan9 - 0x10 (0) or Cannot allocate xx bytes for PCI device on bus yy This error indicates there was not enough memory or a memory error occurred while attempting to allocate heap space during the PCI device probe. SDRAM is needed because the BIOS maintains a list of PCI devices present in the system. Resolution: A) Cycle power on the node. B) Remove all PCI cards. C) Replace CPU DIMMs. D) Replace the node motherboard. Cannot find bus xx in scanned PCI busses9 - 0x11 (0) During the PCI bus scan, a list of PCI devices present is recorded in SDRAM. For each device present, a block of memory is allocated and initialized. This error indicates that a data value indicating bus number could not be found in the list of devices previously scanned. This is probably due to an SDRAM or CPU failure. Resolution: A) Cycle power on the node. B) Remove all PCI cards. C) Replace CPU DIMMs. D) Replace bootstrap CPU. E) Replace the node motherboard. InForm OS Failed Error Codes and Resolution 79
  • 80. DescriptionSubcode No memory installed.9 - 0x12 (0) This error indicates that the CPU memory scan failed to locate any usable memory for the system. There must be at least one bank of SDRAM configured for the node to operate correctly. Resolution: A) Cycle power on the node. B) Verify CPU DIMM scan output shows DIMMs. C) Replace CPU DIMMs. D) Replace the node motherboard. Unknown DDR2 frequency (xxxx)9 - 0x13 (xxxx) This error indicates that the CPU memory installed is of an unrecognized and thus unsupported memory speed. Supported speeds include 533, 667 and 800 MHz. Resolution: Replace CPU DIMMs with 533, 667 or 800 MHz modules. FB-DIMM Initialization Failure9 - 0x14 (0) This error indicates that CBIOS was unable to initialize the CPU memory installed. Resolution: A) Cycle power on the node. B) Replace CPU DIMMs. C) Replace the node motherboard. This error indicates that an uncorrectable ECC error was detected on a DIMM. The data value is a bitmask that may be decoded to determine which DIMM had the error. A 9 - 0x15 (data) value of 1 indicates DIMM 0, 2 indicates DIMM 1, 4 -> DIMM 2, etc. More than one bit may be set if CBIOS is unable to isolate the error down to a single DIMM. Resolution: A) Cycle power on the node. B) Replace FB-DIMM(s). C) Replace the node motherboard. During the PCI scan, many devices which were programmed by previous PCI scan steps are examined again to verify the programming was successful. This error indicates that a bridge failed to record the PCI bus number of bridges below it. 10 - 0x1 (0) Resolution: A) Cycle power on the node. B) Remove all PCI cards. C) Replace the node motherboard. There are on the PCI bus several devices in a node board which are known by the CBIOS to have specific sizes. As a hardware consistency check, the BIOS verifies that 10 - 0x2 (0) these devices are not only present, but also have appropriate memory and I/O space requirements. If any device is found outside of expected requirements, it will cause this error. Resolution: A) Cycle power on the node. B) Reseat all PCI cards. C) Swap out the PCI card for another qualified card (if it's a card). D) Pull all PCI cards to see if the problem persists. If so, replace any defective cards. E) Replace the node motherboard. This error indicates that the system has run out of available mapping area while attempting to map this device into the CPU's I/O address range (0x0000 - 0xfe00). 10 - 0x3 (0) The likely cause of this error is that a prior PCI device is consuming too much I/O space. 80 CBIOS Error Codes
  • 81. DescriptionSubcode Since most device I/O ranges are extremely small, it is likely a defective PCI card or PCI bus problem which is the cause. Resolution: A) Reseat all PCI cards. B) Swap out individual PCI cards. C) Replace the node motherboard. Many PCI devices (and software drivers) require DMA addressable memory within the 32 bit address space (less than 4 GB). For this reason, all 32 bit PCI devices are required 10 - 0x4 (0) to be mapped within this space. Currently, all CPU memory is also forced to be mapped within this space, limiting the maximum 32-bit CPU memory to about 3 GB. Resolution: A) Swap out individual PCI cards. B) Replace the node motherboard. See Code 10, sub-code 0x3 for diagnostic information. The non-prefetchable memory has the same 32 bit limitations as prefetchable memory does. 10 - 0x5 (0) See Code 10, sub-code 0x4 for resolution information. 64 bit PCI devices are not limited to a 32 bit address space. The CPU, however, can only access a 36 bit space (when virtual memory is enabled). Because most drivers 10 - 0x6 (0) need direct access to the memory a device provides on the bus, the device must be addressable by the Pentium and so the maximum 64 bit address allowed is 0xf:ffffffff. This is 64 GB. See Code 10, sub-code 0x4 for resolution information. Testing CM PCI 64-bit data lines: FAIL10 - 0x7 (0) The Cluster Manager (Eagle / Osprey) is used to perform a walking bit test on both PCI0 and PCI1 data paths to CPU memory. If a problem is found, with either path, this error will be displayed. The error will be further qualified by one of the following prior lines: PCIxxxx all data bits stuck high PCIxxxx found data bits stuck high: BitWW, BitXX, BitYY, BitZZ PCIxxxx all data bits stuck low PCIxxxx found data bits stuck low: BitWW, BitXX, BitYY, BitZZ PCIxxxx data bits possibly floating: BitWW, BitXX, BitYY, BitZZ Resolution: A) Cycle power on the node. B) Reseat all PCI cards. C) Pull all PCI cards to see if the problem persists. If so, replace any defective cards. D) Replace the node motherboard. CBIOS runs simple CM PCI Tests as part of POST in both normal operation and manufacturing test. The tests use XCBs to transfer data over both CM PCI interfaces from 10 - 0x8 (0) Cluster Memory to CPU Memory and back. If any test fails due to a data miscompare, the test will generate this fatal error code with sub-code '0x4'. These tests are similar to the Cluster Memory Tests and may fail due to Cluster Memory SDRAM hardware or CPU SDRAM hardware failures. Any test failure will result in a fatal error. Resolution: A) Cycle power on the node. B) Reseat CM memory riser card. C) Reseat the failing Cluster memory DIMM. InForm OS Failed Error Codes and Resolution 81
  • 82. DescriptionSubcode D) Replace the failing Cluster memory DIMM. E) Replace the node motherboard. This error indicates one of the PCI bridges on the board has a bad clock value and is refusing to accept programming of a good clock. 10 - 0x9 (0) Resolution: A) Cycle power on the node. The problem may occur on power cycle (only) with random chance on a bad board. B) Pull all PCI cards which have integrated bridges (QLogic quad port cards are a good example of this). You should power cycle several times to determine it is not an intermittent problem with the motherboard. C) Replace the node motherboard. This error indicates one of the PCI bridges on the board has a bad GPIO input which selects bridge clock sources on a power on condition. 10 - 0xa (0) Resolution: A) Cycle power on the node. The problem may occur on power cycle (only) with random chance on a bad board. B) Replace the node motherboard. Warning: This node has xx PCI cards present, but yy is the required minimum. Please verify your node is properly configured. You may adjust the required minimum with the "set pci_min" command. 10 - 0xb (0) This error indicates this node has detected less PCI cards than the recommended 3PAR minimum. In a system configuration where there are less than the minimum active PCI cards, inactive load cards should be used to reach the required minimum. Resolution: A) Verify the minimum required number of PCI cards are inserted in the node. Install dummy load cards to reach the required minimum. B) Verify all PCI cards in the system have been identified. Replace any missing card. C) Replace the node motherboard. Testing CM PCI 64-bit address lines: FAIL10 - 0xc (0) CM XCB TEST miscompare at offset, uuuu Expected (vvvvvvvv) Actual (wwwwwwww) CM DIMMxx (Jyyyy): Address (zz:zzzzzzzz) The Cluster Manager is used to perform a walking bit test on both PCI0 and PCI1 address lines paths from CPU memory into cluster memory. If a problem is found (with either path), this error will be displayed. The particular memory address which caused this error will be indicated. Resolution: A) Cycle power on the node. B) Reseat all PCI cards. C) Pull all PCI cards to see if the problem persists. If so, replace any defective cards. D) Replace the node motherboard. *** Vendor xxxx device yyyy on motherboard not yet qualified.10 - 0xd (zz) *** Vendor xxxx device yyyy in slot zz not yet qualified. This is an error indicating that the device found is not recognized by the BIOS as a 3PAR-qualified device. This may be because the board is a new generation or that there was a PCI error in communicating with the device. In the former case, it is probably safe to press ^C to ignore this error. In the later case, it is possible that part of the board has become non-functional to where the BIOS may not be able to determine if the rest 82 CBIOS Error Codes
  • 83. DescriptionSubcode of the board will continue to function.If you need to override this feature, enter Whack at this point by pressing ^W. Enter the following command: Whack> set perm pci_unqual_ok If the data field is non-zero, it indicates the BIOS discovered the problem is a card in a particular PCI slot. The specific codes are as follows: * 30 is PCI Slot 0 * 31 is PCI Slot 1 * 32 is PCI Slot 2 * 33 is PCI Slot 3 * 34 is PCI Slot 4 * 35 is PCI Slot 5 Resolution: A) Swap out the PCI card for a qualified card. B) Replace the node motherboard. This error indicates the PCI scanning code was unable to lay out a valid PCI address table mapping within 21 passes. The cause of this error is possibly due to either defective hardware or BIOS firmware. 10 - 0xe (0) Resolution: A) Remove all PCI cards. If the error goes away, attempt to find failed card by process of elimination (put back half of the cards and try to boot again). B) Replace the node motherboard. This error indicates a possible hardware failure on the board. The bus which connects the CMIC (P4 North Bridge) to CIOB A failed to initialize properly. 10 - 0x10 (0) Resolution: A) Cycle power on the node. The problem may occur with random chance on a bad board. B) Replace the node motherboard. The BIOS checks for specific onboard PCI devices (such as bridges) which are known to be on a particular node board. If a device listed in the BIOS table is not found on the board, then this error will result. 10 - 0x11 (0) Resolution: A) Cycle power on the node. B) Remove PCI cards and see if error disappears. C) Replace the node motherboard. Onboard PCI devices (such as bridges) are well known by the BIOS to appear at specific bus addresses. If this device is not known by the BIOS, but it is configured on a bus 10 - 0x12 (0) which is not externally exposed (PCI slot), then you will see this error. Since the node board is a closed solution, this error might occur if an on board device is failing and does not report a correct device vendor/ID, or corrupts the device vendor/ID reported by another device on the bus. See Code 10, sub-code 0x11 for resolution information. The PCI header is re-read on multiple passes of the PCI initialization. If a mismatch is found with a previous read of the PCI bus, then this error will result. This is a strong 10 - 0x13 (0) indicator of a flaky device or bus. If the BIOS is in Diagnostic mode (press ESC at the initial memory test), at this point, the following will also be displayed: Starting infinite PCI read loop... In Diagnostic mode, once a failure is detected, this test is then repeated until manual intervention. See Code 10, sub-code 0x3 for resolution information. InForm OS Failed Error Codes and Resolution 83
  • 84. DescriptionSubcode During PCI initialization, a 64 bit window was found on the PCI bus which is outside the 36 bit range imposed by the CPU. 10 - 0x14 (0) See Code 10, sub-code 0x3 for resolution information. During PCI initialization, a window was found on the PCI device with a size of zero. This fatal error may indicate that the BIOS is not able to properly communicate with the PCI device. 10 - 0x15 (0) See Code 10, sub-code 0x3 for resolution information. During PCI initialization, each memory or I/O window present on each device found on the bus is programmed with a CPU memory bus address so that it may be accessed 10 - 0x16 (slot) by further BIOS initialization, tests and of course the main operating system. The BIOS verifies the address it programs for each window was correctly programmed (by reading back the value just written). If they do not match, this error is generated. The slot number is an ASCII value represented as Hexadecimal. If the slot value is 0, then the failure occurred on a node motherboard device. If PCI Slot 0 was involved, then slot is 30. PCI Slot 1 is 31; PCI Slot 2 is 32; PCI Slot 6 is 36, etc. See Code 10, sub-code 0x3 for resolution information. See Code 10, sub-code 0x16 for information on this error.10 - 0x17 (0) See Code 10, sub-code 0x16 for information on this error.10 - 0x18 (0) During PCI initialization, each memory or I/O window present on each device found on the bus is programmed with a CPU memory bus address. The size of the window 10 - 0x19 (0) require is provided by the specific PCI device. It is required that this window is a power of 2 in size (1 KB, 2 KB, 4 KB, ... 32 MB, 64 MB, etc). This is a consistency check the BIOS performs to ensure it is properly communicating with the PCI device. See Code 10, sub-code 0x3 for resolution information. During PCI initialization, the entire PCI bus is walked as a tree and devices registers are initialized and mapped into processor address space using this tree. The bus structure 10 - 0x1a (0) is then ordered and summarized into a table so that software can later find specific devices for high level initialization. This specific error indicates the PCI scan attempted to map a PCI device into the CPU's 32-bit address space, but failed due to no more available space. Verify that NVRAM flags such as "pci_base" and "mem_max" are not set to unusual values. See Code 10, sub-code 0x3 for resolution information. This error indicates a possible hardware failure on the board. The bus which connects the CMIC (P4 North Bridge) to CIOB B failed to initialize properly. 10 - 0x1b (0) See Code 10, sub-code 0x10 for resolution information. This error indicates a possible hardware failure on the board. The CIOB (which connects the North Bridge to the I/O system) has an incorrect clock speed. 10 - 0x1c (data) Resolution: A) Cycle power on the node. B) Replace the node motherboard. This error indicates one of the PCI bridges on the board has a bad speed selection set, which could indicate an incorrect type of PCI card has been installed or that bridge mode select strappings are bad. 10 - 0x1d (0) Resolution: A) Pull all PCI cards one at a time to determine failed card. B) Replace the node motherboard. This error indicates that during a previous PCI scan, the CPU hung repeatedly. Other than this being a fatal error, this code is identical to that of sub-code 0x23. Note that 10 - 0x24 (data) if this fatal error is seen without a preceding non-fatal sub-code 0x23, then the failure is likely to be the node motherboard. 84 CBIOS Error Codes
  • 85. DescriptionSubcode If the non-fatal is not logged, then a PCI scan hung earlier in the PCI tree than a previous hang. Unless both hangs happened on the same HBA, the cause is likely a shared device on the node motherboard. See Code 10, sub-code 0x23 for resolution information. This error indicates that the PCI device does not have an EEPROM attached.10 - 0x25 (0) Resolution: Replace node motherboard. This error indicates that the EEPROM failed to be programmed.10 - 0x26 (0) Resolution: Replace node motherboard. This error indicates that BIOS was unable to verify the EEPROM contents after programming or that the data was successfully written but did not persist. 10 - 0x27 (0) Resolution: Replace node motherboard. The BIOS installs an interrupt handler to catch spurious (unexpected) interrupts and exceptions during initialization and testing of the node hardware. During initialization, 11 - yyyy (0) the BIOS even tests to verify a generated interrupt is delivered correctly. This is a serious condition and should not be ignored by pressing ^C. The specific interrupt received is the sub-code displayed. The interrupt number will be less than 0x20. Resolution: A) Cycle power on the node. B) Replace the node motherboard. PIII or P4 node:12 - 0x0 (0) --- SMI: No known cause (# zz) GPE status: yyyyyy, GPE input: zzzzzz An SMI is a System Management Interrupt, and interrupt generated by the node hardware for the BIOS to service a particular failure. This error indicates the BIOS was unable to determine the cause of the SMI delivered by hardware. See Code 11 for resolution information. Ironman, Tinman, Titan, or Atlas nodes:12 - 0x0 (0) CPU0 SMI: Bootstrap CPU0 SMI: Updating CPU0 SMI: Updated --- SMI: No known cause (# 1) on CPU6 SMSCS[0] = 0x00000000 ... ALT_GP_SMI_EN = 0xbfbf ALT_GP_SMI_STS = 0x0000 TMP_STS = 0x00000000:88380000 TMP_INT = 0x00000000:00000001 This fatal error indicates the BIOS received an SMI, but wasn't able to determine which device caused the interrupt. In this example, the "Bootstrap," "Updating," and "Updated" messages suggest the BIOS firmware was updated. Resolution: A) Reboot the node. B) Replace the node motherboard. During initialization, the BIOS installs an interrupt handler to verify interrupts are delivered reliably. It then generates an expected interrupt. If an interrupt is delivered 12 - 0x1 (yyyy) which is not the same as the one expected, this error is displayed. The interrupt number, yyyy, represents which interrupt occurred. See Code 11 for resolution information. InForm OS Failed Error Codes and Resolution 85
  • 86. DescriptionSubcode During initialization, the BIOS installs an interrupt handler to verify interrupts are delivered reliably. It then generates a few expected interrupts. If the specific interrupt 13 - 0x0 (yyyy) is not delivered, this error is displayed. The interrupt number, yyyy, represents which interrupt should have been generated. See Code 11 for resolution information. The Whack "mem test ecc" command performs an ECC test over the main memory to ensure ECC memory error correction is functioning. If this test fails, this message is displayed, together with other messages giving details. 14 - 0x0 (0) Note: Running the "mem test ecc" command destroys some memory locations in the range of [0 .. 512 KB] and [1 MB .. just below the top of SDRAM]. Hence, executing this once Linux has booted will cause it to fail if it is reentered. If you see this failure often during BIOS initialization, then the cause is likely a hardware problem. Specifically, the error tells you that the hardware ECC error mechanism is not working correctly. Changing CPU memory DIMMs may solve the problem, but it's more likely a board failure. Resolution: A) Ensure the North Bridge heatsink is firmly attached. B) Replace CPU DIMMs. C) Replace bootstrap CPU. D) Replace the node motherboard. 00 - 0f: 00 00 00 00 00 00 00 00 | 00 00 00 00 00 40 0c 0014 - 0x1 (1) 10 - 1f: 01 ff 00 00 00 00 00 ff | ff ff ff ff ff ff ff ff 20 - 2f: 04 09 08 09 20 09 10 09 | 18 09 00 09 00 00 59 8e 30 - 3f: aa aa 0a 02 a8 00 00 00 | 00 00 00 c0 7b df ff ff This error indicates the BIOS ECC hardware test could not get the hardware to generate an ECC SMI in response to a corrupted memory address. It possibly indicates a failing DIMM or memory controller, or that memory timings are too fast for the DIMMs present in the node. See Code 14, sub-code 0x0 for resolution information. mailbox register xxxx changed inappropriately15 - 0x0 (slot) (yyyy) != expected (zzzz) register test: FAIL (slot) = PCI slot number There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top three will depend on which slot the node is in. During POST, all present FCAL adapters are tested for functionality. The HBA cards sometimes require a firmware download for full capability. POST does not have access to this firmware and will only test basic register access and functionality. If the Register Test fails, POST will indicate this error. If the user continues past this error (^C), software will log the error and continue testing the other PCI cards (if present). Resolution: A) Reseat the failing PCI Fibre Adapter. B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI Fibre Adapter. B) Replace the node motherboard. controller memory xxxx value (yyyy) != expected (zzzz)15 - 0x1 (slot) memory test: FAIL (slot) = PCI slot number 86 CBIOS Error Codes
  • 87. DescriptionSubcode There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top three will depend on which slot the node is in. During POST, all present FCAL adapters are tested for functionality. The HBA cards sometimes require a firmware download for full capability. POST does not have access to this firmware and will only test basic functionality. If the Onboard Memory Test fails, POST will indicate this error. If the user continues past this error (^C), software will log the error and continue testing the other PCI cards (if present). Resolution: A) Reseat the failing PCI Fibre Adapter. B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI Fibre Adapter. B) Replace the node motherboard. data bits possibly float: Bitxxxx-Bityyyy.15 - 0x2 (slot) PCI walking bits: FAIL (slot) = PCI slot number There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top three will depend on which slot the node is in. During POST, all present FCAL adapters are tested for functionality. The HBA cards sometimes require a firmware download for full capability. POST does not have access to this firmware and will only test basic functionality. If the PCI Fibre Card Bus Test fails, POST will indicate this error. If the user continues past this error (^C), software will log the error and continue testing the other PCI cards (if present). Resolution: A) Reseat the failing PCI Fibre Adapter. B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI Fibre Adapter. C) Replace the node motherboard. data bits possibly float: Bitxxxx-Bityyyy.15 - 0x3 (slot) CM0 walking bits: FAIL (slot) = PCI slot number There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top three will depend on which slot the node is in. This test indicates a problem was observed with the fibre channel card talking with the Cluster Manager. If the "fibre test pci" test passed, then this problem is likely in the interface to the CM or CM memory. Resolution: A) Reseat the failing PCI Fibre Adapter. B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI Fibre Adapter. C) Replace the node motherboard. PCIe EYE test: FAIL15 - 0x4 (slot) (slot) = PCI slot number There are 6 or 9 PCI slots available to insert PCI adapter cards on the node board. The slots are numbered 0-6 from left to right when looking at the front of the P4 Eagle and Ironman nodes. The slot are numbered 0-2, 3-5, 6-8 on Titan and Atlas and the top three will depend on which slot the node is in. InForm OS Failed Error Codes and Resolution 87
  • 88. DescriptionSubcode If the "fibre test cm" test passed, then this problem is likely in the PCIe to PCIE link between the card and the switch. Resolution: A) Reseat the failing PCI Fibre Adapter. B) Analyze other failures in the system. If the CM PCI XCB test passed, replace the PCI Fibre Adapter. C) Replace the node motherboard. BIOS can not make LSI card go into Operational state.15 - 0x10 (slot) Resolution: Replace card. Send failed card back for FA. HBA card register test failure15 - 0x11 (slot) Resolution: Replace card. Send failed card back for FA. LSI card register memory copy test failure.15 - 0x13 (slot) Resolution: Replace card. Send failed card back for FA. LSI card register memory copy test failure.15 - 0x14 (slot) Resolution: Replace card. Send failed card back for FA. Firmware rev xxxx not supported. Upgrade to yyyy15 - 0x15 (slot) LSI card does not contain 3PAR-approved firmware. If you need to run with an LSI card which has an older firmware (engineering only), you can set the "lsi_downrev" flag in the BIOS. Example: Whack> set perm lsi_downrev Resolution: Replace card. Send failed card back for upgrade. Unable to get firmware rev15 - 0x16 (slot) Attempting to get the firmware version from the LSI card failed. Resolution: A) Cycle power on the node. B) Replace card. Send failed card back for FA. Manufacturing test for E200 node Only.15 - 0x17 (slot) This error occurs when the onboard LSI chips are not found. They are expected to be in slot 0 and 3, with two devices on each slot. Resolution: A) Cycle power on the node. B) Replace motherboard. The IDE controller failed its internal self test.17 - 0x0 (0) Resolution: A) Replace the IDE or SATA boot drive. B) Replace the IDE or SATA cable. C) Replace the node motherboard. The IDE controller failed to perform a self test.17 - 0x1 (0) See Code 17, sub-code 0x0 for resolution information. IDE register xx value (yyyy) != expected (zzzz)17 - 0x2 (0) The IDE register test failed during a pattern test. See Code 17, sub-code 0x0 for resolution information. 88 CBIOS Error Codes
  • 89. DescriptionSubcode IDE register xx value (yyyy) != expected (zzzz)17 - 0x3 (0) The IDE register test failed during a walking bit test. See Code 17, sub-code 0x0 for resolution information. There was an IDE failure in data requested by the operating system bootstrap. It is possible that data on the disk has become corrupt to the point the operating system will not successfully load. 17 - 0x4 (0) Resolution: Replace the IDE or SATA boot drive. Communication with the IDE interface timed out. This error indicates the drive is not responding to commands within an acceptable amount of time. 17 - 0x5 (0) Resolution: Replace the IDE or SATA boot drive. IDE reported a failure in read verify command.17 - 0x6 (0) Resolution: Replace the IDE or SATA boot drive. A timeout (10 seconds) was detected while performing DMA operation.17 - 0x7 (0) Resolution: Replace the IDE or SATA boot drive. An error condition was detected while performing DMA operation.17 - 0x8 (0) Resolution: Replace the IDE or SATA boot drive. IDE power up: Unknown error17 - 0x9 (xx) ERROR : 80 SECCNT: 80 SECNUM: 80 CYLLOW: 80 CYHIGH: 80 DEVSEL: 80 ALT_STATUS: 80 Drive: BUSY The IDE drive had a failure at power on reset which prevents it from communicating with the chipset IDE controller. Resolution: A) Cycle power on the node. B) Reseat drive cable on both node and drive. C) Replace the IDE or SATA boot drive. D) Replace the node motherboard. IDE SMART self-test failed. The drive failed to finish a built-in self-test.17 - 0x11 (0) Resolution: Replace the IDE or SATA boot drive. Drive failed to collect SMART data. The data is vital for the drive to determine SMART trigger. 17 - 0x12 (0) Resolution: Replace the IDE or SATA boot drive. Drive refused to accept SMART commands.17 - 0x13 (0) Resolution: Replace the IDE or SATA boot drive. The SMART command issued to drive has incorrect syntax.17 - 0x14 (0) Resolution: Replace the IDE or SATA boot drive. The SMART commands failed to write or read attributes.17 - 0x15 (0) Resolution: Replace the IDE or SATA boot drive. InForm OS Failed Error Codes and Resolution 89
  • 90. DescriptionSubcode The IDE controller failed the BIOS interrupt test, possibly due to a bad drive.17 - 0x18 (0) See Code 17, sub-code 0x0 for resolution information. Drive did not return status to host after a command within a reasonable amount of time.17 - 0x20 (0) Resolution: Replace the IDE or SATA boot drive. This error occurs when the disk drive for a harrier system is not a SSD disk drive type.17 - 0x21 (rpm) Resolution: Replace the SATA drive with a SSD drive. This error occurs when we have 32 GB or less of cluster memory and the disk drive is less than 128 GB. This is because the disk is not large enough for the memory dumps if the node panics. 17 - 0x22 (disk size) Resolution: Replace the SSD drive with a drive of at least 128 GB. This error occurs when we have more than 32 GB of cluster memory and the disk drive is less than 256 GB. This is because the disk is not large enough for the memory dumps if the node panics. 17 - 0x23 (disk size) Resolution: A) Replace the SSD drive with a drive of at least 256 GB. B) Reduce cluster memory to 32 GB or less. Drive returned an error status after command execution.17 - 0x30 (0) Resolution: Replace the IDE or SATA boot drive. Booting from SATA IDE...19 - 0x0 (0) No IDE or USB drives present or boot sector is invalid. or Booting from SATA IDE (bootdev)... No IDE drive present or boot sector is invalid. or Booting from PATA IDE... No IDE drive present or boot sector is invalid. or Booting from USB... No USB drive present or boot sector is invalid. The IDE (PATA or SATA) or USB Flash disk is used for booting the operating system. This error indicates no a drive was found during a hardware probe, but it was found to not be bootable. Resolution: A) Cycle power on the node. B) Verify disk power and data cables are connected to both the drive and the motherboard. The red stripe on the IDE data cable must be oriented closest to the power connector on the drive. C) Replace the disk power cable and/or data cable. D) Replace the drive. E) Replace the node motherboard. IDE TIMEOUT waiting for DRDY19 - 0x1 (0) The IDE disk is used for booting the operating system. This error indicates there was a problem communicating with the IDE controller, most likely due to a missing IDE hard drive, a disconnected cable, or a failed IDE hard drive. See Code 19, sub-code 0x0 for resolution information. IDE TIMEOUT waiting for DRQ19 - 0x2 (0) 90 CBIOS Error Codes
  • 91. DescriptionSubcode The IDE disk is used for booting the operating system. This error indicates that a command was issued to the IDE disk (read sectors) but the drive controller did not report back with the data within a reasonable amount of time. This may be caused by a failed sector or IDE controller failure. See Code 19, sub-code 0x0 for resolution information. IDE ERROR reading sector xxxx19 - 0x3 (0) The IDE disk is used for booting the operating system. This error indicates that a command was issued to the IDE disk (read sectors) but the drive controller reported that there was a error in reliably retrieving the requested sectors. This error may be caused by a failed sector or IDE controller failure. See Code 19, sub-code 0x0 for resolution information. If a board has more than a single CPU, only one CPU comes out of power-on executing code. The other waits in a halted state for an AP message from the bootstrap processor. 20 - 0x0 (0) All MP-capable Pentium processor has an onboard Advanced Programmable Interrupt Controller called the Local APIC (there is a complementary component called the IOAPIC located on the motherboard). Once the bootstrap processor has completed all node board initialization and testing, it starts up each application processor (which in Intel terms is defined as any processor other than the initial bootstrap processor). Each AP then does a brief identify, verify, and microcode update. In the above case, if the local APIC fails deliver an AP startup to the other processor within a reasonable amount of time, this error will result. In a single CPU system this error should not occur because an earlier probe should identify no AP processor is present. If the Local APIC cannot reliably deliver a message over the IOAPIC, then it is probably not safe to ignore this error by pressing ^C. Resolution: A) Reseat both processors in their sockets. B) Replace each processor individually. Do not bother with downgrading to a single processor system since this is a multiprocessor startup issue. The problem processor will not be apparent with a single processor configuration. C) Replace the node motherboard. After an AP startup message has been delivered to the application processor through the IOAPIC, the bootstrap processor waits for an indication the AP has started. If the 20 - 0x1 (0) indication is not received before a reasonable timeout, this error is given. It should be ok to ignore this message by pressing ^C and continue with further BIOS diagnostics. See Code 20, sub-code 0x0 for resolution information. Once the application processor (AP) has started initialization, it sets a flag that the bootstrap processor can use to determine when the bootstrap processor has completed. 20 - 0x2 (0) If the AP remains in the AP_INIT_START state too long, this fatal error is displayed. It is probably not safe to resume after this error since the AP may be off executing errant code or interfering with bootstrap processor bus cycles. See Code 20, sub-code 0x0 for resolution information. The application processor (AP) previously failed to complete a Built In Self Test (BIST). This is likely due to a bad processor. 20 - 0x3 (0) Resolution: Replace the application processor. During application processor (AP) initialization, it verifies that the CPU model, stepping, and clock multiplier which is being initialized matches those values of the bootstrap processor. If they do not match, this error will result. 20 - 0x4 (0) Resolution: Since the processors are possibly mismatched, remove the heatsink on both and verify that the CPU model and stepping are identical. See Code 20, sub-code 0x0 for more resolution information. The currently supported node board hardware configuration is a maximum of two physical processors. The BIOS uses this knowledge to limit the possibility of repeat 20 - 0x5 (0) InForm OS Failed Error Codes and Resolution 91
  • 92. DescriptionSubcode initialization of the application processor (AP). If this message occurs, it may be due to a variety of hardware problems, but most suspect is the application processor. See Code 20, sub-code 0x0 for resolution information. *** SMI setup error: Not expecting to install a vector on CPU xxxx21 - 0x0 (0) Intel processors support an interrupt level called SMI (System Management Interrupt) which is used for hardware management (usually by the BIOS). Events such as power management and hardware errors usually trigger an SMI. When an SMI is triggered, the system enters SMM (system management mode). In a multiprocessor system, both processors are usually triggered by an SMI at the same time. Since both processors may attempt to service an SMI at the same time, each processor must have a unique stack area where to dump processor context. SMI setup configures each processor individually with a unique stack address for SMI handling. This particular error indicates that the SMI setup handler has detected a stack setup SMI, yet one was not expected (because one had already been set up or CPU initialization had not yet reached the point of SMI setup). The bootstrap CPU delivers the setup SMI to itself and to the application processor. This error could be caused by a faulty CPU or motherboard. The CPU which reports the setup error may not be the one at fault. Resolution: A) Pull one processor at a time to determine if the problem is reproducible with a single CPU. B) Swap CPUs to see if the exact problem moves with CPU. If not, it may be the motherboard. C) Individually replace both CPUs. D) Replace the node motherboard. *** SMI setup error: CPU xxxx not found in CPU table21 - 0x1 (0) During SMI setup, each processor in turn receives an SMI and then performs stack initialization. Prior to the SMI setup, all application processors wait in a halted state for an APIC message to identify and download microcode. If the processor performing an SMI setup detects that it had not previously executed and added its CPU ID to the system table, then this fatal error will be displayed. See Code 20, sub-code 0x1 for resolution information. *** SMI setup error: CPU xxxx did not respond21 - 0x2 (0) During SMI setup, each processor in turn receives an SMI and then performs stack initialization. This error indicates that the bootstrap processor issued an SMI through the APIC and it was not processed by the targeted processor. This indicates that either SMIs are not being delivered properly, or that the targeted processor may be defective. See Code 20, sub-code 0x1 for resolution information. CBIOS provides service to the 3PAR kernel through a special command queue. Responses are returned to the OS through another queue, which is tested during BIOS initialization. 22 - 0x0 (0) Sub-code 0x0 indicates that the CBIOS to OS queue did not pass the built-in test. Resolution: A) Pull one processor at a time to determine if the problem is reproducible with a single CPU. B) Swap SDRAM with good SDRAM. C) Update CBIOS to the latest version. D) Replace the node motherboard. This error indicates that the CBIOS to OS queue test failed to acquire a message it previously sent. 22 - 0x1 (0) See Code 20, sub-code 0x0 for resolution information. 92 CBIOS Error Codes
  • 93. DescriptionSubcode This error indicates that the CBIOS to OS queue test failed because the message received did not match the message sent. 22 - 0x2 (0) See Code 20, sub-code 0x0 for resolution information. This error indicates that the CBIOS to OS queue test failed because there were more items in the queue than those sent. 22 - 0x3 (0) See Code 20, sub-code 0x0 for resolution information. This error indicates that the OS to CBIOS queue test failed. The minor code will indicate to an engineer what went wrong. 22 - 0x4 (0) See Code 20, sub-code 0x0 for resolution information. This error indicates that the CBIOS to OS queue test failed because the queue pointers became corrupt. 22 - 0x5 (0) See Code 20, sub-code 0x0 for resolution information. Invalid magic for full CBIOS23 - 0x2 (0) Prior to starting up the non-failsafe (full diagnostic) CBIOS image, the failsafe CBIOS performs some consistency checks over the image. This error indicates the failsafe BIOS could not find a proper header record for the full CBIOS. See Code 23, sub-code 0x1 for resolution information. CRC mismatch for full CBIOS23 - 0x3 (0) Prior to starting up the non-failsafe (full diagnostic) CBIOS image, the failsafe CBIOS performs a strong CRC over the full CBIOS image to verify the image's integrity. This error indicates the full CBIOS had a CRC failure. See Code 23, sub-code 0x1 for resolution information. Failsafe CBIOS is now enabling the full CBIOS ...23 - 0x4 (0) The full CBIOS either detected an error or user input (the 'f' key) which forced it to return to the failsafe BIOS. If the user did press the 'f' key, then press ^C to resume startup under the failsafe BIOS. If the user did not press the 'f' key, browse prior messages to learn of a failure which may have caused this error. Resolution: If the error was not the result of a keystroke, try pressing the 'n' key at BIOS startup to clear any initialization skips. It may be recorded in NVRAM to skip the full BIOS version and always execute the failsafe. See Code 23, sub-code 0x1 for more resolution information. The BIOS presents to the operating system a set of tables which describe the hardware present in the system. These tables have a rigid structure for each type of device. If the 24 - 0x0 (ptr) CBIOS configuration structure becomes corrupt, this error may result when the TURD structures are initialized for the operating system. A consistency check ensures the TURD area does not go beyond 1 MB (which is the base address where the operating system normally begins using main memory). The data to this error is the pointer address reached, and will be greater than 0x100000. ptr is the value which exceeded 0x100000. Resolution: A) Remove cards from all PCI slots. If the error no longer occurs, it may be a hardware failure on one of cards. B) Replace the node motherboard. The BIOS presents to the operating system a set of tables which describe the hardware present in the system. In this case, the BIOS detected that one of the tables had a bad checksum. 24 - 0x1 (0) Resolution: A) Remove cards from all PCI slots. If the error no longer occurs, it may be a hardware failure on one of cards. B) Replace the node motherboard. InForm OS Failed Error Codes and Resolution 93
  • 94. DescriptionSubcode The BIOS presents to the operating system a set of tables which describe the hardware present in the system. In this case, the BIOS detected that it had added too many entries 24 - 0x2 (0) to the table, likely because too many PCI devices are present in the system. This error is likely due to an earlier PCI failure. Resolution: A) Remove cards from all PCI slots. If the error no longer occurs, it may be a hardware failure on one of cards. B) Replace the node motherboard. The node board has two different Serial EEPROM devices used for storing persistent board information. One PROM device is located on the I2C bus. It stores node board 25 - 0x0 (0) manufacturing, assembly, serial number, and error message log information. The second PROM device is connected through the Intel 82559ER Ethernet controller. It stores Ethernet controller information such as initialization state and the hardware MAC address. PROM checksum: FAIL The PROM which stores node board manufacturing, assembly, serial number, and error message log information does not have a valid checksum. If the PROM has not yet been initialized or if it has become corrupt, you may see this error. Resolution: A) Press ^W to enter Whack and use either "prom init" or "prom edit" to correct this error. B) If the information looks correct with "prom id" then try using "prom checksum" to rewrite the checksum. C) Replace the node motherboard. Ethernet 0 PROM checksum: FAIL25 - 0x1 (0) A) Press ^W to enter Whack and use "prom id" to verify the other PROM is valid. If not, first use "prom init" or "prom edit" to set the PROM information. If the PROM information appears valid, use "prom mac" to reprogram the Ethernet MAC address and checksum. B) Try flushing out a correct checksum. Note: You must first select the device with an error using the "eth dev" command. Example: Whack> eth dev 1 Whack> eth checksum C) Replace the node motherboard. This sub-code indicates that a CPU has asserted its THERMTRIP_N signal. This could mean that it has reached its case temperature, that a VRM has failed, or there is a problem with the FPGA. 27 - 0x4 (0) Resolution: A) Check the environmentals. B) Replace the node. The Eagle/Osprey/Harrier ASICs are the Cluster Managers which are used for high speed communication between nodes of a cluster. These device are critical for the 28 - 0x0 (0) correct operation of the node software, and hence for operation of the whole cluster. The CM exists on all PCI buses in the node. If the CM cannot be found on any of the require PCI bus, this is a serious problem. sub-code 0x0 indicates the PCI bus scan did not locate the Cluster Manager. Resolution: A) Cycle power on the node. B) Pull all PCI cards and cycle power on the node. C) Replace the node motherboard. 94 CBIOS Error Codes
  • 95. DescriptionSubcode Pairwwww DIMMxxxx: Bad checksum. Got yyyy, SPD said zzzz28 - 0x1 (0) The memory DIMMs located on the CM riser are called cluster memory. This memory is used to store data destined for the disks (dirty data) as well as data previously read from the disks (cache data). It is also used for communication among the nodes in the cluster. This memory is not required to boot the operating system, but is required for the node to participate in the cluster. Even before the memory is thoroughly tested for proper operation, it must be configured to appear in CM addressable space. Each memory DIMM has a small embedded serial EEPROM which holds DIMM configuration information such as the number of rows, columns, and banks, as well as memory timing. If this serial EEPROM becomes corrupt, data stored in it regarding the DIMM configuration cannot be trusted. So, this EEPROM also contains a checksum which the BIOS verifies is correct before configuring the DIMM. If this checksum does not match the checksum the BIOS computes across the DIMM, this error will result. You should look at prior output to determine if there were I2C errors. These errors suggest a problem with riser installation. The DIMM number is logged in the Data field of the Fatal Error. Resolution: A) Reseat Cluster Memory riser card(s). B) Reseat Cluster Memory DIMMs. C) Replace Cluster Memory DIMMs in pairs to ensure replacement parts are matched. P4-Eagle and PIII-Eagle DIMM Pairs are always located four riser positions apart. For example, if you number the slots from the top, Pair 0 is at positions 3 and position 7 (top). Pair 1 is at positions 0 (bottom) and position 4. Pair 2 is at positions 2 and position 6. Pair 3 is at positions 1 and position 5. Ironman (Tclass) and Tinman (Fclass) sets are always in sets of three. The DIMMs are set as "DIMM C.S" as in Channel then set. There are two riser cards, one for channel 0 and one for channel 1 and 2. Set 0 is DIMM 0.0, 1.0, 2.0 Set 1 is DIMM 0.1, 1.1, 2.1 Set 2 is DIMM 0.2, 1.2, 2.2 Titan and Atlas have 4 DIMM sets on the motherboard. Set 0: DIMM 0.0 and 1.0 Set 1: DIMM 0.1 and 1.1 Set 2: DIMM 2.0 and 3.0 Set 3: DIMM 2.1 and 3.1 D) Replace the Cluster memory riser(s). E) Replace the node motherboard. Pairww DIMMxx (yyyy): 'zzzz' read failed28 - 0x2 (mm) Where xxxx is one of: row address, column address, module rows, cas latency3, refresh, banks, cas latency2, cas latency1, ras precharge, act_to_rw, act_to_deact, ras cycle, write_to_deact, density, frequency, DIMM type This error indicates that a Cluster Memory DIMM was detected but that the Serial EEPROM present on the DIMM could not be reliably read. The DIMM number is logged in the Data field of the Fatal Error. See Code 28, sub-code 0x1 for resolution information. This error indicates the Cluster Memory DIMMs reported an odd (and unsupported) number of rows. Usually the number of rows reported by a DIMM corresponds to the 28 - 0x4 (mm) number of sides of the DIMM which are populated by memory. One DIMM number of the failing pair will be logged in the Data field of the Fatal Error. See Code 28, sub-code 0x3 for resolution information. InForm OS Failed Error Codes and Resolution 95
  • 96. DescriptionSubcode No Cluster Memory Installed28 - 0x5 (0) This error indicates that no memory was found in the Cluster memory riser. Since cluster memory is needed for proper node operation within the cluster, this is a condition which must be resolved for proper operation. You should look at prior output to determine if there were I2C errors. These errors suggest a problem with riser or DIMM installation. See Code 28, sub-code 0x1 for resolution information. This error indicates the Serial EEPROM on the DIMM reports a value which is outside tolerance for the memory controller. One DIMM number of the failing pair will be logged in the Data field of the Fatal Error. 28 - 0x6 (mm) See Code 28, sub-code 0x1 for resolution information. Before ECC initialization of Cluster memory (scrub), a small region must be tested and configured by the CPU to set up the ECC scrub of the remainder. If an error occurs 28 - 0x7 (mm) during this test (such as memory read does not match the value just written), then this error will be reported. The DIMM number is logged in the Data field of the Fatal Error. See Code 28, sub-code 0x1 for resolution information. During the ECC initialization of Cluster memory, The Cluster Manager records and memory errors it encounters. If any were recorded, this error will be displayed. 28 - 0x8 (0) See Code 28, sub-code 0x1 for resolution information. For each Cluster memory DIMM, there is a register in the Eagle / Osprey memory controller which specifies where the DIMM maps into CM physical memory. These 28 - 0x9 (0) mapping registers are configured during the Cluster memory probe and should not change under normal circumstances. Since this is an internal CM register, it is unlikely that reseating memory will correct this problem. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Replace the node motherboard. The Cluster memory controller detected an Uncorrectable ECC error. Eagle / Osprey identifies the failing bank and address with the error as well as the error syndrome. 28 - 0xa (mm) The BIOS will convert the information into the failing DIMM and Riser Slot numbers. There may be multiple Uncorrectable errors. In this case, the CM will save the address/syndrome for the most recent error. The DIMM number is logged in the Data field of the Fatal Error. Eagle nodes (S-Series and E-Series): There are 8 DIMMs maximum on the S-Series Cluster Memory Riser Card. If the DIMM number is not between 0-7 (inclusive), then the failing DIMM cannot be identified. Osprey nodes (T-Series and F-Series): There are 6 DIMMs on T-Series and 3 DIMMs on F-Series. The data field encodes which DIMM encodes the DIMM number in the lower 4 bits of the field and the channel number in the upper 4 bits. So a data value of 12 indicates DIMM 1.2 is at fault. Harrier nodes (V-Series, Atlas, Minime1 & 2): There are 8 DIMMs on V-Series between two different Harrier ASICs; two memory controllers with 2 DIMMs each. The data field encodes which memory channel encountered the uncorrectable error. A data value of 10 means channel one ia at fault, a value of 0 means channel zero is at fault. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Reseat the failing Cluster Memory DIMM(s). D) Replace the failing Cluster Memory DIMM(s). E) Replace the node motherboard. 96 CBIOS Error Codes
  • 97. DescriptionSubcode The Cluster memory controller detected a correctable ECC error.28 - 0xb (mm) The CM identifies the failing bank and address with the error as well as the error syndrome. The BIOS will convert the information into the failing DIMM and Riser Slot numbers. The DIMM number is logged in the Data field of the Fatal Error. Eagle nodes (S-Series and E-Series): There are 8 DIMMs maximum on the Cluster Memory Riser Card. If the DIMM number is not between 0-7 (inclusive), then the failing DIMM cannot be identified. Osprey nodes (T-Series and F-Series): There are 6 DIMMs on T-Series and 3 DIMMs on F-Series. The data field encodes which DIMM encodes the DIMM number in the lower 4 bits of the field and the channel number in the upper 4 bits. So a data value of 12 indicates DIMM 2.1 is at fault. Harrier nodes (V-Series, Atlas, Minime1 & 2): This should not occur on Harrier. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Reseat the failing Cluster Memory DIMM. D) Replace the failing Cluster Memory DIMM. E) Replace the node motherboard. The CBIOS runs Cluster Memory Tests as part of POST in both normal operation and manufacturing test. If any test fails due to a data miscompare, the test will generate this fatal error code with sub-code '0xc'. CBIOS runs the following tests: 28 - 0xc (mm) Walking 1/0 across data Walking 1/0 across address (512 MB Small Memory Window) Walking 1/0 using XCB (64 bytes) across segment boundaries Any test failure will result in a fatal error. The DIMM number is logged in the Data field of the Fatal Error. Eagle nodes (S-Series and E-Series): There are 8 DIMMs maximum on the Cluster Memory Riser Card. If the DIMM number is not between 0-7 (inclusive), then the failing DIMM cannot be identified. Osprey nodes (T-Series and F-Series): There are 6 DIMMs on T-Series and 3 DIMMs on F-Series. The data field encodes which DIMM encodes the DIMM number in the lower 4 bits of the field and the channel number in the upper 4 bits. So a data value of 12 indicates DIMM 2.1 is at fault. Harrier nodes (V-Series, Atlas, Minime1 & 2): This should not occur in Harrier. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Reseat the failing Cluster Memory DIMM. D) Replace the failing Cluster Memory DIMM. E) Replace the node motherboard. Pairwwww DIMMxxxx: Illegal SPD value <name of value> <value>28 - 0xd (mm) This error indicates that a Cluster Memory DIMM was detected but that the Serial EEPROM present on the DIMM reported an illegal or unsupported value for our memory controller. The DIMM number is logged in the Data field of the Fatal Error. Example: InForm OS Failed Error Codes and Resolution 97
  • 98. DescriptionSubcode Density (SPD byte 31) has more than 1 bit set (ie. 0x30) which indicates a non-standard part. See Code 28, sub-code 0x1 for resolution information. Most likely, the DIMM is not qualified for use in our node board. If there was a problem mapping the CM Small Cluster window into CPU 32-bit space, this error may result when attempting to initialize Cluster memory. The initialization 28 - 0xe (mm) problem could be due either to hardware failure or by setting a special NVRAM variable that eliminates the address space normally reserved for CM memory windows. An example of such is setting "mem_max" to a value above 2496. Another example would be setting "pci_base" above 0xa0000000. Resolution: Contact 3PAR technical support. The Cluster memory controller detected a memory error in a specific DIMM bank. The CM memory error status register is logged in the Data field of the Fatal Error. 28 - 0xf (mm) See Code 28, sub-code 0xb for resolution information. H1 LPC0 HW ERR ST [00000004]: dataq_parity28 - 0x10 (mm) H1 LPC0 ERR Stat [00000006]: EP-Error-Rpt Fatal-Error H1 LPC0 ERR ID [80000000]: HW-Err The Cluster memory controller detected a hardware error. This error is printed, as shown above. mm is decoded as bits 31-28 represent the LPC number and bits 27-0 are the error bits as set in the hardware error status register. The hardware error means that the Harrier ASIC is non functional. Resolution: A) Cycle power on the node. B) Replace the node. Testing CM data lines with walking 128 - 0x20 (mm) Addr (xxxx) Wrote(yyyy) Read(zzzz) The CM walking 1 bits test verifies that the processor may directly access CM cluster memory by performing a walking 1's test on all data lines. If any fails, this error will result. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Reseat Cluster Memory DIMMs. D) Replace the node motherboard. Testing CM data lines with walking 028 - 0x21 (mm) Addr (xxxx) Wrote(yyyy) Read(zzzz) The CM walking 0 bits test verifies that the processor may directly access cluster memory by performing a walking 0's test on all data lines. If any fails, this error will result. See Code 28, sub-code 0x20 for resolution information. ZERO CM problem at addr xxxx28 - 0x22 (mm) Between PCI bus tests, a small portion of cluster memory is cleared. If errors in clearing the memory are detected, this error will result. See Code 28, sub-code 0x20 for resolution information. Testing CM address lines with walking 1 (first 512 MB only)28 - 0x23 (mm) The CM walking 1 address bits test verifies that the processor may directly access cluster memory by performing a walking 1's test on all address lines. If any fails, this error will result. See Code 28, sub-code 0x20 for resolution information. 98 CBIOS Error Codes
  • 99. DescriptionSubcode Testing CM address lines with walking 0 (first 512 MB only)28 - 0x24 (mm) The CM walking 0 address bits test verifies that the processor may directly access cluster memory by performing a walking 0's test on all address lines. If any fails, this error will result. See Code 28, sub-code 0x20 for resolution information. Testing CM segment decode boundaries28 - 0x25 (mm) This test verifies that memory decoding at all CM DIMM pairs is working correctly. It does so by writing a unique 128 bytes at each memory decode boundary location. It then verifies the values were written correctly and looks for corruption of other addresses. See Code 28, sub-code 0x20 for resolution information. Testing CM with random XOR (all Cluster Memory)28 - 0x26 (eecd) ee = number of errors in XOR errors. c = Channel Number where the error took place. d = DIMM number where the error took place. This function performs a random data test on all cluster memory attached to the CM to verify memory under stress with random patterns. This test also exercises the CM XOR engine as several sources are used simultaneously throughout the cluster memory test. See Code 28, sub-code 0x20 for resolution information. This error occurs when the DQS training fails to find working values for the DQS enable, DQS out skew, and DQS in skew. 28 - 0x27 (0) See Code 28, sub-code 0x20 for resolution information. Testing CM ECC lines with walking 128 - 0x30 (mm) Addr (xxxx) Wrote(yyyy) Read(zzzz) The CM walking 1 bits test verifies that the processor may directly access CM cluster memory by performing a walking 1's test on all ECC lines. If any fails, this error will result. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Reseat Cluster Memory DIMMs. D) Replace the node motherboard. Testing CM ECC lines with walking 028 - 0x31 (mm) Addr (xxxx) Wrote(yyyy) Read(zzzz) The CM walking 0 bits test verifies that the processor may directly access cluster memory by performing a walking 0's test on all ECC lines. If any fails, this error will result. See Code 28, sub-code 0x30 for resolution information. Testing CM Op Codes28 - 0x32 (mm) The CM Op Code test verifies that the processor may execute one of the available operations for this cluster manager ASIC. This error means that a particular op code is not supported. If any op code fails, this error will result. Resolution: A) Replace the node motherboard. Testing CM Source Interrupts28 - 0x33 (data) The CM Source Interrupts test will test that an interrupt is generated for each CMA data path, from processor, CMA, or companion CMA to either processor memory to local CMA. On systems with only one CMA, the companion tests are not done. Resolution: A) Replace the node motherboard. Testing CM I2C communication test28 - 0x34 (data) InForm OS Failed Error Codes and Resolution 99
  • 100. DescriptionSubcode The CM I2C communication test will read and write to various safe CMA registers or CMA memory and verify that the expected values are read. A fail means either a bad DIMM or bad CMA. See Code 28, sub-code 0x30 for resolution information. Stopped on an Uncorrectable Error28 - 0x35 (data) The scan for errors found an uncorrectable error in one of the CMAs. The system stopped during a BIOS test when this error was discovered. See Code 28, sub-code 0x30 for resolution information. Stopped on a Correctable Error28 - 0x36 (data) The scan for errors found a correctable error in one of the CMAs. The system stopped during a BIOS test when this error was discovered. See Code 28, sub-code 0x30 for resolution information. Testing CM MMW data lines with walking 128 - 0x40 (mm) Addr (xxxx) Wrote(yyyy) Read(zzzz) The CM walking 1 bits test verifies that the processor may directly access CM cluster memory by performing a walking 1's test on all data lines. This test uses the Medium Memory Window (MMW). If any fails, this error will result. Resolution: A) Cycle power on the node. B) Reseat Cluster Memory riser card. C) Reseat Cluster Memory DIMMs. D) Replace the node motherboard. Testing CM MMW data lines with walking 028 - 0x41 (mm) Addr (xxxx) Wrote(yyyy) Read(zzzz) The CM walking 0 bits test verifies that the processor may directly access cluster memory by performing a walking 0's test on all data lines. This test uses the Medium Memory Window (MMW). If any fails, this error will result. See Code 28, sub-code 0x40 for resolution information. ZERO CM problem at addr xxxx28 - 0x42 (mm) Between PCI bus MMW tests, a small portion of cluster memory is cleared. If errors in clearing the memory are detected, this error will result. See Code 28, sub-code 0x40 for resolution information. Testing CM address lines with walking 1 (MMW)28 - 0x43 (mm) The CM walking 1 address bits test verifies that the processor may directly access cluster memory by performing a walking 1's test test on all address lines using the medium memory window. If any fails, this error will result. See Code 28, sub-code 0x40 for resolution information. Testing CM address lines with walking 0 (MMW)28 - 0x44 (mm) The CM walking 0 address bits test verifies that the processor may directly access cluster memory by performing a walking 0's test on all address lines using the medium memory window. If any fails, this error will result. See Code 28, sub-code 0x40 for resolution information. Testing CM address lines with walking 1 (RMW)28 - 0x45 (mm) The CM walking 1 address bits test verifies that the processor may directly access cluster memory by performing a walking 1's test on all address lines using the remote memory window. If any fails, this error will result. See Code 28, sub-code 0x40 for resolution information. 100 CBIOS Error Codes
  • 101. DescriptionSubcode Testing CM address lines with walking 0 (RMW)28 - 0x46 (mm) The CM walking 0 address bits test verifies that the processor may directly access cluster memory by performing a walking 0's test test on all address lines using the remote memory window. If any fails, this error will result. See Code 28, sub-code 0x40 for resolution information. Link 0 did not come up (0xac000000) error = (0x002022ff)29 - 0x0 (data) (data = link number) CM Links are high speed connections between all of the node boards in a cluster via the center panel. During Manufacturing test, nodes are connected to a special Manufacturing Center panel that connects the link transmitter to its own receivers (external loopback). When the node senses that it is in this special Center Panel, it will initialize all of the links and perform loopback tests. If any link fails to initialize, this sub-code will be reported. Resolution: A) Cycle power on the node. B) Verify that the node is securely mated with the Center Panel. C) Turn off power, re-seat the node into the center panel, and turn power back on. D) Replace the node motherboard. CM Link Initialization failed29 - 0x1 (data) (data = LLRR) where LL is the link bit pattern. 01 is link 0, 02 is link 1, 04 is link 2, and 08 is link 3. RR is the failure reason. E4 is Hardware error, F0 is user abort. CM Links are high speed connections between all of the node boards via the center panel. During Manufacturing test, nodes are connected to a special Manufacturing Center panel that connects each link's transmitter to its own receiver (external loopback). When the node senses that it is in this special Center Panel, it will initialize the links and run a special test to verify the operation of the transmitter/receivers of each link. If any link fails, the test will report this sub-code. See Code 29, sub-code 0x0 for resolution information. CM# Link XOR test: Link [0]..[FAIL] (1)29 - 0x2 (data) (data = the link bit pattern. bit 0 is link 0, bit 1 is link 1, bit 2 is link 2, and bit 3 is link 3. CM Links are high speed connections between all of the node boards via the center panel. During Manufacturing test, nodes are connected to a special Manufacturing Center panel that connects each link's transmitter to its own receiver (external loopback). When the node senses that it is in this special Center Panel, it will initialize the links and run a special test to verify the operation of the transmitter/receivers of each link. If any link fails, the test will report this sub-code. See Code 29, sub-code 0x0 for resolution information. CM# Link INT??? test: Link [0]..[FAIL] (1)29 - 0x3 (data) (data = the link bit pattern. bit 0 is link 0, bit 1 is link 1, bit 2 is link 2, and bit 3 is link 3. The CM Link INT test verifies that setting either of the two interrupt flags (DEST, SRC) in the XCB does actually generate and interrupt to the processor. See Code 29, sub-code 0x0 for resolution information. (data = link number)29 - 0x4 (data) The CM Link Round Trip Test failed due to an XCB failure. CM XCB failed during link DMA. Use the "eagle status" command for more information on the type of error. This test checks the CM link status at multiple times during the test. InForm OS Failed Error Codes and Resolution 101
  • 102. DescriptionSubcode The "(Send)" part of the message indicates which stage failed. Another possible values is "(Receive)". See Code 29, sub-code 0x0 for resolution information. (data = link number)29 - 0x5 (data) The CM Link Round Trip Test failed due to data miscompare. All packets have a length check and timestamp check. Payload compare is optional. Use the "eagle status" command to check for Uncorrectable ECC errors. See Code 29, sub-code 0x0 for resolution information. (data = link number)29 - 0x6 (data) The CM Link Round Trip Test failed due to packet timeout. A packet was sent and not received in a reasonable timeout period. The Round Trip Test may not have been started on a remote node. Use the "eagle status" to check for Uncorrectable ECC errors. Resolution: A) Start CM Link Round Trip Test on remote node. B) Cycle power on the node. C) Verify that the node is securely mated with the Center Panel. D) Turn off power, re-seat the node into the center panel, and turn power back on. E) Replace the node motherboard. REC_EN went low. Test failed for link [x](yyyyyyyy)29 - 0x10 (0) The "cma link init" command is used to initialize and bring up the CM links to nodes which indicate a "Power Ok" state. If this error occurs, it is possible the remote node was transmitting BIST, but then later stopped (such as from a reset or power off). Resolution: A) Perform the same test again. B) Replace the node motherboard. The CM has XCB engines which transfer data. Software manages the producer register and the CM hardware follows with the consumer register. If these two do not agree 29 - 0x11 (0) and CM should be idle, then it's possible the CM has halted due to failure of some operation. This problem is likely caused by a cluster memory or link failure. Resolution: A) Cycle power on the node. B) Replace the node motherboard. C) Replace the link partner node. The Exar and Oxford serial chips are used for a secondary low speed link which directly connects all nodes in the cluster. They are primarily in the event of a link failure to verify 30 - 0x0 (0) whether another node in the cluster has actually gone down. Since the part is integrated onto the motherboard and is on a PCI bus, a failure to locate the internal serial chips may indicate other PCI problems as well. Resolution: A) Cycle power on the node. B) Replace the node motherboard. When the node board is inserted into a Manufacturing Test Centerpanel, the internal Serial Port Manufacturing test will automatically run. This error indicates failures on all ports tested. 30 - 0x1 (0) Resolution: A) Cycle power on the node. B) Replace the node motherboard. 102 CBIOS Error Codes
  • 103. DescriptionSubcode Port (4): Processed 109 bytes [FAIL]30 - 0x2 (0) All cluster internal serial ports go through a quick internal loopback test immediately after initialization to do a short test of proper operation. This test will run regardless of the type of centerplane in which the node is connected. This error indicates failures on all ports tested. Resolution: A) Cycle power on the node. B) Replace the node motherboard. Internal UART is not functioning properly.30 - 0x3 (0) Most likely this is due to a hardware failure related to the SuperIO. Resolution: A) Cycle power on the node. B) Replace the node motherboard. FPGA Scratchpad registers failed meaning bad FPGA hardware.31 - 0x2 (0) Resolution: A) Cycle power on the node. B) Replace the node. Xor Engine Status: P0_XERR32 - 0x0 (0) Error Status : XOR_ERR PCI0 Error Status: PCI1 Error Status: The Eagle ASIC and Osprey ASIC contain a DMA engine capable of XOR operations. This DMA engine is commonly referred to as the XCB engine. The XCB engine can DMA data between 14 different modules within the ASIC, each module capable of sinking or sourcing data. The XCB engine will stop all DMA if it encounters an error while transferring data. The XCB error status indicates the module that produced the error. Further details of the error can be gathered by inspecting the error registers of that module. Use the whack command "cma status" to get further diagnostic information. If the user continues past this error, software will attempt to reset the error and continue. Resolution: A) Cycle power on the node. B) Replace the node motherboard. This error indicates that an SDRAM DIMM for which information was requested is no longer available. This may be due to an intermittent I2C bus, or a hardware failure. 33 - 0x0 (0) Resolution: A) Cycle power on the node. B) Replace the failing DIMM's pair. C) Replace the node motherboard. This error indicates an uncorrectable error occurred on the PCI bus. In the future, the data field may indicate the PCI slot number for the device which failed. In order to 34 - 0x1 (0xff) determine the cause of this error, it may be useful to review either console messages or the IDE disk log. Typical messages preceding this error are likely difficult to read, but may indicate the exact cause. Example: --- SMI: smm_inb(0x3a) == 0x86 GPE 9 triggered Error in PCI device 02.02.00 (PCI/PCI Bridge #0 (controls slot 1)): PCI status register (0x06) [62b0]: Signaled system error (SERR#), Received master abort InForm OS Failed Error Codes and Resolution 103
  • 104. DescriptionSubcode Secondary PCI status register (0x1e) [0aa0]: Signaled target abort Bridge P_SERR (0x6a) [80]: Delayed transaction master initiator timeout Error in PCI device 03.01.00 (PCI Slot 1): PCI status register (0x06) [1290]: Received target abort Secondary PCI status register (0x1e) [0a80]: Signaled target abort Error in PCI device 04.06.00 (inside PCI Slot 1): PCI status register (0x06) [1230]: Received target abort Error in PCI device 04.06.01 (inside PCI Slot 1): PCI status register (0x06) [1230]: Received target abort (PCI errors not cleared) *** Fatal error: Code 34, sub-code 0x1 (ff). In the above case, a card in PCI Slot 1 was transferring data up to a device, likely the cluster manager, when it didn't get a response. The bridge above the card received a master abort, which it then relayed to its secondary side as signaled target abort. The bridge on the card in PCI Slot 1 then received the target abort and signaled a target abort on its secondary side. Both PCI devices then indicated they received target aborts. Resolution: A) Cycle power on the node. B) Reseat all PCI cards. C) Replace the suspected PCI card. D) Remove PCI cards one at a time. E) Replace the node motherboard. One or both DIMMs in a DIMM pair has failed. Bits 4-7 of the data value indicate the DIMM pair. 35 - 0x0 (data) If data is 0, then DIMM pair 0 has failed. if data is 10, then DIMM pair 1 has failed. Example: --- SMI: TEMPCAUT (SMALERT): 0x01 (bits reset) Uncorrectable ECC error 0x9279a103 recorded in reg 0x98 Pair1, either DIMM1 or DIMM3 contains the error Error in locations [0x382cd818 .. 0x382cd81f] Uncorrectable ECC error 0x9279a101 recorded in reg 0x94 Syndrome/bit number information might not be accurate, as more than 1 error happened Pair1, either DIMM1 or DIMM3 contains the error Error in locations [0x382cd808 .. 0x382cd80f] (Clearing cache line at 0x382cd800) (Clearing cache line at 0x382cd800) ESR == 0x0003 (expected low bit == 0) *** Fatal error: Code 35, sub-code 0x0 (10). Resolution: A) Cycle power on the node. B) Clear dust and debris from the node. C) Remove and reseat the specified CPU DIMM pair. D) Replace the failed CPU DIMM pair. E) Replace the node motherboard. 104 CBIOS Error Codes
  • 105. DescriptionSubcode A single DIMM of a DIMM pair has failed. The data value indicates which DIMM. Bits 4-7 of the data value indicate which DIMM pair. Bits 0-3 of the data value indicate which DIMM within that pair. 35 - 0x1 (data) If data is 0, then DIMM 0 of pair 0 has failed. If data is 1, then DIMM 1 of pair 0 has failed. if data is 10, then DIMM 0 of pair 1 has failed. if data is 11, then DIMM 1 of pair 1 has failed. Resolution: A) Cycle power on the node. B) Clear dust and debris from the node. C) Remove and reseat the specified CPU DIMM. D) Replace the failed CPU DIMM. E) Replace the node motherboard. This code means an ECC error was detected, but the BIOS did not completely decode the error. 35 - 0x2 (data) See Code 35, sub-code 0x0 for resolution information. In the event of a hardware failure, it is normal to trigger a processor System Management Interrupt (SMI). If the SMI gets cleared before the BIOS has a chance to observe it (which should not happen), then this error will result. 36 - 0x0 (0) Resolution: A) Cycle power on the node. B) Replace the node motherboard. In normal operation the operating system should not write to the ACPI PM register. If the BIOS detects a write took place, it will flag this as an error caused by a failing operating system or other node hardware. 36 - 0x1 (0) Resolution: A) Cycle power on the node. B) Reinstall the operating system. C) Replace the node motherboard. The BIOS was not able to determine the actual cause of the triggered SMI.36 - 0x2 (0) Resolution: A) Cycle power on the node. B) Reinstall the operating system. C) Replace the node motherboard. This error may result if there is an unknown hardware device triggering SMIs in the system and those SMIs are happening too frequently. Most likely the device continues 36 - 0x3 (0) to trigger an SMI because its problem has not been serviced, and no real work is possible at this point because immediately after returning from the SMI, another is triggered. The BIOS attempts to recognize this condition and stop with a fatal error rather than just continuing to display errors. Resolution: A) Remote reset or cycle power on the node. B) Reinstall the operating system. C) Replace the node motherboard. This error may result if a known SMI cause is happening too frequently. In a normally functioning node, SMIs should occur infrequently, as there is a performance impact 36 - 0x4 (0) associated with handling each SMI. The BIOS will first attempt to disable known SMIs in order to mask this problem. If that is insufficient, the BIOS will stop with this fatal error. InForm OS Failed Error Codes and Resolution 105
  • 106. DescriptionSubcode Resolution: A) Check for CPU memory DIMM correctables in the event log. Replace DIMMs if they are suspect. B) Check for hardware oscillating events in the event log (such as PS status). On some node types, board GPIO changes are reported through SMI. You may need to replace power supplies or another FRU. C) Replace the node motherboard. This error will result if the BIOS inadvertently changes the contents of CR2 while processing a SMI. This should not happen in normal operation, but might happen as 36 - 0x5 (0) the result of a `whack' command. As returning from this SMI could easily cause corruption of the OS or of a user-level program, this fatal error is flagged instead. Resolution: Cycle power on the node. Code 37 sub-codes are a bitmask of error values.37 - zz (0) This means you may find an error which will simultaneously trigger multiple GEVENTs. This event is probably one of the hardest to interpret as it often will indicate multiple board devices have detected a fatal error condition. In general, it's much more convenient to look up the decoded error in the BIOS output of the idelog rather than manually decoding this event back to indicators. Resolution: Look up each individual documented sub-code below which when OR'd together form the sub-code observed. S-Series and E-Series (P4) nodes:37 - 0x1 (0) --- SMI: smm_inb(0x39) == 0x01 CMIC_FATAL (GEVENT0) This error indicates the CMIC (North Bridge) had a fatal error. Resolution: A) Cycle power on the node. B) Verify the system is getting adequate ventilation. C) Remove any recently installed PCI cards. D) Remove all PCI cards. E) Replace the node motherboard. S-Series and E-Series (P4) nodes:37 - 0x2 (0) --- SMI: smm_inb(0x39) == 0x02 ALERT (GEVENT1) Error in PCI device 00.00.00 (CMIC-LE Memory Controller/Thin IMB): ESR (0x4c) [0004]: IMBus error (PCI errors not cleared) The output above can be considered "typical" but really may contain any of the possible CMIC (North Bridge) Memory Controller or other PCI bus errors. An IMBus error indicates a communication problem between the North Bridge and one of the South Bridge or CIOBX2. This would likely indicate a node motherboard failure. It has been observed in the field that a flaky or bad PCI socket may also cause this. Resolution: A) Cycle power on the node. B) Verify the system is getting adequate ventilation. C) Remove any recently installed PCI cards. D) Remove all PCI cards. E) Replace the node motherboard. This error indicates the MCH (North Bridge) has detected a fatal condition. Most likely there are other error messages present in the idelog to help pinpoint the issue. Since 106 CBIOS Error Codes
  • 107. DescriptionSubcode the MCH is the top of the root complex, it's very common to see the MCH indicating Fatal error on nearly all failures. Resolution: A) Cycle power on the node. B) Replace CPU DIMMs if no other error is indicated. C) Replace the node motherboard. S-Series (PIII) nodes:37 - 0x4 (0) --- SMI: smm_inb(0x39) == 0x04 GPE 2 triggered THERMT_L0_OSB (GEVENT2) This indicates a thermal event triggered a GPIO interrupt. It is a fatal condition on Pentium III nodes, and the node will be immediately taken out of the cluster with this fatal error. Resolution: A) Cycle power on the node. If it is a temperature related problem, verify the system is getting adequate ventilation. B) Replace the node motherboard. S-Series and E-Series (P4) nodes: --- SMI: smm_inb(0x39) == 0x04 GPE 2 triggered P0_PROC_HOT (GEVENT2) The Pentium 4 CPU supports clock modulation which reduces the core frequency when the core temperature is too high. The BIOS enables this support when starting the OS, so after the node has joined the cluster, the BIOS will asynchronously notify the OS if this event occurs but not take it out of the cluster. At the same time, the Pentium 4 processor will automatically reduce its clock speed so as to generate less heat and not reach a shutdown temperature. This message is therefore not fatal on P4 CPUs. Resolution: A) Cycle power on the node. If it is a temperature related problem, verify the system is getting adequate ventilation. B) Replace the node motherboard. This error indicates either the PLX #0 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX brige #0 detected a parity error. These components manage PCI slots 4, and 5 on T-Series and Slot 2 on F-Series. See Code 37, sub-code 0x1 for resolution information. This error indicates that the PLX #0 PCIe-PCIe bridge detected a fatal error. These components manage PCI slots 6, 7, and 8; Harrier 1 and 2 LPC2. See Code 37, sub-code 0x1 for resolution information. S-Series (PIII) nodes:37 - 0x8 (0) --- SMI: smm_inb(0x39) == 0x08 GPE 3 triggered THERMT_L1_OSB (GEVENT2) This indicates a thermal event triggered a GPIO interrupt. See Code 37, sub-code 0x2 for resolution information. S-Series and E-Series (P4) nodes: --- SMI: smm_inb(0x39) == 0x08 GPE 3 triggered P1_PROC_HOT (GEVENT2) This indicates a thermal event triggered a GPIO interrupt. See Code 37, sub-code 0x2 for resolution information. InForm OS Failed Error Codes and Resolution 107
  • 108. DescriptionSubcode This error indicates either the PLX #0 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX bridge #0 detected a fatal error (SERR). These components manage PCI slots 4, and 5 on T-Series and Slot 2 on F-Series. See Code 37, sub-code 0x1 for resolution information. This error indicates that the PLX #1 PCIe-PCIe bridge detected a fatal error. These components manage PCI slots 3, 4, and 5; Harrier 1 and 2 LPC1. See Code 37, sub-code 0x1 for resolution information. S-Series (PIII) nodes:37 - 0x10 (0) GPE 4 triggered MIRQ (GEVENT4) This error indicates the memory controller (CNB20HE) triggered an interrupt. The CNB20HE documentation lists possible sources as correctable ECC error on Memory data bus and Processor data bus. See below (P4) for resolution information. S-Series and E-Series (P4) nodes: --- SMI: smm_inb(0x39) == 0x10 GPE 4 triggered P0_IERR (GEVENT4) This error indicates that P4 CPU 0 has asserted IERR#, which is used to indicate a processor internal error event occurred. The Intel documentation indicates one cause of this error is a machine check exception when exceptions have not yet been enabled. From our experience in the field, the problem is possibly a CPU or node motherboard failure. This error indicates either the PLX #1 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX brige #1 detected a parity error. These components manage PCI slots 2, and 3 on T-Series and Slot 1 on F-Series. See Code 37, sub-code 0x1 for resolution information. This error indicates an internal error. This should not occur in a V-Series system. See Code 37, sub-code 0x1 for resolution information. S-Series and E-Series (P4) nodes:37 - 0x20 (0) --- SMI: smm_inb(0x39) == 0x20 GPE 5 triggered P1_IERR (GEVENT5) This error indicates that P4 CPU 1 has asserted IERR#. See Code 37, sub-code 0x10 (P4) for resolution information. This error indicates either the PLX #1 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX brige #1 detected a fatal error (SERR). These components manage PCI slots 2, and 3 on T-Series and Slot 1 on F-Series. See Code 37, sub-code 0x1 for resolution information. This error indicates that NEMOE raised the FPGA SMI interrupt and it was not handled properly. See Code 37, sub-code 0x1 for resolution information. S-Series and E-Series (P4) nodes:37 - 0x40 (0) --- SMI: smm_inb(0x39) == 0x40 GPE 6 triggered P_SERR (GEVENT6) This error indicates one or more of the system's chipset is asserting P_SERR (primary side system error). Output is usually followed by outstanding PCI errors as indicated by chipset devices. Resolution: 108 CBIOS Error Codes
  • 109. DescriptionSubcode A) Identify and replace failing PCI card based on error output. It may be necessary to contact hardware engineering with BIOS output to determine which PCI slot is at fault. B) Remove all PCI cards. C) Replace the node motherboard. This error indicates the MCH (North Bridge) has detected an uncorrectable error. Most likely there are other error messages present in the idelog to help pinpoint the issue. Since the MCH is the top of the root complex, it's very common to see the MCH indicating Uncorrectable error on nearly all failures. Resolution: A) Cycle power on the node. B) Replace CPU DIMMs if no other error is indicated. C) Replace the node motherboard. S-Series and E-Series (P4) nodes:37 - 0x80 (0) --- SMI: smm_inb(0x39) == 0x80 GPE 7 triggered P_PERR (GEVENT7) This error indicates one or more of the system's chipset is asserting P_PERR (primary side parity error). See Code 37, sub-code 0x40 for resolution information. This error indicates either the PLX #2 PCIe-PCIX bridge or the Intel 31154 PCIX-PCIX brige #2 detected a fatal error (SERR). These components manage PCI slots 0, and 1 on T-Series and Slot 0 on F-Series. See Code 37, sub-code 0x1 for resolution information. This error indicates an internal error. This should not occur in a V-Series system. See Code 37, sub-code 0x1 for resolution information. S-Series (PIII) nodes:37 - 0x100 (0) --- SMI: smm_inb(0x3a) == 0x01 GPE 8 triggered CPU_TEMP_INTR (GEVENT8) This indicates a CPU temperature event triggered a GPIO interrupt. See Code 37, sub-code 0x2 for resolution information. S-Series and E-Series (P4) nodes: --- SMI: smm_inb(0x3a) == 0x01 GPE 8 triggered S_SERR (GEVENT8) This error indicates one or more of the system's chipset is asserting S_SERR (secondary side system error). See Code 37, sub-code 0x40 for resolution information. T-Series and F-Series (5000P) nodes: --- SMI request via EXT_SMI This error indicates another node in the cluster has forced this node to handle an SMI. Most likely the other node is attempting to force a panic dump because the local node has stopped responding. Resolution: A) Inspect the core dump to determine if the cause was a software or hardware failure. B) Replace the node motherboard if the issue recurs and can not be identified as a software failure. This error indicates an internal error. This should not occur in a V-Series system. See Code 37, sub-code 0x1 for resolution information. InForm OS Failed Error Codes and Resolution 109
  • 110. DescriptionSubcode S-Series (PIII) nodes:37 - 0x200 (0) This error indicates one or more of the system's chipset is asserting SERR (system error). Output is followed by the PCI scan results, which displays outstanding PCI errors of all PCI bus devices. See below (P4) for resolution information. S-Series and E-Series (P4) nodes: --- SMI: smm_inb(0x3a) == 0x02 GPE 9 triggered S_PERR (GEVENT8) This error indicates one or more of the system's chipset is asserting S_PERR (secondary side parity error). Resolution: A) Identify and replace failing PCI card based on error output. It may be necessary to contact hardware engineering with BIOS output to determine which PCI slot is at fault. B) Remove all PCI cards. C) Replace the node motherboard. This error indicates that CPU 0 has asserted IERR#, which is used to indicate a processor internal error event occurred. The Intel documentation indicates one cause of this error is a machine check exception when exceptions have not yet been enabled. From our experience in the field, the problem is possibly a CPU or node motherboard failure. Resolution: A) Cycle power on the node. B) Verify the system is getting adequate ventilation. C) Remove all PCI cards. D) Replace the node motherboard. This error indicates that CPU 1 has asserted IERR#, which is used to indicate a processor internal error event occurred. 37 - 0x400 (0) See Code 37, sub-code 0x200 for resolution information. Both Power Supplies failed: DC Output Bad38 - 0x9 (data) This error indicates there is a hardware problem in one of the node power supplies. If this failure is transient, it could also be caused by turning the power supply off and then on or by a quick AC loss followed by AC being restored. f both power supplies fail simultaneously (not likely), this is a fatal error. The data value may be decoded to determine which power supply triggered this error. The low 2 bits are a bitmask of the DC Output status for the two power supplies. As a Fatal error, the value will be 3, indicating PS0 and PS1 both had a DC Output Bad. Resolution: A) Ensure a service operation was not taking place at the time, and that AC had not also failed. B) Replace the power supply. C) Replace the node motherboard. PS x has down-rev firmware (x)38 - 0x11 (data) This failure code indicates the power supply firmware revision is not up-to-date and therefore not supported. Resolution: Replace power supply. PS x Battery has down-rev firmware (rev)38 - 0x12 (data) This failure code indicates the battery attached to the power supply indicated has firmware that is not up-to-date and therefore not supported. 110 CBIOS Error Codes
  • 111. DescriptionSubcode Resolution: Replace battery. Maximum count for no successful OS boot (xxxx) exceeded.39 - 0x1 (0) Type "unset cnt_no_os_boot" to clear this error. This error indicates that the BIOS has detected that the node has not successfully booted the OS and will now prohibit boots until operator intervention clears this error. Resolution: A) Clear this error as suggested in the error text. You may also turn off this checking mechanism if it does not meet your application. To do this, type, "unset max_no_os_boot" at a Whack prompt. B) Verify that a valid operating system image is installed on the node's internal disk. Reinstall the operating system if defective. C) Replace the IDE drive. Maximum count for OS boot with no cluster (xxxx) exceeded.39 - 0x2 (0) Type "unset cnt_no_cluster" to clear this error. This error indicates that the BIOS has detected that the node has booted, but the cluster has not successfully formed several times. The BIOS will prohibit boots until operator intervention clears this error. This is to prevent cyclic node up/down caused by a hardware or software failure. This increases the reliability of the cluster by preventing the node from continuously attempting to join the cluster. Resolution: A) Clear this error as suggested in the error text. You may also turn off this checking mechanism if it does not meet your application. To do this, type, "unset max_no_cluster" at a Whack prompt. B) Verify that a valid operating system image is installed on the node's internal disk. Reinstall the operating system if defective. C) Replace the IDE drive. Maximum count for OS panic (xxxx) exceeded.39 - 0x3 (0) Type "unset cnt_os_panic" to clear this error. This error indicates that the BIOS has detected that the node has booted and then caused a panic several times. When the OS causes a panic, it notifies the BIOS of this event, so the BIOS can track problems. Once a limit is exceeded, the BIOS will prohibit boots until operator intervention clears this error. This is to prevent cyclic node up/down caused by a hardware or software failure. This increases the reliability of the cluster by preventing the node from continuously attempting to join the cluster. Resolution: A) Clear this error as suggested in the error text. You may also turn off this checking mechanism if it does not meet your application. To do this, type, "unset max_os_panic" at a Whack prompt. B) Verify that a valid operating system image is installed on the node's internal disk. Reinstall the operating system if defective. C) Replace the IDE drive. Maximum count for OS cluster without shutdown (xxxx) exceeded.39 - 0x4 (0) Type "unset cnt_no_shutdown" to clear this error. This error indicates that the BIOS has detected that the node has booted, but has not been shut down properly several times. The BIOS will prohibit boots until operator intervention clears this error. This is to prevent cyclic node up/down caused by a hardware or software failure. This increases the reliability of the cluster by preventing the node from continuously attempting to join the cluster. Resolution: A) Clear this error as suggested in the error text. You may also turn off this checking mechanism if it does not meet your application. To do this, type, "unset max_no_shutdown" at a Whack prompt. InForm OS Failed Error Codes and Resolution 111
  • 112. DescriptionSubcode B) Verify that a valid operating system image is installed on the node's internal disk. Reinstall the operating system if defective. C) Replace the IDE drive. Maximum count for same fatal error (xxxx) exceeded.39 - 0x5 (0) Type "unset cnt_same_fatal" to clear this error. This error indicates that the BIOS has detected that the same fatal or non-fatal error has occurred repeatedly. The BIOS will prohibit boots until operator intervention clears this error. This is to prevent cyclic node up/down caused by a hardware or software failure. This increases the reliability of the cluster by preventing the node from continuously attempting to join the cluster. Resolution: A) Observe other errors present in the PROM log to determine the cause of this error. B) Clear this error as suggested in the error text. You may also turn off this checking mechanism if it does not meet your application. To do this, type, "unset max_same_fatal" at a Whack prompt. B) Verify that a valid operating system image is installed on the node's internal disk. Reinstall the operating system if defective. C) Replace the IDE drive. Maximum count for errors logged (xxxx) exceeded.39 - 0x6 (0) Type "unset cnt_log_error" to clear this error. This error indicates that the BIOS has detected that it has recorded too many fatal or non-fatal errors in the board serial PROM and that it should prohibit further boots until operator intervention clears this error. This is to prevent cyclic node up/down caused by a hardware or software failure. This increases the reliability of the cluster by preventing the node from continuously attempting to join the cluster. Resolution: A) Observe other errors present in the PROM log to determine the cause of this error. B) Clear this error as suggested in the error text. You may also turn off this checking mechanism if it does not meet your application. To do this, type, "unset max_log_error" at a Whack prompt. C) Verify that a valid operating system image is installed on the node's internal disk. Reinstall the operating system if defective. D) Replace the IDE drive. Invalid boot sector.39 - 0x10 (0) Use "boot net install" to correct this. The IDE disk is used for booting the operating system. This error indicates the boot sector which has been loaded from the disk does not have a valid signature. The most likely cause of this error is that a fresh IDE drive has been installed in the node and it needs to be field net installed. Disk MBR does not have a valid partition table You may also see the above line immediately following the fatal error. This message indicates the partition table in the boot sector (Master Boot Record) was also invalid, and that a "ide log" entry could not be written. Resolution: A) If no hardware has been replaced, first try cycling power on the node. B) Perform a field IDE net install on the drive, or use "boot net install". C) Use the "ide smart status" to acquire the drive SMART status. Replace the IDE drive if a failure is reported. C) Replace the IDE cable. D) Replace the IDE drive. E) Replace the node motherboard. 112 CBIOS Error Codes
  • 113. DescriptionSubcode The computed CPU speed is lower than the expected minimum supported in a 3PAR node. Most likely this is due to a hardware failure. Since the CPU speed computation 41 - 0x0 (0) depends upon access to the RTC, it is most likely there is a communication problem with the SuperIO containing the RTC. If you need to run with a reduced CPU speed, enter the following command on the node: Whack> set perm cpu_slow_ok See Code 41, sub-code 0x0 for resolution information. After the CPU speed is computed, the memory bus (FSB) speed is computed. It is computed based on the CPU speed, and bus speed multiplier as reported by the CPU. 41 - 0x1 (0) If you need to run with a reduced Memory bus speed, enter the following command on the node: Whack> set perm mem_slow_ok Resolution: A) Cycle power on the node. B) Replace the bootstrap CPU. C) Replace the node motherboard. Failed CP PROM ww.xx.yy.zz read42 - 0x1 (0) Centerpanel access using Manufacturing PROM: FAILURE The centerpanel is used by the 3PAR cluster for the nodes to communicate. The CM links and backup serial links serve this purpose. There is also a diagnostic I2C bus present in the centerpanel which is used by nodes to diagnose error conditions and reset other nodes in the cluster. As part of the manufacturing process, this bus is tested by accessing the serial PROM which is present on a manufacturing centerpanel. If this test fails, it is likely the node will have a problem accessing the centerpanel I2C bus. Resolution: A) Cycle power on the node. B) Replace the node motherboard. Failed CP PROM ww.xx.yy.zz write42 - 0x2 (0) Centerpanel access using Manufacturing PROM: FAILURE See Code 42, sub-code 0x1 for resolution information. CP PROM node data does not match what is written: Addr xxxx42 - 0x3 (0) Centerpanel access using Manufacturing PROM: FAILURE See Code 42, sub-code 0x1 for resolution information. CP PROM pattern data read is incorrect42 - 0x4 (0) Addr xx Expected yy Read zz ... Centerpanel access using Manufacturing PROM: FAILURE See Code 42, sub-code 0x1 for resolution information. Failed I2C access to board register x.y.z42 - 0x5 (0) Centerpanel access using Manufacturing PROM: FAILURE See Code 42, sub-code 0x1 for resolution information. Failed I2C access to board register x.y.z42 - 0x6 (0) Centerpanel access using Manufacturing PROM: FAILURE Titan specific. It does read accessibility check for extra I2C addresses while testing CP PROM 0.a0 and fails with fatal error message if the address is not accessible. Note that if the failure is not related to CP PROM 0.a0, it will not print "CP PROM at 0.a0:" message and only "Failed I2C access to board register x.yy". InForm OS Failed Error Codes and Resolution 113
  • 114. DescriptionSubcode See Code 42, sub-code 0x1 for resolution information. Voltage ID indicates CPUxx present but TEMP sensor disagrees.43 - 0x0 (data) This error indicates either a CPU failure or onboard sensors are reading incorrect values for the specified CPU. The VID (voltage ID sense) lines are attached to each physical CPU and used to indicate to the VRMs (voltage regulator modules) the voltage level expected by the CPU. These lines are also connected to the LM87 which use this to determine the correct voltage which should be delivered to the CPU. The TEMP (temperature) sensor is connected to an on-die CPU thermal diode. If its reading is out of acceptable range, the BIOS determines the sensor is not reliably connected to a CPU, or a CPU is not present. Bits 0-1 of data indicate CPU non-presence as determined by the VID sense lines. Bits 8-9 of data indicate CPU non-presence as determined by connection to the thermal diode. Data Value Failure ---------- ---------------------------------------------------- 1 CPU0 does not respond to startup 2 CPU1 does not respond to startup 10 CPU0 thermal sensor/voltage ID indicates not present 20 CPU1 thermal sensor/voltage ID indicates not present Resolution: A) Cycle power on the node. B) Remove physical CPU from specific socket and test with no CPU present. B1) If error persists, replace node motherboard. B2) If error clears, replace CPU. C) Replace the node motherboard. Voltage ID indicates CPUxx not present but TEMP sensor disagrees.43 - 0x1 (data) See Code 43, sub-code 0x0 for resolution information. Physical CPUxx active, but thermal sensor disagrees43 - 0x2 (data) Bits 0-1 of data indicate CPU non-presence as determined by the running CPU APIC addresses. Bits 8-9 of data indicate CPU non-presence as determined by connection to the thermal diode. See Code 43, sub-code 0x0 for resolution information. Physical CPUxx not active, but thermal sensor disagrees43 - 0x3 (data) Bits 0-1 of data indicate CPU non-presence as determined by the running CPU APIC addresses. Bits 8-9 of data indicate CPU non-presence as determined by connection to the thermal diode. See Code 43, sub-code 0x0 for resolution information. Not all hyper-threads started on physical CPUxx43 - 0x4 (data) Bits 0-1 of data indicate logical CPU non-presence in physical CPU0 as determined by the running CPU APIC addresses. Bits 2-3 of data indicate logical CPU non-presence in physical CPU1 as determined by the running CPU APIC addresses. See Code 43, sub-code 0x0 for resolution information. Not all cores started on physical CPUxx43 - 0x5 (data) Bits 0-3 of data indicate logical CPU non-presence in physical CPU0 as determined by the running CPU APIC addresses. Bits 4-7 of data indicate logical CPU non-presence in physical CPU1 as determined by the running CPU APIC addresses. 114 CBIOS Error Codes
  • 115. DescriptionSubcode See Code 43, sub-code 0x0 for resolution information. CMIC heatsink disconnected: yy43 - 0x10 (xx) The GPIOs reporting proper connection of the CMIC (North Bridge) heatsink report a loss of connection. This is a board failure which requires a lab technician to reattach the heatsink. Resolution: A) Cycle power on the node. B) Replace the node motherboard. The VSC055 reports tachometer inputs for both node fans, 0 and 1. This is a dual node fan failure which requires both of the fans to be replaced. The system may overheat. 44 - 0x01 (xx) Resolution: A) Cycle power on both nodes. B) Replace both node fans. This error code indicates an error while running the QLogic45 - 0x00 (data) iSCSI POST. Failed Test (bits 8-15), Slot (bits 4-7) and Port (bits 0-3) are packed into data. Failed Test is one of the following: <QLogic internal card diagnostics> 2 Test Local RAM Size 3 Test Local RAM R/W 4 Test RISC RAM 5 Test NVRAM 6 Test Flash ROM 7 Test Network Internal Loopback 8 Test Network External Loopback 9 Test DMA Transfer 240 (0xf0) Test NOP 241 (0xf1) Test Registers 242 (0xf2) Test DMA Transfer to CPU memory 243 (0xf3) Test DMA Transfer to Cluster memory 244 (0xf4) Card Initialization Resolution: A) Cycle power on failing node. B) Re-seat failing iSCSI card C) Replace failing iSCSI card This error code indicates CBIOS does not recognize the chipset installed on the node's motherboard. 46 - 0x1 (0) Resolution: A) Cycle power on the node. B) Replace the node motherboard. No CPU SDRAM is available.47 - 0x00 (0) This error indicates that CBIOS has no working CPU memory available for it to continue with POST and ultimately boot the node. Resolution: A) Cycle power on the node. B) Replace CPU DIMMs. InForm OS Failed Error Codes and Resolution 115
  • 116. DescriptionSubcode C) Replace the node motherboard. This error code indicates CBIOS does not recognize the board type for the chipset installed on the node's motherboard. 48 - 0x0 (XXXXXXXX) Resolution: A) Cycle power on the node. B) Replace the node motherboard. Failed to find USB device handle49 - 0x1 (data) or Inquiry Request Failed rc = xxxx The USB controller failed to perform a self test. A data value of 0 indicates the BIOS failed to find a USB handle. Resolution: A) If a USB Flash drive is not expected to be present, set the "usb_nodevice_ok" NVRAM variable to override BIOS requiring a USB Flash drive be found. B) Replace the USB Flash drive. C) Replace the node motherboard. There was a USB failure in data requested by the operating system bootstrap. It is possible that data on the disk has become corrupt to the point the operating system will not successfully load. 49 - 0x4 (0) Resolution: Reinstall the operating system bootstrap with the "boot net install" command. USB reported a failure in the read verify command.49 - 0x6 (0) See Code 49, sub-code 0x1 for resolution information. USB reported a failure in the write verify command.49 - 0x7 (0) See Code 49, sub-code 0x1 for resolution information. Invalid control cache setup.50 - 0x1 (0) Resolution: Contact 3PAR technical support. Incompatible FB-DIMM installed.50 - 0x2 (<DIMM>) Resolution: Replace DIMM. Electrically isolated FB-DIMM.50 - 0x3 (<DIMM>) Resolution: A) Replace DIMM. B) Replace node. Incompatible module installed.50 - 0x4 (<DIMM>) Resolution: Replace DIMM. Mismatched DIMM pair.50 - 0x5 (<DIMM>) Resolution: Replace DIMM. Odd rank disabled.50 - 0x6 (<DIMM>) Resolution: Replace DIMM. FB-DIMM branch failed to train and lockstep mode has been disabled.50 - 0x7 (0) Resolution: A) Replace all DIMMs. B) Replace node. 116 CBIOS Error Codes
  • 117. DescriptionSubcode FB-DIMM northbound merge has been disabled.50 - 0x9 (<DIMM>) Resolution: Replace DIMM. FB-DIMM disabled due to lockstep skew.50 - 0xa (<DIMM>) Resolution: Replace DIMM. FB-DIMM rank disabled due to Built-in Self Test failure.50 - 0xb (<DIMM>) Resolution: Replace DIMM. Memory interleave range limit invalid.50 - 0xe (0) Resolution: Contact 3PAR technical support. High temp disabled.50 - 0xf (0) Resolution: Contact 3PAR technical support. Logical rank with CECC detected.50 - 0x10 (<DIMM>) Resolution: Replace DIMM. Sub-optimal FB-DIMM channel population detected.50 - 0x12 (0) Resolution: Contact 3PAR technical support. Mismatched AMB pair.50 - 0x13 (0) Resolution: Replace all DIMMs. FB-DIMM branch disabled.50 - 0x14 (0) Resolution: A) Replace all DIMMs. B) Replace node. FB-DIMM thermal throttling has been disabled.50 - 0x15 (0) Resolution: Contact 3PAR technical support. Last FB-DIMM AMB has been disabled.50 - 0x16 (0) Resolution: Contact 3PAR technical support. The FB-DIMM memory branches do not match in size.50 - 0x17 (0) Resolution: Contact 3PAR technical support. The BIST (Built-in Self Test) in Harrier reported either a BAD value or a different value from what was recorded in the node PROM during MFG board assembly. (Data = Harrier BIST result) 51 - 0x1 (Data) Resolution: Replace the node. Note for OPS that Harrier BIST failed, and that the PROM should not be wiped. The CPU was unable to communicate with the FPGA.53 - 0x0 (xxxx) Resolution: Replace node motherboard. A CPU VRM is missing.54 - 0x0 (xxyy) or A CPU VRM is not providing power. Resolution: A) Replace CPU VRM yy. B) Replace node motherboard. UEFI failed to boot, failed during PEI due to assert.55 - 0xzzzzzzzz (yyy) Look-up zzzzzzzz in doc/udk_hash_index.csv of udk2010_up3 tree to determine filename of assert. yyy specifies line number (in hex). InForm OS Failed Error Codes and Resolution 117
  • 118. DescriptionSubcode Resolution: Contact 3PAR technical support. UEFI failed to boot, failed during Intel MRC memory training code.56 - 0xaabb (yyy) aa specifies the major code, bb specifies the minor code. * Major Code Table 0xE8 ERR_NO_MEMORY 0xE9 ERR_LT_LOCK 0xEA ERR_DDR_INIT 0xEB ERR_MEM_TEST 0xEC ERR_VENDOR_SPECIFIC 0xED ERR_DIMM_COMPAT 0XEE ERR_MRC_COMPATIBILITY 0xEF ERR_MRC_STRUCT Resolution: Contact 3PAR technical support. UEFI failed to boot, failed during DXE due to assert.57 - 0xzzzzzzzz (yyy) Look-up zzzzzzzz in doc/udk_hash_index.csv of udk2010_up3 tree to determine filename of assert. yyy specifies line number (in hex). Resolution: Contact 3PAR technical support. CBIOS Degraded Error Codes and Resolution This table explains the codes, sub-codes, error code descriptions, and problem resolutions for the CBIOS error codes. Degraded Alerts DescriptionCode The Real-Time Clock (RTC) is a function of the SuperIO which provides a battery backed system clock and a small quantity of battery backed Non-Volatile RAM for system 2 - 0x1 (0) configuration flags. This error indicates the RTC memory has become corrupt, possibly due to a dead battery or battery removal when no mainline power was available. Resolution: A) Power down, wait 30 seconds, power up. This error should self-correct (likely with a loss of current date/time and other NVRAM contents). Set the date and time using the Whack "rtc date" command. B) Replace the RTC battery, located near the SuperIO ASIC. C) Use the Whack command "rtc date" to set the RTC date and time. D) Replace the node motherboard. RTC_BATTERY_LOW2 - 0x2 (0) RTC / NVRAM Battery Failure - Replace battery. The RTC / NVRAM battery was found to have a low voltage by the built-in monitoring circuit of the Real Time Clock (RTC). The RTC battery provides power to the RTC clock function of the SuperIO while the board is not drawing mainline supply power. Over time, this battery's available power will decay (rated for over five years normal operation). Resolution: A) Replace the RTC lithium cell battery on the node motherboard. B) Replace the node motherboard. RTC_INVALID_TIME2 - 0x3 (0) 118 CBIOS Error Codes
  • 119. DescriptionCode The current RTC date/time is invalid. Enter the correct date/time or press Tab to acquire it from the network. If the time has not yet been set, or becomes invalid due to loss of battery power, this BIOS will report this error and wait for the user to update the time. Resolution: A) Enter the correct time. B) Press TAB to acquire the time from the network. C) Press ^C to abort prompt and resume boot. RTC_BATTERY_LOW2 - 0x4 (0) RTC / NVRAM Battery Failure - Replace battery. The RTC / NVRAM battery was found to have a low voltage by the built-in monitoring circuit of the RTC (TOD clock). Resolution: A) Replace the lithium-ion cell battery on the node. B) Replace the node motherboard. The indicated PLX switch chip has incorrect hardware configuration strappings.10 - 0x1e (0) Resolution: Replace the node motherboard. This error indicates that the device found is not running at the correct PCIe link width.10 - 0x1f (YYYYYYxx) If the "xx" portion of the data field is non-zero, it indicates a problem with a particular PCI slot. The specific codes for "xx" are as follows: 30 is PCI Slot 0 31 is PCI Slot 1 32 is PCI Slot 2 33 is PCI Slot 3 34 is PCI Slot 4 35 is PCI Slot 5 36 is PCI Slot 6 37 is PCI Slot 7 38 is PCI Slot 8 To ignore this error, enter Whack by pressing ^W and entering: Whack> set perm pci_speed_any Resolution: A) Replace indicated card (if "xx" is non-zero). B) Replace node motherboard. This error indicates that the device found is not running at the correct PCIe link speed.10 - 0x20 (YYYYYYxx) See Code 10, sub-code 0x1f for resolution information. This error indicates that a PCI device was found in a slot which was expected to be empty. The likely cause of this failure is an HBA which is not fully seated. If this is an expected failure, you can set "pci_missing_ok" to override this check. 10 - 0x21 (xxx) Resolution: A) Reseat or replace the indicated HBA. B) Replace node motherboard. This error indicates that no PCI device was found in a slot which was expected to be populated (HBA present). The likely cause of this failure is an HBA which has failed. If this is an expected failure, you can set "pci_missing_ok" to override this check. 10 - 0x22 (xxx) Resolution: CBIOS Degraded Error Codes and Resolution 119
  • 120. DescriptionCode A) Reseat or replace the indicated HBA. B) Replace node motherboard. This error indicates that during a previous PCI scan, the CPU hung. The most probable cause of this error is a defective HBA. The data field provides several details about the 10 - 0x23 (data) suspect device. The low byte indicates which PCI slot, if known. Value 0x30 corresponds to PCI Slot 0, 0x31 is PCI Slot 1, ..., 0x38 is PCI Slot 8. Byte 2 and byte 1 correspond to the PCI bus.dev.func. Byte 3 indicates whether the failure occurred during a PCI error scan, and whether this is a repeat failure. Decode table for data: bits 0..7 PCI Slot (0x00=MB, 0x30..0x38=PCI Slot 0..8) bits 8..10 PCI func bits 12..15 PCI dev bits 16..23 PCI bus bits 24..28 Reserved (0) bit 29 Repeat flag (1=repeat -- fatal error) bit 30 Hang during (0=PCI scan, 1=PCI error scan) bit 31 Reserved (1) Example (data=c00a0a35): The 0x35 value implicates PCI Slot 5. The 0a0a value is bus.dev.func 0a.01.02. The c0 value tells the hang occurred during a PCI error scan. Example (data=a0090831): The 0x31 value implicates PCI Slot 1. The 0908 value is bus.dev.func 09.01.00. The a0 value indicates a repeated hang during the PCI scan. Resolution: A) Replace HBA if PCI Slot is indicated. B) Convert to PCI bus.dev.func and match with the suspect PCI device from previous BIOS messages. If this is an onboard device, replace the node motherboard. A disk SMART threshold was triggered. This would indicate an imminent boot drive failure. 17 - 0x10 (data) Resolution: Replace the IDE or SATA boot drive. No IDE device was found.17 - 0x16 (0) Resolution: A) Install or replace the IDE or SATA drive. B) Replace the node motherboard. DMA xfer error code xxxx17 - 0x19 (0) The drive DMA test failed due to a timeout. Although each sequential DMA read operation is succeeding, the total test time was exceeded. The likely cause of this failure is a drive which is having to perform a large number of relocations due to failed sectors, or a drive interface failure which only shows up under stress. Resolution: Replace the IDE or SATA boot drive. Drive returned an error status after command execution. xxxxxxxx, AHCI Port Status register, for lab debug 17 - 0x40 (xxxxxxxx) Resolution: Replace the IDE or SATA boot drive. Drive returned an error status after command execution.17 - 0x41 (xxxxxxxx) xxxxxxxx, AHCI Port Error register, for lab debug 120 CBIOS Error Codes
  • 121. DescriptionCode Resolution: Replace the IDE or SATA boot drive. Drive returned an error status after command execution.17 - 0x42 (xxxxxxxx) xxxxxxxx, AHCI Port TFD register, for lab debug Resolution: Replace the IDE or SATA boot drive. *** Real-mode BIOS interrupt: xxxx (error: yyyy)18 - zzzz (0) This error most commonly indicates a bad or missing boot area of the USB disk. Customer Service node-disks or node spares (FRUs) might not be shipped with an operating system. Attempting to boot from one of these disks without first installing the system software might produce this error message. From the Whack prompt, use the "boot net install" command to install the system software. In order for Linux to boot, LILO must load the kernel image. It needs assistance from the BIOS in order to perform this task. Linux also acquires some information from the BIOS using 16 bit BIOS interrupts. CBIOS automatically accepts and emulates traditional 16 bit BIOS interrupts to support these methods. If LILO or Linux triggers an interrupt which is not supported by CBIOS, this possibly fatal error will result. There are many obsolete BIOS facilities which are not supported by CBIOS. In some cases, the system boot may be able to continue after this error. The sub-code and minor code indicate the specific BIOS interrupt called and the eax register parameter value. This information may be useful to Engineering. Resolution: A) Reboot. Attempt to reproduce the problem. B) Reinstall system software on the disk. This may require a "boot net install" in order to reinstall the operating system. C) There may be a bug in the OS you are using or it has been misconfigured. Confirm this version of the OS has been verified to work on a 3PAR node board. Or, temporarily swap system disks with a known good system disk. D) Replace the boot drive and reinstall the system software. E) Replace the node motherboard. CRC mismatch for failsafe CBIOS23 - 0x0 (0) Upon startup, CBIOS computes a strong CRC over all executable code and data stored in the flash. This is done to guard against flash corruption which also ensures reliable system initialization and testing. This specific sub-code indicates that a CRC error was detected in the failsafe component of CBIOS. The majority of the failsafe is only executed if corruption is detected in the main CBIOS. Resolution: A) Try pressing ^C to resume. Perform a flash update as soon as possible. If flash updating under Linux, make sure to specify the 'failsafe' option to update the failsafe area as well. B) If the flash update is successful, but you still get a CRC error, verify that your flash image is intact. The Linux flash utility does this automatically using the same strong CRC algorithm as the BIOS uses. C) Replace the node motherboard. Invalid entry point for full CBIOS23 - 0x1 (0) Boot with clustering disabled and update flash immediately! Prior to starting up the non-failsafe (full diagnostic) CBIOS image, the failsafe CBIOS performs some consistency checks over the image. This error indicates corruption was detected in the entry point to the main routine of the full CBIOS. If you are have recently installed a new CBIOS which is larger than the previous, it is possible to get this error because the failsafe BIOS present cannot properly verify the larger size BIOS. Resolution: CBIOS Degraded Error Codes and Resolution 121
  • 122. DescriptionCode A) Try pressing ^C to resume. Perform a flash update as soon as possible. Boot with clustering disabled by typing "tpd nokmod" at the LILO prompt. Once the node has booted, login as root and use the flash command. Example: # flash /opt/tpd/bios/bios-1.9.4 Upon completion of the flash update, reboot and observe console messages to ensure the CRC error no longer occurs. B) If the flash update is successful, but you still get this error, verify that your flash image is intact. The Linux flash utility does this automatically using the same strong CRC algorithm as the BIOS uses. C) Replace the node motherboard. Ethernet MAC xx:xx:xx:xx:xx:xx mismatches PROM: yy:yy:yy:yy:yy:yy25 - 0x2 (0) The "prom mac" command may fix this. This error indicates the MAC address stored in the onboard Ethernet controller's PROM does not match that which can be computed from the board revision and serial number stored in the node's PROM. This mismatch suggests that one or the other PROM may contain corrupt contents. If the Ethernet MAC address was purposely set to an address (see "prom mac" command), then this check may be overridden by setting the NVRAM "oddmac" flag. Example: Whack> set perm oddmac Resolution: A) Look for a prior message indicating an invalid board type or check the banner to ensure the board type and serial number are correct for this node. If either is not correct, use the 'prom edit' command to repair the corruption. B) Use the "prom mac" command to reprogram the MAC address in the Ethernet controller's PROM. C) Replace the node motherboard. eth0 device self test: FAIL All tests: xxxx (timeout)26 - 0x1 (ethdev) During initialization, CBIOS has the Ethernet controller perform an internal test to verify correct operation. If the Ethernet controller does not respond within a reasonable amount of time, this error will be displayed. "ethdev" indicates the PCI Slot in which the failed Ethernet device is located. This is an ASCII value, so 0x30 indicates PCI slot 0. If the Ethernet device is located on the node motherboard, then ethdev will have a value of 0x00. Resolution: A) Cycle power on the node. B) Replace the node motherboard. eth0 device self test: FAIL xxxx yyyy26 - 0x2 (ethdev) If the Ethernet controller fails its internal test, this error will be displayed. Since this is an internal test, it is likely the Ethernet controller itself which has failed. Resolution: A) Cycle power on the node. B) Replace the node motherboard. No Ethernet devices available for loopback test26 - 0x3 (0) This error indicates that no Ethernet devices could be found or initialized on the node. This is possibly the result of a hardware failure. Resolution: A) Cycle power on the node. B) Replace the node motherboard. 122 CBIOS Error Codes
  • 123. DescriptionCode No loopback connections were found. An external loopback plug is required if this node has only one Ethernet port. A crossover cable is required if this node has more than a single Ethernet port. 26 - 0x4 (0) Resolution: A) Make sure the Ethernet loopback plug is in the Ethernet connector (you should see link status lights illuminated). In the case of a node having two Ethernet ports, make sure a crossover cable is connected between the Ethernet ports. B) Cycle power on the node. C) Replace the node motherboard. eth2 loopback PHY internal: FAIL26 - 0x5 (slotid) This error indicates that the internal loopback of the PHY did not correctly loop back packets. If the device being tested is onboard the node (82559ER or 82551ER), then this is a failure. Some plug-in PCI boards (such as 82557) do not fully support PHY loopback. Those devices will cause the following warning: eth2 loopback PHY internal: Unavailable No error stop will occur in the case of a PHY not supporting internal loopback. Resolution: A) Cycle power on the node. B) Replace the node motherboard. eth0 sends to eth1 but cannot receive from it26 - 0x6 (slotid) This is an unusual error in that one Ethernet device is able to reliably receive packets from the other, but the opposite is not true. Resolution: A) Run the test again. If the nodes are attached to a hub, the failure may be due to another Ethernet node flooding the network. B) Cycle power on the node. C) Ensure that there is no a switch between the Ethernet ports. A switch may prevent the test from functioning properly if the MAC address of an interface is in use elsewhere or the switch is really an IP router. D) Ensure that there is no a switch between the Ethernet ports. A switch may prevent the test from functioning properly if the MAC address of an interface is in use elsewhere or the switch is really an IP router. eth0 loopback wwwww: FAIL - receive timeout (xx seconds)26 - 0x7 (slotid) This error indicates the Ethernet device did not successfully receive the loopback pattern sent to test the Ethernet device's transceiver. The failure to receive a loopback pattern usually means the Ethernet device has failed. "ethdev" indicates the PCI Slot in which the failed Ethernet device is located. This is an ASCII value, so 0x30 indicates PCI slot 0. If the Ethernet device is located on the node motherboard, then ethdev will have a value of 0x00. The following are normal test results that you would expect to see. If this error occurs, then one of the following has not happened: eth0 loopback All zeros: PASS eth0 loopback All ones: PASS eth0 loopback Walking ones: PASS eth0 loopback Walking zeros: PASS eth0 loopback Random pattern: PASS This error indicates that within 100 packets successfully transmitted, there were no packets successfully received. Resolution: A) Cycle power on the node. CBIOS Degraded Error Codes and Resolution 123
  • 124. DescriptionCode B) Unplug the network cable and run the test again. If the node is attached to a hub, the failure may be due to another Ethernet node flooding the network. This is not very likely. C) If the Ethernet device is located in a PCI slot, replace the card. D) Replace the node motherboard eth0 loopback wwwww: Packet transmit failed26 - 0x8 (slotid) This error indicates that the Ethernet device was not able to successfully transmit packets. This is really a serious failure, since the Ethernet code will under any condition not fail to transmit unless the Ethernet device failed to initialize. Resolution: A) Use "eth reset" to reset the Ethernet device. B) Cycle power on the node. C) Replace the node motherboard if the failed Ethernet device is on the node. eth0 loopback wwwww: FAIL - miscompare26 - 0x9 (slotid) stuck high=xxxx stuck low=yyyy toggle=zzzz This error is displayed if one of the Ethernet tests detects a mismatch between the packet send and the data received. It also includes a diagnostic line which is useful to see in what way the data is different. Resolution: A) Use "eth reset" to reset the Ethernet device. B) Cycle power on the node. C) Replace the node motherboard if the failed Ethernet device is on the node. ethxxx device registers: FAIL26 - 0xa (slotid) Onboard Ethernet device did not read valid config from EEPROM. A powercycle might clear this failure if this is a new node. This error indicates the Ethernet device failed to initialize properly, probably because it read invalid content from the attached EEPROM device. If this an onboard GigE on the 5000P chipset (Tx00, Fx00, Vx00, Gx00), then it is likely this is the first time the node has ever been powered on. Once the BIOS writes a configuration to the SPI EEPROM attached to the GigE, it is necessary for the board to be power cycled before the GigE device is usable. If the board is not new and you see this failure, then it's likely a component on the node motherboard has failed. Resolution: A) Power cycle the node. B) Replace the node motherboard if the failed Ethernet device is on the node. Each node board has multiple temperature and voltage sensors and fan RPM sensors which monitor the environment to ensure the temperature, voltage, and fan RPM are within operating tolerances. This directly results in increased reliability of the product. 27 - 0x0 (#) If a temperature or a voltage falls outside a programmed tolerance level, CBIOS will alert the user to this condition. The sub-code displayed reflects the type of (the first) error detected. The data value is a count of the number of temperature/voltage/fan problems detected. A sub-code value of 0x0 indicates a fan RPM problem. A sub-code value of 0x1 indicates a temperature problem. A sub-code value of 0x2 indicates a voltage problem. This particular sub-code indicates a programmed temperature limit has been exceeded. Resolution: A) Cycle power on the node. If it is a temperature related problem, verify the system is getting adequate ventilation. B) Verify the limit settings are reasonable. Use the Whack "i2c env" command. The Whack "i2c env defaults" command resets all defaults. 124 CBIOS Error Codes
  • 125. DescriptionCode C) Verify both power supply fans are spinning freely and that the supply amber failure light is not illuminated. If only a single supply is installed, make sure the second slot either has a fan or is covered. D) Replace the power supply. E) If it's CPU temperature, verify the heatsink is conducting heat well. F) If it's CPU voltage, try swapping out the CPU voltage regulators. G) Replace the node motherboard. This sub-code indicates a programmed temperature limit has been exceeded.27 - 0x1 (#) See Code 27, sub-code 0x0 for resolution information. This sub-code indicates a programmed voltage limit has been exceeded.27 - 0x2 (#) See Code 27, sub-code 0x0 for resolution information. This sub-code indicates a sensor interrupt test failed.27 - 0x3 (0) See Code 27, sub-code 0x0 for resolution information. This error indicates the BIOS detected the SDRAM DIMMs in the cluster memory bank pair are of a different type. One DIMM number of the mismatched pair will be logged in the data field of the Fatal Error. 28 - 0x3 (mm) Resolution: Ensure both DIMMs in the pair are identical. Note that two DIMMs may have the same capacity but have different number of rows, columns, or banks. The DIMM configuration must exactly match. If the DIMMs have similar markings and capacity, they are probably identical. See Code 28, sub-code 0x1 for more resolution information. FAIL (high)31 - 0x0 (0) Port (6) Bit (4) wrote 0 (0x1) Port (7) Bit (4) read 1, expected 0 (0x3) The Vitesse VSC055 2 Wire Backplane Controller chip controls interfaces to the Centerplane, LEDs, Power Supplies, Nickel battery, and PCI slots. It is connected to the I2C bus. In normal 2, 4, or 8 node centerplanes, the chip will get its ports initialized as inputs or outputs and start monitoring peripheral systems. No tests available. When connected to a Manufacturing Centerplane, it will have selected pins routed to other pins for loopback testing. See the Manufacturing Centerplane Specification for details. During this test, proper VSC operation will be confirmed. Resolution: A) Cycle power on the node. B) Replace the node motherboard. Failed I2C VSC055 1.ce.yy write zzzz31 - 0x1 (0) During initialization, the VSC055 registers are programmed for proper system operation. This is done over the I2C bus. If an I2C operation fails during VSC055 initialization, this error will result. Resolution: A) Cycle power on the node. B) Replace the node motherboard. FPGA Interrupt Test failed.31 - 0x3 (0) Resolution: A) Cycle power on the node. B) Replace the node. NEMOE Loopback Test failed.31 - 0x4 (0) CBIOS Degraded Error Codes and Resolution 125
  • 126. DescriptionCode Resolution: A) Cycle power on the node. B) Replace the node. During the "Board GPIO Test", the FPGA ID is not what it expects it to be.31 - 0x5 (0) Resolution: A) Cycle power on the node. B) Replace the node. During the "Board GPIO Test", the FPGA Revision is not what it expects it to be.31 - 0x6 (0) Resolution: A) Cycle power on the node. B) Replace the node. Titan specific. During the "Manufacturing Centerpanel GPIO Test", one or more tests have failed depending upon the output. 31 - 0x7 (0) A) If failed during 'Testing Expanders (o/p) <--> FPGA (i/p) connections:' For example, FAIL (low) Port (76) Bit (1) wrote (0x00) Port (302) Bit (4) read 0xff, expected (0xef) 1) Program I2C expander by following command: Whack> cb i2c 9.76.3 0 Here "76" is reported port number. "3" is config register offset for the expander. "0" makes all expander bits as output. 2) Set the bit in I2C expander. Whack> cb i2c 9.76.1 2 Here "1" is rdwr register offset for the expander. "2" is reported bit 1 (1 << "1") in expander. 3) Read a byte from FPGA offset. Whack> db fpga 302 1 Here "302" is reported FPGA offset. Confirm if the bit "4" in read value is set. Repeat step 2) and 3) by writing 0 to I2C expander 9.76.1 and checking if the bit "4" in FPGA offset 0x302 is cleared. B) If failed during 'Testing FPGA (o/p) <--> Expanders (i/p) connections:' For example, FAIL (low) Port (305) Bit (4) wrote (0x00) Port (7e) Bit (7) read 0x86, expected (0x06) 1) Program I2C expander by following command: Whack> cb i2c 9.7e.3 ff Here "7e" is reported port number. "3" is config register offset for the expander. "ff" makes all expander bits as input. 2) Write a byte to FPGA offset. Whack> db fpga 305 10 Here "305" is reported FPGA offset. Writing 0x10 will set the bit "4" in that offset. 3) Read a byte from I2C expander. Whack> db i2c 9.7e.0 1 Here "7e" is reported port number. "0" is read register offset for the expander. Confirm if the bit "7" in read value is set. Repeat step 2) and 3) by writing 0 to the FPGA offset and checking if the bit "7" in I2C Expander 9.7e.0 is cleared. 126 CBIOS Error Codes
  • 127. DescriptionCode C) For all other failure cases refer to Section # 18.2 "Manufacturing Centerplane GPIO Test Diagnostics" of CBIOS user guide at http://engweb/twiki/bin/view/Main/TitanMfgCpFpgaGpioTestDiag Power Supply xx indicates invalid battery configuration: y batteries Verify battery connection and individual battery units. 38 - 0x0 (data) The maximum count of batteries in a string which are supported by software is 3. Any greater number will result in this non-fatal error. The data value may be decoded to determine which power supply and the battery count. The high 8 bits are a bitmask of the power supply. The lower 16 bits are the number of batteries counted. Thus, a data value of 100000c indicates PS1 had a battery count of 12. A data value of 4 indicates PS0 had a battery count of 4. Resolution: A) Verify no more than 3 batteries in a string are connected to any one power supply. B) Cycle power on the node. C) Remove batteries one at a time to determine if there is a faulty connection or battery. Replace the faulty cable or battery. D) Replace the power supply. E) Replace the node motherboard. RTC / NVRAM Battery Failure - Replace battery.38 - 0x1 (0) The RTC / NVRAM battery was found to have a low voltage by the built-in monitoring circuit of the RTC (TOD clock). Resolution: A) Replace the lithium-ion cell battery on the node. B) Replace the node motherboard. No batteries present on power supply xx38 - 0x3 (data) This error indicates no batteries were found on a node power supply. This warning may be enabled by setting "warn_nobat" in NVRAM. The data value may be decoded to determine which power supply triggered this error. The high 8 bits are a bitmask of the power supply. Thus, a data value of 0 indicates PS0 is not present. A data value of 1000000 indicates PS1 is not present. Resolution: A) Verify there is at least one battery connected. B) Cycle power on the node. C) Exchange cables and batteries. D) Replace the power supply. E) Replace the node motherboard. Power supply missing: node power configuration is not redundant38 - 0x4 (data) This error indicates one of the two power supplies for a node is not present. This warning may be enabled by setting "warn_ps" in NVRAM. The data value may be decoded to determine which power supply triggered this error. The high 8 bits are a bitmask of the power supply. Thus, a data value of 0 indicates PS0 is not present. A data value of 1000000 indicates PS1 is not present. Resolution: A) Verify both power supplies are present and powered on. B) Power off the missing supply, remove it, and reinsert it in the chassis. C) Replace the power supply. D) Replace the node motherboard. Battery failure on Power Supply38 - 0x5 (0) CBIOS Degraded Error Codes and Resolution 127
  • 128. DescriptionCode This error indicates that a battery on the power supply has reported a hardware error. The status light on the back of the failed battery will be amber. Resolution: A) Verify both power supplies are present and powered on. Verify batteries are present and powered on. B) Power off the failed battery, remove the cable, and reinsert it in the Power Supply. Turn it back on. If that does not reset the FAILED condition, replace the battery. C) Replace the power supply. D) Replace the node motherboard. Powering off PSxx because it is on battery power. This will shut down the node until AC is restored. 38 - 0x6 (data) This message indicates that a power supply lost input AC Power and that the BIOS powered down the node to avoid draining the battery. The data value may be decoded to determine which power supply triggered this error. The low 2 bits are a bitmask of the DC power status. Bit 0 represents power supply 0 and Bit 1 represents power supply 1. If this bit is 1, then the DC output from the power supply was good when the system shut down. Resolution: A) Apply AC power to the node. B) Replace the power supply. Power supply xx failure: Fan Bad38 - 0x7 (data) or Power supply xx failure: Fan 0 Bad or Power supply xx failure: Fan 1 Bad This error indicates there is a hardware problem in one of the node power supplies. One or more of the fans may have failed. The data value may be decoded to determine which power supply (and fan) triggered this error. The low 2 bits are a bitmask of the fan status for Power Supply 0. The next 2 bits are a bitmask of the fan status for Power Supply 1. Thus: 1: PS0 had a Fan0 failure 2: PS0 had a Fan1 failure 3: PS0 had a double fan failure c: PS1 had a double fan failure 4: PS1 had a Fan0 failure 8: PS1 had a Fan 1 failure Resolution: A) Replace the power supply. B) Replace the node motherboard. Power supply xx failure: Charger Overload38 - 0x8 (data) This error indicates there is a hardware problem in one of the node power supplies, specifically that the charger cannot handle the battery charge current draw. If you need to override this error so the node continues, you can set "ignore_chargefail" in NVRAM. The data value may be decoded to determine which power supply triggered this error. The low 2 bits are a bitmask of the charger status for the two power supplies. This a value of 1 indicates PS0 had a charger overload. A value of 2 indicates PS1 had a charger overload. A value of 3 indicates PS0 and PS1 both had a charger overload. Resolution: A) Check battery connection. B) Exchange cables and batteries. C) Replace the power supply. D) Replace the node motherboard. Power supply xx failure: DC Output Bad38 - 0x9 (data) 128 CBIOS Error Codes
  • 129. DescriptionCode This error indicates there is a hardware problem in one of the node power supplies. If this failure is transient, it could also be caused by turning the power supply off and then on or by a quick AC loss followed by AC being restored. If both power supplies fail simultaneously (not likely), this is a fatal error. The data value may be decoded to determine which power supply triggered this error. The low 2 bits are a bitmask of the DC Output status for the two power supplies. A value of 1 indicates PS0 had a DC Output Bad. A value of 2 indicates PS1 had a DC Output Bad. Resolution: A) Ensure a service operation was not taking place at the time, and that AC had not also failed. B) Replace the power supply. C) Replace the node motherboard. Power supply xx failure: AC Input Bad38 - 0xa (data) This error indicates that AC input power is not being supplied to one or more power supplies. The likely cause is either a real AC Failure or that the power supply has been switched to the off position. In the case of an AC Failure, the power supply will be automatically shut down to preserve batteries (if "ignore_acfail" is set then the power supply will not be shut down). The lower 2 bits of the data value may be decoded to determine which power supply lost AC power. A value of 1 indicates PS0. A value of 2 indicates PS1. A value of 3 indicates both power supplies lost AC power. Resolution: A) Verify AC power is present and the power supply switch is turned on. B) Check the Power Distribution Unit (PDU) breaker. C) Replace the power supply. D) Replace the node motherboard. **** Power Supplies mismatch ****38 - 0xb (0) Power Supply 0: I2C accessible Power Supply 1: I2C inaccessible This error indicates one of the power supplies is a new style (I2C interface) and the other power supply is not responding using I2C, but has been detected as present. This is not a supported configuration. If you need to override this error, set "ignore_psdiff" in NVRAM. Resolution: A) Pull and reinsert the inaccessible power supply. B) Check the Power Distribution Unit (PDU) breaker for the inaccessible power supply. C) Replace the power supply. D) Replace the node motherboard. This error indicates Power Supply 0 reported a limit was exceeded while performing the power supply status test. Each power supply has integrated monitors for temperature, 38 - 0xc (data) voltage, and current draw. The BIOS reads these sensors as part of initialization to determine if the power supply is operating within specifications. The data value may be decoded to determine the particular cause of the limit failure. Each bit represents a unique sensor. Data values may be decoded as follows: 00000001 - Temperature 00000004 - 3.3V 00000008 - 3.3V Current 00000010 - 5V 00000020 - 5V Current CBIOS Degraded Error Codes and Resolution 129
  • 130. DescriptionCode 00000040 - 12V 00000080 - 12V Current 00000100 - 24V 00000200 - 24V Current 00000400 - 48V 00000800 - 48V Current 00001000 - Bat0 48V 00002000 - Bat1 48V 00004000 - Bat2 48V 00008000 - Bat0 12V 00010000 - Undefined ... to ... 00400000 - Undefined 00800000 - Battery LED is Amber 01000000 - Battery Relay is Off 02000000 - PS LED is Amber 04000000 - Fan Fail 08000000 - DC Fail 10000000 - AC Fail 20000000 - Power Supply is Disabled 40000000 - Power Supply Switch is Off 80000000 - Low Limit exceeded (combined with bits above) Resolution: Contact 3PAR technical support. This error indicates Power Supply 1 reported a limit was exceeded while performing the power supply status test. 38 - 0xd (data) See Code 38, sub-code 0xc for resolution information. Each newer generation (Magnetek) power supply and battery has an I2C interface which allows the node to acquire power supply internal temperature, voltages, and 38 - 0xe (data) current loads. The BIOS will verify these readings are within acceptable limits as part of normal initialization. This failure code indicates a limit has been exceeded on a battery attached to a power supply on the node. The data value may be decoded to determine which power supply and battery. The lower 2 bits are a bitmask of the power supply. The upper 16 bits are a bitmask of the failing battery. Thus, a data value of 10002 indicates PS1 Bat0 has exceeded a limit. A data value of 40001 indicates PS0 Bat2 has exceeded a limit. Resolution: A) Check battery expiration date and replace as necessary. B) Power cycle the failing battery. C) Replace battery cable. I2C errors prevented completion of the power test.38 - 0xf (data) Each newer generation (Magnetek) power supply and battery has an I2C interface which allows the node to acquire power supply status. This failure codes indicates the BIOS was unable to read one of the Power Supply or battery status registers. The lower 2 bits of the data value may be decoded to determine which power supply failed. A value of 1 indicates PS0. A value of 2 indicates PS1. A value of 3 indicates both power supplies failed. Resolution: A) Power cycle the indicated power supply. B) Replace power supply. C) Replace all attached batteries to the power supply. 130 CBIOS Error Codes
  • 131. DescriptionCode D) Replace the node motherboard. PSwwww Batxxxx Switch Off38 - 0x10 (data) This failure code indicates a battery has its power switch in the off position, and is thus unable to supply back up power to the node in the case of AC Failure. The data value may be decoded to determine which power supply and battery. See Code 38, sub-code 0xd for decoding information. Resolution: A) Turn battery on. B) Power cycle the indicated battery. C) Replace battery cable. D) Replace power supply. Turning off BBU because the node is on battery power. This will shut down the node until AC is restored. 38 - 0x13 (data) This message indicates that all power supplies lost input AC Power and that the BIOS powered down the node to avoid draining the battery. The data value provides a mask of power supplies which have AC good input but failed DC output. Resolution: A) Apply AC power to the node. B) Replace the power supplies. During CPU SMI initialization, the queue facility to send messages between the BIOS and TPD is tested. If there is a problem triggering an SMI, or some other error which 40 - 0x1 (0) causes message corruption, this error will result. This error is recoverable because the OS can still come up and function at a degraded level even if the communication between the OS and BIOS is not functioning. Resolution: A) View prom log to see if this is repeatable. If not, ignore a single occurrence. B) Cycle power on the node. C) Replace the bootstrap CPU. D) Replace the node motherboard. The VSC055 reports tachometer inputs for both node fans, 0 and 1. This is a single node fan failure which requires the fan to be replaced. 44 - 0x00 (xx) Resolution: A) Cycle power on the node. B) Replace the node fan. No USB device was found.49 - 0x17 (0) Resolution: Install or replace the USB Flash drive. During Harrier initialization, the CMA BIST test failed but due to some other (e.g. I2C I/O error) reason. This error codes indicates that the BIST test itself hasn't failed but 51 - 0x2 (Data) there was an error which occurred either during book-keeping (PROM0 read/write) or the test was not performed at all because it failed to read a Harrier register. (Data = 0x2f) Resolution: Monitor and replace the node if the issue recurs. If the node is replaced, note for OPS that they should verify I2C to the node PROM is functional. One or more bits in CPU's 2 General Power Management registers were set due to abnormal power reset. The set bits are printed describing the cause. 52 - 0x0 CBIOS Degraded Error Codes and Resolution 131
  • 132. DescriptionCode Resolution: Contact engineering with data. CBIOS failed to obtain the ME firmware flash unlock code through the HECI interface. This could prevent flash commands from functioning. 58 - 0x0 Resolution: Try rebooting the node 132 CBIOS Error Codes
  • 133. 7 Support and Other Resources Contacting HP For worldwide technical support information, see the HP support website: http://www.hp.com/support Before contacting HP, collect the following information: • Product model names and numbers • Technical support registration number (if applicable) • Product serial numbers • Error messages • Operating system type and revision level • Detailed questions Specify the type of support you are requesting: Support requestHP 3PAR storage system StoreServ 7000 StorageHP 3PAR StoreServ 7200, 7400, and 7450 Storage systems 3PAR or 3PAR StorageHP 3PAR StoreServ 10000 Storage systems HP 3PAR T-Class storage systems HP 3PAR F-Class storage systems HP 3PAR documentation See:For information about: The Single Point of Connectivity Knowledge for HP Storage Products (SPOCK) website: Supported hardware and software platforms http://www.hp.com/storage/spock The HP 3PAR StoreServ Storage site:Locating HP 3PAR documents http://www.hp.com/go/3par To access HP 3PAR documents, click the Support link for your product. HP 3PAR storage system software HP 3PAR StoreServ Storage Concepts GuideStorage concepts and terminology HP 3PAR Management Console User's GuideUsing the HP 3PAR Management Console (GUI) to configure and administer HP 3PAR storage systems HP 3PAR Command Line Interface Administrator’s Manual Using the HP 3PAR CLI to configure and administer storage systems HP 3PAR Command Line Interface ReferenceCLI commands HP 3PAR System Reporter Software User's GuideAnalyzing system performance HP 3PAR Host Explorer User’s GuideInstalling and maintaining the Host Explorer agent in order to manage host configuration and connectivity information HP 3PAR CIM API Programming ReferenceCreating applications compliant with the Common Information Model (CIM) to manage HP 3PAR storage systems Contacting HP 133
  • 134. See:For information about: HP 3PAR-to-3PAR Storage Peer Motion GuideMigrating data from one HP 3PAR storage system to another HP 3PAR Secure Service Custodian Configuration Utility Reference Configuring the Secure Service Custodian server in order to monitor and control HP 3PAR storage systems HP 3PAR Remote Copy Software User’s GuideUsing the CLI to configure and manage HP 3PAR Remote Copy HP 3PAR Upgrade Pre-Planning GuideUpdating HP 3PAR operating systems HP 3PAR F-Class, T-Class, and StoreServ 10000 Storage Troubleshooting Guide Identifying storage system components, troubleshooting information, and detailed alert information HP 3PAR Policy Server Installation and Setup GuideInstalling, configuring, and maintaining the HP 3PAR Policy Server HP 3PAR Policy Server Administration Guide 134 Support and Other Resources
  • 135. See:For information about: Planning for HP 3PAR storage system setup Hardware specifications, installation considerations, power requirements, networking options, and cabling information for HP 3PAR storage systems HP 3PAR StoreServ 7000 Storage Site Planning ManualHP 3PAR 7200, 7400, and 7450 storage systems HP 3PAR StoreServ 7450 Storage Site Planning Manual HP 3PAR StoreServ 10000 Storage Physical Planning Manual HP 3PAR 10000 storage systems HP 3PAR StoreServ 10000 Storage Third-Party Rack Physical Planning Manual Installing and maintaining HP 3PAR 7200, 7400, and 7450 storage systems HP 3PAR StoreServ 7000 Storage Installation GuideInstalling 7200, 7400, and 7450 storage systems and initializing the Service Processor HP 3PAR StoreServ 7450 Storage Installation Guide HP 3PAR StoreServ 7000 Storage SmartStart Software User’s Guide HP 3PAR StoreServ 7000 Storage Service GuideMaintaining, servicing, and upgrading 7200, 7400, and 7450 storage systems HP 3PAR StoreServ 7450 Storage Service Guide HP 3PAR StoreServ 7000 Storage Troubleshooting GuideTroubleshooting 7200, 7400, and 7450 storage systems HP 3PAR StoreServ 7450 Storage Troubleshooting Guide HP 3PAR Service Processor Software User GuideMaintaining the Service Processor HP 3PAR Service Processor Onsite Customer Care (SPOCC) User's Guide HP 3PAR host application solutions HP 3PAR Recovery Manager Software for Oracle User's Guide Backing up Oracle databases and using backups for disaster recovery HP 3PAR Recovery Manager Software for Microsoft Exchange 2007 and 2010 User's Guide Backing up Exchange databases and using backups for disaster recovery HP 3PAR Recovery Manager Software for Microsoft SQL Server User’s Guide Backing up SQL databases and using backups for disaster recovery HP 3PAR Management Plug-in and Recovery Manager Software for VMware vSphere User's Guide Backing up VMware databases and using backups for disaster recovery HP 3PAR VSS Provider Software for Microsoft Windows User's Guide Installing and using the HP 3PAR VSS (Volume Shadow Copy Service) Provider software for Microsoft Windows HP 3PAR Storage Replication Adapter for VMware vCenter Site Recovery Manager Implementation Guide Best practices for setting up the Storage Replication Adapter for VMware vCenter HP 3PAR Storage Replication Adapter for VMware vCenter Site Recovery Manager Troubleshooting Guide Troubleshooting the Storage Replication Adapter for VMware vCenter Site Recovery Manager HP 3PAR VAAI Plug-in Software for VMware vSphere User's Guide Installing and using vSphere Storage APIs for Array Integration (VAAI) plug-in software for VMware vSphere HP 3PAR documentation 135
  • 136. Typographic conventions Table 20 Document conventions ElementConvention Bold text • Keys that you press • Text you typed into a GUI element, such as a text box • GUI elements that you click or select, such as menu items, buttons, and so on Monospace text • File and directory names • System output • Code • Commands, their arguments, and argument values <Monospace text in angle brackets> • Code variables • Command variables Bold monospace text • Commands you enter into a command line interface • System output emphasized for scannability WARNING! Indicates that failure to follow directions could result in bodily harm or death, or in irreversible damage to data or to the operating system. CAUTION: Indicates that failure to follow directions could result in damage to equipment or data. NOTE: Provides additional information. Required Indicates that a procedure must be followed as directed in order to achieve a functional and supported implementation based on testing at HP. HP 3PAR branding information • The server previously referred to as the "InServ" is now referred to as the "HP 3PAR StoreServ Storage system." • The operating system previously referred to as the "InForm OS" is now referred to as the "HP 3PAR OS." • The user interface previously referred to as the "InForm Management Console (IMC)" is now referred to as the "HP 3PAR Management Console." • All products previously referred to as “3PAR” products are now referred to as "HP 3PAR" products. 136 Support and Other Resources
  • 137. 8 Documentation feedback HP is committed to providing documentation that meets your needs. To help us improve the documentation, send any errors, suggestions, or comments to Documentation Feedback (docsfeedback@hp.com). Include the document title and part number, version number, or the URL when submitting your feedback. 137