4 colin walls - self-testing in embedded systems

Colin Walls
Self-Testing in Embedded
Systems
colin_walls@mentor.com
http://blogs.mentor.com/colinwalls

Restricted © 2017 Mentor Graphics Corporation
Agenda
Introduction
CPU Failure
Peripheral Failure
Memory Failure
Software Error Conditions
Failure Recovery and Reporting
Conclusions

Introduction
 Failure is almost inevitable
 Important to accept this fact
 Key issues:
— How to reduce likelihood of failure
— How to handle impending failure
— How to recover from failure conditions

Failure and Testing
 What can fail in an
embedded system?
— CPU
— Peripherals
— Memory
— Software
 Start-up testing
 Background testing
 Watchdog

CPU Failure
 Quite unlikely for just processor to fail
 Possibly no hope of recovery
 Failure most likely at power up
 Partial failure very rare
 Multicore designs offer
options

CPU Failure
 Quite unlikely for just processor to fail
 Possibly no hope of recovery
 Partial failure very rare
 Multicore designs
offer options

Peripheral Failure
 Total failure: not responding to address
— Trap handler
 Other failures/tests are device dependent
 Loop back testing is common

Memory Failure
 Systems have a lot of memory
— Surprising that failure is not more
common
— That is the time for
comprehensive testing
 3 failure modes:
— Not responding
— Stuck bits
— Cross-talk

Memory Testing
 “Moving Ones” test
— Looks for stuck bits and cross talk with fine grain resolution
— Perform on start up
— Code to only use registers
— “Moving Zeros” is the same idea
 Pattern test
— Also looks for stuck bits and cross talk
— Perform in background task
 All tests can be optimized if the memory
architecture is known

Moving Ones Test
set every bit of memory to 0
for each bit of memory
{
verify that all bits are 0
set the bit under test to 1
verify that it is 1
verify all other bits are 0
set the bit under test to 0
}

Pattern Test
for each byte of memory
{
turn off interrupts
save memory byte contents
for values 0x00, 0xff, 0xaa, 0x55
{
write value to byte under test
verify value of byte
}
restore byte data
turn on interrupts
}

Memory Testing
 “Moving Ones” test
— Looks for stuck bits and cross talk with fine grain resolution
— Perform on start up
— Code to only use registers
— “Moving Zeros” is the same idea
 Pattern test
— Also looks for stuck bits and cross talk
— Perform in background task
 All tests can be optimized if the memory architecture is known

Software Error Conditions
 Bugs can lead to unpredictable failure
 Defensive code can anticipate problems
 2 key failure modes:
— Data corruption
— Code looping

Data Corruption
 Pointers!
 Null pointer
— Trap handler
 Incorrect pointer
— May point anywhere
— Leads to random corruption
— MMU may help
— Special cases
– Stack overflow/underflow
– Array bound violations

Stack Overflow/Underflow
 Avoid by careful testing
— Use memory access breakpoints during
debug
— Unexpected recursion depth is hard to
predict
 Use guard words
— Test periodically
– Background task?
— Choose odd value to avoid addresses
— Use unlikely value
– Not 0, 1, 0xffffffff
— 4 billion to 1 chance of a false alarm
— Might use MMU

0x16951695
0x16951695
debug
predict
 Use guard words
— Might use MMU

0x00000000
0xffffffff
0x00000000
0x00000000
0x99000000
0x00000000
0x00000000
0x00000000
0x12341234
0x16951695
0x16951695
debug
predict
 Use guard words
— Might use MMU

0x16951695
0x16951695
0x77770000
0x00000000
0xffffffff
0x00000000
0x00000000
0x99000000
0x00000000
0x00000000
0x00000000
0x12341234
debug
predict
 Use guard words
— Might use MMU

Array Bound Violations
 No checking in C
— Considered a runtime overhead
— Could add in C++
– Overload the [ ] operator
— Pointers can get around bounding
 Using guard words makes sense
— Most programming errors result in writing
to memory immediately adjacent to the
end of the array

Code Looping
 Infinite loops should not occur
 May be a programming error
 May be a device failing to respond
— Software should have timeout
 Hardware watchdog helpful
 In multi-threaded environment, use a watchdog task
— Event flag for each task
— Watchdog sets flags to 1 and goes to sleep
— Other tasks periodically set their flag to 0
— When watchdog wakes, any 1 flags result in alarm

Using a Watchdog Task
Task 1 Task 2 Task 3
Watchdog
Task
1 1 1

Watchdog
Task
1 1 10 0

Watchdog
Task
1 1 10 0
Watchdog
Task

Failure Reporting and Recovery
 Action on finding a fault is very system
dependent
 If there is a user interface, an alarm may
be activated
 A deeply embedded system may have no
option other than a reset
 Example: heart pacemaker

Sounding the Alarm
 Display
— Text/graphics
 Sound
 Network
— Send email
— Web page
 LEDs
— On/Off
— Color
— Flashing

Flashing LEDs
 Speed
— Slow – “heartbeat”
— Fast – “error”
 Duty cycle
— Morse code?
LONG = 500
SHORT = 50
flash_delay = LONG
LED_state = 0
loop-forever
{
flags = 0xff
sleep(flash_delay)
set_LED(LED_state)
if LED_state = 0
LED_state = 1
else
LED_state = 0
if flags <> 0
flash_delay = SHORT
}

Reset
 Automatic reset is sometimes the only option
— Deeply embedded systems with no UI
— Maybe log for later reference
 User input
— Reset button may be reassuring
— Multi-key sequences not intuitive or reliable

Conclusions
 First rule is accept that failure is possible
 Consider all possible failure modes
 Add code to monitor system “health”
 Consider action on failure
— Warn
— Fix

Colin Walls
Thank you
colin_walls@mentor.com
http://blogs.mentor.com/colinwalls

4 colin walls - self-testing in embedded systems

More Related Content

Similar to 4 colin walls - self-testing in embedded systems (20)

More from Ievgenii Katsan (20)

Recently uploaded (20)

4 colin walls - self-testing in embedded systems