Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts

Chris Bailey, Elaine Pearson, Voula Gkatzidou.
Teesside University; AbilityNet.
Measuring and Comparing the Reliability of the
Structured Walkthrough Evaluation Method with
Novices and Experts
Email: chris.bailey@abilitynet.org.uk
Twitter: @chrisbailey000

The Overall Problem
“We haven‟t solved the problem of web
accessibility”
- Jeff Bigham, W4A 2013.

The Evaluation Problem
 The evaluator effect: multiple evaluators detect different sets of
problems when examining the same interface (Herzum &
Jacobsen (2001).
 Manual evaluation and human judgement is a significant
requirement when using automated tools (Vigo et al, 2013).
 Evaluator expertise is particularly significant in WCAG 2.0
(Brajnik, 2010) and BW (Yesilada et al, 2009).
 Neither WCAG set have reliability definitely over the W3C
threshold (Brajnik, 2009); only 8 of 25 SC can be reliably tested
(Alonso et al, 2010).
 With experts, half WCAG 2.0 SC fail to meet threshold (Brajnik,
2010).

Towards A Solution
 With novices, comprehension, knowledge and effort are key
factors (Alonso et al, 2010).
 Evaluation reports (audit) has motivational and educational
value (Sloan, 2006).
 Heuristic evaluation constrains evaluator (Brajnik, 2005) and
potentially reduces the evaluator effect.
 BW finds more severe issues and fewer false positives (Brajnik,
2006) and more issues compared to CR (2008).

• Developed as an educational, evaluation support tool
for novices
• 3 Evaluation Functions; 48 Checks; 5 Categories.
• SWM guides novice through process.
• Checks based on potential barriers they are testing,
supported by guidance and tutorials.
• W4A 2010 has full information.
Accessibility Evaluation Assistant (AEA)

Each check broken into a number of components.
1. The title of the accessibility principle (heuristic).
2. A short summary.
3. General description of the check‟s importance in terms of the
user group(s) affected and the nature of the barrier or
problem caused.
4. Description of the method to perform the check, with step-by-
step instructions if using the Web Accessibility Toolbar.
5. Steps to verify and record a result for the check.
6. A video demonstration of the check being performed in a live
context.
Structured Walkthrough Method

Aims of the Experiment
 Define and measure quality attributes of SWM.
 Reliability
 Validity (Correctness, Sensitivity)
 Usefulness
 Efficiency
 Measure reliability and validity of novice evaluations:
 The extent to which the participants agree on the result of a
check.
 How „correct‟ the novice evaluators were. Did they reach the
same judgment as majority of experts.
 Measure reliability of expert evaluations.
 Gain qualitative feedback on the potential usefulness and
viability of SWM.

Experiment Methodology (Part 1)
 26 final year Computing Students, 12 week elective Accessibility
and Adaptive Technology Module.
 Conducted as assessment within curriculum constraints.
 4 tasks over 3 weeks:
 2 Evaluations: Fitness First and Pure Gym Homepages.
 2 Reflective Pieces: Personas/User Group, Experience of
Evaluation.
 Evaluate pages for conformance to 15 AEA Heuristics; relevant
to both pages, result may be different.
 Check criteria is Met, Not Met or Partly Met; explain and justify
their decision.

Example of Measuring Reliability and Validity –
Fitness First
Check
Decision
Reliability
(R)
Validity
(V)Met
Part
Met
Not Met
Colour
Contrast 15 13* 0 15/28 (54%) 13/28 (46%)
Text Size 23* 5 0 23/28 (82%) 23/28 (82%)
Text
Alternatives 2 16 10* 16/28 (57%) 10/28 (38%)
Link Titles 2 4 21* 21/28 (75%) 21/28 (75%)
Language of
Text 23* 3 2 23/28 (83%) 23/28 (83%)

Calculating Overall Reliability and Validity
 In example 28 evaluators performing 5 checks; total of 140
decisions.
 R is extent to which evaluators reached same decision
expressed as a proportion of maximum value.
 R= (15 + 23 + 16 + 21 + 23) /140 (98/140) 70%
 V is extent to which decision matches majority of experts
(Yesliada et al, 2010).
 V= (13 + 23 + 10 + 21 + 23) /140 (90/140) 64%

Results: Reliability and Validity
 Summary of Reliability
 Summary of Validity
Website
Reliability (R)
Novice
Evaluations
Expert
Evaluations
Fitness First 62% 76%
Pure Gym 67% -
Overall 65% 76%
Website
Validity (V)
Novice
Evaluations
Fitness First 48%
Pure Gym -
Overall 48%
2011: 66% - 73%
Overall 69%
2012: 63% - 78%
Overall 71%
2011: 56% - 65%
Overall 60%
2012: 62% - 73%
Overall 68%

Results: Comparison of Reliability
Check Reliability (R)
Images of Text 60%
Colour Contrast 54%
Moving Elements 57%
Text Size 82%
Keyboard Navigation 75%
Link Names 57%
Skip Navigation Link 68%
Text Alternatives 57%
Link Titles 75%
Headings and Sub-
Headings
39%
Form Labels 50%
Identify Language of
Text
82%
Validate (X)HTML
Code
68%
Site Map 57%
Accessibility
Information
50%
Check Reliability (R)
Images of Text 66%
Colour Contrast 83%
Moving Elements 83%
Text Size 100%
Link Names 66%
Skip Navigation Link 66%
Link Titles 66%
Headings and Sub-
Headings
66%
Form Labels 50%
Identify Language of
Text
100%
Validate (X)HTML
Code
100%
Site Map 66%
Accessibility
Information
83%
Novices Experts
 Checks performed by experts generally had higher level of reliability.

Results: Validity of Novice Evaluations
 Validity of some novice checks was particularly low; reasons include:
 Lack of thoroughness (Alonso et al, 2010)
 Incomplete instructional information.
Check Validity (V)
1. Images of Text 14%
Colour Contrast 46%
Moving Elements 57%
Text Size 82%
2. Link Names 0%
3. Skip Navigation Link 1%
Link Titles 75%
Headings and Sub-Headings 39%
Form Labels 50%
Identify Language of Text 92%
Validate (X)HTML Code 68%
Site Map 39%
Accessibility Information 43%
Overall 48%

Expert Feedback: Viability and Usefulness
 “Simple to understand and well structured. Could easily
follow the steps based on the instructions provided.”
 “It was easy and succinct. Found it pretty useful.”
 “Much simpler (than WCAG 2.0) and more directed.”
 “The information about why it (a check) is important and
how to check it.”

 “The evaluation tool would be very useful for someone with
little accessibility experience. They would be able to evaluate a
web page using the instructions and video provided.”
 “I don‟t think it could replace a WCAG 2.0 audit but it does
have the benefit of being a quick way to evaluate a number of
pages to provide indicators as to where problem areas are
before conducting a more in-depth WCAG 2.0 audit once the
top level issues have been fixed.”
 “Works well if only Internet Explorer is used however in my
testing I will use Firefox inspectors and assistive technology to
verify issues.”

 “Ability to grade issues as partially metnot met felt useful.
Checkpoints seemed quite broad, allowing for some
degree of flexibility when interpreting.”
 “….judgement was still required as to how to classify a
check. If one of the points was not met does that mean
„part met‟ or „not met‟? How much common sense and
judgement should be applied? However this is still much
better than WCAG 2.0 where guidance at this level is a
really big issue.”

Expert Feedback: Appropriateness and
Specificity
 “Some aspects not covered (coloursensory reliance),
heading interpretation is too strict. Not sure on
coverage of link title attributes and how much of an
impact adhering to this checkpoint would have in
practical terms.”
 “Requirement for a sitemap explicitly was good rather
than the vaguer, WCAG 2.0 equivalent.”

Conclusion
 Levels of reliability of novice evaluations have been consistent.
 Reliability of expert evaluations was high; overall figure
approaching 80%.
 AEA is not an appropriate means to deliver method to experts.
 Improved coverage (notification of dynamic content).
 Current approach useful for top level evaluations.
 Tool needs redevelopment.
 Trialled with different cohorts of novices.
 Further trials of method with experts.
 WCAG 2.0 integration.

Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts

More Related Content

Similar to Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts (20)

Recently uploaded (20)

Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts

Editor's Notes