SlideShare a Scribd company logo
Chris Bailey, Elaine Pearson, Voula Gkatzidou.
Teesside University; AbilityNet.
Measuring and Comparing the Reliability of the
Structured Walkthrough Evaluation Method with
Novices and Experts
Email: chris.bailey@abilitynet.org.uk
Twitter: @chrisbailey000
The Overall Problem
“We haven‟t solved the problem of web
accessibility”
- Jeff Bigham, W4A 2013.
The Evaluation Problem
 The evaluator effect: multiple evaluators detect different sets of
problems when examining the same interface (Herzum &
Jacobsen (2001).
 Manual evaluation and human judgement is a significant
requirement when using automated tools (Vigo et al, 2013).
 Evaluator expertise is particularly significant in WCAG 2.0
(Brajnik, 2010) and BW (Yesilada et al, 2009).
 Neither WCAG set have reliability definitely over the W3C
threshold (Brajnik, 2009); only 8 of 25 SC can be reliably tested
(Alonso et al, 2010).
 With experts, half WCAG 2.0 SC fail to meet threshold (Brajnik,
2010).
Towards A Solution
 With novices, comprehension, knowledge and effort are key
factors (Alonso et al, 2010).
 Evaluation reports (audit) has motivational and educational
value (Sloan, 2006).
 Heuristic evaluation constrains evaluator (Brajnik, 2005) and
potentially reduces the evaluator effect.
 BW finds more severe issues and fewer false positives (Brajnik,
2006) and more issues compared to CR (2008).
• Developed as an educational, evaluation support tool
for novices
• 3 Evaluation Functions; 48 Checks; 5 Categories.
• SWM guides novice through process.
• Checks based on potential barriers they are testing,
supported by guidance and tutorials.
• W4A 2010 has full information.
Accessibility Evaluation Assistant (AEA)
Each check broken into a number of components.
1. The title of the accessibility principle (heuristic).
2. A short summary.
3. General description of the check‟s importance in terms of the
user group(s) affected and the nature of the barrier or
problem caused.
4. Description of the method to perform the check, with step-by-
step instructions if using the Web Accessibility Toolbar.
5. Steps to verify and record a result for the check.
6. A video demonstration of the check being performed in a live
context.
Structured Walkthrough Method
Aims of the Experiment
 Define and measure quality attributes of SWM.
 Reliability
 Validity (Correctness, Sensitivity)
 Usefulness
 Efficiency
 Measure reliability and validity of novice evaluations:
 The extent to which the participants agree on the result of a
check.
 How „correct‟ the novice evaluators were. Did they reach the
same judgment as majority of experts.
 Measure reliability of expert evaluations.
 Gain qualitative feedback on the potential usefulness and
viability of SWM.
Experiment Methodology (Part 1)
 26 final year Computing Students, 12 week elective Accessibility
and Adaptive Technology Module.
 Conducted as assessment within curriculum constraints.
 4 tasks over 3 weeks:
 2 Evaluations: Fitness First and Pure Gym Homepages.
 2 Reflective Pieces: Personas/User Group, Experience of
Evaluation.
 Evaluate pages for conformance to 15 AEA Heuristics; relevant
to both pages, result may be different.
 Check criteria is Met, Not Met or Partly Met; explain and justify
their decision.
Example of Measuring Reliability and Validity –
Fitness First
Check
Decision
Reliability
(R)
Validity
(V)Met
Part
Met
Not Met
Colour
Contrast 15 13* 0 15/28 (54%) 13/28 (46%)
Text Size 23* 5 0 23/28 (82%) 23/28 (82%)
Text
Alternatives 2 16 10* 16/28 (57%) 10/28 (38%)
Link Titles 2 4 21* 21/28 (75%) 21/28 (75%)
Language of
Text 23* 3 2 23/28 (83%) 23/28 (83%)
Calculating Overall Reliability and Validity
 In example 28 evaluators performing 5 checks; total of 140
decisions.
 R is extent to which evaluators reached same decision
expressed as a proportion of maximum value.
 R= (15 + 23 + 16 + 21 + 23) /140 (98/140) 70%
 V is extent to which decision matches majority of experts
(Yesliada et al, 2010).
 V= (13 + 23 + 10 + 21 + 23) /140 (90/140) 64%
Results: Reliability and Validity
 Summary of Reliability
 Summary of Validity
Website
Reliability (R)
Novice
Evaluations
Expert
Evaluations
Fitness First 62% 76%
Pure Gym 67% -
Overall 65% 76%
Website
Validity (V)
Novice
Evaluations
Fitness First 48%
Pure Gym -
Overall 48%
2011: 66% - 73%
Overall 69%
2012: 63% - 78%
Overall 71%
2011: 56% - 65%
Overall 60%
2012: 62% - 73%
Overall 68%
Results: Comparison of Reliability
Check Reliability (R)
Images of Text 60%
Colour Contrast 54%
Moving Elements 57%
Text Size 82%
Keyboard Navigation 75%
Link Names 57%
Skip Navigation Link 68%
Text Alternatives 57%
Link Titles 75%
Headings and Sub-
Headings
39%
Form Labels 50%
Identify Language of
Text
82%
Validate (X)HTML
Code
68%
Site Map 57%
Accessibility
Information
50%
Check Reliability (R)
Images of Text 66%
Colour Contrast 83%
Moving Elements 83%
Text Size 100%
Keyboard Navigation 50%
Link Names 66%
Skip Navigation Link 66%
Text Alternatives 83%
Link Titles 66%
Headings and Sub-
Headings
66%
Form Labels 50%
Identify Language of
Text
100%
Validate (X)HTML
Code
100%
Site Map 66%
Accessibility
Information
83%
Novices Experts
 Checks performed by experts generally had higher level of reliability.
Results: Validity of Novice Evaluations
 Validity of some novice checks was particularly low; reasons include:
 Lack of thoroughness (Alonso et al, 2010)
 Incomplete instructional information.
Check Validity (V)
1. Images of Text 14%
Colour Contrast 46%
Moving Elements 57%
Text Size 82%
Keyboard Navigation 75%
2. Link Names 0%
3. Skip Navigation Link 1%
Text Alternatives 36%
Link Titles 75%
Headings and Sub-Headings 39%
Form Labels 50%
Identify Language of Text 92%
Validate (X)HTML Code 68%
Site Map 39%
Accessibility Information 43%
Overall 48%
Expert Feedback: Viability and Usefulness
 “Simple to understand and well structured. Could easily
follow the steps based on the instructions provided.”
 “It was easy and succinct. Found it pretty useful.”
 “Much simpler (than WCAG 2.0) and more directed.”
 “The information about why it (a check) is important and
how to check it.”
Expert Feedback: Viability and Usefulness
 “The evaluation tool would be very useful for someone with
little accessibility experience. They would be able to evaluate a
web page using the instructions and video provided.”
 “I don‟t think it could replace a WCAG 2.0 audit but it does
have the benefit of being a quick way to evaluate a number of
pages to provide indicators as to where problem areas are
before conducting a more in-depth WCAG 2.0 audit once the
top level issues have been fixed.”
 “Works well if only Internet Explorer is used however in my
testing I will use Firefox inspectors and assistive technology to
verify issues.”
Expert Feedback: Viability and Usefulness
 “Ability to grade issues as partially metnot met felt useful.
Checkpoints seemed quite broad, allowing for some
degree of flexibility when interpreting.”
 “….judgement was still required as to how to classify a
check. If one of the points was not met does that mean
„part met‟ or „not met‟? How much common sense and
judgement should be applied? However this is still much
better than WCAG 2.0 where guidance at this level is a
really big issue.”
Expert Feedback: Appropriateness and
Specificity
 “Some aspects not covered (coloursensory reliance),
heading interpretation is too strict. Not sure on
coverage of link title attributes and how much of an
impact adhering to this checkpoint would have in
practical terms.”
 “Requirement for a sitemap explicitly was good rather
than the vaguer, WCAG 2.0 equivalent.”
Conclusion
 Levels of reliability of novice evaluations have been consistent.
 Reliability of expert evaluations was high; overall figure
approaching 80%.
 AEA is not an appropriate means to deliver method to experts.
 Improved coverage (notification of dynamic content).
 Current approach useful for top level evaluations.
 Tool needs redevelopment.
 Trialled with different cohorts of novices.
 Further trials of method with experts.
 WCAG 2.0 integration.
Chris Bailey, Elaine Pearson, Voula Gkatzidou.
Teesside University; AbilityNet.
Measuring and Comparing the Reliability of the
Structured Walkthrough Evaluation Method with
Novices and Experts
Email: chris.bailey@abilitynet.org.uk
Twitter: @chrisbailey000

More Related Content

PDF
l’outil CASP pour les études qualitatives – webinaire du Club de lecture en l...
DOCX
Software Testing
PPTX
Practitioners’ Expectations on Automated Fault Localization
PDF
Guideline Aggregation: Web Accessibility
PPTX
Oral defense b. henry
PPT
Rik Teuben - Many Can Quarrel, Fewer Can Argue
PPTX
Promise 2011: "Empirical validation of human factors on predicting issue reso...
PDF
2 Studies UX types should know about (Straub UXPA unconference13)
l’outil CASP pour les études qualitatives – webinaire du Club de lecture en l...
Software Testing
Practitioners’ Expectations on Automated Fault Localization
Guideline Aggregation: Web Accessibility
Oral defense b. henry
Rik Teuben - Many Can Quarrel, Fewer Can Argue
Promise 2011: "Empirical validation of human factors on predicting issue reso...
2 Studies UX types should know about (Straub UXPA unconference13)

Similar to Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts (20)

PPT
#W4A2011 - C. Bailey
PPTX
Validation and mechanism: exploring the limits of evaluation
PPTX
evaluation -human computer interaction.pptx
PPT
Website Usability
PPT
Usability Testing Options
PDF
Notes on evaluation and usability inspection, Baobab Health Trust, March 2014
PPTX
Library Website Usability
PDF
Mirri w4a2012
PPT
Intranet Usability Testing
PDF
The Power of the UX Evaluation
PPTX
Evaluation
PPTX
H evaluation
PDF
Conducting Expert Reviews Using the VIMM Model
PPT
Reactome: Usability testing - is it useful?
PPTX
Users are Losers! They’ll Like Whatever we Make! and Other Fallacies.
PPTX
Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole R...
DOCX
Notes on Evaluation of eLearning
PDF
Concept for Testing a New Medical Product for World-wide Launch
PDF
Usability Testing by Rajdeep Gupta, Misys
PPTX
week_10_day_2_evaluatvssvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvion.pptx
#W4A2011 - C. Bailey
Validation and mechanism: exploring the limits of evaluation
evaluation -human computer interaction.pptx
Website Usability
Usability Testing Options
Notes on evaluation and usability inspection, Baobab Health Trust, March 2014
Library Website Usability
Mirri w4a2012
Intranet Usability Testing
The Power of the UX Evaluation
Evaluation
H evaluation
Conducting Expert Reviews Using the VIMM Model
Reactome: Usability testing - is it useful?
Users are Losers! They’ll Like Whatever we Make! and Other Fallacies.
Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole R...
Notes on Evaluation of eLearning
Concept for Testing a New Medical Product for World-wide Launch
Usability Testing by Rajdeep Gupta, Misys
week_10_day_2_evaluatvssvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvion.pptx
Ad

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MIND Revenue Release Quarter 2 2025 Press Release
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
Ad

Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts

  • 1. Chris Bailey, Elaine Pearson, Voula Gkatzidou. Teesside University; AbilityNet. Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts Email: chris.bailey@abilitynet.org.uk Twitter: @chrisbailey000
  • 2. The Overall Problem “We haven‟t solved the problem of web accessibility” - Jeff Bigham, W4A 2013.
  • 3. The Evaluation Problem  The evaluator effect: multiple evaluators detect different sets of problems when examining the same interface (Herzum & Jacobsen (2001).  Manual evaluation and human judgement is a significant requirement when using automated tools (Vigo et al, 2013).  Evaluator expertise is particularly significant in WCAG 2.0 (Brajnik, 2010) and BW (Yesilada et al, 2009).  Neither WCAG set have reliability definitely over the W3C threshold (Brajnik, 2009); only 8 of 25 SC can be reliably tested (Alonso et al, 2010).  With experts, half WCAG 2.0 SC fail to meet threshold (Brajnik, 2010).
  • 4. Towards A Solution  With novices, comprehension, knowledge and effort are key factors (Alonso et al, 2010).  Evaluation reports (audit) has motivational and educational value (Sloan, 2006).  Heuristic evaluation constrains evaluator (Brajnik, 2005) and potentially reduces the evaluator effect.  BW finds more severe issues and fewer false positives (Brajnik, 2006) and more issues compared to CR (2008).
  • 5. • Developed as an educational, evaluation support tool for novices • 3 Evaluation Functions; 48 Checks; 5 Categories. • SWM guides novice through process. • Checks based on potential barriers they are testing, supported by guidance and tutorials. • W4A 2010 has full information. Accessibility Evaluation Assistant (AEA)
  • 6. Each check broken into a number of components. 1. The title of the accessibility principle (heuristic). 2. A short summary. 3. General description of the check‟s importance in terms of the user group(s) affected and the nature of the barrier or problem caused. 4. Description of the method to perform the check, with step-by- step instructions if using the Web Accessibility Toolbar. 5. Steps to verify and record a result for the check. 6. A video demonstration of the check being performed in a live context. Structured Walkthrough Method
  • 7. Aims of the Experiment  Define and measure quality attributes of SWM.  Reliability  Validity (Correctness, Sensitivity)  Usefulness  Efficiency  Measure reliability and validity of novice evaluations:  The extent to which the participants agree on the result of a check.  How „correct‟ the novice evaluators were. Did they reach the same judgment as majority of experts.  Measure reliability of expert evaluations.  Gain qualitative feedback on the potential usefulness and viability of SWM.
  • 8. Experiment Methodology (Part 1)  26 final year Computing Students, 12 week elective Accessibility and Adaptive Technology Module.  Conducted as assessment within curriculum constraints.  4 tasks over 3 weeks:  2 Evaluations: Fitness First and Pure Gym Homepages.  2 Reflective Pieces: Personas/User Group, Experience of Evaluation.  Evaluate pages for conformance to 15 AEA Heuristics; relevant to both pages, result may be different.  Check criteria is Met, Not Met or Partly Met; explain and justify their decision.
  • 9. Example of Measuring Reliability and Validity – Fitness First Check Decision Reliability (R) Validity (V)Met Part Met Not Met Colour Contrast 15 13* 0 15/28 (54%) 13/28 (46%) Text Size 23* 5 0 23/28 (82%) 23/28 (82%) Text Alternatives 2 16 10* 16/28 (57%) 10/28 (38%) Link Titles 2 4 21* 21/28 (75%) 21/28 (75%) Language of Text 23* 3 2 23/28 (83%) 23/28 (83%)
  • 10. Calculating Overall Reliability and Validity  In example 28 evaluators performing 5 checks; total of 140 decisions.  R is extent to which evaluators reached same decision expressed as a proportion of maximum value.  R= (15 + 23 + 16 + 21 + 23) /140 (98/140) 70%  V is extent to which decision matches majority of experts (Yesliada et al, 2010).  V= (13 + 23 + 10 + 21 + 23) /140 (90/140) 64%
  • 11. Results: Reliability and Validity  Summary of Reliability  Summary of Validity Website Reliability (R) Novice Evaluations Expert Evaluations Fitness First 62% 76% Pure Gym 67% - Overall 65% 76% Website Validity (V) Novice Evaluations Fitness First 48% Pure Gym - Overall 48% 2011: 66% - 73% Overall 69% 2012: 63% - 78% Overall 71% 2011: 56% - 65% Overall 60% 2012: 62% - 73% Overall 68%
  • 12. Results: Comparison of Reliability Check Reliability (R) Images of Text 60% Colour Contrast 54% Moving Elements 57% Text Size 82% Keyboard Navigation 75% Link Names 57% Skip Navigation Link 68% Text Alternatives 57% Link Titles 75% Headings and Sub- Headings 39% Form Labels 50% Identify Language of Text 82% Validate (X)HTML Code 68% Site Map 57% Accessibility Information 50% Check Reliability (R) Images of Text 66% Colour Contrast 83% Moving Elements 83% Text Size 100% Keyboard Navigation 50% Link Names 66% Skip Navigation Link 66% Text Alternatives 83% Link Titles 66% Headings and Sub- Headings 66% Form Labels 50% Identify Language of Text 100% Validate (X)HTML Code 100% Site Map 66% Accessibility Information 83% Novices Experts  Checks performed by experts generally had higher level of reliability.
  • 13. Results: Validity of Novice Evaluations  Validity of some novice checks was particularly low; reasons include:  Lack of thoroughness (Alonso et al, 2010)  Incomplete instructional information. Check Validity (V) 1. Images of Text 14% Colour Contrast 46% Moving Elements 57% Text Size 82% Keyboard Navigation 75% 2. Link Names 0% 3. Skip Navigation Link 1% Text Alternatives 36% Link Titles 75% Headings and Sub-Headings 39% Form Labels 50% Identify Language of Text 92% Validate (X)HTML Code 68% Site Map 39% Accessibility Information 43% Overall 48%
  • 14. Expert Feedback: Viability and Usefulness  “Simple to understand and well structured. Could easily follow the steps based on the instructions provided.”  “It was easy and succinct. Found it pretty useful.”  “Much simpler (than WCAG 2.0) and more directed.”  “The information about why it (a check) is important and how to check it.”
  • 15. Expert Feedback: Viability and Usefulness  “The evaluation tool would be very useful for someone with little accessibility experience. They would be able to evaluate a web page using the instructions and video provided.”  “I don‟t think it could replace a WCAG 2.0 audit but it does have the benefit of being a quick way to evaluate a number of pages to provide indicators as to where problem areas are before conducting a more in-depth WCAG 2.0 audit once the top level issues have been fixed.”  “Works well if only Internet Explorer is used however in my testing I will use Firefox inspectors and assistive technology to verify issues.”
  • 16. Expert Feedback: Viability and Usefulness  “Ability to grade issues as partially metnot met felt useful. Checkpoints seemed quite broad, allowing for some degree of flexibility when interpreting.”  “….judgement was still required as to how to classify a check. If one of the points was not met does that mean „part met‟ or „not met‟? How much common sense and judgement should be applied? However this is still much better than WCAG 2.0 where guidance at this level is a really big issue.”
  • 17. Expert Feedback: Appropriateness and Specificity  “Some aspects not covered (coloursensory reliance), heading interpretation is too strict. Not sure on coverage of link title attributes and how much of an impact adhering to this checkpoint would have in practical terms.”  “Requirement for a sitemap explicitly was good rather than the vaguer, WCAG 2.0 equivalent.”
  • 18. Conclusion  Levels of reliability of novice evaluations have been consistent.  Reliability of expert evaluations was high; overall figure approaching 80%.  AEA is not an appropriate means to deliver method to experts.  Improved coverage (notification of dynamic content).  Current approach useful for top level evaluations.  Tool needs redevelopment.  Trialled with different cohorts of novices.  Further trials of method with experts.  WCAG 2.0 integration.
  • 19. Chris Bailey, Elaine Pearson, Voula Gkatzidou. Teesside University; AbilityNet. Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts Email: chris.bailey@abilitynet.org.uk Twitter: @chrisbailey000

Editor's Notes