SlideShare a Scribd company logo
Summer PIT 2019
Validation and Mechanism
exploring the limits of evaluation
Alan Dix
http://alandix.com/academic/talks/PIT-2019-validation-and-mechanism/
Tiree
Tiree Tech Wave
3-7 October
Computational Foundry
Swansea University
the foundry
building
mission
community
Validation and mechanism: exploring the limits of evaluation
Validation and mechanism: exploring the limits of evaluation
what is evaluation?
easy and hard questions
what does it mean anyway?
easy questions
How fast do people recognise a menu option
Is product A easer to learn than product B
even then …
individuals or average
better for whom?
WEIRD …
WEIRD people
Henrich J, Heine S, Norenzayan A. (2010). The weirdest people in the world?
Behav Brain Sci. 2010 Jun;33(2-3):61-83; discussion 83-135.
doi:10.1017/S0140525X0999152X. Epub 2010 Jun 15
Western, Educated, Industrialized,
Rich, and Democratic
Harder questions
Subjective experience (UX, fun)
Long term-interactions
e.g. meetings
Long-term effects
eg. education, sustainability, behaviour change
What do we mean by evaluation?
often
post-hoc empirical study/experiment
… but why?
what is it for?
why are you doing it?
exploration vs. validation
process vs. product
research
exploration
?finding
questions
✓
validation
answering
them
explanation
finding
why and how
ethnography
in-depth interviews
detailed observation
big data
experiments
large-scale survey
quantitative data
qualitative data
theoretical models
mechanism
development
design build
test
process
product
summative
formative
make it better
does it work
purpose
Two types of evaluation
purpose stage
formative improve a design development
summative say “this is good” contractual/sales
investigative gain understanding researchinvestigative
user research /
big changes
gain understanding
Three
exploration / formative
– find any interesting issues
– stats about deciding priorities
validation / summative
– exhaustive: find all problems/issues
– verifying: is hypothesis true, does system work
– mensuration: how good, how prevalent
explanation / investigative
– matching qualitative/quantitative, small/large samples
are five users enough?
original work
Nielsen & Landauer (1993) about iterative process
not summative – not for stats!
how many?
to find enough to do in next development cycle
depends on size of project and complexity
now-a-days with cheap development maybe n=1
but always more in next cycle
N.B. later work
on saturation
Validation and mechanism: exploring the limits of evaluation
from evaluation
to validation
dealing with
harder questions
validating work
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
your work
evaluation
• experiments
• user studies
• peer review
generative artefacts
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
artefact
evaluation
singularity
different people
different situations
different designers
• toolkits
• devices
• interfaces
• guidelines
• methodologies
(pure) evaluation of generative artefacts
is methodologically unsound
your work
validating work
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
justification
• expert opinion
• previous research
• new experiments
evaluation
• experiments
• user studies
• peer review
justification vs. validation
• different disciplines
– mathematics: proof = justification
– medicine: drug trials = evaluation
• combine them:
– look for weakness in justification
– focus evaluation there
evaluationjustification
Validation and mechanism: exploring the limits of evaluation
mechanism
from what happens
to how and why
mechanism
quantitative and statistical
what is true end to end
phenomena
qualitative and theoretical
why and how
mechanism
generalisation
empirical data
at best interpolate
understanding mechanism allows:
extrapolation
application in new contexts
mechanism
• reduction reconstruction
– formal hypothesis testing
+ may be qualitative too
– more scientific precision
• wholistic analytic
– field studies, ethnographies
+ ‘end to end’ experiments
– more ecological validity
?
? ? ? ? ?
• wholistic analytic
– field studies, ethnographies
+ ‘end to end’ experiments
– more ecological validity
example: mobile font size
early paper on fonts in mobile menus:
well conducted experiment
statistically significant results
conclusion gives best font size
but … a menu selection task includes:
1. visual search (better big fonts)
2. if not found scroll/page display (better small fonts)
3. when found touch target (better big fonts)
no single best size – the balance depends on menu length, etc.
Validation and mechanism: exploring the limits of evaluation
Validation and mechanism: exploring the limits of evaluation
what have you
really shown?
stats are about the measure,
but what does it measure
what have you really shown
• think about the conditions
– are there other explanations for data?
• individual or population
– small #of groups/individuals, many measurements
– sig. statistics => effect reliable for each individual
– but are individuals representative of all?
• systems vs properties
a little story …
BIG ACM conference – ‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:
A – domain specific software, synchronous
B – generic software, synchronous
C – generic software, asynchronous
A
B C
domai
n
spec.generic
experiment
sensible quality measures
reasonable nos. subjects in each condition
significant results p<0.05
domain spec. > generic
asynchronous > synchronous
conclusion: really want asynchronous domain specific
A
B C
domain
spec.
generic
domain
spec.
generic
asyncsync
what’s wrong with that?
interaction effects
gap is interesting to study
not necessarily good to implement
more important …
if you blinked at the wrong moment …
NOT independent variables
three different pieces of software
like experiment on 3 people!
say system B was just bad
domain
spec.
generic asyncsync
A
B C
domain
spec.
generic
?
B < A B < C
what went wrong?
borrowed psych method
… but method embodies assumptions
single simple cause, controlled environment
interaction needs ecologically valid experiment
multiple causes, open situations
what to do?
understand assumptions and modify
Validation and mechanism: exploring the limits of evaluation
diversity – individual/task
good for not just good
don’t just look at average!
e.g. overall system A lower error rate than system B
but … system B better for experts
… and tasks too
e.g. PieTree
(interactive circular treemap)
exploding
Pie chart
good for finding
large things
unfolding
hierarchical
text view
good for finding
small things
more important to know
who or what
something is good for
Validation and mechanism: exploring the limits of evaluation
types of evaluation
you’ve designed it, but is it right?
points of comparison
• measures:
– average satisfaction 3.2 on a 5 point scale
– time to complete task in range 13.2–27.6 seconds
– good or bad?
• need a point of comparison
– but what?
– self, similar system, created or real??
– think purpose ...
• what constitutes a ‘control’
– think!!
types of knowledge
• descriptive
– explaining what happened
• predictive
– saying what will happen
cause  effect
– where science often ends
• synthetic
– working out what to do to make what you want happen
effect  cause
– design and engineering
• synthetic
– working out what to do to make what you want happen
effect  cause
– design and engineering
different kinds of evaluation
endless arguments
– quantitative vs. qualitative
– in the lab vs. in the wild
– experts vs. real users (vs UG students!)
really
– combine methods
e.g. quantitative – what is true & qualitative – why
– what is appropriate and possible
when does it end?
in a world of perpetual beta ...
real use is the ultimate evaluation
• logging, bug reporting, etc.
• how do people really use the product?
• are some features never used?
Validation and mechanism: exploring the limits of evaluation

More Related Content

PPT
HCI - Chapter 2
PPTX
Sources of Bias and Explanation
PPT
HCI 3e - Ch 3 (extra):
PPTX
More than a Moment.
PPTX
Cognition as Material: personality prostheses and other stories
PPTX
Formal 8 – Interaction Models – describing general properties of systems incl...
PPTX
Formal 5 – Dialogue models – what to do when
PPT
HCI 3e - Ch 4 (extra):
HCI - Chapter 2
Sources of Bias and Explanation
HCI 3e - Ch 3 (extra):
More than a Moment.
Cognition as Material: personality prostheses and other stories
Formal 8 – Interaction Models – describing general properties of systems incl...
Formal 5 – Dialogue models – what to do when
HCI 3e - Ch 4 (extra):

What's hot (20)

PDF
Hci activity#3
PPTX
Human Computer Interaction (HCI)
PPT
E3 chap-05
PPT
HCI 3e - Ch 19: Groupware
PPT
Chapter 2
PPT
What Is Interaction Design
PPT
HCI - Chapter 4
PPT
HCI 3e - Ch 3: The interaction
PDF
Hci md exam
PPT
Interaction 09 Introduction to Interaction Design
PPT
HCI 3e - Ch 5: Interaction design basics
PPTX
Human computer interaction 3 4(revised)
PPTX
Formal 6 – A success story!
PPT
HCI 3e - Ch 16: Dialogue notations and design
PPT
HCI 3e - Ch 10: Universal design
PPT
HCI - Chapter 3
PPTX
Modelling interactions: digital and physical – Part 1 – lightning tour
PPT
HCI 3e - Ch 18: Modelling rich interaction
PPT
Human computer interaction
Hci activity#3
Human Computer Interaction (HCI)
E3 chap-05
HCI 3e - Ch 19: Groupware
Chapter 2
What Is Interaction Design
HCI - Chapter 4
HCI 3e - Ch 3: The interaction
Hci md exam
Interaction 09 Introduction to Interaction Design
HCI 3e - Ch 5: Interaction design basics
Human computer interaction 3 4(revised)
Formal 6 – A success story!
HCI 3e - Ch 16: Dialogue notations and design
HCI 3e - Ch 10: Universal design
HCI - Chapter 3
Modelling interactions: digital and physical – Part 1 – lightning tour
HCI 3e - Ch 18: Modelling rich interaction
Human computer interaction
Ad

Similar to Validation and mechanism: exploring the limits of evaluation (20)

PPT
Evaluation
PPT
HCI 3e - Ch 9: Evaluation techniques
PDF
User Testing: Adapt to Fit Your Needs
PPTX
HCI_chapter_09-Evaluation_techniques
PPTX
hci Evaluation Techniques.pptx
PPTX
evaluation -human computer interaction.pptx
PDF
Analytic emperical Mehods
PDF
UX from 30,000ft - COMP33512 - Lectures 13 & 14 - Week 7 - 2013/2014 Edition ...
PDF
Lecture 4 - Emperical Investigations.pdf
PPT
human computer interaction - powerpoints
PPT
Evaluation Techniques chapter for Human Computer intaraction
PPT
evaluation-ppt is a good paper for ervalution technique
PPT
e3-chap-09.ppt
PPT
Evaluation techniques
PDF
Research industry panel review
PPTX
Unit 3_Evaluation Technique.pptx
PPT
Visual thinking colin_ware_lectures_2013_10_research methods
PDF
PPT
E3 chap-09
Evaluation
HCI 3e - Ch 9: Evaluation techniques
User Testing: Adapt to Fit Your Needs
HCI_chapter_09-Evaluation_techniques
hci Evaluation Techniques.pptx
evaluation -human computer interaction.pptx
Analytic emperical Mehods
UX from 30,000ft - COMP33512 - Lectures 13 & 14 - Week 7 - 2013/2014 Edition ...
Lecture 4 - Emperical Investigations.pdf
human computer interaction - powerpoints
Evaluation Techniques chapter for Human Computer intaraction
evaluation-ppt is a good paper for ervalution technique
e3-chap-09.ppt
Evaluation techniques
Research industry panel review
Unit 3_Evaluation Technique.pptx
Visual thinking colin_ware_lectures_2013_10_research methods
E3 chap-09
Ad

More from Alan Dix (20)

PPTX
Artificial Intelligence for Agriculture: Being smart and Being small
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
PPTX
Whose choice? Making decisions with and about Artificial Intelligence, Keele ...
PPTX
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
PPTX
Citations and Sub-Area Bias in the UK Research Assessment Process
PPTX
Technical Creativity – 901 Time Managing Creativity Introduction
PPTX
Technical Creativity – 906 Doing Nothing
PPTX
Technical Creativity – 905 Impasse: getting unstuck
PPTX
Technical Creativity – 904 To Do and Done
PPTX
Technical Creativity – 902 Plans: staying open to creative moments
PPTX
Technical Creativity – 903 Busy Work: being productive in the gaps
PPTX
Technical Creativity – 907 The Creativity Plan
PPTX
Technical Creativity – 801 Nurture Introduction
PPTX
Technical Creativity – 802 Habits that foster creativity
PPTX
Technical Creativity – 803 Create and Capture
PPTX
Technical Creativity – 701 Personality Prostheses: working with you
PPTX
Technical Creativity – 702 One Morning: a short creativity story
PPTX
Technical Creativity – 603 Fixation and Insight
PPTX
Technical Creativity – 605 What is the Problem
PPTX
Technical Creativity – 601 Theory Introduction
Artificial Intelligence for Agriculture: Being smart and Being small
Enabling the Digital Artisan – keynote at ICOCI 2025
Whose choice? Making decisions with and about Artificial Intelligence, Keele ...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Citations and Sub-Area Bias in the UK Research Assessment Process
Technical Creativity – 901 Time Managing Creativity Introduction
Technical Creativity – 906 Doing Nothing
Technical Creativity – 905 Impasse: getting unstuck
Technical Creativity – 904 To Do and Done
Technical Creativity – 902 Plans: staying open to creative moments
Technical Creativity – 903 Busy Work: being productive in the gaps
Technical Creativity – 907 The Creativity Plan
Technical Creativity – 801 Nurture Introduction
Technical Creativity – 802 Habits that foster creativity
Technical Creativity – 803 Create and Capture
Technical Creativity – 701 Personality Prostheses: working with you
Technical Creativity – 702 One Morning: a short creativity story
Technical Creativity – 603 Fixation and Insight
Technical Creativity – 605 What is the Problem
Technical Creativity – 601 Theory Introduction

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Validation and mechanism: exploring the limits of evaluation

  • 1. Summer PIT 2019 Validation and Mechanism exploring the limits of evaluation Alan Dix http://alandix.com/academic/talks/PIT-2019-validation-and-mechanism/
  • 2. Tiree Tiree Tech Wave 3-7 October Computational Foundry Swansea University
  • 6. what is evaluation? easy and hard questions what does it mean anyway?
  • 7. easy questions How fast do people recognise a menu option Is product A easer to learn than product B even then … individuals or average better for whom? WEIRD …
  • 8. WEIRD people Henrich J, Heine S, Norenzayan A. (2010). The weirdest people in the world? Behav Brain Sci. 2010 Jun;33(2-3):61-83; discussion 83-135. doi:10.1017/S0140525X0999152X. Epub 2010 Jun 15 Western, Educated, Industrialized, Rich, and Democratic
  • 9. Harder questions Subjective experience (UX, fun) Long term-interactions e.g. meetings Long-term effects eg. education, sustainability, behaviour change
  • 10. What do we mean by evaluation? often post-hoc empirical study/experiment … but why? what is it for?
  • 11. why are you doing it? exploration vs. validation process vs. product
  • 12. research exploration ?finding questions ✓ validation answering them explanation finding why and how ethnography in-depth interviews detailed observation big data experiments large-scale survey quantitative data qualitative data theoretical models mechanism
  • 14. purpose Two types of evaluation purpose stage formative improve a design development summative say “this is good” contractual/sales investigative gain understanding researchinvestigative user research / big changes gain understanding Three
  • 15. exploration / formative – find any interesting issues – stats about deciding priorities validation / summative – exhaustive: find all problems/issues – verifying: is hypothesis true, does system work – mensuration: how good, how prevalent explanation / investigative – matching qualitative/quantitative, small/large samples
  • 16. are five users enough? original work Nielsen & Landauer (1993) about iterative process not summative – not for stats! how many? to find enough to do in next development cycle depends on size of project and complexity now-a-days with cheap development maybe n=1 but always more in next cycle N.B. later work on saturation
  • 18. from evaluation to validation dealing with harder questions
  • 19. validating work • justification – expert opinion – previous research – new experiments • evaluation – experiments – user studies – peer review your work evaluation • experiments • user studies • peer review
  • 20. generative artefacts • justification – expert opinion – previous research – new experiments • evaluation – experiments – user studies – peer review artefact evaluation singularity different people different situations different designers • toolkits • devices • interfaces • guidelines • methodologies (pure) evaluation of generative artefacts is methodologically unsound
  • 21. your work validating work • justification – expert opinion – previous research – new experiments • evaluation – experiments – user studies – peer review justification • expert opinion • previous research • new experiments evaluation • experiments • user studies • peer review
  • 22. justification vs. validation • different disciplines – mathematics: proof = justification – medicine: drug trials = evaluation • combine them: – look for weakness in justification – focus evaluation there evaluationjustification
  • 25. mechanism quantitative and statistical what is true end to end phenomena qualitative and theoretical why and how mechanism
  • 26. generalisation empirical data at best interpolate understanding mechanism allows: extrapolation application in new contexts
  • 27. mechanism • reduction reconstruction – formal hypothesis testing + may be qualitative too – more scientific precision • wholistic analytic – field studies, ethnographies + ‘end to end’ experiments – more ecological validity ? ? ? ? ? ? • wholistic analytic – field studies, ethnographies + ‘end to end’ experiments – more ecological validity
  • 28. example: mobile font size early paper on fonts in mobile menus: well conducted experiment statistically significant results conclusion gives best font size but … a menu selection task includes: 1. visual search (better big fonts) 2. if not found scroll/page display (better small fonts) 3. when found touch target (better big fonts) no single best size – the balance depends on menu length, etc.
  • 31. what have you really shown? stats are about the measure, but what does it measure
  • 32. what have you really shown • think about the conditions – are there other explanations for data? • individual or population – small #of groups/individuals, many measurements – sig. statistics => effect reliable for each individual – but are individuals representative of all? • systems vs properties
  • 33. a little story … BIG ACM conference – ‘good’ empirical paper looking at collaborative support for a task X three pieces of software: A – domain specific software, synchronous B – generic software, synchronous C – generic software, asynchronous A B C domai n spec.generic
  • 34. experiment sensible quality measures reasonable nos. subjects in each condition significant results p<0.05 domain spec. > generic asynchronous > synchronous conclusion: really want asynchronous domain specific A B C domain spec. generic domain spec. generic asyncsync
  • 35. what’s wrong with that? interaction effects gap is interesting to study not necessarily good to implement more important … if you blinked at the wrong moment … NOT independent variables three different pieces of software like experiment on 3 people! say system B was just bad domain spec. generic asyncsync A B C domain spec. generic ? B < A B < C
  • 36. what went wrong? borrowed psych method … but method embodies assumptions single simple cause, controlled environment interaction needs ecologically valid experiment multiple causes, open situations what to do? understand assumptions and modify
  • 39. don’t just look at average! e.g. overall system A lower error rate than system B but … system B better for experts
  • 40. … and tasks too e.g. PieTree (interactive circular treemap) exploding Pie chart good for finding large things unfolding hierarchical text view good for finding small things
  • 41. more important to know who or what something is good for
  • 43. types of evaluation you’ve designed it, but is it right?
  • 44. points of comparison • measures: – average satisfaction 3.2 on a 5 point scale – time to complete task in range 13.2–27.6 seconds – good or bad? • need a point of comparison – but what? – self, similar system, created or real?? – think purpose ... • what constitutes a ‘control’ – think!!
  • 45. types of knowledge • descriptive – explaining what happened • predictive – saying what will happen cause  effect – where science often ends • synthetic – working out what to do to make what you want happen effect  cause – design and engineering • synthetic – working out what to do to make what you want happen effect  cause – design and engineering
  • 46. different kinds of evaluation endless arguments – quantitative vs. qualitative – in the lab vs. in the wild – experts vs. real users (vs UG students!) really – combine methods e.g. quantitative – what is true & qualitative – why – what is appropriate and possible
  • 47. when does it end? in a world of perpetual beta ... real use is the ultimate evaluation • logging, bug reporting, etc. • how do people really use the product? • are some features never used?