Validation and mechanism: exploring the limits of evaluation

Summer PIT 2019
Validation and Mechanism
exploring the limits of evaluation
Alan Dix
http://alandix.com/academic/talks/PIT-2019-validation-and-mechanism/

Tiree
Tiree Tech Wave
3-7 October
Computational Foundry
Swansea University

the foundry
building
mission
community

what is evaluation?
easy and hard questions
what does it mean anyway?

easy questions
How fast do people recognise a menu option
Is product A easer to learn than product B
even then …
individuals or average
better for whom?
WEIRD …

WEIRD people
Henrich J, Heine S, Norenzayan A. (2010). The weirdest people in the world?
Behav Brain Sci. 2010 Jun;33(2-3):61-83; discussion 83-135.
doi:10.1017/S0140525X0999152X. Epub 2010 Jun 15
Western, Educated, Industrialized,
Rich, and Democratic

Harder questions
Subjective experience (UX, fun)
Long term-interactions
e.g. meetings
Long-term effects
eg. education, sustainability, behaviour change

What do we mean by evaluation?
often
post-hoc empirical study/experiment
… but why?
what is it for?

why are you doing it?
exploration vs. validation
process vs. product

research
exploration
?finding
questions
✓
validation
answering
them
explanation
finding
why and how
ethnography
in-depth interviews
detailed observation
big data
experiments
large-scale survey
quantitative data
qualitative data
theoretical models
mechanism

development
design build
test
process
product
summative
formative
make it better
does it work

purpose
Two types of evaluation
purpose stage
formative improve a design development
summative say “this is good” contractual/sales
investigative gain understanding researchinvestigative
user research /
big changes
gain understanding
Three

exploration / formative
– find any interesting issues
– stats about deciding priorities
validation / summative
– exhaustive: find all problems/issues
– verifying: is hypothesis true, does system work
– mensuration: how good, how prevalent
explanation / investigative
– matching qualitative/quantitative, small/large samples

are five users enough?
original work
Nielsen & Landauer (1993) about iterative process
not summative – not for stats!
how many?
to find enough to do in next development cycle
depends on size of project and complexity
now-a-days with cheap development maybe n=1
but always more in next cycle
N.B. later work
on saturation

from evaluation
to validation
dealing with
harder questions

validating work
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
your work
evaluation
• experiments
• user studies
• peer review

generative artefacts
• justification
– expert opinion
– new experiments
• evaluation
– experiments
– user studies
– peer review
artefact
evaluation
singularity
different people
different situations
different designers
• toolkits
• devices
• interfaces
• guidelines
• methodologies
(pure) evaluation of generative artefacts
is methodologically unsound

your work
validating work
• justification
– expert opinion
– new experiments
• evaluation
– experiments
– user studies
– peer review
justification
• expert opinion
• previous research
• new experiments
evaluation
• experiments
• user studies
• peer review

justification vs. validation
• different disciplines
– mathematics: proof = justification
– medicine: drug trials = evaluation
• combine them:
– look for weakness in justification
– focus evaluation there
evaluationjustification

mechanism
from what happens
to how and why

mechanism
quantitative and statistical
what is true end to end
phenomena
qualitative and theoretical
why and how
mechanism

generalisation
empirical data
at best interpolate
understanding mechanism allows:
extrapolation
application in new contexts

mechanism
• reduction reconstruction
– formal hypothesis testing
+ may be qualitative too
– more scientific precision
• wholistic analytic
– field studies, ethnographies
+ ‘end to end’ experiments
– more ecological validity
?
? ? ? ? ?
• wholistic analytic
– field studies, ethnographies
+ ‘end to end’ experiments
– more ecological validity

example: mobile font size
early paper on fonts in mobile menus:
well conducted experiment
statistically significant results
conclusion gives best font size
but … a menu selection task includes:
1. visual search (better big fonts)
2. if not found scroll/page display (better small fonts)
3. when found touch target (better big fonts)
no single best size – the balance depends on menu length, etc.

what have you
really shown?
stats are about the measure,
but what does it measure

what have you really shown
• think about the conditions
– are there other explanations for data?
• individual or population
– small #of groups/individuals, many measurements
– sig. statistics => effect reliable for each individual
– but are individuals representative of all?
• systems vs properties

a little story …
BIG ACM conference – ‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:
A – domain specific software, synchronous
B – generic software, synchronous
C – generic software, asynchronous
A
B C
domai
n
spec.generic

experiment
sensible quality measures
reasonable nos. subjects in each condition
significant results p<0.05
domain spec. > generic
asynchronous > synchronous
conclusion: really want asynchronous domain specific
A
B C
domain
spec.
generic
domain
spec.
generic
asyncsync

what’s wrong with that?
interaction effects
gap is interesting to study
not necessarily good to implement
more important …
if you blinked at the wrong moment …
NOT independent variables
three different pieces of software
like experiment on 3 people!
say system B was just bad
domain
spec.
generic asyncsync
A
B C
domain
spec.
generic
?
B < A B < C

what went wrong?
borrowed psych method
… but method embodies assumptions
single simple cause, controlled environment
interaction needs ecologically valid experiment
multiple causes, open situations
what to do?
understand assumptions and modify

diversity – individual/task
good for not just good

don’t just look at average!
e.g. overall system A lower error rate than system B
but … system B better for experts

… and tasks too
e.g. PieTree
(interactive circular treemap)
exploding
Pie chart
good for finding
large things
unfolding
hierarchical
text view
good for finding
small things

more important to know
who or what
something is good for

types of evaluation
you’ve designed it, but is it right?

points of comparison
• measures:
– average satisfaction 3.2 on a 5 point scale
– time to complete task in range 13.2–27.6 seconds
– good or bad?
• need a point of comparison
– but what?
– self, similar system, created or real??
– think purpose ...
• what constitutes a ‘control’
– think!!

types of knowledge
• descriptive
– explaining what happened
• predictive
– saying what will happen
cause  effect
– where science often ends
• synthetic
– working out what to do to make what you want happen
effect  cause
– design and engineering
• synthetic
– working out what to do to make what you want happen
effect  cause
– design and engineering

different kinds of evaluation
endless arguments
– quantitative vs. qualitative
– in the lab vs. in the wild
– experts vs. real users (vs UG students!)
really
– combine methods
e.g. quantitative – what is true & qualitative – why
– what is appropriate and possible

when does it end?
in a world of perpetual beta ...
real use is the ultimate evaluation
• logging, bug reporting, etc.
• how do people really use the product?
• are some features never used?

Validation and mechanism: exploring the limits of evaluation

More Related Content

What's hot (20)

Similar to Validation and mechanism: exploring the limits of evaluation (20)

More from Alan Dix (20)

Recently uploaded (20)

Validation and mechanism: exploring the limits of evaluation