Demystifying p<.05: A Balanced Approach to Significance Testing (or Avoiding it Altogether) 📈🧐

John Neuhoff

Associate Director of UX Research | AI Strategist | PhD Psychology | Retired

Published Jan 6, 2024

Feeling shamed for not adhering to a p<.05 statistical significance rule in your UX research? Don’t.

The p<.05 standard is a benchmark set by Ronald Fisher in the 1920s before modern computers were available. It’s the probability that your results occurred simply by random chance. For example, if five users prefer Design A and four prefer Design B, would you be confident that the larger population prefers Design A? Of course not. Why? Because if you reran the test with new users you could get four who prefer A and five who prefer B. There’s no evidence that your designs differ in preference because the likelihood of getting these results by chance is high. In Fisher’s world, the odds that the results occurred by random chance are “greater than 5%."

So, how did we end up with the .05 standard? Early statisticians thought it was reasonable, and scientific journals picked it up and made it gospel. Besides, calculating by hand the exact probability of your results occurring by chance could have taken months! So they made a table of “critical values” that you could compare your statistical result against to see if it was “over or under” the critical value that indicates 5%. For its time, it was a useful concept.

Old traditions die hard. Even though we can calculate the exact alpha levels of experiments now in a flash, many people (and journals) cling to the old notion of p <.05 religiously.

But think about it. What if there’s a 6% chance that your results occurred by chance? Fisher would say your results are not statistically significant. But, if you’re in business, is there an appreciable difference in your decision-making when the chance of a false positive is 6% versus 5%? What about 9%?

The answer, of course, is “It depends.” What are the costs of a false positive? The p<.20 might be a reasonable standard if the costs are relatively small. If they are life and death, p<.05 seems woefully inadequate. Would you take an experimental treatment if there were “only” a 5% chance it would kill you?

Statistical significance testing also often ignores the importance of “effect size.” Let’s say you have a very large sample, and your new design is preferred more than the old one with a statistical significance level of p<.01. Great, right? Fisher would be proud. Now let’s say the mean preference on a 1-10 scale for the new design is 7.6, and the mean preference for the old design is 7.5. It’s a reliable statistically significant difference that would almost certainly replicate time and time again. But is it worth it to implement given the associated costs? No, because the effect size (although significant) is too small.

Is there a better way? Enter Bayesian analysis. Bayesian methods shift the focus from rigid, binary "significant or not" decisions to probabilistic reasoning. Think of it as a nuanced conversation with your data. Instead of asking, "Is this result statistically significant at the p<.05 level?" Bayesian analysis prompts a more relevant question: "Given the data and our prior knowledge, what is the probability that one design is genuinely better than the other?" This approach is particularly advantageous when dealing with complex or uncertain scenarios common in UX research. It allows for incorporating prior knowledge and expertise into the analysis, yielding contextually richer insights and often more directly applicable to business decisions.

Let's be clear: advocating for a more nuanced approach than the p<.05 standard is not a call to abandon hypothesis evaluation- far from it. Statistical analysis remains a cornerstone of robust UX research. But, it's time to rethink our adherence to the p<.05 dogma in UX research and embrace a more flexible, nuanced approach.

It's crucial to consider the real-world implications of our findings, the magnitude of effect sizes, and the consequences and practicality of decision-making thresholds. With their probabilistic and contextual richness, Bayesian methods offer a compelling alternative. So, let's break free from the shackles of p<.05 and step into a more informed and adaptable era of data analysis, where the true goal is insightful, actionable conclusions, not just statistical victories.

#UXResearchInsights #BeyondP05 #StatisticalSignificance #BayesianAnalysisUX #DataDrivenDesign

#RethinkStatistics

Paula Bach

Principal Director Product Research Microsoft Azure Data and Fabric

Joshua Noble - Bayesian!

1 Reaction

Aaron Mooney

EHS Professional with experience in Statistics, Analytics, and System Design

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/

1 Reaction

Krystal Cooper

AI Researcher | Creative Engineer | Content Creator | #GHC25 #specsquad | Startup Advisor

What will it take to change the tide? p<.05 feels like the UX version of the developer’s always ending up debating if something is deterministic or probabilistic for every complex code challenge. So many other things can impact variance and variables and are worth of discussion.

1 Reaction

Karla H.R

Chevening scholar at LSE's MSc Management of Information Systems and Digital Innovation

Alejandro Kantor

Andrii Zhulidin 👽

Product Design Leader

Thanks for the article! John Neuhoff Back in the day, Practical Significance was a real eye-opener for me. It's crucial to communicate this comprehension when presenting outcomes to stakeholders. Because they are likely to think in terms of p<.05.

LinkedIn respects your privacy

Demystifying p<.05: A Balanced Approach to Significance Testing (or Avoiding it Altogether) 📈🧐

John Neuhoff

Associate Director of UX Research | AI Strategist | PhD Psychology | Retired

More articles by this author

Others also viewed

OmniParser: Unifying Text Spotting, Key Information Extraction, and Table Recognition

🔗 Integrating STEP Models into Semantic Cartography: Bridging Engineering and Enterprise Knowledge

Paper Review: Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Nate Baldwin & the AI Spellbook: When Design Systems Break the Rules to Cast New Magic

Error Messages Are Evil

Paper Review: Lumiere: A Space-Time Diffusion Model for Video Generation

Language in Design Thinking

Day 5 of 30-Days Challenge: Learning Gen AI and LLM's

What Are "Fractal Tensors"?

Introduction to ARM Neon SIMD Optimization

Explore content categories

UX and AI in 2024: Reflecting on a Year of Transformation and Innovation

Dec 27, 2024

AI is Self-Aware: Here's How It Can Transform Your Business

Oct 17, 2024

🧠 How User Experience is Quietly Shaping the Future of AI 🤖

Sep 12, 2024

Others also viewed

OmniParser: Unifying Text Spotting, Key Information Extraction, and Table Recognition

🔗 Integrating STEP Models into Semantic Cartography: Bridging Engineering and Enterprise Knowledge

Paper Review: Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Nate Baldwin & the AI Spellbook: When Design Systems Break the Rules to Cast New Magic

Error Messages Are Evil

Paper Review: Lumiere: A Space-Time Diffusion Model for Video Generation

Language in Design Thinking

Day 5 of 30-Days Challenge: Learning Gen AI and LLM's

What Are "Fractal Tensors"?

Introduction to ARM Neon SIMD Optimization

Explore content categories