SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
On the transformation of a response in regression modelling
and hypothesis testing
Adrian Olszewski
Originally posted at: Research Gate, March 21st
2020, URL: https://tinyurl.com/yyv2ryus
Updated and enhanced: November, 13th
2020
Find the most updated version here:
https://www.dropbox.com/s/62bh8cvbkjuu21n/data%20transformations.pdf?dl=0
I do suggest avoiding any variable transformations as much as possible, except the cases you
can thoroughly and convincingly justify the reason and explain the outcome. It applies
especially to Box-Cox.
1. It completely changes the formulation, and affects the interpretation. Only in "clean"
cases you will get interpretable outcome, like log-transformed data generated by
multiplicative process (not *any right skewed data*!). log, exp, reciprocal, square/cube
root, power of 2, 3 transformations may be meaningful in *special scenarios*, e.g.
velocity, area, volume, concentration, length (square root of area), but y^-0.67 doesn't
mean anything. And most of your audience will have no idea how your response
changes with the predictor unless you draw the curve. Easy for singe response, but if
you have more? You will need marginal effects to give some idea.
Sometimes you can decide to approximate the obtained coefficient with well-known
ones, e.g. 0.45 is close to 0.5 (square root), but it’s not easy in general. So why doing
that?
By transforming, you *force your variables to follow certain distribution* and to *tell
it your story*. For example, log-transformation assumes your data comes from log-
normal distribution. Just look what it does with the equation. As a consequence...
2. … It changes the model along with the errors! In our case - from additive to multiple
including errors. Maybe it's good maybe not, depends on your case.
3. It will also affect the variance along with means - many people blindly use
transformations completely forgetting about that! Well, it can be useful if we want to
stabilize variance, BUT it changes more! For example, in normal distribution mean and
variance are independent, in log-normal it's not! Of course this is an idealized case. Box-
Cox may return any weird coefficient - guess how will the model and the mean-variance
relationship change?
4. The Jensen's inequality says clearly, log(E(y)) ≠ E(log(y)) (except the identity link). By
running regression you are interested in modelling the conditional expectation of the
response, rather than response itself (transformed). And remember that no
transformation can handle certain response distribution properly, like counts (it makes
no sense, by the way).
5. In case of testing, it changes the null hypothesis, which is likely not the one you wanted
to assess anymore! In our case: it switches from testing the shift in arithmetic means to
the ratio of geometric means.
I can hear you: “I was told it leads to valid inference!”.
Yes, it leads to valid inference... of a hypothesis you did not started with initially (unless
you can justify that transition).
You obtain a valid answer to *unasked question*. And yes, log(y) may results may
differ from results returned by a model with log link (e.g. gamma regression). You will
have to decide which one to choose.
Sometimes there are industry guidelines, like those given by the FDA for clinical
biostatistics, which advises using log on PK data (for a good reason), but *even those
guidelines* warn you against unconditional and *unjustified* transformations!
6. Your back-transformed confidence intervals will be biased. Another disease to the
collection.
7. BoxCox and any other transformation does NOT guarantee the properties you need.
And then what are you going to do? Transform again the data, until satisfied? You know
this will only complicate already complicated situation? Not to mention that you may
turn your right skewed data into... left-skewed one and fall into more troubles.
I know there are many proponents of unconventional data transformation ("skewed data?  go
transform it!") on ResearchGate. They were taught this for tens of years. Moreover, some of
them were told by authorities to continue using it.
But in the light of the above arguments I collected I strongly suggest considering (practically
always better) alternatives.
Except the mentioned few scenarios, the transformations can cause more harm than good in
confirmatory and exploratory modelling.
Can it be useful? Yes, it may be OK in predictive modelling, especially if you agree on using a
“black box” approach, where you care mostly of the predicted outcomes and not the rest of the
story. If the predicted outcome agrees with the expectations – you are fine with that. Then –
it’s OK.
“OK, so what techniques and methods do you recommend instead”?
In the 21st century we have a plenty of models, estimation methods and other techniques (being
here for ca 50 years) allowing us to deal with certain violations of the assumptions (normality,
homoscedasticity, independence of observations and so on), including:
1) generalized models (GLM and GAM), like: gamma, beta, logistic (and probit),
fractional logit, Poisson, negative binomial, etc. regressions. Trucated (most of real
variables have truncated domain, keep it in mind!) and censored regression (e.g. tobit
model). I’m sure you will be able to find a tool suitable for you. Remember, that this
generalizes nicely to the mixed-effect models.
2) non-linear models
3) robust and non-parametric methods and tests (there are over 280 statistical tests! Lots
of them do not require or relax certain parametric assumptions, like Yuen, Brunner-
Munzel, ATS, WTS, ART ANOVA, Welch, Mann-Whitney/Wilcoxon, and many,
many more). At the end of this document I added an longer set of the literature that I
read and can wholeheartedly recommend it.
a. If you need ANOVA on non-normal or heterogeneous data, remember that you
can a run more advanced model (e.g. robust regression, quantile regression,
mixed models) or use GLS or GEE estimation and follow the modelling with a
set of LR (likelihood ratio) tests to mimic the type-3 ANOVA and get the main
and interaction effects!
Yes, you read well – that’s what the anova() (or car::Anova(… type=2/3)
function does in R when dealing with so many kinds models, performing the
assessment of reduction of the residual deviance. Which – in case of the simple
linear model – reduces to the analysis of certain contrasts, which is nothing but
comparing group means. See? The dots connect with themselves!
Yes, the outcome will be approximated (Chi2 rather than F), but hey – it’s still
a worthwhile and flexible solution!
4) quantile regression (which handles also mixed effects) – it’s one of the most powerful
method, requiring no distributional assumptions yet still offering good
interpretability!
5) Generalized Least Square and Generalized Estimation Equations estimation
6) Passing-Bablock and Deming regression
7) resampling (permutation/exact tests, approximate permutation tests, bootstrapped
interval estimation). Only remember those methods aren’t accepted by the regulators in
the Clinical Research industry when used to analyse the primary outcomes
8) In case of serious skewness you can also try adding categorical covariate(s) to your
model which may split your dataset into more homogeneous subgroups. Why? Because
the skewness often comes from mixed data coming from 2+ populations with different
variability.
Afterword
As always in statistics – there’s no easy solution to all cases. There are justified cases, where
the transformations are not only applicable, but also demanded by the regulations – see an
example here: FDA: Guidance for Industry - Statistical Approaches to Establishing
Bioequivalence
Or here: EMA - ICH Topic E 9 Statistical Principles for Clinical Trials, step 5
Also: “THE LOG TRANSFORMATION IS SPECIAL”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.530.9640&rep=rep1&type=pdf
Also, find my diagram (on DropBox) showing a few of the families of models (along with the
relationships) and estimation methods, that may be useful for you more than data
transformation:
https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf
A note on “what we call a regression” may also interest you, especially if you:
1) you advise people to transform their DV (response) with Box-Cox (or log) without a
thought on the consequences, to "achieve normality" in DV or residuals. It MAY be
OK when predicting with the "black-box" approach, but is NOT OK when you use a
model to explain / confirm relationships between variables.
2) you "chase" for normality of the raw DV (response) in the linear regression.
3) you believe that strongly skewed data cannot be modelled with the linear regression
4) …and vice versa - you overuse it to everything (counts, %, categorical data from
questionnaires, concentrations)
5) you say that linear model is named so as it "produces" straight line
6) you say that the logistic r. is "not a regression, because it models binary response"
7) you believe the "stepwise regression" is a regression
https://www.linkedin.com/posts/adrianolszewski_rockyourr-datascience-dataanalysis-activity-
6691521288101531648-jkuW
A few URL linking to discussions and resources worth reading:
1. Log-transformation and its implications for data analysis
2. GLM with a Gamma-distributed Dependent Variable (PDF)
3. CrossValidated: When to use gamma GLMs?
4. To transform or not to transform: using generalized linear mixed models to analyse
reaction time data
5. Stat 504 - Introduction to Generalized Linear Models
6. Do Not Log-Transform Count Data, Bitches!
7. Generalized linear models - An introduction by Christoph Scherber
8. https://www.theanalysisfactor.com/the-difference-between-link-functions-and-data-
transformations/
9. Notes on Transformations and Generalized Linear Models
10. Handling Skewed Data: A Comparison of Two Popular Methods
11. CrossValidated: Linear model with log-transformed response vs. generalized linear
model with log link
12. CrossValidated: How to decide which glm family to use?
13. CrossValidated: Family of GLM represents the distribution of the response variable or
residuals?
14. CrossValidated: Why is GLM different than an LM with transformed variable
15. CrossValidated: GLM vs square root data transformation
16. https://stats.idre.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-
terms-of-percent-change-in-linear-regression/
17. CrossValidated: How to interpret regression coefficients when response was
transformed by the 4th root?
18. CrossValidated: Express answers in terms of original units, in Box-Cox transformed
data
Books worth reading (yes, I read or “familiarized enough” with and use(d) at work to
recommend them):
1. Alan Agresti, Foundations of Linear and Generalized Linear Models
2. John Fox, Applied regression analysis and generalized linear model
3. Roger Koenker, Victor Chernozhukov, Xuming He, Limin Peng, Handbook of Quantile
Regression
4. Young, Derek S, Handbook of regression methods
5. Andreas Ziegler, Generalized Estimating Equations
6. Daryl S. Paulson, Handbook of Regression and Modeling Applications for the Clinical
and Pharmaceutical Industries
7. Myles Hollander, Douglas A. Wolfe, Eric Chicken, Nonparametric Statistical Methods
8. Jason C. Hsu, Multiple Comparisons, Theory and methods
9. Alex Dmitrienko, Ajit C. Tamhane, Frank Bretz, Multiple Testing Problems in
Pharmaceutical Statistics
10. Michael G. Akritas and Dimitris N. Politis, Recent Advances and Trends in
Nonparametric Statistics
11. W. J. Conover practical nonparametric statistics
+ some more literature about the modern and flexible non-parametric methods (there’s lots of
more beyond the Mann-Whitney-Wilcoxon, Friedman, Kruska-Wallis!), so you don’t have to
transform your data ;]
1. Erceg-Hurn, David & Mirosevich, Vikki. (2008). Modern Robust Statistical Methods
An Easy Way to Maximize the Accuracy and Power of Your Research. The American
psychologist. 63. 591-601. 10.1037/0003-066X.63.7.591.
https://www.researchgate.net/publication/23319441_Modern_Robust_Statistical_Meth
ods_An_Easy_Way_to_Maximize_the_Accuracy_and_Power_of_Your_Research
https://pdfs.semanticscholar.org/88cb/15520b2f84fd2a5a09e0341e791f40ab4118.pdf
2. Edgar Brunner, Madan L. Puri, Nonparametric Methods in Factorial Designs
https://www.researchgate.net/profile/Jos_Feys/post/What_statistical_tests_can_I_use_
to_compare_mean_values_for_my_study/attachment/59d6558b79197b80779acad7/A
S:526088510111744@1502440683536/download/Brunner.pdf
3. Brunner, E., & Puri, M. L. (1996). Nonparametric methods in design and analysis of
experiments. In Design and Analysis of Experiments (Vol. 13, pp. 631–703). Elsevier.
https://doi.org/https://doi.org/10.1016/S0169-7161(96)13021-2
4. Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011). The Aligned Rank
Transform for nonparametric factorial analyses using only ANOVA procedures.
Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI
'11). Vancouver, British Columbia (May 7-12, 2011). New York: ACM Press, pp.
143-146. http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf
5. Christophe Leys, Sandy Schumann, A nonparametric method to analyze interactions:
The adjusted rank transform test http://cescup.ulb.be/wp-
content/uploads/2015/04/Leys_and_Schumann_nonparametric_interactions.pdf
6. Haiko Lüpsen, The Aligned Rank Transform and discrete Variables -a Warning
https://kups.ub.uni-koeln.de/7554/1/ART-discrete.pdf
7. Friedrich, S., Konietschke, F., & Pauly, M. (2017). GFD: An R Package for the
Analysis of General Factorial Designs. Journal of Statistical Software, 79(Code
Snippet 1), 1 - 18. doi:http://dx.doi.org/10.18637/jss.v079.c01
8. Kimihiro Noguchi, Yulia R. Gel, Edgar Brunner, Frank Konietschke,“nparLD: An R
Software Package for the Nonparametric Analysis of Longitudinal Data in Factorial
Experiments”
9. Edgar Brunner, Arne C. Bathke, Frank Konietschke, Rank and Pseudo-Rank
Procedures for Independent Observations in Factorial Designs: Using R and SAS,
Springer, 2019, ISBN: 303002914X, 9783030029142, page 134
https://books.google.pl/books?id=t9KiDwAAQBAJ&lpg=PA134&ots=_Jgi9Rt0Kz&h
l=pl&pg=PA134#v=onepage&q&f=false
10. Feys, Jos. "New Nonparametric Rank Tests for Interactions in Factorial Designs with
Repeated Measures." Journal of Modern Applied Statistical Methods 15.1 (2016): 78-
99. Web.
https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1924&context=jmasm
11. Friedrich, S., Konietschke, F., Pauly, M.(2017). GFD - An R-package for the Analysis
of GeneralFactorial Designs. Journal of Statistical Software, Code Snippets 79(1), 1–
18, doi:10.18637/jss.v079.c01.Pauly, M., Brunner, E., Konietschke, F.(2015).
Asymptotic Permutation Tests in General FactorialDesigns. Journal of the Royal
Statistical Society - Series B 77, 461-473
12. Akritas, M. G., & Politis, D. N. (2003). Recent Advances and Trends in
Nonparametric Statistics. Elsevier B.V. https://doi.org/10.1016/B978-0-444-51378-
6.X5000-5
13. Peterson, K.M. (2002). Six Modifications Of The Aligned Rank Transform Test For
Interaction.
https://pdfs.semanticscholar.org/ad4b/54e104acf7356b53c075e959ba8c24e23fea.pdf
14. Schacht, A., Bogaerts, K., Bluhmki, E., & Lesaffre, E. (2008). A new nonparametric
approach for baseline covariate adjustment for two-group comparative studies.
Biometrics, 64 4, 1110-6
15. Shah DA, Madden LV. Nonparametric analysis of ordinal data in designed factorial
experiments. Phytopathology. 2004;94(1):33-43. doi:10.1094/PHYTO.2004.94.1.33,
https://apsjournals.apsnet.org/doi/pdf/10.1094/PHYTO.2004.94.1.33
16. Versace, V., Schwenker, K., Langthaler, P. B., Golaszewski, S., Sebastianelli, L.,
Brigo, F., Pucks-Faes, E., Saltuari, L., & Nardone, R. (2020). Facilitation of Auditory
Comprehension After Theta Burst Stimulation of Wernicke's Area in Stroke Patients:
A Pilot Study. Frontiers in neurology, 10, 1319.
https://doi.org/10.3389/fneur.2019.01319,
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960103/
17. Prossegger, J., Huber, D., Grafetstätter, C., Pichler, C., Braunschmid, H., Weisböck-
Erdheim, R., & Hartl, A. (2019). Winter Exercise Reduces Allergic Airway
Inflammation: A Randomized Controlled Study. International journal of
environmental research and public health, 16(11), 2040.
https://doi.org/10.3390/ijerph16112040
18. Akritas, M.G. and E. Brunner. 1997. A unified approach to rank tests for mixed
models. Journal of Statistical Planning and Inference. 61:249–277.
19. Haiko Lüpsen, Anova with binary variables - Alternatives for a dangerous F-test (dac
lepszy citation)
20. Haiko Lüpsen, Comparison of nonparametric analysis of variance methods a Monte
Carlo study - Part A: Between subjects designs - A Vote for van der Waerden
+ my list of various non-parametric and robust alternatives to the classic n-way ANOVA:
https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way-
ANOVA-in-biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz
2020-11-13 (Friday ), LinkedIn, Adrian Olszewski

More Related Content

PPTX
Types of graphs
PPT
Measures of Variation
PPTX
Basic stat analysis using excel
PPTX
Analysis of variance
PPTX
Introduction To SPSS
PPTX
Graphical representation of data
PDF
Confidence Intervals: Basic concepts and overview
PPT
BIAS AND CHANCE AND CONFOUNDING
Types of graphs
Measures of Variation
Basic stat analysis using excel
Analysis of variance
Introduction To SPSS
Graphical representation of data
Confidence Intervals: Basic concepts and overview
BIAS AND CHANCE AND CONFOUNDING

What's hot (20)

PPT
Data Analysis with SPSS : One-way ANOVA
PPTX
pie chart.pptx
PPTX
box plot or whisker plot
PPTX
Scatter Plot.pptx
PPTX
SPSS How to use Spss software
PDF
Data Analysis with SPSS PPT.pdf
PPT
Basic guide to SPSS
DOCX
Descriptive Statistics - SPSS
PPTX
Boxplot
PPTX
"A basic guide to SPSS"
PPTX
Inferential statistics
PDF
Choosing a statistical test
PPTX
Analysis of data in research
PPTX
Spss by vijay ambast
PDF
Regression analysis
PPTX
PPTX
Understanding statistics in research
PDF
Categorical data analysis
PPTX
Statistical test
PPTX
Data Analysis with SPSS : One-way ANOVA
pie chart.pptx
box plot or whisker plot
Scatter Plot.pptx
SPSS How to use Spss software
Data Analysis with SPSS PPT.pdf
Basic guide to SPSS
Descriptive Statistics - SPSS
Boxplot
"A basic guide to SPSS"
Inferential statistics
Choosing a statistical test
Analysis of data in research
Spss by vijay ambast
Regression analysis
Understanding statistics in research
Categorical data analysis
Statistical test
Ad

Similar to Why are data transformations a bad choice in statistics (20)

PDF
Data analysis01 singlevariable
PDF
M08 BiasVarianceTradeoff
PDF
4_5_Model Interpretation and diagnostics part 4.pdf
PDF
StatsModelling
PDF
deep larning
PDF
Analysing The Data
PDF
Robust Estimation And Hypothesis Testing New Age Moti L Tiku
PDF
Are Evolutionary Algorithms Required to Solve Sudoku Problems
DOCX
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
PPT
Analyzing Performance Test Data
PDF
Algo sobre cladista to read
DOC
Poor man's missing value imputation
PDF
copy for Gary Chin.
PPTX
BASIC MATH PROBLEMS IN STATISCTICSS.pptx
PPTX
Project Analytics
PDF
Real Estate Data Set
DOCX
A researcher in attempting to run a regression model noticed a neg.docx
PDF
Factor analysis using spss 2005
PPT
Pentaho Meeting 2008 - Statistics & BI
PDF
0 Model Interpretation setting.pdf
Data analysis01 singlevariable
M08 BiasVarianceTradeoff
4_5_Model Interpretation and diagnostics part 4.pdf
StatsModelling
deep larning
Analysing The Data
Robust Estimation And Hypothesis Testing New Age Moti L Tiku
Are Evolutionary Algorithms Required to Solve Sudoku Problems
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
Analyzing Performance Test Data
Algo sobre cladista to read
Poor man's missing value imputation
copy for Gary Chin.
BASIC MATH PROBLEMS IN STATISCTICSS.pptx
Project Analytics
Real Estate Data Set
A researcher in attempting to run a regression model noticed a neg.docx
Factor analysis using spss 2005
Pentaho Meeting 2008 - Statistics & BI
0 Model Interpretation setting.pdf
Ad

More from Adrian Olszewski (12)

PDF
Challenging the cult of the normal distribution in science
PDF
Logistic regression vs. logistic classifier. History of the confusion and the...
PPTX
Logistic regression - one of the key regression tools in experimental research
PPTX
Meet a 100% R-based CRO - The summary of a 5-year journey
PDF
SAS and R Team in Clinical Research, EPC 11-2016 p18-21.pdf
PPTX
Meet a 100% R-based CRO. The summary of a 5-year journey
PDF
Flextable and Officer
PDF
Modern statistical techniques
PDF
Dealing with outliers in Clinical Research
PPTX
The use of R statistical package in controlled infrastructure. The case of Cl...
PDF
Rcommander - a menu-driven GUI for R
PDF
GNU R in Clinical Research and Evidence-Based Medicine
Challenging the cult of the normal distribution in science
Logistic regression vs. logistic classifier. History of the confusion and the...
Logistic regression - one of the key regression tools in experimental research
Meet a 100% R-based CRO - The summary of a 5-year journey
SAS and R Team in Clinical Research, EPC 11-2016 p18-21.pdf
Meet a 100% R-based CRO. The summary of a 5-year journey
Flextable and Officer
Modern statistical techniques
Dealing with outliers in Clinical Research
The use of R statistical package in controlled infrastructure. The case of Cl...
Rcommander - a menu-driven GUI for R
GNU R in Clinical Research and Evidence-Based Medicine

Recently uploaded (20)

PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Pre independence Education in Inndia.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Insiders guide to clinical Medicine.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
master seminar digital applications in india
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Cell Types and Its function , kingdom of life
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Cell Structure & Organelles in detailed.
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pre independence Education in Inndia.pdf
Renaissance Architecture: A Journey from Faith to Humanism
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Insiders guide to clinical Medicine.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Sports Quiz easy sports quiz sports quiz
Microbial diseases, their pathogenesis and prophylaxis
master seminar digital applications in india
O7-L3 Supply Chain Operations - ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
Anesthesia in Laparoscopic Surgery in India
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Cell Types and Its function , kingdom of life
GDM (1) (1).pptx small presentation for students
Cell Structure & Organelles in detailed.
Microbial disease of the cardiovascular and lymphatic systems
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx

Why are data transformations a bad choice in statistics

  • 1. On the transformation of a response in regression modelling and hypothesis testing Adrian Olszewski Originally posted at: Research Gate, March 21st 2020, URL: https://tinyurl.com/yyv2ryus Updated and enhanced: November, 13th 2020 Find the most updated version here: https://www.dropbox.com/s/62bh8cvbkjuu21n/data%20transformations.pdf?dl=0 I do suggest avoiding any variable transformations as much as possible, except the cases you can thoroughly and convincingly justify the reason and explain the outcome. It applies especially to Box-Cox. 1. It completely changes the formulation, and affects the interpretation. Only in "clean" cases you will get interpretable outcome, like log-transformed data generated by multiplicative process (not *any right skewed data*!). log, exp, reciprocal, square/cube root, power of 2, 3 transformations may be meaningful in *special scenarios*, e.g. velocity, area, volume, concentration, length (square root of area), but y^-0.67 doesn't mean anything. And most of your audience will have no idea how your response changes with the predictor unless you draw the curve. Easy for singe response, but if you have more? You will need marginal effects to give some idea. Sometimes you can decide to approximate the obtained coefficient with well-known ones, e.g. 0.45 is close to 0.5 (square root), but it’s not easy in general. So why doing that? By transforming, you *force your variables to follow certain distribution* and to *tell it your story*. For example, log-transformation assumes your data comes from log- normal distribution. Just look what it does with the equation. As a consequence... 2. … It changes the model along with the errors! In our case - from additive to multiple including errors. Maybe it's good maybe not, depends on your case. 3. It will also affect the variance along with means - many people blindly use transformations completely forgetting about that! Well, it can be useful if we want to stabilize variance, BUT it changes more! For example, in normal distribution mean and variance are independent, in log-normal it's not! Of course this is an idealized case. Box- Cox may return any weird coefficient - guess how will the model and the mean-variance relationship change? 4. The Jensen's inequality says clearly, log(E(y)) ≠ E(log(y)) (except the identity link). By running regression you are interested in modelling the conditional expectation of the response, rather than response itself (transformed). And remember that no transformation can handle certain response distribution properly, like counts (it makes no sense, by the way).
  • 2. 5. In case of testing, it changes the null hypothesis, which is likely not the one you wanted to assess anymore! In our case: it switches from testing the shift in arithmetic means to the ratio of geometric means. I can hear you: “I was told it leads to valid inference!”. Yes, it leads to valid inference... of a hypothesis you did not started with initially (unless you can justify that transition). You obtain a valid answer to *unasked question*. And yes, log(y) may results may differ from results returned by a model with log link (e.g. gamma regression). You will have to decide which one to choose. Sometimes there are industry guidelines, like those given by the FDA for clinical biostatistics, which advises using log on PK data (for a good reason), but *even those guidelines* warn you against unconditional and *unjustified* transformations! 6. Your back-transformed confidence intervals will be biased. Another disease to the collection. 7. BoxCox and any other transformation does NOT guarantee the properties you need. And then what are you going to do? Transform again the data, until satisfied? You know this will only complicate already complicated situation? Not to mention that you may turn your right skewed data into... left-skewed one and fall into more troubles. I know there are many proponents of unconventional data transformation ("skewed data?  go transform it!") on ResearchGate. They were taught this for tens of years. Moreover, some of them were told by authorities to continue using it. But in the light of the above arguments I collected I strongly suggest considering (practically always better) alternatives. Except the mentioned few scenarios, the transformations can cause more harm than good in confirmatory and exploratory modelling. Can it be useful? Yes, it may be OK in predictive modelling, especially if you agree on using a “black box” approach, where you care mostly of the predicted outcomes and not the rest of the story. If the predicted outcome agrees with the expectations – you are fine with that. Then – it’s OK. “OK, so what techniques and methods do you recommend instead”? In the 21st century we have a plenty of models, estimation methods and other techniques (being here for ca 50 years) allowing us to deal with certain violations of the assumptions (normality, homoscedasticity, independence of observations and so on), including: 1) generalized models (GLM and GAM), like: gamma, beta, logistic (and probit), fractional logit, Poisson, negative binomial, etc. regressions. Trucated (most of real variables have truncated domain, keep it in mind!) and censored regression (e.g. tobit model). I’m sure you will be able to find a tool suitable for you. Remember, that this generalizes nicely to the mixed-effect models.
  • 3. 2) non-linear models 3) robust and non-parametric methods and tests (there are over 280 statistical tests! Lots of them do not require or relax certain parametric assumptions, like Yuen, Brunner- Munzel, ATS, WTS, ART ANOVA, Welch, Mann-Whitney/Wilcoxon, and many, many more). At the end of this document I added an longer set of the literature that I read and can wholeheartedly recommend it. a. If you need ANOVA on non-normal or heterogeneous data, remember that you can a run more advanced model (e.g. robust regression, quantile regression, mixed models) or use GLS or GEE estimation and follow the modelling with a set of LR (likelihood ratio) tests to mimic the type-3 ANOVA and get the main and interaction effects! Yes, you read well – that’s what the anova() (or car::Anova(… type=2/3) function does in R when dealing with so many kinds models, performing the assessment of reduction of the residual deviance. Which – in case of the simple linear model – reduces to the analysis of certain contrasts, which is nothing but comparing group means. See? The dots connect with themselves! Yes, the outcome will be approximated (Chi2 rather than F), but hey – it’s still a worthwhile and flexible solution! 4) quantile regression (which handles also mixed effects) – it’s one of the most powerful method, requiring no distributional assumptions yet still offering good interpretability! 5) Generalized Least Square and Generalized Estimation Equations estimation 6) Passing-Bablock and Deming regression 7) resampling (permutation/exact tests, approximate permutation tests, bootstrapped interval estimation). Only remember those methods aren’t accepted by the regulators in the Clinical Research industry when used to analyse the primary outcomes 8) In case of serious skewness you can also try adding categorical covariate(s) to your model which may split your dataset into more homogeneous subgroups. Why? Because the skewness often comes from mixed data coming from 2+ populations with different variability. Afterword As always in statistics – there’s no easy solution to all cases. There are justified cases, where the transformations are not only applicable, but also demanded by the regulations – see an example here: FDA: Guidance for Industry - Statistical Approaches to Establishing Bioequivalence Or here: EMA - ICH Topic E 9 Statistical Principles for Clinical Trials, step 5
  • 4. Also: “THE LOG TRANSFORMATION IS SPECIAL” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.530.9640&rep=rep1&type=pdf Also, find my diagram (on DropBox) showing a few of the families of models (along with the relationships) and estimation methods, that may be useful for you more than data transformation: https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf
  • 5. A note on “what we call a regression” may also interest you, especially if you: 1) you advise people to transform their DV (response) with Box-Cox (or log) without a thought on the consequences, to "achieve normality" in DV or residuals. It MAY be OK when predicting with the "black-box" approach, but is NOT OK when you use a model to explain / confirm relationships between variables. 2) you "chase" for normality of the raw DV (response) in the linear regression. 3) you believe that strongly skewed data cannot be modelled with the linear regression 4) …and vice versa - you overuse it to everything (counts, %, categorical data from questionnaires, concentrations) 5) you say that linear model is named so as it "produces" straight line 6) you say that the logistic r. is "not a regression, because it models binary response" 7) you believe the "stepwise regression" is a regression https://www.linkedin.com/posts/adrianolszewski_rockyourr-datascience-dataanalysis-activity- 6691521288101531648-jkuW A few URL linking to discussions and resources worth reading: 1. Log-transformation and its implications for data analysis 2. GLM with a Gamma-distributed Dependent Variable (PDF) 3. CrossValidated: When to use gamma GLMs? 4. To transform or not to transform: using generalized linear mixed models to analyse reaction time data 5. Stat 504 - Introduction to Generalized Linear Models 6. Do Not Log-Transform Count Data, Bitches! 7. Generalized linear models - An introduction by Christoph Scherber 8. https://www.theanalysisfactor.com/the-difference-between-link-functions-and-data- transformations/ 9. Notes on Transformations and Generalized Linear Models 10. Handling Skewed Data: A Comparison of Two Popular Methods
  • 6. 11. CrossValidated: Linear model with log-transformed response vs. generalized linear model with log link 12. CrossValidated: How to decide which glm family to use? 13. CrossValidated: Family of GLM represents the distribution of the response variable or residuals? 14. CrossValidated: Why is GLM different than an LM with transformed variable 15. CrossValidated: GLM vs square root data transformation 16. https://stats.idre.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in- terms-of-percent-change-in-linear-regression/ 17. CrossValidated: How to interpret regression coefficients when response was transformed by the 4th root? 18. CrossValidated: Express answers in terms of original units, in Box-Cox transformed data Books worth reading (yes, I read or “familiarized enough” with and use(d) at work to recommend them): 1. Alan Agresti, Foundations of Linear and Generalized Linear Models 2. John Fox, Applied regression analysis and generalized linear model 3. Roger Koenker, Victor Chernozhukov, Xuming He, Limin Peng, Handbook of Quantile Regression 4. Young, Derek S, Handbook of regression methods 5. Andreas Ziegler, Generalized Estimating Equations 6. Daryl S. Paulson, Handbook of Regression and Modeling Applications for the Clinical and Pharmaceutical Industries 7. Myles Hollander, Douglas A. Wolfe, Eric Chicken, Nonparametric Statistical Methods 8. Jason C. Hsu, Multiple Comparisons, Theory and methods 9. Alex Dmitrienko, Ajit C. Tamhane, Frank Bretz, Multiple Testing Problems in Pharmaceutical Statistics 10. Michael G. Akritas and Dimitris N. Politis, Recent Advances and Trends in Nonparametric Statistics 11. W. J. Conover practical nonparametric statistics + some more literature about the modern and flexible non-parametric methods (there’s lots of more beyond the Mann-Whitney-Wilcoxon, Friedman, Kruska-Wallis!), so you don’t have to transform your data ;] 1. Erceg-Hurn, David & Mirosevich, Vikki. (2008). Modern Robust Statistical Methods An Easy Way to Maximize the Accuracy and Power of Your Research. The American psychologist. 63. 591-601. 10.1037/0003-066X.63.7.591. https://www.researchgate.net/publication/23319441_Modern_Robust_Statistical_Meth ods_An_Easy_Way_to_Maximize_the_Accuracy_and_Power_of_Your_Research https://pdfs.semanticscholar.org/88cb/15520b2f84fd2a5a09e0341e791f40ab4118.pdf 2. Edgar Brunner, Madan L. Puri, Nonparametric Methods in Factorial Designs https://www.researchgate.net/profile/Jos_Feys/post/What_statistical_tests_can_I_use_
  • 7. to_compare_mean_values_for_my_study/attachment/59d6558b79197b80779acad7/A S:526088510111744@1502440683536/download/Brunner.pdf 3. Brunner, E., & Puri, M. L. (1996). Nonparametric methods in design and analysis of experiments. In Design and Analysis of Experiments (Vol. 13, pp. 631–703). Elsevier. https://doi.org/https://doi.org/10.1016/S0169-7161(96)13021-2 4. Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011). The Aligned Rank Transform for nonparametric factorial analyses using only ANOVA procedures. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '11). Vancouver, British Columbia (May 7-12, 2011). New York: ACM Press, pp. 143-146. http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf 5. Christophe Leys, Sandy Schumann, A nonparametric method to analyze interactions: The adjusted rank transform test http://cescup.ulb.be/wp- content/uploads/2015/04/Leys_and_Schumann_nonparametric_interactions.pdf 6. Haiko Lüpsen, The Aligned Rank Transform and discrete Variables -a Warning https://kups.ub.uni-koeln.de/7554/1/ART-discrete.pdf 7. Friedrich, S., Konietschke, F., & Pauly, M. (2017). GFD: An R Package for the Analysis of General Factorial Designs. Journal of Statistical Software, 79(Code Snippet 1), 1 - 18. doi:http://dx.doi.org/10.18637/jss.v079.c01 8. Kimihiro Noguchi, Yulia R. Gel, Edgar Brunner, Frank Konietschke,“nparLD: An R Software Package for the Nonparametric Analysis of Longitudinal Data in Factorial Experiments” 9. Edgar Brunner, Arne C. Bathke, Frank Konietschke, Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs: Using R and SAS, Springer, 2019, ISBN: 303002914X, 9783030029142, page 134 https://books.google.pl/books?id=t9KiDwAAQBAJ&lpg=PA134&ots=_Jgi9Rt0Kz&h l=pl&pg=PA134#v=onepage&q&f=false 10. Feys, Jos. "New Nonparametric Rank Tests for Interactions in Factorial Designs with Repeated Measures." Journal of Modern Applied Statistical Methods 15.1 (2016): 78- 99. Web. https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1924&context=jmasm 11. Friedrich, S., Konietschke, F., Pauly, M.(2017). GFD - An R-package for the Analysis of GeneralFactorial Designs. Journal of Statistical Software, Code Snippets 79(1), 1– 18, doi:10.18637/jss.v079.c01.Pauly, M., Brunner, E., Konietschke, F.(2015). Asymptotic Permutation Tests in General FactorialDesigns. Journal of the Royal Statistical Society - Series B 77, 461-473 12. Akritas, M. G., & Politis, D. N. (2003). Recent Advances and Trends in Nonparametric Statistics. Elsevier B.V. https://doi.org/10.1016/B978-0-444-51378- 6.X5000-5
  • 8. 13. Peterson, K.M. (2002). Six Modifications Of The Aligned Rank Transform Test For Interaction. https://pdfs.semanticscholar.org/ad4b/54e104acf7356b53c075e959ba8c24e23fea.pdf 14. Schacht, A., Bogaerts, K., Bluhmki, E., & Lesaffre, E. (2008). A new nonparametric approach for baseline covariate adjustment for two-group comparative studies. Biometrics, 64 4, 1110-6 15. Shah DA, Madden LV. Nonparametric analysis of ordinal data in designed factorial experiments. Phytopathology. 2004;94(1):33-43. doi:10.1094/PHYTO.2004.94.1.33, https://apsjournals.apsnet.org/doi/pdf/10.1094/PHYTO.2004.94.1.33 16. Versace, V., Schwenker, K., Langthaler, P. B., Golaszewski, S., Sebastianelli, L., Brigo, F., Pucks-Faes, E., Saltuari, L., & Nardone, R. (2020). Facilitation of Auditory Comprehension After Theta Burst Stimulation of Wernicke's Area in Stroke Patients: A Pilot Study. Frontiers in neurology, 10, 1319. https://doi.org/10.3389/fneur.2019.01319, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960103/ 17. Prossegger, J., Huber, D., Grafetstätter, C., Pichler, C., Braunschmid, H., Weisböck- Erdheim, R., & Hartl, A. (2019). Winter Exercise Reduces Allergic Airway Inflammation: A Randomized Controlled Study. International journal of environmental research and public health, 16(11), 2040. https://doi.org/10.3390/ijerph16112040 18. Akritas, M.G. and E. Brunner. 1997. A unified approach to rank tests for mixed models. Journal of Statistical Planning and Inference. 61:249–277. 19. Haiko Lüpsen, Anova with binary variables - Alternatives for a dangerous F-test (dac lepszy citation) 20. Haiko Lüpsen, Comparison of nonparametric analysis of variance methods a Monte Carlo study - Part A: Between subjects designs - A Vote for van der Waerden + my list of various non-parametric and robust alternatives to the classic n-way ANOVA: https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way- ANOVA-in-biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz 2020-11-13 (Friday ), LinkedIn, Adrian Olszewski