Why are data transformations a bad choice in statistics

On the transformation of a response in regression modelling
and hypothesis testing
Adrian Olszewski
Originally posted at: Research Gate, March 21st
2020, URL: https://tinyurl.com/yyv2ryus
Updated and enhanced: November, 13th
2020
Find the most updated version here:
https://www.dropbox.com/s/62bh8cvbkjuu21n/data%20transformations.pdf?dl=0
I do suggest avoiding any variable transformations as much as possible, except the cases you
can thoroughly and convincingly justify the reason and explain the outcome. It applies
especially to Box-Cox.
1. It completely changes the formulation, and affects the interpretation. Only in "clean"
cases you will get interpretable outcome, like log-transformed data generated by
multiplicative process (not *any right skewed data*!). log, exp, reciprocal, square/cube
root, power of 2, 3 transformations may be meaningful in *special scenarios*, e.g.
velocity, area, volume, concentration, length (square root of area), but y^-0.67 doesn't
mean anything. And most of your audience will have no idea how your response
changes with the predictor unless you draw the curve. Easy for singe response, but if
you have more? You will need marginal effects to give some idea.
Sometimes you can decide to approximate the obtained coefficient with well-known
ones, e.g. 0.45 is close to 0.5 (square root), but it’s not easy in general. So why doing
that?
By transforming, you *force your variables to follow certain distribution* and to *tell
it your story*. For example, log-transformation assumes your data comes from log-
normal distribution. Just look what it does with the equation. As a consequence...
2. … It changes the model along with the errors! In our case - from additive to multiple
including errors. Maybe it's good maybe not, depends on your case.
3. It will also affect the variance along with means - many people blindly use
transformations completely forgetting about that! Well, it can be useful if we want to
stabilize variance, BUT it changes more! For example, in normal distribution mean and
variance are independent, in log-normal it's not! Of course this is an idealized case. Box-
Cox may return any weird coefficient - guess how will the model and the mean-variance
relationship change?
4. The Jensen's inequality says clearly, log(E(y)) ≠ E(log(y)) (except the identity link). By
running regression you are interested in modelling the conditional expectation of the
response, rather than response itself (transformed). And remember that no
transformation can handle certain response distribution properly, like counts (it makes
no sense, by the way).

5. In case of testing, it changes the null hypothesis, which is likely not the one you wanted
to assess anymore! In our case: it switches from testing the shift in arithmetic means to
the ratio of geometric means.
I can hear you: “I was told it leads to valid inference!”.
Yes, it leads to valid inference... of a hypothesis you did not started with initially (unless
you can justify that transition).
You obtain a valid answer to *unasked question*. And yes, log(y) may results may
differ from results returned by a model with log link (e.g. gamma regression). You will
have to decide which one to choose.
Sometimes there are industry guidelines, like those given by the FDA for clinical
biostatistics, which advises using log on PK data (for a good reason), but *even those
guidelines* warn you against unconditional and *unjustified* transformations!
6. Your back-transformed confidence intervals will be biased. Another disease to the
collection.
7. BoxCox and any other transformation does NOT guarantee the properties you need.
And then what are you going to do? Transform again the data, until satisfied? You know
this will only complicate already complicated situation? Not to mention that you may
turn your right skewed data into... left-skewed one and fall into more troubles.
I know there are many proponents of unconventional data transformation ("skewed data?  go
transform it!") on ResearchGate. They were taught this for tens of years. Moreover, some of
them were told by authorities to continue using it.
But in the light of the above arguments I collected I strongly suggest considering (practically
always better) alternatives.
Except the mentioned few scenarios, the transformations can cause more harm than good in
confirmatory and exploratory modelling.
Can it be useful? Yes, it may be OK in predictive modelling, especially if you agree on using a
“black box” approach, where you care mostly of the predicted outcomes and not the rest of the
story. If the predicted outcome agrees with the expectations – you are fine with that. Then –
it’s OK.
“OK, so what techniques and methods do you recommend instead”?
In the 21st century we have a plenty of models, estimation methods and other techniques (being
here for ca 50 years) allowing us to deal with certain violations of the assumptions (normality,
homoscedasticity, independence of observations and so on), including:
1) generalized models (GLM and GAM), like: gamma, beta, logistic (and probit),
fractional logit, Poisson, negative binomial, etc. regressions. Trucated (most of real
variables have truncated domain, keep it in mind!) and censored regression (e.g. tobit
model). I’m sure you will be able to find a tool suitable for you. Remember, that this
generalizes nicely to the mixed-effect models.

2) non-linear models
3) robust and non-parametric methods and tests (there are over 280 statistical tests! Lots
of them do not require or relax certain parametric assumptions, like Yuen, Brunner-
Munzel, ATS, WTS, ART ANOVA, Welch, Mann-Whitney/Wilcoxon, and many,
many more). At the end of this document I added an longer set of the literature that I
read and can wholeheartedly recommend it.
a. If you need ANOVA on non-normal or heterogeneous data, remember that you
can a run more advanced model (e.g. robust regression, quantile regression,
mixed models) or use GLS or GEE estimation and follow the modelling with a
set of LR (likelihood ratio) tests to mimic the type-3 ANOVA and get the main
and interaction effects!
Yes, you read well – that’s what the anova() (or car::Anova(… type=2/3)
function does in R when dealing with so many kinds models, performing the
assessment of reduction of the residual deviance. Which – in case of the simple
linear model – reduces to the analysis of certain contrasts, which is nothing but
comparing group means. See? The dots connect with themselves!
Yes, the outcome will be approximated (Chi2 rather than F), but hey – it’s still
a worthwhile and flexible solution!
4) quantile regression (which handles also mixed effects) – it’s one of the most powerful
method, requiring no distributional assumptions yet still offering good
interpretability!
5) Generalized Least Square and Generalized Estimation Equations estimation
6) Passing-Bablock and Deming regression
7) resampling (permutation/exact tests, approximate permutation tests, bootstrapped
interval estimation). Only remember those methods aren’t accepted by the regulators in
the Clinical Research industry when used to analyse the primary outcomes
8) In case of serious skewness you can also try adding categorical covariate(s) to your
model which may split your dataset into more homogeneous subgroups. Why? Because
the skewness often comes from mixed data coming from 2+ populations with different
variability.
Afterword
As always in statistics – there’s no easy solution to all cases. There are justified cases, where
the transformations are not only applicable, but also demanded by the regulations – see an
example here: FDA: Guidance for Industry - Statistical Approaches to Establishing
Bioequivalence
Or here: EMA - ICH Topic E 9 Statistical Principles for Clinical Trials, step 5

Also: “THE LOG TRANSFORMATION IS SPECIAL”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.530.9640&rep=rep1&type=pdf
Also, find my diagram (on DropBox) showing a few of the families of models (along with the
relationships) and estimation methods, that may be useful for you more than data
transformation:
https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf

A note on “what we call a regression” may also interest you, especially if you:
1) you advise people to transform their DV (response) with Box-Cox (or log) without a
thought on the consequences, to "achieve normality" in DV or residuals. It MAY be
OK when predicting with the "black-box" approach, but is NOT OK when you use a
model to explain / confirm relationships between variables.
2) you "chase" for normality of the raw DV (response) in the linear regression.
3) you believe that strongly skewed data cannot be modelled with the linear regression
4) …and vice versa - you overuse it to everything (counts, %, categorical data from
questionnaires, concentrations)
5) you say that linear model is named so as it "produces" straight line
6) you say that the logistic r. is "not a regression, because it models binary response"
7) you believe the "stepwise regression" is a regression
https://www.linkedin.com/posts/adrianolszewski_rockyourr-datascience-dataanalysis-activity-
6691521288101531648-jkuW
A few URL linking to discussions and resources worth reading:
1. Log-transformation and its implications for data analysis
2. GLM with a Gamma-distributed Dependent Variable (PDF)
3. CrossValidated: When to use gamma GLMs?
4. To transform or not to transform: using generalized linear mixed models to analyse
reaction time data
5. Stat 504 - Introduction to Generalized Linear Models
6. Do Not Log-Transform Count Data, Bitches!
7. Generalized linear models - An introduction by Christoph Scherber
8. https://www.theanalysisfactor.com/the-difference-between-link-functions-and-data-
transformations/
9. Notes on Transformations and Generalized Linear Models
10. Handling Skewed Data: A Comparison of Two Popular Methods

11. CrossValidated: Linear model with log-transformed response vs. generalized linear
model with log link
12. CrossValidated: How to decide which glm family to use?
13. CrossValidated: Family of GLM represents the distribution of the response variable or
residuals?
14. CrossValidated: Why is GLM different than an LM with transformed variable
15. CrossValidated: GLM vs square root data transformation
16. https://stats.idre.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-
terms-of-percent-change-in-linear-regression/
17. CrossValidated: How to interpret regression coefficients when response was
transformed by the 4th root?
18. CrossValidated: Express answers in terms of original units, in Box-Cox transformed
data
Books worth reading (yes, I read or “familiarized enough” with and use(d) at work to
recommend them):
1. Alan Agresti, Foundations of Linear and Generalized Linear Models
2. John Fox, Applied regression analysis and generalized linear model
3. Roger Koenker, Victor Chernozhukov, Xuming He, Limin Peng, Handbook of Quantile
Regression
4. Young, Derek S, Handbook of regression methods
5. Andreas Ziegler, Generalized Estimating Equations
6. Daryl S. Paulson, Handbook of Regression and Modeling Applications for the Clinical
and Pharmaceutical Industries
7. Myles Hollander, Douglas A. Wolfe, Eric Chicken, Nonparametric Statistical Methods
8. Jason C. Hsu, Multiple Comparisons, Theory and methods
9. Alex Dmitrienko, Ajit C. Tamhane, Frank Bretz, Multiple Testing Problems in
Pharmaceutical Statistics
10. Michael G. Akritas and Dimitris N. Politis, Recent Advances and Trends in
Nonparametric Statistics
11. W. J. Conover practical nonparametric statistics
+ some more literature about the modern and flexible non-parametric methods (there’s lots of
more beyond the Mann-Whitney-Wilcoxon, Friedman, Kruska-Wallis!), so you don’t have to
transform your data ;]
1. Erceg-Hurn, David & Mirosevich, Vikki. (2008). Modern Robust Statistical Methods
An Easy Way to Maximize the Accuracy and Power of Your Research. The American
psychologist. 63. 591-601. 10.1037/0003-066X.63.7.591.
https://www.researchgate.net/publication/23319441_Modern_Robust_Statistical_Meth
ods_An_Easy_Way_to_Maximize_the_Accuracy_and_Power_of_Your_Research
https://pdfs.semanticscholar.org/88cb/15520b2f84fd2a5a09e0341e791f40ab4118.pdf
2. Edgar Brunner, Madan L. Puri, Nonparametric Methods in Factorial Designs
https://www.researchgate.net/profile/Jos_Feys/post/What_statistical_tests_can_I_use_

to_compare_mean_values_for_my_study/attachment/59d6558b79197b80779acad7/A
S:526088510111744@1502440683536/download/Brunner.pdf
3. Brunner, E., & Puri, M. L. (1996). Nonparametric methods in design and analysis of
experiments. In Design and Analysis of Experiments (Vol. 13, pp. 631–703). Elsevier.
https://doi.org/https://doi.org/10.1016/S0169-7161(96)13021-2
4. Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011). The Aligned Rank
Transform for nonparametric factorial analyses using only ANOVA procedures.
Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI
'11). Vancouver, British Columbia (May 7-12, 2011). New York: ACM Press, pp.
143-146. http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf
5. Christophe Leys, Sandy Schumann, A nonparametric method to analyze interactions:
The adjusted rank transform test http://cescup.ulb.be/wp-
content/uploads/2015/04/Leys_and_Schumann_nonparametric_interactions.pdf
6. Haiko Lüpsen, The Aligned Rank Transform and discrete Variables -a Warning
https://kups.ub.uni-koeln.de/7554/1/ART-discrete.pdf
7. Friedrich, S., Konietschke, F., & Pauly, M. (2017). GFD: An R Package for the
Analysis of General Factorial Designs. Journal of Statistical Software, 79(Code
Snippet 1), 1 - 18. doi:http://dx.doi.org/10.18637/jss.v079.c01
8. Kimihiro Noguchi, Yulia R. Gel, Edgar Brunner, Frank Konietschke,“nparLD: An R
Software Package for the Nonparametric Analysis of Longitudinal Data in Factorial
Experiments”
9. Edgar Brunner, Arne C. Bathke, Frank Konietschke, Rank and Pseudo-Rank
Procedures for Independent Observations in Factorial Designs: Using R and SAS,
Springer, 2019, ISBN: 303002914X, 9783030029142, page 134
https://books.google.pl/books?id=t9KiDwAAQBAJ&lpg=PA134&ots=_Jgi9Rt0Kz&h
l=pl&pg=PA134#v=onepage&q&f=false
10. Feys, Jos. "New Nonparametric Rank Tests for Interactions in Factorial Designs with
Repeated Measures." Journal of Modern Applied Statistical Methods 15.1 (2016): 78-
99. Web.
https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1924&context=jmasm
11. Friedrich, S., Konietschke, F., Pauly, M.(2017). GFD - An R-package for the Analysis
of GeneralFactorial Designs. Journal of Statistical Software, Code Snippets 79(1), 1–
18, doi:10.18637/jss.v079.c01.Pauly, M., Brunner, E., Konietschke, F.(2015).
Asymptotic Permutation Tests in General FactorialDesigns. Journal of the Royal
Statistical Society - Series B 77, 461-473
12. Akritas, M. G., & Politis, D. N. (2003). Recent Advances and Trends in
Nonparametric Statistics. Elsevier B.V. https://doi.org/10.1016/B978-0-444-51378-
6.X5000-5

13. Peterson, K.M. (2002). Six Modifications Of The Aligned Rank Transform Test For
Interaction.
https://pdfs.semanticscholar.org/ad4b/54e104acf7356b53c075e959ba8c24e23fea.pdf
14. Schacht, A., Bogaerts, K., Bluhmki, E., & Lesaffre, E. (2008). A new nonparametric
approach for baseline covariate adjustment for two-group comparative studies.
Biometrics, 64 4, 1110-6
15. Shah DA, Madden LV. Nonparametric analysis of ordinal data in designed factorial
experiments. Phytopathology. 2004;94(1):33-43. doi:10.1094/PHYTO.2004.94.1.33,
https://apsjournals.apsnet.org/doi/pdf/10.1094/PHYTO.2004.94.1.33
16. Versace, V., Schwenker, K., Langthaler, P. B., Golaszewski, S., Sebastianelli, L.,
Brigo, F., Pucks-Faes, E., Saltuari, L., & Nardone, R. (2020). Facilitation of Auditory
Comprehension After Theta Burst Stimulation of Wernicke's Area in Stroke Patients:
A Pilot Study. Frontiers in neurology, 10, 1319.
https://doi.org/10.3389/fneur.2019.01319,
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960103/
17. Prossegger, J., Huber, D., Grafetstätter, C., Pichler, C., Braunschmid, H., Weisböck-
Erdheim, R., & Hartl, A. (2019). Winter Exercise Reduces Allergic Airway
Inflammation: A Randomized Controlled Study. International journal of
environmental research and public health, 16(11), 2040.
https://doi.org/10.3390/ijerph16112040
18. Akritas, M.G. and E. Brunner. 1997. A unified approach to rank tests for mixed
models. Journal of Statistical Planning and Inference. 61:249–277.
19. Haiko Lüpsen, Anova with binary variables - Alternatives for a dangerous F-test (dac
lepszy citation)
20. Haiko Lüpsen, Comparison of nonparametric analysis of variance methods a Monte
Carlo study - Part A: Between subjects designs - A Vote for van der Waerden
+ my list of various non-parametric and robust alternatives to the classic n-way ANOVA:
https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way-
ANOVA-in-biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz
2020-11-13 (Friday ), LinkedIn, Adrian Olszewski

Why are data transformations a bad choice in statistics

More Related Content

What's hot (20)

Similar to Why are data transformations a bad choice in statistics (20)

More from Adrian Olszewski (12)

Recently uploaded (20)

Why are data transformations a bad choice in statistics