12 Transformation

Data transformations are common in (microbial) ecology (Legendre and Gallagher 2001). They are used to mitigate technical biases in the data, to obtain more interpretable values, to enhance the comparability of samples/features or to make data compatible with the assumptions of certain statistical methods.

Legendre, Pierre, and Eugene D. Gallagher. 2001. “Ecologically Meaningful Transformations for Ordination of Species Data.” Oecologia 129 (2): 271–80. https://doi.org/10.1007/s004420100716.

Examples include transforming feature counts into relative abundances (i.e., “normalising as proportions”), or with compositionality-aware transformations such as the centered log-ratio transformation (clr).

12.1 Characteristics of microbiome data to inform data transformations

Transformations are important in working with microbiome data due to various unique characteristics of sequencing data. For example, the above mentioned examples of feature transformations are commonly used with microbiome data following the inherently proportional nature of sequencing reads: Due to the nature of sequencing technology, read counts of taxa do not represent real counts of microbes in the original sample, and raw counts of taxa per sample are thus not comparable between samples.

This Compositionality means that a change in the absolute abundance of one taxon will lead to apparent variations in the relative abundances of other taxa in the same sample. If neglected, such properties may cause significant bias in the results of statistical tests. The above-mentioned transformation processes are one method developed to overcome these issues by making taxon abundances comparable across samples.

In addition to this compositionality, other special characteristics of microbiome sequencing data are high variability and zero-inflation. High variability means that abundance of taxa often varies by several orders of magnitude from sample to sample. Zero-inflation means that typically more than 70% of the taxa-per-sample values are zeros, which could be due to either physical absence (structural zeros) or insufficient sequencing coverage (sampling zeros).

12.2 Common transformation methods

Let us now summarize some commonly used transformations in microbiome data science; further details and benchmarkings are available in the references.

alr: The additive log ratio transformation is part of a broader Aitchison family of transformations with ‘clr’ and ‘rclr’. Compared to them the biggest difference is that it selects a single feature or component as a reference and expresses all other features as log-ratios relative to it. (Greenacre, Martínez-Álvaro, and Blasco 2021) provides guidance on choosing an appropriate reference feature.
clr: Centered log ratio transformation (Aitchison 1986) is used to reduce data skewness and compositionality bias in relative abundances, while bringing the data onto a logarithmic scale. This transformation is frequently applied in microbial ecology as it enhances comparability of relative differences between samples (Gloor et al. 2017). However, the resulting transformed values are difficult to interpret directly, and it can only be applied to positive values, not zeros. Usual solution for for making values non-zero is to add pseudocount, which adds another type of bias in the data as true taxon absences are not taken into account. This is mitigated by the method of robust clr transformation (see rclr below).
hellinger: Hellinger transformation is equal to the square root of relative abundances. This ecological transformation can be useful when the focus is on how species proportions vary across samples, rather than on absolute counts.
log, log2, log10: Logarithmic transformations, used e.g. to reduce data skewness. With compositional data, the clr (or rclr) transformation is often preferred.
pa: Presence/Absence transformation ignores abundances and only indicates whether the given feature is detected above the given threshold (default: 0). This simple transformation is relatively widely used in ecological research, whenever we have reasons to be interested in which taxa are present in which samples more so than their abundances. An example of such case is the process of inferring microbial ecological association networks from covariation of taxa across samples. Here, microbes that are rarely seen in the same samples are expected to have a negative ecological association (either competitive exclusion with each other of different ecological niche preferences). Conversely, looking for negative correlations in their relative abundances would not be a meaningful measure of ecological associations. Moreover, it has been used in machine learning classification models with a good performance Karwowska et al. (2024).
rank: Rank transformation replaces each feature abundance value by its rank. Similar to relative rank transformation (rrank) which uses relative ranks. This has use, for instance, in non-parametric statistics.
rclr: The robust clr (rclr) is similar to regular clr (see above) but allows data with zeroes and avoids the need to add pseudocount Martino et al. (2019).
relabundance: Relative transformation, also known as normalising as proportions, total sum scaling (TSS) and compositional transformation. This converts counts into proportions (at the scale [0, 1]) that sum up to 1. Much of the currently available taxonomic abundance data from high-throughput assays (16S, metagenomic sequencing) is compositional by nature, even if the data is provided as “counts” (Gloor et al. 2017).
standardize: Standardize (or z-score) transformation scales data to zero mean and unit variance. This is used to bring features (or samples) to more comparable levels in terms of mean and scale of the values. This can enhance visualization and interpretation of the data
Other available transformations include Chi square (chi.square), Frequency transformation (frequency), and make margin sum of squares equal to one (normalize)

Greenacre, Michael, Marina Martínez-Álvaro, and Agustín Blasco. 2021. “Compositional Data Analysis of Microbiome and Any-Omics Datasets: A Validation of the Additive Logratio Transformation.” Frontiers in Microbiology 12 (October). https://doi.org/10.3389/fmicb.2021.727398.

Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London, UK: Chapman & Hall.

Gloor, GB, JM Macklaim, V Pawlowsky-Glahn, and JJ Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Frontiers in Microbiology 8. https://doi.org/10.3389/fmicb.2017.02224.

Karwowska, Zuzanna, Oliver Aasmets, Estonian Biobank Research Team, Tomasz Kosciolek, and Elin Org. 2024. “Effects of Data Transformation and Model Selection on Feature Importance in Microbiome Classification Data.” bioRxiv. https://doi.org/10.1101/2023.09.19.558406v2.

Martino, C, J. T. Morton, C. A. Marotz, L. R. Thompson, A Tripathi, R Knight, and K Zengler. 2019. “A Novel Sparse Compositional Technique Reveals Microbial Perturbations.” mSystems 4.

Transformations on abundance assays can be performed with mia::transformAssay(), keeping both the original and the transformed assay(s) in the data object. The function applies sample-wise or column-wise transformation when MARGIN = ‘cols’, feature-wise or row-wise transformation when MARGIN = ‘rows’. A complete list of available transformations and parameters, is available in the function help.

Important

Pseudocount is a small non-negative value (e.g., 1) added to the normalized data to avoid taking the logarithm of zero. It’s value can have a significant impact on the results when applying a logarithm transformation to normalized data, as the logarithm transformation is a nonlinear operation that can fundamentally change the data distribution (Costea et al. 2014).

Pseudocount should be chosen consistently across all normalization methods being compared, for example, by setting it to a value smaller than the minimum abundance value before transformation. Some tools, like ancombc2, take into account the effect of the pseudocount by performing sensitivity tests using multiple pseudocount values. See Chapter 17.

Costea, Paul I., Georg Zeller, Shinichi Sunagawa, and Peer Bork. 2014. “A Fair Comparison.” Nature Methods 11: 359. https://doi.org/https://doi.org/10.1038/nmeth.2897.

12.3 Rarefaction

Another approach to control uneven sampling depths is to apply rarefaction with rarefyAssay, which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed (Schloss 2024a) (Schloss 2024b). Moreover, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).

Schloss, Patrick D. 2024a. “Rarefaction Is Currently the Best Approach to Control for Uneven Sequencing Effort in Amplicon Sequence Analyses.” mSphere 9 (2): e00354–23. https://doi.org/10.1128/msphere.00354-23.

———. 2024b. “Waste Not, Want Not: Revisiting the Analysis That Called into Question the Practice of Rarefaction.” mSphere 9 (1): e00355–23. https://doi.org/10.1128/msphere.00355-23.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4): e1003531.

12.4 Transformations in practice

Below, we apply relative transformation to counts table.

# Load example data
library(mia)
data("Tengeler2020")
tse <- Tengeler2020

# Transform counts assay to relative abundances
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")

Get the values in the resulting assay, and view some of the first entries of it with the head command.

assay(tse, "relabundance") |> head()
##                     A110     A12     A15     A19     A21     A23     A25
##  Bacteroides     0.47393 0.28657 0.00000 0.22459 0.27397 0.32796 0.21594
##  Bacteroides_1   0.32230 0.00000 0.16664 0.07080 0.08503 0.04193 0.00000
##  Parabacteroides 0.00000 0.02390 0.00000 0.01400 0.02283 0.00000 0.00994
##  Bacteroides_2   0.00000 0.04709 0.00000 0.14019 0.10376 0.00000 0.05362
##  Akkermansia     0.03057 0.04659 0.07539 0.01489 0.01323 0.12818 0.04012
##  Bacteroides_3   0.00000 0.16011 0.00000 0.11362 0.09605 0.00000 0.04760
##                      A28     A29     A34     A36     A37     A39    A111
##  Bacteroides     0.19379 0.14221 0.27229 0.37622 0.38072 0.00000 0.49423
##  Bacteroides_1   0.00000 0.00000 0.30309 0.38768 0.00000 0.00000 0.39163
##  Parabacteroides 0.01752 0.01749 0.00000 0.00000 0.10521 0.43546 0.00000
##  Bacteroides_2   0.07981 0.07957 0.00000 0.02852 0.32992 0.00000 0.02786
##  Akkermansia     0.03593 0.01411 0.07693 0.05196 0.02641 0.05413 0.01383
##  Bacteroides_3   0.07525 0.19865 0.00000 0.00000 0.00000 0.00000 0.00000
##                      A13     A14     A16     A17     A18    A210     A22
##  Bacteroides     0.05534 0.22500 0.25188 0.21775 0.50314 0.22023 0.09614
##  Bacteroides_1   0.00000 0.05667 0.00000 0.00000 0.29137 0.09577 0.00000
##  Parabacteroides 0.33893 0.00000 0.10925 0.06554 0.00000 0.00000 0.28324
##  Bacteroides_2   0.02042 0.00000 0.09913 0.10589 0.00000 0.00000 0.04164
##  Akkermansia     0.05270 0.09199 0.09739 0.10253 0.03738 0.13349 0.05105
##  Bacteroides_3   0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
##                       A24      A26     A27     A33     A35     A38
##  Bacteroides     0.375716 0.076844 0.24740 0.34204 0.05687 0.20908
##  Bacteroides_1   0.266715 0.000000 0.10112 0.04590 0.00000 0.16765
##  Parabacteroides 0.000000 0.172826 0.00000 0.00000 0.27061 0.00000
##  Bacteroides_2   0.010230 0.006608 0.02002 0.04444 0.02484 0.03583
##  Akkermansia     0.038485 0.118447 0.07714 0.02582 0.07364 0.04671
##  Bacteroides_3   0.008525 0.000000 0.02200 0.03333 0.00000 0.01621

In ‘pa’ transformation, abundance table is converted to presence/absence table that ignores abundances and only indicates whether the given feature is detected in the sample.

tse <- transformAssay(tse, method = "pa")
assay(tse, "pa") |> head()
##                  A110 A12 A15 A19 A21 A23 A25 A28 A29 A34 A36 A37 A39 A111
##  Bacteroides        1   1   0   1   1   1   1   1   1   1   1   1   0    1
##  Bacteroides_1      1   0   1   1   1   1   0   0   0   1   1   0   0    1
##  Parabacteroides    0   1   0   1   1   0   1   1   1   0   0   1   1    0
##  Bacteroides_2      0   1   0   1   1   0   1   1   1   0   1   1   0    1
##  Akkermansia        1   1   1   1   1   1   1   1   1   1   1   1   1    1
##  Bacteroides_3      0   1   0   1   1   0   1   1   1   0   0   0   0    0
##                  A13 A14 A16 A17 A18 A210 A22 A24 A26 A27 A33 A35 A38
##  Bacteroides       1   1   1   1   1    1   1   1   1   1   1   1   1
##  Bacteroides_1     0   1   0   0   1    1   0   1   0   1   1   0   1
##  Parabacteroides   1   0   1   1   0    0   1   0   1   0   0   1   0
##  Bacteroides_2     1   0   1   1   0    0   1   1   1   1   1   1   1
##  Akkermansia       1   1   1   1   1    1   1   1   1   1   1   1   1
##  Bacteroides_3     0   0   0   0   0    0   0   1   0   1   1   0   1

You can now view the entire list of abundance assays in your data object with:

assays(tse)
##  List of length 3
##  names(3): counts relabundance pa

A common question is whether the centered log-ratio (clr) transformation should be applied directly to raw counts or if a prior transformation, such as conversion to relative abundances, is necessary.

In theory, the clr transformation is scale-invariant, meaning it does not matter whether it is applied to raw or relative abundances, as long as the relative scale of abundances remains the same. However, in practice, there are some differences due to the introduction of a pseudocount, which can introduce bias.

There is no single correct answer, but the following considerations may help:

Data imputation should typically be applied to raw abundances, regardless of the microbial profiling pipeline used or whether the obtained abundances are counts or relative abundances.
Once a pseudocount has been added, it makes no difference whether one first converts to relative abundances before applying clr or applies clr directly to the adjusted counts.
Since applying clr directly to raw counts is the simpler approach, it is generally recommended.
One might also consider using robust clr instead.

tse <- transformAssay(
    x = tse,
    assay.type = "counts",
    method = "clr",
    pseudocount = TRUE,
    name = "clr"
)

To incorporate phylogenetic information, one can apply the phylogenetic isometric log-ratio (PhILR) transformation (Silverman et al. 2017). Unlike standard transformations, PhILR accounts for the genetic relationships between taxonomic features. This is important because closely related species often share similar properties, which traditional transformations fail to capture.

Silverman, Justin D, Alex D Washburne, Sayan Mukherjee, and Lawrence A David. 2017. “A Phylogenetic Transform Enhances Analysis of Compositional Microbiota Data.” eLife 6. https://doi.org/10.7554/eLife.21887.

tse <- transformAssay(tse, method = "philr", MARGIN = 1L, pseudocount = TRUE)

Unlike other transformations, PhILR outputs a table where rows represent nodes of phylogeny. These new features do not match with features of TreeSE which is why this new dataset is stored into altExp.

altExp(tse, "philr")
##  class: TreeSummarizedExperiment 
##  dim: 149 27 
##  metadata(0):
##  assays(1): philr
##  rownames(149): node_1 node_2 ... node_148 node_149
##  rowData names(0):
##  colnames(27): A110 A12 ... A35 A38
##  colData names(4): patient_status cohort patient_status_vs_cohort
##    sample_name
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: NULL
##  rowTree: NULL
##  colLinks: NULL
##  colTree: NULL

Summary

Microbiome data is characterized by the following features:

Compositionality
High variability
Zero-inflation

OSCA book provides additional information on normalization from the perspective of single-cell analysis.

Exercises

Goal: The goal is to learn how to apply different transformations.

Exercise 1: Transform data

Load any of the example datasets mentioned in Section 4.2.
Visualize counts with a histogram. Describe the data distribution. Is there lots of zeroes?
Transform the counts assay into relative abundances and store it into the TreeSE as an assay named relabund.
Similarly, perform a CLR transformation on the counts assay with a pseudocount of 1 and add it to the TreeSE as a new assay.
List the available assays by name.
Visualize the CLR-transformed data with a histogram. Compare the distribution with distribution of counts data.
Access the CLR-assay and store it to variable. Select a subset of its first 100 features and 10 samples, and print the abundance table. Explore the data.
Agglomerate the data with agglomerateByRanks() then apply transformations to the stored alternative experiments created. Use the option altexp = altExpnames(tse).
If the data has a phylogenetic tree, agglomerate the data to the family level and perform the phILR transformation. Where was the transformed data stored? Compare the feature names with original data. Why do the names differ?

Useful functions:

utils::data(), miaViz::plotHistogram(), mia::transformAssay(), mia::assayNames(), SummarizedExperiment::assay(), mia::agglomerateByRanks(), mia::altExp(), BiocGenerics::rownames()