Handbook Of Statistical Bioinformatics 2nd Edition Henry Horngshing Lu

Handbook Of Statistical Bioinformatics 2nd
Edition Henry Horngshing Lu download
https://ebookbell.com/product/handbook-of-statistical-
bioinformatics-2nd-edition-henry-horngshing-lu-47409026
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Handbook Of Statistical Bioinformatics 1st Edition Lei M Li Auth
https://ebookbell.com/product/handbook-of-statistical-
bioinformatics-1st-edition-lei-m-li-auth-2251736
Handbook Of Statistical Modeling For The Social And Behavioral
Sciences Softcover Reprint Of The Original 1st Ed 1995 G Arminger
https://ebookbell.com/product/handbook-of-statistical-modeling-for-
the-social-and-behavioral-sciences-softcover-reprint-of-the-
original-1st-ed-1995-g-arminger-57219192
Handbook Of Statistical Data Editing And Imputation Wiley Handbooks In
Survey Methodology 1st Edition Ton De Waal
https://ebookbell.com/product/handbook-of-statistical-data-editing-
and-imputation-wiley-handbooks-in-survey-methodology-1st-edition-ton-
de-waal-2226190
Handbook Of Statistical Systems Biology 1st Edition Michael Stumpf
https://ebookbell.com/product/handbook-of-statistical-systems-
biology-1st-edition-michael-stumpf-2387424

Handbook Of Statistical Genetics Volume 1 And 2 3d D J Balding
https://ebookbell.com/product/handbook-of-statistical-genetics-
volume-1-and-2-3d-d-j-balding-2401278
Handbook Of Statistical Methods For Randomized Controlled Trials
Kyungmann Kim
https://ebookbell.com/product/handbook-of-statistical-methods-for-
randomized-controlled-trials-kyungmann-kim-33691440
Handbook Of Statistical Analysis And Data Mining Applications 1st
Edition Robert Nisbet John F Elder Gary Miner
https://ebookbell.com/product/handbook-of-statistical-analysis-and-
data-mining-applications-1st-edition-robert-nisbet-john-f-elder-gary-
miner-4922684
Handbook Of Statistical Distributions With Applications Second Edition
Krishnamoorthy
https://ebookbell.com/product/handbook-of-statistical-distributions-
with-applications-second-edition-krishnamoorthy-5270566
Handbook Of Statistical Methods And Analyses In Sports 1st Edition Jim
Albert
https://ebookbell.com/product/handbook-of-statistical-methods-and-
analyses-in-sports-1st-edition-jim-albert-5891896

Springer Handbooks of Computational Statistics
Henry Horng-Shing Lu
Bernhard Schölkopf
MartinT.Wells
Hongyu Zhao Editors
Handbook
of Statistical
Bioinformatics
SecondEdition

Series Editors
James E. Gentle, George Mason University, Fairfax, VA, USA
Wolfgang Karl Härdle, Humboldt-Universität zu Berlin, Berlin, Germany
Yuichi Mori, Okayama University of Science, Okayama, Japan

Henry Horng-Shing Lu • Bernhard Schölkopf •
Martin T. Wells • Hongyu Zhao
Editors
Handbook of Statistical
Bioinformatics

Editors
Henry Horng-Shing Lu
Institute of Statistics
National Yang Ming Chiao Tung University
Hsinchu, Taiwan, ROC
Bernhard Schölkopf
Department of Empirical Inference
Max Planck Institute for Intelligent Systems
Tübingen, Germany
Martin T. Wells
Department of Statistics and Data Science
Cornell University
Ithaca, NY, USA
Hongyu Zhao
Department of Biostatistics
Yale University
New Haven, CT, USA
ISSN 2197-9790 ISSN 2197-9804 (electronic)
ISBN 978-3-662-65901-4 ISBN 978-3-662-65902-1 (eBook)
https://doi.org/10.1007/978-3-662-65902-1
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer-Verlag GmbH,
DE, part of Springer Nature 2011, 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer-Verlag GmbH, DE, part of
Springer Nature.
The registered company address is: Heidelberger Platz 3, 14197 Berlin, Germany

Preface
Numerous fascinating and important breakthroughs in biotechnology have gen-
erated massive volumes of high throughput data with diverse types that demand
novel developments of efficient and appropriate tools in computational statistics
that are integrated with biological knowledge and computational algorithms. This
updated volume collects contributed chapters from leading researchers to survey
many recent active research topics that have developed since the previous edition of
the Handbook of Statistical Bioinformatics. This updated handbook is intended to
serve as both an introductory and reference monograph for students and researchers
who are interested in learning the state-of-the-art developments in computational
statistics as applied to computational biology.
This collection of articles, from the leading scholars in the field, is primarily a
monograph which will be of interest to the educational, academic, and professional
organizations related to statisticians, computer scientists, biological and biomedical
researchers with strong interests in computational biology. Although there are other
volumes available for computational statistics and bioinformatics on the market,
there are few books such as this that focus on the interface between computational
statistics and cutting-edge developments in computational biology. Seeing this need,
this completely updated collection is aimed to establish this bridge. This handbook
covers many significant up-to-date topics in probabilistic and statistical modeling
as well as the analysis of massive data sets generated from modern biotechnology.
These methods and technologies will change the perspectives of biology, healthcare,
and medicine in the twenty-first century! This collection is an extended version
of the previous edited handbook. The advanced research topics cover statistical
methods for single-cell analysis, network analysis, and systems biology.
During the editing process of this handbook, the world has been upended by
the massive influence of COVID-19 pandemic and other challenges. The editors
would like to thank the contributing authors, Springer management team members,
v

vi Preface
supporting colleagues and family members for their incredible support and patience
during this challenging time period in order for this handbook to be made available
to the related scholarly communities!
Hsinchu, Taiwan, ROC Henry Horng-Shing Lu
Tübingen, Germany Bernhard Schölkopf
Ithaca, NY, USA Martin T. Wells
New Haven, CT, USA Hongyu Zhao
May 8, 2022

Contents
Part I Single-Cell Analysis
Computational and Statistical Methods for Single-Cell RNA
Sequencing Data.................................................................. 3
Zuoheng Wang and Xiting Yan
Pre-processing, Dimension Reduction, and Clustering for
Single-Cell RNA-seq Data ....................................................... 37
Jialu Hu, Yiran Wang, Xiang Zhou, and Mengjie Chen
Integrative Analyses of Single-Cell Multi-Omics Data: A Review
from a Statistical Perspective ................................................... 53
Zhixiang Lin
Approaches to Marker Gene Identification from Single-Cell
RNA-Sequencing Data........................................................... 71
Ronnie Y. Li, Wenjing Ma, and Zhaohui S. Qin
Model-Based Clustering of Single-Cell Omics Data.......................... 85
Xinjun Wang, Haoran Hu, and Wei Chen
Deep Learning Methods for Single-Cell Omics Data......................... 109
Jingshu Wang and Tianyu Chen
Part II Network Analysis
Probabilistic Graphical Models for Gene Regulatory Networks ........... 135
Zhenwei Zhou, Xiaoyu Zhang, Peitao Wu, and Ching-Ti Liu
Additive Conditional Independence for Large and Complex
Biological Structures............................................................. 153
Kuang-Yao Lee, Bing Li, and Hongyu Zhao
Integration of Boolean and Bayesian Networks .............................. 173
Meng-Yuan Tsai and Henry Horng-Shing Lu
vii

viii Contents
Computational Methods for Identifying MicroRNA-Gene
Regulatory Modules ............................................................. 187
Yin Liu
Causal Inference in Biostatistics................................................ 209
Shasha Han and Xiao-Hua Zhou
Bayesian Balance Mediation Analysis in Microbiome Studies .............. 237
Lu Huang and Hongzhe Li
Part III Systems Biology
Identifying Genetic Loci Associated with Complex Trait Variability ...... 257
Jiacheng Miao and Qiongshi Lu
Cell Type-Specific Analysis for High-throughput Data ...................... 271
Ziyi Li and Hao Wu
Recent Development of Computational Methods in the Field
of Epitranscriptomics............................................................ 285
Zijie Zhang, Shun Liu, Chuan He, and Mengjie Chen
Estimation of Tumor Immune Signatures from Transcriptomics Data .... 311
Xiaoqing Yu
Cross-Linking Mass Spectrometry Data Analysis ............................ 339
Chen Zhou and Weichuan Yu
Cis-regulatory Element Frequency Modules and their Phase
Transition across Hominidae ................................................... 371
Lei M Li, Mengtian Li, and Liang Li
Improved Method for Rooting and Tip-Dating a Viral Phylogeny ......... 397
Xuhua Xia

Computational and Statistical Methods
for Single-Cell RNA Sequencing Data
Zuoheng Wang and Xiting Yan
Abstract In recent years, advances in droplet-based technology have boosted
the popularity of using single-cell RNA sequencing (scRNA-seq) technology to
investigate transcriptomic and cell population composition changes in various
tissues and diseases. Despite the potential of these technologies in understanding
disease pathogenesis and developing novel personalized therapeutics, analyses
of the generated scRNA-seq data are challenging, mainly due to high noise
level, prevalent dropout events, heterogeneous sources of variation confounding
phenotype of interest, and so on. In this chapter, we introduce these challenges in
analyses of scRNA-seq data and the corresponding computational and statistical
methods developed to address them. The topics include data preprocessing, data
normalization, dropout imputation, and differential expression analysis.
1 Introduction
Gene expression profiling measures levels of mRNA to understand transcriptomic
changes due to disease, treatment, environment, time, and so on. Traditional bulk
RNA gene expression profiling using microarrays and RNA sequencing pools RNAs
from a large population of cells consisting of various and often unknown cell types.
It measures the average expression profile in mixed cell populations with unknown
contribution from different cells or cell types. Thus, bulk RNA gene expression
data is unable to precisely identify the cellular source of transcriptomic changes of
interest, especially when high cell-to-cell heterogeneity exists [1–5]. To investigate
transcriptomic changes at single-cell resolution, two major challenges exist includ-
ing (1) isolating cells from each other without strong perturbations to cells that
lead to systematic transcriptomic changes and (2) amplification of extremely low
Z. Wang · X. Yan ()
Yale University, New Haven, CT, USA
e-mail: zuoheng.wang@yale.edu; xiting.yan@yale.edu
© The Author(s), under exclusive license to Springer-Verlag GmbH, DE,
part of Springer Nature 2022
H. H.-S. Lu et al. (eds.), Handbook of Statistical Bioinformatics,
Springer Handbooks of Computational Statistics,
https://doi.org/10.1007/978-3-662-65902-1_1
3

4 Z. Wang and X. Yan
amount of mRNAs from each cell. To address these challenges, multiple types and
generations of single-cell transcriptomic technology, including single-cell qPCR [6–
11], single-cell microarray [12–14], and single-cell RNA sequencing [15–18], have
been developed. Major differences between these technologies exist in the steps of
single-cell capturing, cDNA amplification, and cDNA profiling. There are mainly
five ways to capture single cell, including micropipetting micromanipulation, laser
capture microdissection, fluorescence-activated cell sorting (FACS), microfluidics,
and microdroplets [19]. The early-staged micropipetting micromanipulation and
laser capture microdissection that capture low number of cells are time consuming
and require microliter volumes of specimen. FACS and microfluidics can both
capture hundreds of cells and are fast although microfluidics requires nanoliter
volumes. Microdroplets, the most popular cell capturing method, can capture the
largest number of cells (currently from thousands to tens of thousands), are fast, and
require nanoliter volumes. After cells are captured at single-cell resolution, mRNAs
are reversed transcribed into cDNAs and further amplified using PCR, which may
have amplification bias leading to uneven amplification across different genes. Some
of the single-cell sequencing technologies reduce or remove this amplification bias
by in vitro transcription (IVT) or unique molecular identifier (UMI). The amplified
cDNAs will further be profiled using qPCR, microarrays, or RNA sequencing with
RNA sequencing being the most popular due to its unbiasedness in gene capturing
and the existence of nonspecific probe binding in microarrays. Taken together, the
single-cell transcriptomics field has moved from capturing a few targeted genes
in less than 100 cells by single-cell qPCR to whole-transcriptome profiling in
hundreds of thousands of cells in an unbiased style by droplet-based single-cell
RNA sequencing.
Instead of pooling RNAs from all cells together, droplet-based single-cell RNA
(scRNA-seq) sequencing technologies isolate cells in oil droplets and measure
transcriptome-wide mRNA expression levels in each single cell separately. Despite
differences in protocols, each scRNA-seq technology follows a similar basic
strategy. As an example, we demonstrate the workflow of 10x Genomics Chromium
Single Cell 3 v3 assay in Fig. 1. First, organ or tissue samples are processed to
generate a single-cell suspension in which cells are separated. Most of the time, this
process involves usage of proteases to digest attachments between cells, especially
for solid tissues. Due to different perturbations the dissociation step could have on
different cell types, tissue dissociation needs to be optimized to balance between
releasing cell types that are difficult to dissociate and avoiding damage to fragile cell
types. Second, cells are co-encapsulated with distinctively barcoded microparticle
(bead) in oil droplet. Ideally, each droplet contains only one cell and one bead.
Cells are lysed in droplet. Third, mRNAs in each droplet are reverse transcribed into
full-length cDNAs, during which oligo primers on beads are ligated onto cDNAs.
Each oligo primer consists of sequencing primer, cell barcode, UMI, and poly(dT).
Cell barcode is the same across all oligo primers on the same bead but UMI is
distinctively unique. As a result, cell barcodes and UMIs can identify cell origin
and transcript origin of each cDNA, respectively. Finally, the full-length cDNAs
from different droplets are pooled, amplified, and fragmented into smaller cDNA

Computational and Statistical Methods for Single-Cell RNA Sequencing Data 5
cells oil
GEMs
10x Genomics Chromium Chip
gel beads
R1
cell barcode
UMI
Poly(dT)VN
10x Genomics Gel Bead
a
b
gel
bead
Read 1 10x CB UMI Poly(dT) Read 1 10x CB UMI Poly(dT)
Read 1 10x CB UMI Poly(dT)
Read 2
Read 2
Read 2
Read 2
P5
gel
bead
AAAAAAAA
3' AAAAAAAA 5'
Poly-dT Primer
Reverse Transcription
gel
bead
AAAAAAAA
3' AAAAAAAA 5'
C C C
Template Switching
Oligo Priming
gel
bead
AAAAAAAA
3' AAAAAAAA rGrGrG
C C C
TSO
gel
bead
AAAAAAAA
3' AAAAAAAA rGrGrG
C C C
TSO
TSO
TSO
TSO
Read1:
10x CB+UMI
Read 2:
cDNA template
Template Switch
Transcript Extension
c
GEM generation and cell barcoding
3' Gene Expression Library Construction
Sequencing
Enzymatic Fragmentation
End Repair, A-tailing, Ligation
Cleanup Priming
d
Sample
Index P7
Sample
Index
Sequencing Library
Fig. 1 Workflow of 10x Genomics Chromium 3 V3 chip. (a) Structure of the oligo primer on the
gel beads. (b) For each GEM, steps to capture mRNAs with poly(A) tails and reverse transcribe
them into cDNAs for amplification. (c) Steps to fragment the amplified cDNAs into small pieces
and ligating sample index, P5 and P7 for sequencing. (d) Structure of the final cDNA templates in
the library for sequencing. Read 1 and Read 2 are copies of cDNA template from the corresponding
location, representing the cell barcode+UMI and a small fragment of cDNA from the 3 end of the
transcript. Created with BioRender.com
inserts for sequencing using enzymes. The fragmented inserts are cleaned up to only
keep those with oligo primers, which are from the 3 end of cDNAs with polyA tails.
In each of these inserts, one end has oligo primer, and the other end has sequence
from the cDNA template. Both ends are sequenced using a pair of sequencing reads
(Read 1 and Read 2). Read 1 sequences cell barcode and UMI to determine the cell
origin and remove PCR duplication. Read 2 measures the sequence content of a
small fragment of the transcript close to the 3 end, which can be mapped to human
genome to determine the gene origin of the mRNA. In this way, sequencing reads
can get demultiplexed into different cells and different transcripts to enable single-
cell transcriptome profiling and PCR amplification bias reduction.
To date, many on-market scRNA-seq platforms are mainly different in the total
number of captured cells, whether full-length cDNAs are profiled, and whether

UMI or IVT is used to reduce PCR amplification bias. Applications of scRNA-
seq technologies in different human diseases and tissues have revealed potential
disease-associated rare cell types, cell population composition changes, and cell
type-specific transcriptomic changes [20–24]. These scRNA-seq datasets also have
the potential to provide information on disease-associated in vivo cell-to-cell
communication in different tissues. Moreover, scRNA-seq technology has also
served as a base for development of single-nucleus RNA sequencing (snRNA-
seq) and spatial single-cell transcriptomic technology. The snRNA-seq measures
transcriptome in single nucleus, and the spatial single-cell transcriptomics measure
single-cell transcriptome together with spatial location of each cell in intact tissue.
The extra information gained through these technologies has further boosted our
understanding of single-cell biology in challenging tissues, spatial structure of
tissues at single-cell resolution, and in vivo cell-to-cell communications.
2 Data Preprocessing
Raw scRNA-seq data are sequencing reads in FASTQ or BAM formatted files that
need to be preprocessed and quantified for downstream analysis. In this section,
we describe the preprocessing of scRNA-seq data from 10x Genomics Chromium
platform, which is currently the most popular scRNA-seq platform. Preprocessing
of data from other scRNA-seq technologies should follow the same principle with
small variations. Multiple tools have been developed to preprocess 10x Genomics
scRNA-seq data, including the Cell Ranger pipeline from 10x Genomics, STAR-
solo [25], Alevin/Alevin-fry [26], Kallisto-bustools [27], UMI-tools [28], and zUMI
[29]. Despite differences across these methods, key common steps in these methods
include (1) reads mapping, (2) cell barcodes demultiplexing with or without error
correction, (3) UMI deduplication with or without error correction, and (4) cell
barcodes selection. These methods have been previously reviewed and compared
[30, 31]. The output of data preprocessing is a matrix of counts, in which rows are
genes, columns are cells, and each entry is the number of UMIs of the corresponding
gene in the corresponding cell.
2.1 Reads Mapping
The first key preprocessing step is to map the reads from cDNA templates back to the
target genome or transcriptome to identify their transcript origin. There are mainly
two types of aligners used in the existing methods. One category maps reads back to
the genome. STAR [32] is the most popular method in this category due to its high
mapping accuracy and its capacity in identifying novel exons, constructing splice
junction libraries based on the data and providing the two-pass mapping option for
more accurate mapping results. Other aligners in this category used in the existing

scRNA-seq data preprocessing pipelines include BWA [33], Tophat2 [34], Subread
[35], and Bowtie2 [36]. The other category maps reads to transcriptome instead
of genome, including RapMap [37] and kallisto [38] used in Alevin and Kallisto-
bustools, respectively. These mappers are lightweight and very efficient in both
memory usage and speed. However, the results strongly depend on the transcriptome
annotation. Potential incomplete annotation of exons and splicing junctions could
lead to inaccurate mapping results.
2.2 Cell Barcodes Demultiplexing
The second key step is to correct for sequencing errors in cell barcodes so that reads
with the same cell barcodes can be assigned to the same cell. For 10x Genomics
Chromium platform, a barcode whitelist is provided which contains all known
barcode sequences included in the assay kit. Under perfect scenario, all observed
cell barcodes can be compared to this list to split reads into different cells. However,
sequencing errors cause the observed cell barcodes to be slightly different from
the true cell barcodes. A common approach to correct these sequencing errors
is to consider cell barcodes within a given Hamming distance as the same cell
barcode. The Hamming distance-based approach completely relies on the number
of base differences between the observed cell barcodes and the whitelist barcodes,
which may be inaccurate due to varying sequencing quality of different bases in
the observed cell barcodes. To address this, the Cell Ranger pipeline estimates
the posterior probability of an observed cell barcode originating from a given
whitelisted cell barcode based on sequencing quality score and the number of reads
exactly matching the whitelist barcode. For technologies without a manufacturer
provided cell barcode whitelist, Alevin and STARsolo provide the option to run one
pass of the cell barcode without cell barcode correction and use the uncorrected cell
barcodes from the first pass as the “whitelist” for the second pass of demultiplexing
with correction.
2.3 UMI Collapsing
The third key preprocessing step is to deduplicate UMIs with or without error
correction. Ideally, reads with the same UMI and cell barcodes originate from the
same transcript and therefore should be counted as one single UMI. However, in real
data, sequencing errors (nucleotide substitutions, nucleotide miscalling, insertion,
deletion, and recombination) in both UMI and cDNA read cause reads originating
from the same transcript to have slightly different UMIs leading to overestimated
number of UMIs and to be mapped to different genes or transcripts leading to
multigene or multi-transcript UMIs. To correct for errors in UMIs, considering
that miscalling during sequencing is the most prevalent error, both zUMI and

Cell Ranger pipeline correct errors in UMI by collapsing UMIs within a given
Hamming distance. UMI-tools [28] implemented two previously proposed methods,
unique and percentile, and developed three network-based UMI error correction
methods including cluster, adjacency, and directional. The directional method was
shown to have the highest accuracy and robustness in both simulated data and real
data. STARsolo provides both options to use Hamming distance and directional
method for UMI error correction. Kallisto-bustools reported low percentage of reads
recovered by UMI error correction (0.5 and 0.6% for 10-base-pair and 12-base-pair
UMIs, respectively) and therefore does not perform UMI error correction. Alevin
does not correct for errors in UMI either. To resolve the multigene UMIs, Alevin
utilizes transcript-level information and a parsimonious UMI graph (PUG) to find
a minimal set of transcripts to cover the PUG and split the multigene or multi-
transcript UMIs. Both STARsolo and Cell Ranger pipeline compare the number
of reads supporting the multiple genes a UMI is associated with and keep the
gene with the largest number of supporting reads. In addition, STARsolo provides
options to filter out all multigene UMIs, to uniformly distribute the multigene
UMIs to all genes, to distribute multigene UMIs among all genes using maximum
likelihood estimation (MLE) that consider other UMIs from the same cell, and to
distribute multigene UMIs to their gene set proportionally to the sum of the number
of unique-gene UMIs and uniformly distributed multigene UMIs in each gene.
Kallisto-bustools performs naïve collapsing based on its report of low percentage
of lost counts (0.4 and 0.17% for 10xv2 and 10xv3 dataset, respectively).
2.4 Cell Barcodes Selection
The previous key steps generate the raw gene × cell barcode count matrix, which
includes cell barcodes from empty droplets containing ambient RNAs as well as
target cells. Since usually the empty droplets have significant lower RNA content,
Cell Ranger pipeline v2.2 [17] simply kept cell barcodes with total number of
UMIs (nUMI) higher than 10% of the robust maximum count defined as the 99-th
percentile of the largest N UMI counts where N is the expected number of cells to be
captured in the experiment. This approach is similar to the knee-point thresholding
approach [18] that searches for an inflection point or “knee” in the cumulative
frequency of total nUMI per barcode and filters out barcodes with nUMI lower than
the identified knee point. The most recent and popular method is the EmptyDrops
approach [39]. It estimates the profile of cell barcodes containing ambient RNAs and
test each cell barcode for deviations from the estimated profile using a Dirichlet-
multinomial model of UMI count sampling. Barcodes with significant deviations
are considered as cells and included for downstream analysis. This approach
allows inclusion of cells with low total RNA content and thus small total nUMIs.
Different versions of Cell Ranger pipelines provide different cell barcode selection
approaches but covered all the three methods described above. STARsolo provides
options for both the Cell Ranger v2.2 approach and EmptyDrops. Alevin conducts

the knee-based approach at the beginning of the pipeline and a naïve Bayes classifier
[40] to differentiate between high- and low-quality cells at the end of the pipeline.
2.5 Summary
In general, STARsolo provides the most comprehensive options to implement
different approaches for each key data preprocessing step and can be applied to
data generated by different platforms. STARsolo also provides the flexibility of
using exonic reads only (gene), exonic and intronic reads together (pre-mRNA),
or annotated and novel spliced junctions. All these options have made STARsolo
the most popularly used scRNA-seq data preprocessing pipeline in addition to Cell
Ranger pipelines so far.
3 Data Normalization and Visualization
3.1 Background
Preprocessing of scRNA-seq data generates a matrix of nUMIs, in which rows are
genes, columns are cells, and each entry is the number of UMIs of each gene in each
cell. The UMI count of the same gene from different cells is not directly comparable
due to cell-to-cell technical variations associated with different technical factors,
including sequencing depth, cell lysis, reverse transcription efficiency, molecular
sampling during sequencing, and so on. Although the utilization of UMI removes
variations associated with PCR amplification bias, there are still substantial techni-
cal variations in the UMI count data that need to be corrected before downstream
analysis. Therefore, normalization is critically important for scRNA-seq data
analysis to make the data from difference cells and samples comparable, which was
also shown to have the largest impact on performance of downstream analyses [41]
compared to data preprocessing and the choice of downstream analytical method.
Many different normalization methods have been developed, which can be roughly
divided into two groups: global scaling normalization approaches and probabilistic
model-based normalization approaches.
3.2 Global Scaling Normalization for UMI Data
The global scaling normalization methods estimate a global “size factor” to
represent the technical variation in each cell. The UMI count of all genes in each
cell is then divided by the estimated size factor for the same cell to scale the data

for normalization. Note that nUMI of different genes in the same cell are scaled by
the same factor. Many normalization methods designed for bulk RNA sequencing
data normalization, including TPM, TMM, DESeq2, and edgeR, fit well into this
category and therefore have been used in some scRNA-seq studies. The other global
scaling methods designed for scRNA-seq data include library size normalization,
BASiCS [42], scran [43], census [44], and PsiNorm [45], among which BASiCS is
the only method requiring spike-in controls.
To better explain different normalization methods, we use the uniform notation
as follows. Suppose in total, there are q genes measured in n cells. Let Xij denote the
nUMI of gene i in cell j as a random variable and xij denote the observed realization
of Xij. The library size normalization scales the nUMI of all genes by dividing them
by the total number of UMIs

Lj =
q
i=1Xij

in cell j. This approach is one of the
most used methods and is implemented in the Seurat R package [46].
BASiCS requires the data to have nonbiological spike-in genes, which are added
into the lysis buffer at known concentration levels and therefore present at the same
level in every cell. These spike-in genes provide information for BASiCS to quantify
technical variation and separate it from the biological variation in the data. Suppose
the first q0 (i = 1, · · · , q0) genes are biological genes and the remaining genes
(i = q0 + 1, · · · , q) are spike-in controls. BASiCS models the UMI counts in each
cell j using the following hierarchical model:
Xij | μi, vj ∼

Poisson

φj vj μiρij

, i = 1, · · · , q0
Poisson

vj μi

, i = q0 + 1, · · · , q
(1)
with vj | sj , θ ∼ Gamma

1
θ
,
1
sj θ

, ρij | δi ∼ Gamma

1
δi
,
1
δi

,
where μi is the true normalized expression level of gene i in the cells, φj represents
the differences in total mRNA content of the cells, and vj and ρij are independent
random effects representing the cell-to-cell technical variability with a mean of sj
and variance of s2
j θ and the gene-specific biological cell-to-cell variability with a
mean of 1 and variance of δi, respectively. Because μq0+1, · · · , μq are known from
the spike-in genes’ experimental design, sj’s can be identified. δi’s and θ can also be
identified based on the variance of the biological and technical expression counts.
However, because the scale of φj’s is arbitrary, restriction is needed to make the
model identifiable. This can be done by assuming that n−1
n
j=1φj = φ0 or by
reparametrizing the model in terms of κ1, · · · , κn so that
φj = φ0
eκj
n
l=1 eκl
, κ1 = 0. (2)
All model parameters are assumed to have independent prior with a flat non-
informative prior for the normalized expression levels μ1, · · · , μq0 and conjugate
informative prior for all other model parameters including sj’s, θ’s, δi’s, and κj’s.

Bayesian inference is implemented using an adaptive Metropolis (AM) within Gibbs
sampling (GS) algorithm. The estimations of φj and sj are eventually used to
calculate the scaling factor for cell j.
Scran first generates pools of cells to calculate the pool-based size factors,
which are then deconvoluted to yield the cell-based size factors to scale the data.
Scran assumes that E(Xij) = θjλi0, where θj describes cell-specific bias and λi0
is the expected UMI count of gene i. So θj can serve as the scaling size factor
for cell j. Define Zij = Xij/tj, where tj is the library size of cell j. We have that
E(Zij) = θjλi0/tj. Consider an arbitrary set of cells Sk. Define Vik =

j∈Sk
Zij so
we have E (Vik) = λi0

j∈Sk
θj t−1
j . Also define Ui =
n
j=1Zij /n so we have
E (Ui) = λi0
n
j=1θj t−1
j /n. Define Rik = Vik/Ui so we have
E (Rik) ≈
E (Vik)
E (Ui)
=

j∈Sk
θj t−1
j
n−1
n
j=1θj t−1
j
=

j∈Sk
θj t−1
j
C
(3)
where C is a constant independent of genes, cells, or pool of cells Sk and can be set to
1 since it does not affect the differences in size factor θj. Denote the realizations of
Vik, Ui, and Rik as vik, ui, and rik. Based on Eq. (3), we have that rik =

j∈Sk
θj t−1
j
for each Sk. By constructing different pools of cells, we can have an overdetermined
system of linear equations in which θj t−1
j for cell j is represented at least once. This
cell pool construction was achieved by ordering cells based on their total nUMI and
divide all cells into two groups with odd and even ranking, respectively. These cells
are arranged in a ring with odd ranking cells on the left and even ranking cells on
the right. Starting at the 12 o’clock on the ring, a sliding window of a given size
moves clockwise cell-by-cell across the ring so that each window contains the same
number of cells. Cells in each window will be used to define one pool Sk. This cell
pool construction strategy will obtain cell pools with similar library size to provide
robustness to estimation errors for small θj t−1
j . Although the estimation steps of
scran seem to be circuitous, the summation across cells from the constructed pools
reduces the number of stochastic zeros that cause problems in some other existing
normalization methods.
Census considers the relative abundance of genes on the TPM scale. The
generative model of scRNA-seq predicts that when a small portion of transcripts
in a cell can be captured, the signal from most detectable genes will originate from
a single mRNA. Therefore, the TPMs of these genes will be very similar. Based on
this prediction, Census first identifies the TPM value x∗
j defined by the mode of log-
transformed TPM distribution for cell j. Genes with detectable TPM smaller than
x∗
i correspond to genes whose signal originates from a single transcript. Therefore,
the total number of mRNAs captured for cell j is calculated as
Mj =
1
θ
·
nj
FXj

x∗
j − FXj ()
(4)

where θ is the expected number of cDNA molecules generated from each RNA
molecule or simply the capture rate, FXj is the cumulative distribution function of
TPMs in cell j, is a TPM value below which no mRNA is believed to be present
(default = 0.1), and nj is the number of genes with TPM between and x∗
j . The
capture rate θ is unknown a priori and it is highly protocol dependent and has little
dependence on cell type or state. Based on estimations from existing data with spike-
in controls, Census sets θ = 0.25 by default. Taken together, Mj is taken as the
scaling size factor for cell j and the Census normalized count for gene i in cell j is
Ŷij = T PMij ·
Mj
106
(5)
SCnorm does not model the cell-specific technical variations. It directly estimates
the relationship between the observed un-normalized UMI counts and sequencing
depth using quantile regression. Let Sj denote the sequencing depth of cell j and Yij
denote the log nonzero UMI count for gene i in cell j. SCnorm divides the genes into
K different groups with substantially different UMI count-depth relationship. Within
each group, the overall relationship between log un-normalized UMI count and log
sequencing depth for all genes is estimated via the following quantile regression:
Qτk,dk

Yj |Sj

= βτk
0 + βτk
1 Sj + · · · + βτk
dk
Sdk
j
where τk and dk are chosen to minimize η̂
τk
1 − modeg

β̂g,1 in which η̂
τk
1
describes the UMI count-depth relationship between the predicted expression
values estimated by median quantile regression using a first-degree polynomial:
Q0.5

Ŷ
τk
j |Sj = η
τk
0 + η
τk
1 Sj .The scaling factor for cell j is then defined as
SFj = eŶ
τk,dk
j /eYτk
where Yτk is the τkth quantile of expression counts in the kth group of genes. The
normalized count of gene i in cell j is given by Y
ij = Xij /SFj .
PsiNorm assumes that the UMI count follows the Pareto distribution, based on
which the PsiNorm normalized counts of cell j is
x̃j = xj ·
q
q
i=1 log

xij + 1
.
In general, global scaling normalization methods are computational efficient and
highly scalable. However, it assumes that the technical variations are cell-specific
and uniform across different genes. Although UMI-based protocols in principle
remove PCR amplification biases and sequencing depth, the assumption is true
only if all the cDNAs are sequenced, namely, it reaches the sequencing saturation.
When the sequencing is not saturated, some UMI-tagged transcripts will be lost

and systematic differences between nUMI of these lost transcripts will emerge. In
addition, the UMI tags were added onto the cDNAs during reverse transcription.
So, they cannot address the differences in capture efficiency before the reverse
transcription or differences in the amount of mRNA content.
3.3 Probabilistic Model-Based Normalization for UMI Data
Normalization methods in this section build probabilistic model for the observed
UMI count data, which adopt for gene-specific technical variations and high sparsity
of the scRNA-seq UMI count data. The most popular distribution used by these
methods is the negative binomial distribution, which can accommodate for the
overdispersion in the data. There are mainly two methods in this group: sctransform
[47] and ZINB-WaVE [48]. For notations, Xij denotes the observed nUMI of gene
i in cell j as a random variable and xij denotes the realization of Xij. In total, we
assume that there are q genes measured in n cells.
Sctransform assumes that the UMI counts of each gene follow a negative
binomial (NB) distribution NB(μi, θi), for which the log-transformed mean is
decided by a linear function of the sequencing depth:
log (E (xi)) = β0i + β1i log(m)
where xi is the expression of gene i in all cells and m is the vector of sequencing
depth for all cells. Fitting this model for different gene separately results in over-
fitting. So after fitting this model, sctransform estimates the relationship between
the estimated model parameter values and the mean gene expression across all
genes using kernel regression. Based on the kernel regression curve, the model
parameter estimations are then regularized and re-estimated. Let β̂0, β̂1, and θ̂ be
the regularized estimation; the normalized UMI counts are calculated as
zij =
xij − μ̂ij
σ̂ij
,
where μ̂ij = exp

β̂0i + β̂1ilog

mj

and σ̂ij = μ̂ij + μ̂2
ij /θ̂i .
Due to the low amount of RNAs in a single cell and the low sequencing depth per
cell, some genes, especially the lowly expressed genes, may fail to be detected even
if they are being expressed in the cell. This causes an excessive number of zeroes
in the UMI count data and challenges in removing unwanted technical variations
in the data. ZINB-WaVE [48] models the UMI count using a zero-inflated negative
binomial distribution:
fZINB

xij ; μij , θij , πij

= πij δ0

xij

+

1 − πij

fNB

xij ; μij , θij

, (6)

where π is the probability of the observed count being 0 instead of the actual
count, δ0(x) is an indicator function of whether x is zero, and μ and θ are the
mean and dispersion of the negative binomial distribution describing the actual
count distribution. To consider various technical and biological effects, sctransform
considers the following regression models:
ln

μij

=

Cμβμ +

V γμ
T
+ Wαμ + Oμ
T
ij
,
logit

πij

=

Cπ βπ + (V γπ )T
+ Wαπ + Oπ
T
ij
,
ln

θij

= ζi,
where Cμ and Cπ are known n × M matrices representing M cell-level covariates,
Vμ and Vπ are known q × L matrices representing L gene-specific covariates, W
is an unobserved n × K matrix representing K unknown cell-level covariates, Oμ
and Oπ are known n × q matrices of offsets, and ζi is the gene-specific dispersion.
The parameters of this model are inferred by maximizing the following penalized
likelihood function to reduce overfitting:
max
β,γ,W,α,ζ

l (β, γ, W, α, ζ) −
β
2
β0
2
−
γ
2
γ 0
2
−
W
2
W2
−
α
2
α2
−
ζ
2
var (ζ) ,
where l(β, γ , W, α, ζ) is the likelihood function of the model in Eq. (6), β0 contains
coefficients for columns in Cμ and Cπ that are not constant column of ones, and ·
is the Frobenius matrix norm.
3.4 Dimension Reduction and Cell Clustering
Normalized scRNA-seq data serve as basis for many downstream analyses. The first
two analyses, which are also the must to-do analyses, are dimension reduction and
unsupervised clustering of cells. Dimension reduction reduces noise level and helps
identify outliers and understand systematic differences and variations in the data.
Unsupervised clustering of cells helps identify groups of cells that are potentially
different cell types or even cell subtypes within a given cell type.
Principal component analysis (PCA) has been successfully and commonly used
for dimension reduction in microarray expression data, bulk RNA-seq data, and
genome-wide genotyping data. However, PCA was shown to have poor performance
when applied to scRNA-seq data by multiple studies [49, 50] possibly due to linear
nature of PCA, excessive number of zeroes, and high technical and biological

variations in the data. Multiple methods have been developed and designed for
dimension reduction in scRNA-seq data, including canonical correlation analysis
(CCA) [51], independent components analysis (ICA) [52], Laplacian eigenmaps
[53, 54], t-distributed stochastic neighbor embedding (t-SNE) [18, 55, 56], and
uniform manifold approximation and projection (UMAP) [57–60]. Among these
methods, t-SNE and UMAP are the most popular methods with UMAP preserving
the global distances and t-SNE preserving local distances. Although distortions of
distance exist in both methods due to representing the data using low dimensions
(two to three dimensions) [61], in common practice, highly variable genes are
selected, and PCA is conducted on these genes to select the top PCs with significant
variations. Then t-SNE and UMAP are applied to the PCA pre-conditioned data
to reduce the dimension for data visualization. Note that dimension reduction
discussed here is only for data visualization and cell clustering. Many of the down-
stream analyses, including data imputation and differential expression analysis, are
still conducted on normalized data or even the un-normalized UMI count data with
their original dimensions.
Unsupervised clustering of cells is usually conducted on the reduced dimensional
space or by using highly variable genes. Different types of clustering methods
have been applied to scRNA-seq data, including the traditional k-means clustering
and hierarchical clustering. These traditional unsupervised clustering methods are
limited when applied to scRNA-seq data due to their poor scalability to the total
number of cells in terms of required computational time and memory, sensitivity to
outlying cells or cell clusters, and bias to identify equal-sized clusters mixing rare
cell types in a larger cluster [62, 63]. Other types of methods have been developed
to address these issues, including mixture model-based clustering, density-based
clustering, neural network clustering, and affinity propagation clustering [63].
Among these methods, the community detection-based approaches have gained the
most popularity due to their scalability and robustness to noise in the data. Instead of
clustering cells close to each other based on chosen distance, community detection
identifies groups of cells that are densely connected based on a k-nearest-neighbors
graph constructed using the PCA reduced dimensional space or highly variable
genes. The number of clusters is affected by the number of nearest neighbors in the
constructed k-nearest-neighbors graph and indirect resolution parameters. Although
the Louvain algorithm [46, 64, 65] is currently the most widely used approach
for scRNA-seq data, there are many other community detection approaches [66]
available, and some of them have demonstrated better performance in benchmarking
studies [67, 68].

4 Dropout Imputation
4.1 Background
Analysis of scRNA-seq data can be challenging due to low library size, high
technical noise, and prevalent dropout events [49, 69, 70]. In scRNA-seq data, due to
the tiny amount of mRNAs in each cell, some mRNAs may be totally missed during
the reverse transcription and cDNA amplification step, thus cannot be detected in
the sequencing step. This phenomenon is referred to as dropout event for which
a given gene is observed at a moderate expression level in one cell but is not
detected in another cell of the same type from the same sample, thus generating
an increased sparsity in single-cell data, especially for genes with low or moderate
expression [71]. These observed zero values can be the biological variation in
actual expression levels among cells or the technical imperfect measure on small
numbers of molecules. Dropouts lead to inaccurate assessment of gene expression
levels that may mislead downstream analyses such as cell clustering and differential
expression analysis, and cell trajectory inference [72]. To alleviate the increased
sparsity observed in scRNA-seq data, many data imputation methods have been
developed and compared [73, 74]. They can be classified into four categories [75].
4.2 Cell-Cell Similarity-Based Imputation
The first category of methods evaluates cell-cell similarities and imputes dropouts
in each cell using information from cells that are similar to the cell to be imputed,
including kNN-smoothing [76], MAGIC [77], scImpute [78], drImpute [79], and
VIPER [80]. Specifically, kNN-smoothing imputes dropouts by aggregating infor-
mation from the k closest neighboring cells of each cell using the stepwise k-nearest
neighbors approach [76]. MAGIC constructs a cell-cell affinity matrix based on their
expression profiles across genes and diffuses the gene expression values in cells
with similar expression profiles for imputation [77]. scImpute infers dropout events
based on the dropout probability estimated from a Gamma-Gaussian mixture model
and only imputes these events by combining information from similar cells within
cell clusters identified by spectral clustering [78]. drImpute defines similar cells
using k-means clustering and performs imputation by averaging the gene expression
values in cells within the same cluster [79]. While improving the quality of scRNA-
seq data to some extent, the above methods were found to eliminate the natural
cell-to-cell stochasticity which is an important piece of information available in
scRNA-seq data compared to bulk RNA-seq data [80]. Instead, VIPER overcomes
this limitation through selecting a sparse set of neighboring cells for imputation to
preserve variation in gene expression across cells [80]. In general, the first category
of imputation methods that borrow information across similar cells tends to intensify

subject variation in scRNA-seq datasets with multiple subjects, resulting in cells
from the same subject to be more similar than those from different subjects.
4.3 Gene-Gene Similarity-Based Imputation
The second category of methods relies on the gene-gene similarities for imputa-
tion, including SAVER [81], G2S3 [82], netNMF-sc [83], and netSmooth [84].
SAVER borrows information across similar genes instead of cells to impute gene
expression using a penalized regression model [81]. G2S3 recovers gene expression
by borrowing information from adjacent genes in a sparse gene graph learned
from gene expression profiles across cells using graph signal processing [82].
netNMF-sc uses network-regularized nonnegative matrix factorization to leverage
gene-gene interactions for imputation [83]. netSmooth smooths gene expression
values by incorporating protein-protein interaction networks [84]. Both netNMF-sc
and netSmooth require prior information on gene-gene interactions from RNA-seq
or microarray studies of bulk tissue.
4.4 Gene-Gene and Cell-Cell Similarity-Based Imputation
The third category of methods leverages information from both genes and cells.
For example, ALRA imputes gene expression using low-rank matrix approximation
[85], and scTSSR uses two-side sparse self-representation matrices to capture gene-
gene and cell-cell similarities for imputation [86].
4.5 Deep Neural Network-Based Imputation
The last category consists of machine learning-based methods, such as autoImpute
[87], DCA [88], deepImpute [89], and SAUCIE [90], that use deep neural network
to impute for dropout events. While computationally more efficient, these methods
were found to generate false-positive results in differential expression analyses [91].
Recently, an ensemble approach, EnImpute, was developed to integrate results from
multiple imputation methods using weighted trimmed mean [92].
4.6 G2S3
In this section, we give a detailed presentation on the imputation method G2S3
developed by our group. G2S3 uses graph signal processing to learn a sparse gene

graph from scRNA-seq data and imputes dropouts by borrowing information from
nearby genes in the graph. G2S3 first constructs a sparse graph representation of
gene network under the assumption that expression values change smoothly between
closely connected genes. Suppose X = [x1, x2, . . . , xm] ∈ Rn × m is the observed
transcript counts of m genes in n cells, where the column xj ∈ Rn represents the
expression vector of gene j, for j = 1, . . . , m. We consider a weighted gene graph
G = (V, E), in which each vertex Vj represents gene j and the edge between genes j
and k is associated with a weight Wjk.
The gene graph is determined by the weighted adjacency matrix W ∈ Rm×m
+ .
Assuming signals on the graph are smooth and sparse, G2S3 searches for an optimal
adjacency matrix W from the space W =

W ∈ Rm×m
+ : W = WT , diag(W) = 0

.
To accomplish this, we optimize the objective function adapted from Kalofolias’s
model [93]:
min
W∈W
W ◦ Z1,1 − 1T
log (W1) +
1
2
W2
F , (7)
where Z ∈ Rm×m
+ is the pairwise Euclidean distance matrix of genes, defined as
Zjk = xj − xk2, 1 is a vector of ones, ·1, 1 is the elementwise L − 1 norm, ◦
is the Hadamard product, and ·F is the Frobenius norm. In Eq. (1), the first term
is equivalent to 2tr(XTLX) that quantifies how smooth the signals are on the graph,
where L is the graph Laplacian and tr(·) is the trace of a matrix. This term penalizes
edges between distant genes, so it favors a sparse set of edges between the nodes
with a small distance in Z. The second term represents the node degree such that
the degree of each gene is positive to improve the overall connectivity of the gene
graph. The third term controls graph sparsity to penalize large edges between genes.
Equation (1) can be optimized via primal dual techniques [94] by rewriting it as
min
w∈ω
I{w≥0} + 2wT
z − 1T
log(d) + w2
, where ω =

w ∈ R
m(m−1)
2
+

, (8)
where w and z are vector forms of W and Z, respectively; I{·} is the indicator
function that takes value 0 when the condition in the brackets is satisfied, infinite
otherwise; d = Kw ∈ Rm; and K is the linear operator that satisfies W1 = Kw.
After obtaining the optimal W, a lazy random walk matrix can be constructed on
the graph as M = (D−1W + I)/2, where D is an m-dimensional diagonal matrix
with Djj =

kWjk, the degree of gene j, and I is the identity matrix. We then obtain
the imputed count matrix Ximputed by taking a t-step random walk on the graph
XT
imputed = MtXT .
By default, G2S3 takes a one-step random walk (t = 1) to avoid over-
smoothing. Adapted from a diffusion-based imputation method [95], we also
implement hyperparameter tuning based on an objective function that minimizes
the mean squared error (MSE) between the imputed and observed data, i.e.,
t∗ = argmin
t
Mt XT − XT . A good imputation method is not expected to

deviate too far away from the raw data structure in the process of denoising. This
criterion enables us to denoise the observed gene expression through attenuating
noise due to technical variation while preserving biological structure and variation.
Like other diffusion-based methods, G2S3 spreads out counts while keeping the
sum constant in the random walk step. This results in the average value of nonzero
matrix entry decreasing after imputation. To match the observed expression at the
gene level, we rescale the values in Ximputed so that the mean expression of each
gene in the imputed data matches that of the observed data. The pseudo-code for
G2S3 is given in Algorithm 1.
Algorithm 1: Pseudo-code of G2S3
1: Input: X
2: Result: Ximputed = G2S3(X)
3: Z = distance(X)
4: W = min
w∈R
m(m−1)/2
+
I{w≥0} + 2wT z − 1T log(d) + w2
5: D = degree(W)
6: M = (D−1W + I)/2
7: t∗ = argmin
t
Mt XT − XT
8: XT
imputed = Mt∗
XT
9: Xrescaled = rescale(Ximputed)
10: Ximputed = Xrescaled
11: End
4.7 Methods Evaluation and Comparison
In this section, we evaluated and compared the performance of 11 imputation
methods, kNN-smoothing, MAGIC, scImpute, VIPER, SAVER, G2S3, ALRA,
scTSSR, DCA, SAUCIE, and EnImpute, in recovering gene expression using
three unique molecular identifier (UMI)-based datasets. The three datasets are the
Reyfman dataset from human lung tissue [21], the peripheral blood mononuclear
cell (PBMC) dataset from human peripheral blood [17], and the Zeisel dataset from
the mouse cortex and hippocampus [55]. In Reyfman, the raw data include 33,694
genes and 5437 cells. We selected cells with a total number of UMIs greater than
10,000 and genes that have nonzero expression in more than 20% of cells. This
resulted in 3918 genes and 2457 cells as the reference dataset. The PBMC dataset
was downloaded from 10x Genomics website (https://support.10xgenomics.com/
single-cell-gene-expression/datasets). The raw data include 33,538 genes and 7865
cells. We selected cells with a total number of UMIs greater than 5000 and genes
that have nonzero expression in more than 20% of cells. This resulted in 2308 genes
and 2081 cells as the reference dataset. In Zeisel, the raw data include 19,972 genes
and 3005 cells. We selected cells with a total number of UMIs greater than 10,000

R
e
y
f
m
a
n
P
B
M
C
Z
e
i
s
e
l
0.0
0.2
0.4
0.6
0.8
1.0
Correlation
with
Reference
Observed
G2S3
SAVER
kNN−smoothing
MAGIC
scImpute
VIPER
ALRA
scTSSR
DCA
SAUCIE
EnImpute
Gene
R
e
y
f
m
a
n
P
B
M
C
Z
e
i
s
e
l
0.0
0.2
0.4
0.6
0.8
1.0
Correlation
with
Reference
Observed
G2S3
SAVER
kNN−smoothing
MAGIC
scImpute
VIPER
ALRA
scTSSR
DCA
SAUCIE
EnImpute
Cell
Fig. 2 Evaluation of expression data recovery of all imputation methods by down-sampling.
Performance of imputation methods measured by correlation with reference data from the first
category of datasets, using gene-wise (top) and cell-wise (bottom) correlation. Box plots show the
median (centerline), interquartile range (hinges), and 1.5 times the interquartile (whiskers)
and genes that have nonzero expression in more than 40% of cells. This resulted in
3529 genes and 1800 cells as the reference dataset.
In each of the three scRNA-seq datasets, the reference dataset was treated as the
true expression. Down-sampling was performed to generate benchmarking observed
datasets. We performed random binary masking of UMIs in the reference datasets
to mimic the inefficient capturing of transcripts in dropout events. The binary
masking process masked out each UMI independently with a given probability.
In each reference dataset, we randomly masked out 80% of UMIs to create the
down-sampled observed dataset. All imputation methods were applied to each
down-sampled dataset to generate imputed data separately. We performed library
size normalization on all imputed data. Figure 2 shows the gene-wise Pearson
correlation and cell-wise Spearman correlation between the imputed and reference
data from each dataset. The correlation between the observed data without imputa-
tion and reference data was set as a benchmark. In all datasets, G2S3 consistently
achieved the highest correlation with the reference data at both gene and cell levels;

SAVER and scTSSR had slightly worse performance. EnImpute had comparable
performance to G2S3 based on the cell-wise correlation but performed worse than
G2S3, SAVER, and scTSSR based on the gene-wise correlation. VIPER performed
well in the Reyfman and PBMC datasets but not in the Zeisel dataset based on
the gene-wise correlation, although the cell-wise correlations were much lower
than G2S3, SAVER, scTSSR, and EnImpute in all datasets. The other methods,
kNN-smoothing, MAGIC, scImpute, ALRA and DCA, did not have comparable
performance, especially based on the gene-wise correlation. SAUCIE did not have
comparable performance to the other methods in all datasets. To quantify the
performance improvement of G2S3, one-sided t-test was applied to compare the
gene-wise and cell-wise correlations of G2S3 to those of the other methods. G2S3
had significantly higher correlations than all the other methods across three datasets
for both gene-wise and cell-wise correlations (p 0.05, Table 1). Overall, G2S3
provided the most accurate recovery of gene expression levels.
5 Differential Expression Analysis
5.1 Background
Although aims vary widely across different scRNA-seq studies, one common task
is to identify disease−/phenotype-associated genes [96] within each identified cell
type, which provides a potential list of candidate genes for further therapeutic
development and a better understanding of the disease pathogenesis. However, this
task is challenging due to prevalent dropout events and substantial subject effect, or
so-called between-replicate variation [97], in scRNA-seq data. We have described
dropout events in Sect. 4. For subject effect, many studies have consistently shown
that within the same cell type, cells of the same subject cluster together but separate
well from cells of other subjects regardless of the phenotype of subjects [21, 98, 99].
For example, Fig. 3 shows a good separation between cells from the same subject
in both alveolar macrophages and nature killer cells from patients with idiopathic
pulmonary fibrosis (IPF) [98]. This suggests that the across-subject variation is
dominant and much higher than the within-subject variation across cells, possibly
due to heterogeneous genetic backgrounds or environmental exposures. DE analysis
of scRNA-seq data is severely confounded by this dominant subject effect because
the across-subject difference driving genes are likely to be significantly different
between two groups of subjects [97, 100, 101]. In summary, it is critical to dissect
subject effect from disease effect with considerations of dropout events in the DE
analysis of scRNA-seq data with multiple subjects.
Sometimes, subject effect can be easily confused with technical batch effect
because early-stage scRNA-seq datasets profiled freshly collected samples and
thus each sample forms a separate batch. Since transcriptomic data is known to
be sensitive to batches, one possible explanation for the observed large variation

Table
1
Comparison
of
the
gene-wise
and
cell-wise
correlations
of
G2S3
and
other
methods
in
down-sampling
experiments.
P-values
of
testing
the
difference
of
correlations
of
G2S3
and
other
methods
with
the
reference
data
Gene-wise
correlation
Datasets
SAVER
kNN-smoothing
MAGIC
scImpute
VIPER
ALRA
scTSSR
DCA
SAUCIE
EnImpute
Reyfman
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.3
×
10
−10
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
PBMC
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
6.3
×
10
−4
2.2
×
10
−16
1.4
×
10
−4
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
Zeisel
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
8.1
×
10
−3
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
Cell-wise
correlation
Datasets
SAVER
kNN-smoothing
MAGIC
scImpute
VIPER
ALRA
scTSSR
DCA
SAUCIE
EnImpute
Reyfman
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
1.5
×
10
−6
PBMC
1.7
×
10
−10
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.4
×
10
−2
Zeisel
6.1
×
10
−9
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16
2.2
×
10
−16

Alveolar Macrophages Colored by Subjects Alveolar Macrophages Colored by Disease
NK Cells Colored by Subjects NK Cells Colored by Disease
001C
002C
003C
010l
021l
022l
025l
034C
034l
040l
041l
051l
053l
063l
065C
081C
084C
092C
098C
123l
133C
135l
1372C
137C
138l
145l
157l
158l
160C
166l
174l
177l
179l
192C
208C
209l
210l
212l
214l
218C 59l
221l
222C
222l
225l
226C
228l
244C
253C
296C
291
388C
396C
439C
454C
465C
47l
483C
484C
49l
001C
002C
003C
010l
021l
022l
025l
034C
034l
040l
041l
051l
053l
063l
065C
081C
084C
092C
098C
123l
133C
135l
1372C
137C
138l
145l
157l
158l
160C
166l
174l
177l
179l
192C
208C
209l
210l
212l
214l
218C 59l
221l
222C
222l
225l
226C
228l
244C
253C
296C
29l
388C
396C
439C
454C
465C
47l
483C
484C
49l
-5 0 5 10
-6
-3
0
3
UMAP_2
UMAP_2
UMAP_2
UMAP_2
-5 0 5 10
-6
-3
0
3
UMAP_1
UMAP_1
UMAP 1 UMAP 1
8
4
0
-7.5 -5.0 -2.5 0.0 2.5 5.0
5.0
2.5
0.0
-2.5
-5.0
-7.5
0
4
8
Cortrol
IPF
IPF
Cortrol
Fig. 3 UMAPs of the alveolar macrophages (top row) and nature killer cells (bottom row)
demonstrate a dominant subject effect in both cell types. In each row, figure on the left and the
right show UMAPs of cells colored by subjects and disease status, respectively
across subjects is batch effect. Recent advances in preserving cells using dimethyl
sulfoxide (DMSO) enabled processing multiple samples from different subjects in
the same batch [102]. In the scRNA-seq data of sputum samples from patients with
asthma (data unpublished), comparison of scRNA-seq data from the same sputum
sample with and without DMSO preservation showed no significant difference
between the fresh and DMSO data, but significant separation between different
subjects was still present. This confirmed that the dominant between-subjects
variation was a real biological subject effect instead of a technical batch effect.
Therefore, it is inadequate to remove the across-subjects variation using batch effect
adjustment tools. More importantly, removing the across-subject variations using
batch effect adjustment tools will also remove the disease effect of interest because
subject effect confounds with disease effect. Therefore, DE analysis of scRNA-seq
data does not use the data adjusted to remove batch effect using batch effect removal
tools including the integrated analysis in Seurat. All DE analysis of scRNA-seq data
methods use either the normalized UMI counts or the un-normalized UMI counts.
Many DE analysis methods have been developed and compared for scRNA-seq
data [103–105] although not all of them consider subject effects or dropout events.
There are mainly two categories of methods depending on whether subject effect is
considered.

5.2 DE Methods Ignoring Subject Effects
DE methods that ignore subject effect and treat all cells as independent have been
used in many scRNA-seq studies. These methods may be appropriate for identifying
cell cluster marker genes, but inappropriate for the DE analysis to identify disease-
or phenotype-associated genes due to the presence of dominant subject effect
confounding with disease effect as described above.
This category includes both methods designed for scRNA-seq data and methods
adopted from bulk RNA-seq DE analysis. Among the ones designed for scRNA-
seq data, BASiCS [42] and TASC [106] require external RNA spike-ins to provide
information on technical variation and use a Bayesian hierarchical Poisson-Gamma
model and a hierarchical Poisson-lognormal model, respectively, to fit the data.
Monocle [44, 52, 107] and NBID [108] model the UMI counts of each gene using
a negative binomial distribution without considering dropouts. A group of methods
were developed to consider dropouts, including DEsingle [109], DESCENT [110],
SC2P [111], SCDE [71], and MAST [50]. These methods utilize mixture models
or hierarchical models, mostly zero-inflated, to model dropouts and captured
transcripts. DEsingle fits a zero-inflated negative binomial model in each group and
conducts a likelihood ratio test for significance assessment. DESCENT models UMI
counts using a hierarchical model under the assumption that the true underlying
expression follows a zero-inflated negative binomial distribution, and the capturing
process follows a beta-binomial distribution. SC2P models dropout events using a
zero-inflated Poisson distribution and fits the detected transcripts using a lognormal-
Poisson distribution. The assumption in SC2P that the cell-specific dropout rate and
dropout distribution are shared by all genes may eliminate the natural stochasticity
in scRNA-seq data. SCDE employs a two-component mixture model with a negative
binomial and a low-magnitude Poisson component to model efficiently amplified
read-outs and dropout events, respectively. The dropout rate for a given gene is
determined by its true underlying expression level in the cell, which is estimated
based on a selected subset of highly expressed genes. MAST uses a two-part
hurdle model in which dropout rates are modeled by a logistic regression model
and nonzero expression follows a Gaussian distribution. SC2P, SCDE, and MAST
were originally designed for transcript per kilobase million (TPM) data which
has different technical noise and data distribution from UMI count data [41].
Multimodality has been observed in scRNA-seq data due to cellular heterogeneity
within the same cell type. To consider multimodality, scDD [112] was designed
to model count data with a Dirichlet process to detect genes with difference in
mean expression, proportion of the same component, or modality between groups.
D3E [113], a nonparametric method, fits a bursting model for transcriptional
regulation and compares the gene expression distribution between two groups.
It was previously reported to generate false-positive results on negative control
datasets [104].
For the methods adopted from bulk RNA-seq DE analysis, a recent study [103]
showed that when applied to the cell-level UMI count data, DESeq2 [114], limma-

trend [115], and Wilcoxon rank sum test [116] have comparable performance to
those designed for scRNA-seq data, especially after filtering out the lowly expressed
genes.
5.3 DE Methods Considering Subject Effects
Among the DE methods that consider subject effect, the simplest one is to aggregate
expression levels of cells from the same subject by averaging. These aggregated
sample-level “pseudo-bulk” expression levels are then compared between two
groups of subjects using Student’s t-test. This method is denoted as subject-t-test
(subT). Two recent studies proposed the following three DE analysis methods to
consider subject effect. Zimmerman et al. [100] developed MAST-RE by adding a
subject random effect to the nonzero expression part of the hurdle model in MAST.
The muscat package [101] provides two approaches to consider subject effect:
(1) muscat-PB that aggregates cell-level UMI counts into sample-level “pseudo-
bulk” counts which are then compared between two groups using edgeR that
was developed for DE analysis in bulk RNA-seq data and (2) muscat-MM that
fits a generalized linear mixed model (GLMM) on the cell-level UMI counts to
account for subject variation. Both muscat-PB and muscat-MM were compared to
other methods and shown to have power gain by considering subject effect [97].
Recently, we developed a novel method, iDESC, to consider both subject effect
and dropouts using a zero-inflated negative binomial mixed model. Dropout events
are modeled as inflated zeros and non-dropout events are modeled using a negative
binomial distribution. In the negative binomial component, a random effect is used
to represent subject effect and thus separate it from group effect. Wald statistic is
used to assess the significance of group effect. Our method is the only one that
explicitly considers both dropout events and subject effects in the same model.
Although there are debates [117] on the necessity and overfitting problem of the
zero-inflated model for scRNA-seq UMI count data, many publications have shown
a better goodness-of-fit of the zero-inflated model compared to the regular GLMM
model for genes with relatively high dropout rate [48, 118]. More importantly,
ignoring dropout events in the GLMM model introduces bias to the estimation of
between-group fold change in expression, which is an important part of DE analysis
results in addition to assessing the significance of differential expression.
5.4 iDESC
In this section, we provide details of the iDESC model. There are two components
in the model: a zero component representing dropouts and a negative binomial com-
ponent representing captured expression. Both dropout rate and mean expression are
allowed to be different between the two groups. Suppose cells are collected from n

subjects. In a given cell type of interest, subject i has ni cells so that there are in total
N =
n
i=1ni cells of the given type. Let Xi be the group label of subject i, where Xi
is 1 if subject i belongs to group 1 and 0 if subject i belongs to group 2. For a given
gene, let Yij denote the UMI count of this gene in cell j from subject i. We model the
UMI count as
Yij | π1, π2, λij , d ∼ [π1Xi + π2 (1 − Xi)] × I{Yij =0}
+ [1 − π1Xi − π2 (1 − Xi)] × NB

Sij λij , d

with log

λij

= α + βXi + γi,
where π1 and π2 are the dropout rates in the two groups representing the probability
of this gene being dropped out; I{·} is an indicator function that takes value 1 when
the condition in the brackets is satisfied, 0 otherwise; Sij is the total UMI counts of
cell j from subject i; λij is the rate parameter of the negative binomial distribution
representing the true underlying relative expression level of this gene; and d is the
dispersion parameter. The rate parameter λij is further modeled as a GLMM with log
link, where α is the intercept, β is the group effect representing the log fold change
of mean expression between the two groups, and γ i is the subject random effect. We
assume γ i are independent and γ i~N(0, σ2). A Wald statistic is constructed to test
H0 : β = 0 against H1 : β = 0 using an R package “glmmTMB” [119], which tests
whether the given gene is differentially expressed between the two groups or not.
5.5 DE Methods Evaluation and Comparison
In this section, we evaluated and compared the performance of 12 DE methods
shown in Table 2. The method performance was compared based on type I error
and statistical power using both simulated and real datasets including the Kaminski
dataset [98] and the Kropski dataset [120]. Both datasets measured scRNA-seq data
of whole lung tissue from patients with IPF and normal controls.
5.5.1 Type I Error Comparison
To avoid performance evaluation bias due to data distribution assumption, we
permuted the group labels of subjects in both Kaminski and Kropski datasets for 500
times. For each gene, an empirical type I error was calculated as the proportion of
permuted datasets with a p-value 0.05. Figure 4 shows the empirical type I errors
of all methods. In both datasets, methods that consider subject effects, including
iDESC, MAST-RE, muscat-MM, muscat-PB, and subT, had well-controlled type I
error. Moreover, MAST-RE had slightly inflated type I error for some genes, likely
due to the deviation of log-normalized UMI counts from the assumed Gaussian

Table
2
Overview
of
the
12
DE
analysis
methods
for
comparison
Dropout
Subject
effect
Test
Model
iDESC
√
Mixed
model
Wald
test
Zero-inflated
negative
binomial
mixed
model
MAST-RE
√
Mixed
model
Likelihood
ratio
test
Two-part
hurdle
mixed
model
Muscat-MM
×
Mixed
model
Wald
test
Negative
binomial
mixed
model
Muscat-PB
×
Aggregation
Quasi-likelihood
F-test
EdgeR
on
sample-level
aggregated
data
subT
×
Aggregation
Student’s
T-test
T
test
on
sample-level
aggregated
data
DEsingle
√
×
Likelihood
ratio
test
Group-specific
zero-inflated
negative
binomial
model
MAST
√
×
Likelihood
ratio
test
Two-part
hurdle
model
scDD
√
×
Kolmogorov-Smirnov
test
Dirichlet
process
mixture
of
normals
NBID
×
×
Likelihood
ratio
test
Negative
binomial
model
with
group-specific
dispersion
DESeq2
×
×
Wald
test
Negative
binomial
model
with
the
same
dispersion
between
groups
limma
×
×
Moderated
T
test
Linear
regression
model
Wilcoxon
×
×
Wilcoxon
rank
sum
test
Nonparametric
test

Fig. 4 Empirical type I error of all 12 DE methods using permuted Kaminski and Kropski datasets.
Box plots show the median (centerline), interquartile range (hinges), and 1.5 times the interquartile
(whiskers) of empirical type I error at the nominal level of 0.05. Confidence interval of type I error
is marked by two dashed lines (0.031–0.069)
distribution. In contrast, methods that do not consider subject effect had severely
inflated type I error. Among these methods, DEsingle, MAST, and scDD had the
largest inflation in type I error. DESeq2 had the largest variation in type I error
across all genes. In summary, these results suggested that it is important to consider
subject effect for type I error control in the DE analysis of scRNA-seq data with
multiple subjects.
5.5.2 Statistical Power Comparison
Since methods that ignore subject effects had highly inflated type I error, we mainly
compared the statistical power of methods that consider subjects. To achieve this,
we simulated scRNA-seq data under a wide range of parameter settings estimated
from real datasets. The DE analysis was conducted using the 12 methods on the
simulated data. Method performance was assessed by the area under a receiver
operating characteristic curve (AUC) that describes the sensitivity and specificity of
the identified DE genes under different significance levels where genes with nonzero
group effect in the simulation model were treated as ground truth.

Fig. 5 Statistical power comparison across methods that consider subject effects. Area under an
ROC curve (AUC) was calculated to measure the accuracy of identified DE genes under two
scenarios of dropout rate settings in disease (π1) and control (π2) groups: (a) Scenario I with
the same dropout rate (0 and 0.01) between groups and (b) Scenario II with dropout rate higher in
the control group
Two scenarios were considered regarding dropout rate settings. In the first
scenario (Scenario I), there was no dropout or dropout rates were the same between
two groups. Figure 5a shows that iDESC had the highest AUC in most of the
simulation settings. SubT had the second highest AUC but was slightly better than
iDESC when the intercept was low (α = − 9.3) and the group effect was positive
and low (β = 0.1), corresponding to the situation of overall low expression in both
groups and slightly higher expression in the disease group. The other three methods,
MAST-RE, muscat-MM, and muscat-PB, had lower AUC than iDESC across all
simulation settings.
In the second scenario (Scenario II), control group had higher dropout rate
(π2 π1), and Fig. 5b shows that iDESC had the highest AUC when the group effect
was negative. SubT had the largest decrease in performance when the difference
in dropout rate between the two groups increases (π1 = 0, π2 = 0.04). However,
when the group effect was positive, subT had improved performance and achieved

comparable or even higher AUC than iDESC. This was expected because subT is
designed to detect the observed mean expression difference between groups. When
the mean gene expression of the disease group is higher and the dropout rate of
the disease group is lower than those of the control group, the observed mean
expression difference will be larger than the true difference between the two group
means, facilitating subT to detect the expression difference between the two groups,
especially when the difference in dropout rate is large and in an opposite direction
to the group mean difference. In contrast, when the mean gene expression and the
dropout rate of the disease group are both lower than those of the control group, the
observed mean expression difference becomes smaller than the true difference so
that subT loses power to detect the expression difference between the two groups,
especially when the difference in dropout rate is large and in the same direction as
the group mean difference. This suggests that subT is highly sensitive to difference
in dropout rate between the two groups. The other three methods, MAST-RE,
muscat-MM, and muscat-PB, had lower AUC than iDESC across all simulation
settings.
In summary, iDESC had the highest AUC except for the settings with positive
group effect and higher dropout rate in the control group where the observed
difference is larger than the true group mean difference which favor the methods
that ignore dropouts. Note that comparison using real dataset is strongly needed.
However, scRNA-seq datasets with multiple subjects from the same disease and
same tissue are hard to find, making it challenging to evaluate and compare methods
using real datasets.
6 Concluding Remarks
Data preprocessing, normalization, data imputation, and differential expression
analysis are four important components of scRNA-seq data analysis. Methods
discussed in this chapter represent a selected collection of existing methods that
are widely used or have potential to become so in these four analysis pipelines.
These methods help researchers extract signals from scRNA-seq data and identify
robust disease-associated changes in cell types of interest. Identifying cell type-
specific transcriptomic changes associated with disease is one of the first steps to
study cellular system in disease samples. Broader scientific questions include (1)
what cause transcriptomic changes, (2) how cell communicate with each other, (3)
how cell-to-cell communications change in disease samples, and (4) what potential
therapeutics can be identified or developed based on the identified changes. To
answer these questions, we may consider integrating drug perturbation data and
scRNA-seq data, combining ligand-receptor and signaling pathway information
with scRNA-seq data, conducting cell trajectory and RNA velocity analysis of
scRNA-seq data, and performing in silico cell lineage tracing. Although the cost
of scRNA-seq experiments has been dropped over time, it is still costly to generate
scRNA-seq datasets with large sample size. Deconvolving existing bulk RNA-seq

data using gene signatures identified from scRNA-seq data of the same tissue brings
a great potential to increase sample size of the data to understand cell type-specific
changes.
With the rapid development of high-throughput scRNA-seq technologies, spatial
transcriptomic data in intact tissue are emerging. Spatial location of cells facilitates
accurate and robust learning of cell-to-cell communications and disease-associated
tissue microenvironment. Currently, due to limitations in spatial transcriptomic
technologies, scRNA-seq data is helpful to deconvolve spatial transcriptomic data
into single-cell resolution. Computational and statistical methods are in great need
to integrate these two types of data. In addition, the scale and complexity of the cell-
to-cell communication inference analysis increases exponentially with the number
of cells included in the analysis. Therefore, highly scalable graph-based methods in
network inference that integrate prior graph information hold great promise in the
field.
References
1. Eldar A, Elowitz MB (2010) Functional roles for noise in genetic circuits. Nature
467(7312):167–173
2. Huang S (2009) Non-genetic heterogeneity of cells in development: more than just noise.
Development 136(23):3853–3862
3. Li L, Clevers H (2010) Coexistence of quiescent and active adult stem cells in mammals.
Science 327(5965):542–545
4. Shalek AK et al (2014) Single-cell RNA-seq reveals dynamic paracrine control of cellular
variation. Nature 510(7505):363–369
5. Maamar H, Raj A, Dubnau D (2007) Noise in gene expression determines cell fate in Bacillus
subtilis. Science 317(5837):526–529
6. Huang H et al (2014) Non-biased and efficient global amplification of a single-cell cDNA
library. Nucleic Acids Res 42(2):e12
7. Taniguchi K, Kajiyama T, Kambara H (2009) Quantitative analysis of gene expression in a
single cell by qPCR. Nat Methods 6(7):503–506
8. Bengtsson M et al (2008) Quantification of mRNA in single cells and modelling of RT-qPCR
induced noise. BMC Mol Biol 9:63
9. Warren L et al (2006) Transcription factor profiling in individual hematopoietic progenitors
by digital RT-PCR. Proc Natl Acad Sci U S A 103(47):17807–17812
10. Eberwine J et al (1992) Analysis of gene expression in single live neurons. Proc Natl Acad
Sci U S A 89(7):3010–3014
11. Brady G, Barbara M, Iscove NN (1990) Representative in vitro cDNA amplification from
individual hemopoietic cells and colonies. Methods Mol Cell Biol 2(1):17–25
12. Subkhankulova T, Gilchrist MJ, Livesey FJ (2008) Modelling and measuring single cell RNA
expression levels find considerable transcriptional differences among phenotypically identical
cells. BMC Genomics 9:268
13. Kurimoto K et al (2007) Global single-cell cDNA amplification to provide a template for
representative high-density oligonucleotide microarray analysis. Nat Protoc 2(3):739–752
14. Kurimoto K et al (2006) An improved single-cell cDNA amplification method for efficient
high-density oligonucleotide microarray analysis. Nucleic Acids Res 34(5):e42
15. Tang F et al (2009) mRNA-seq whole-transcriptome analysis of a single cell. Nat Methods
6(5):377–382

16. Cloonan N et al (2008) Stem cell transcriptome profiling via massive-scale mRNA sequenc-
ing. Nat Methods 5(7):613–619
17. Zheng GX et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat
Commun 8:14049
18. Macosko EZ et al (2015) Highly parallel genome-wide expression profiling of individual cells
using nanoliter droplets. Cell 161(5):1202–1214
19. Kolodziejczyk AA et al (2015) The technology and biology of single-cell RNA sequencing.
Mol Cell 58(4):610–620
20. Vieira Braga FA et al (2019) A cellular census of human lungs identifies novel cell states in
health and in asthma. Nat Med 25(7):1153–1163
21. Reyfman PA et al (2019) Single-cell transcriptomic analysis of human lung provides insights
into the pathobiology of pulmonary fibrosis. Am J Respir Crit Care Med 199(12):1517–1536
22. Azizi E et al (2018) Single-cell map of diverse immune phenotypes in the breast tumor
microenvironment. Cell 174(5):1293–1308.e36
23. Adams TS et al (2019) Single cell RNA-seq reveals ectopic and aberrant lung resident cell
populations in idiopathic pulmonary fibrosis. bioRxiv: 759902
24. Chung W et al (2017) Single-cell RNA-seq enables comprehensive tumour and immune cell
profiling in primary breast cancer. Nat Commun 8:15081
25. Kaminow B, Yunusov D, Dobin A (2021) STARsolo: accurate, fast and versa-
tile mapping/quantification of single-cell and single-nucleus RNA-seq data. bioRxiv:
2021.05.05.442755
26. Srivastava A et al (2019) Alevin efficiently estimates accurate gene abundances from
dscRNA-seq data. Genome Biol 20(1):65
27. Melsted P et al (2021) Modular, efficient and constant-memory single-cell RNA-seq prepro-
cessing. Nat Biotechnol 39(7):813–818
28. Smith T, Heger A, Sudbery I (2017) UMI-tools: modeling sequencing errors in unique
molecular identifiers to improve quantification accuracy. Genome Res 27(3):491–499
29. Parekh S et al (2018) zUMIs – a fast and flexible pipeline to process RNA sequencing data
with UMIs. Gigascience 7(6)
30. You Y et al (2021) Benchmarking UMI-based single-cell RNA-seq preprocessing workflows.
Genome Biol 22(1):339
31. Bruning RS et al (2022) Comparative analysis of common alignment tools for single-cell
RNA sequencing. Gigascience 11
32. Dobin A et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21
33. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25(14):1754–1760
34. Kim D et al (2013) TopHat2: accurate alignment of transcriptomes in the presence of
insertions, deletions and gene fusions. Genome Biol 14(4):R36
35. Liao Y, Smyth GK, Shi W (2013) The Subread aligner: fast, accurate and scalable read
mapping by seed-and-vote. Nucleic Acids Res 41(10):e108
36. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods
9(4):357–359
37. Srivastava A et al (2016) RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq
reads to transcriptomes. Bioinformatics 32(12):i192–i200
38. Bray NL et al (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol
34(5):525–527
39. Lun ATL et al (2019) EmptyDrops: distinguishing cells from empty droplets in droplet-based
single-cell RNA sequencing data. Genome Biol 20(1):63
40. Petukhov V et al (2018) dropEst: pipeline for accurate estimation of molecular counts in
droplet-based single-cell RNA-seq experiments. Genome Biol 19(1):78
41. Vieth B et al (2019) A systematic evaluation of single cell RNA-seq analysis pipelines. Nat
Commun 10(1):4667
42. Vallejos CA, Marioni JC, Richardson S (2015) BASiCS: Bayesian analysis of single-cell
sequencing data. PLoS Comput Biol 11(6):e1004333

43. Lun AT, Bach K, Marioni JC (2016) Pooling across cells to normalize single-cell RNA
sequencing data with many zero counts. Genome Biol 17:75
44. Qiu X et al (2017) Single-cell mRNA quantification and differential analysis with census. Nat
Methods 14(3):309–315
45. Borella M et al (2021) PsiNorm: a scalable normalization for single-cell RNA-seq data.
Bioinformatics
46. Satija R et al (2015) Spatial reconstruction of single-cell gene expression data. Nat Biotechnol
33(5):495–502
47. Hafemeister C, Satija R (2019) Normalization and variance stabilization of single-cell RNA-
seq data using regularized negative binomial regression. Genome Biol 20(1):296
48. Risso D et al (2018) A general and flexible method for signal extraction from single-cell
RNA-seq data. Nat Commun 9(1):284
49. Hicks SC et al (2018) Missing data and technical variability in single-cell RNA-sequencing
experiments. Biostatistics 19(4):562–578
50. Finak G et al (2015) MAST: a flexible statistical framework for assessing transcriptional
changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol
16:278
51. Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377
52. Trapnell C et al (2014) The dynamics and regulators of cell fate decisions are revealed by
pseudotemporal ordering of single cells. Nat Biotechnol 32(4):381–386
53. Campbell K, Ponting CP, Webber C (2015) Laplacian eigenmaps and principal curves for
high resolution pseudotemporal ordering of single-cell RNA-seq profiles. bioRxiv: 027219
54. Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput 15(6):1373–1396
55. Zeisel A et al (2015) Brain structure. Cell types in the mouse cortex and hippocampus revealed
by single-cell RNA-seq. Science 347(6226):1138–1142
56. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–
2605
57. McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection
for dimension reduction. arXiv preprint arXiv:1802.03426
58. Kobak D, Berens P (2019) The art of using t-SNE for single-cell transcriptomics. Nat
Commun 10
59. Kharchenko PV (2021) The triumphs and limitations of computational methods for scRNA-
seq. Nat Methods 18(7):723–732
60. Heiser CN, Lau KS (2020) A quantitative framework for evaluating single-cell data structure
preservation by dimensionality reduction techniques. Cell Rep 31(5)
61. Chari T, Banerjee J, Pachter L (2021) The specious art of single-cell genomics. bioRxiv:
2021.08.25.457696
62. Kiselev VY, Andrews TS, Hemberg M (2019) Challenges in unsupervised clustering of
single-cell RNA-seq data. Nat Rev Genet 20(5):273–282
63. Petegrosso R, Li Z, Kuang R (2020) Machine learning and statistical methods for clustering
single-cell RNA-sequencing data. Brief Bioinform 21(4):1209–1223
64. Levine JH et al (2015) Data-driven phenotypic dissection of AML reveals progenitor-like
cells that correlate with prognosis. Cell 162(1):184–197
65. Wolf FA, Angerer P, Theis FJ (2018) SCANPY: large-scale single-cell gene expression data
analysis. Genome Biol 19(1):15
66. Xie JR, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the
state-of-the-art and comparative study. ACM Comput Surv 45(4)
67. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis.
Phys Rev E Stat Nonlinear Soft Matter Phys 80(5 Pt 2):056117
68. Puxeddu MG et al (2017) Community detection: comparison among clustering algorithms
and application to EEG-based brain networks. Annu Int Conf IEEE Eng Med Biol Soc
2017:3965–3968

69. McDavid A et al (2013) Data exploration, quality control and testing in single-cell qPCR-
based gene expression experiments. Bioinformatics 29(4):461–467
70. Karaayvaz M et al (2018) Unravelling subclonal heterogeneity and aggressive disease states
in TNBC through single-cell RNA-seq. Nat Commun 9(1):1–10
71. Kharchenko PV, Silberstein L, Scadden DT (2014) Bayesian approach to single-cell differen-
tial expression analysis. Nat Methods 11(7):740–742
72. Yuan G-C et al (2017) Challenges and emerging directions in single-cell analysis. Genome
Biol 18(1):1–8
73. Hou W et al (2020) A systematic evaluation of single-cell RNA-sequencing imputation
methods. Genome Biol 21(1):1–30
74. Zhang L, Zhang S (2018) Comparison of computational methods for imputing single-cell
RNA-sequencing data. IEEE/ACM Trans Comput Biol Bioinform 17(2):376–389
75. Lähnemann D et al (2020) Eleven grand challenges in single-cell data science. Genome Biol
21(1):1–35
76. Wagner F, Yan Y, Yanai I (2018) K-nearest neighbor smoothing for high-throughput single-
cell RNA-Seq data. bioRxiv: 217737
77. Van Dijk D et al (2018) Recovering gene interactions from single-cell data using data
diffusion. Cell 174(3):716–729.e27
78. Li WV, Li JJ (2018) An accurate and robust imputation method scImpute for single-cell RNA-
seq data. Nat Commun 9(1):1–9
79. Gong W et al (2018) DrImpute: imputing dropout events in single cell RNA sequencing data.
BMC Bioinformatics 19(1):1–10
80. Chen M, Zhou X (2018) VIPER: variability-preserving imputation for accurate gene expres-
sion recovery in single-cell RNA sequencing studies. Genome Biol 19(1):1–15
81. Huang M et al (2018) SAVER: gene expression recovery for single-cell RNA sequencing. Nat
Methods 15(7):539–542
82. Wu W et al (2021) G2S3: a gene graph-based imputation method for single-cell RNA
sequencing data. PLoS Comput Biol 17(5):e1009029
83. Elyanow R et al (2020) netNMF-sc: leveraging gene–gene interactions for imputation and
dimensionality reduction in single-cell expression analysis. Genome Res 30(2):195–204
84. Ronen J, Akalin A (2018) netSmooth: network-smoothing based imputation for single cell
RNA-seq. F1000Research 7
85. Linderman GC, Zhao J, Kluger Y (2018) Zero-preserving imputation of scRNA-seq data
using low-rank approximation. bioRxiv: 397588
86. Jin K et al (2020) scTSSR: gene expression recovery for single-cell RNA sequencing using
two-side sparse self-representation. Bioinformatics 36(10):3131–3138
87. Talwar D et al (2018) AutoImpute: autoencoder based imputation of single-cell RNA-seq
data. Sci Rep 8(1):1–11
88. Eraslan G et al (2019) Single-cell RNA-seq denoising using a deep count autoencoder. Nat
Commun 10(1):1–14
89. Arisdakessian C et al (2019) DeepImpute: an accurate, fast, and scalable deep neural network
method to impute single-cell RNA-seq data. Genome Biol 20(1):1–14
90. Amodio M et al (2019) Exploring single-cell data with deep multitasking neural networks.
Nat Methods 16(11):1139–1145
91. Andrews TS, Hemberg M (2018) False signals induced by single-cell imputation.
F1000Research 7
92. Zhang X-F et al (2019) EnImpute: imputing dropout events in single-cell RNA-sequencing
data via ensemble learning. Bioinformatics 35(22):4827–4829
93. Kalofolias V (2016) How to learn a graph from smooth signals. In: Artificial intelligence and
statistics. PMLR
94. Komodakis N, Pesquet J-C (2015) Playing with duality: an overview of recent primal?
Dual approaches for solving large-scale optimization problems. IEEE Signal Process Mag
32(6):31–54

95. Tjärnberg A et al (2021) Optimal tuning of weighted kNN-and diffusion-based methods for
denoising single cell genomics data. PLoS Comput Biol 17(1):e1008569
96. Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: a
tutorial. Mol Syst Biol 15(6):e8746
97. Squair JW et al (2021) Confronting false discoveries in single-cell differential expression. Nat
Commun 12(1):5692
98. Adams TS et al (2020) Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell
populations in idiopathic pulmonary fibrosis. Sci Adv 6(28):eaba1983
99. Yao C et al (2019) Single-cell RNA-seq reveals TOX as a key regulator of CD8(+) T cell
persistence in chronic infection. Nat Immunol 20(7):890–901
100. Zimmerman KD, Espeland MA, Langefeld CD (2021) A practical solution to pseudoreplica-
tion bias in single-cell studies. Nat Commun 12(1)
101. Crowell HL et al (2020) Muscat detects subpopulation-specific state transitions from multi-
sample multi-condition single-cell transcriptomics data. Nat Commun 11(1):6077
102. Wohnhaas CT et al (2019) DMSO cryopreservation is the method of choice to preserve cells
for droplet-based single-cell RNA sequencing. Sci Rep 9(1):10699
103. Soneson C, Robinson MD (2018) Bias, robustness and scalability in single-cell differential
expression analysis. Nat Methods 15(4):255–261
104. Dal Molin A, Baruzzo G, Di Camillo B (2017) Single-cell RNA-sequencing: assessment of
differential expression analysis methods. Front Genet 8:62
105. Jaakkola MK et al (2017) Comparison of methods to detect differentially expressed genes
between single-cell populations. Brief Bioinform 18(5):735–743
106. Jia C et al (2017) Accounting for technical noise in differential expression analysis of single-
cell RNA sequencing data. Nucleic Acids Res 45(19):10978–10988
107. Qiu X et al (2017) Reversed graph embedding resolves complex single-cell trajectories. Nat
Methods 14(10):979–982
108. Chen W et al (2018) UMI-count modeling and differential expression analysis for single-cell
RNA sequencing. Genome Biol 19(1):70
109. Miao Z et al (2018) DEsingle for detecting three types of differential expression in single-cell
RNA-seq data. Bioinformatics 34(18):3223–3224
110. Ye C, Speed TP, Salim A (2019) DECENT: differential expression capture efficiency
adjustmeNT for single-cell RNA-seq data. Bioinformatics 35(24):5155–5162
111. Wu Z et al (2018) Two-phase differential expression analysis for single cell RNA-seq.
Bioinformatics 34(19):3340–3348
112. Korthauer KD et al (2016) A statistical approach for identifying differential distributions in
single-cell RNA-seq experiments. Genome Biol 17(1):222
113. Delmans M, Hemberg M (2016) Discrete distributional differential expression (D3E)–a tool
for gene expression analysis of single-cell RNA-seq data. BMC Bioinformatics 17:110
114. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for
RNA-seq data with DESeq2. Genome Biol 15(12):550
115. Ritchie ME et al (2015) limma powers differential expression analyses for RNA-sequencing
and microarray studies. Nucleic Acids Res 43(7):e47
116. Wilcoxon F (1946) Individual comparisons of grouped data by ranking methods. J Econ
Entomol 39:269
117. Svensson V (2020) Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38(2):147–150
118. Sarkar A, Stephens M (2021) Separating measurement and expression models clarifies
confusion in single-cell RNA sequencing analysis. Nat Genet 53(6):770–777
119. Brooks ME et al (2017) glmmTMB balances speed and flexibility among packages for zero-
inflated generalized linear mixed modeling. R Journal 9(2):378–400
120. Habermann AC et al (2020) Single-cell RNA sequencing reveals profibrotic roles of distinct
epithelial and mesenchymal lineages in pulmonary fibrosis. Sci Adv 6(28):eaba1972

Pre-processing, Dimension Reduction,
and Clustering for Single-Cell RNA-seq
Data
Jialu Hu, Yiran Wang, Xiang Zhou, and Mengjie Chen
Abstract The advent of scRNA-seq technologies enables us to quantitatively
characterize gene expression at each single-cell level. The high resolution has thus
far transformed many areas of genomics. For example, scRNA-seq technologies
have been applied to characterization of immunophenotypes, genome sequences,
lineage origins, cell development, and cell type specific response to stimulus.
However, scRNA-seq data suffer from various technical noises, such as excessive
zeros, batch effects, and bias caused by library sizes, which hinder interpretation
of scRNA-seq data and revealing of the underlying biology. Hence, there is an
ever-increasing interest of developing statistical and computational tools to analyze
scRNA-seq data and gain biological insights into cellular compositions, lineage
origins, cell type-specific response to diseases in complex tissues. This chapter
summarizes key algorithmic insights and statistical models of some state-of-the-
art methods developed in the past years to facilitate the understanding of the current
scRNA-seq data analysis workflows and to promote the applications of these tools.
J. Hu · Y. Wang
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
e-mail: jhu@nwpu.edu.cn; 1119755397@qq.com
X. Zhou
Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA
e-mail: xzhousph@umich.edu
M. Chen ()
Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL,
USA
Department of Human Genetics, The University of Chicago, Chicago, IL, USA
e-mail: mengjiechen@uchicago.edu
© The Author(s), under exclusive license to Springer-Verlag GmbH, DE,
part of Springer Nature 2022
H. H.-S. Lu et al. (eds.), Handbook of Statistical Bioinformatics,
Springer Handbooks of Computational Statistics,
https://doi.org/10.1007/978-3-662-65902-1_2
37

38 J. Hu et al.
1 Introduction
The scRNA-seq technology was firstly proposed by Tang et al. [1] to measure
mRNA expression in a single mouse blastomere, and they found that this novel
technology could detect expression of 75% more genes than microarray techniques.
Subsequently, a number of scRNA-seq protocols were developed to obtain a high-
resolution view of cellular heterogeneity in a variety of applications, which include,
but not limited to, identification of rare cell type [2], cell type annotation [3], cell
development [4], cell response to stimuli [5], etc. Now, scRNA-seq technologies
have been widely used in almost all fields of biology and medicines.
Although there are many scRNA-seq technologies available for quantitively
measurement of RNA-expression at the single-cell level, each technology has its
own strengths and weaknesses [6, 7]. To carry out a successful scRNA-seq study,
it is necessary to make a series of choices for suitable experimental procedures and
computational methods that address different technical biases. The first step is to
isolate transcriptome of an individual cell from another [8]. Typically, there are five
ways for isolation: (1) dilution; (2) micromanipulation [9, 10]; (3) flow-activated
cell sorting (FACS) [11]; (4) laser capture microdissection (LCM) [12, 13]; (5)
microfluidics [14]. The dilution method isolates cells with pipettes, leveraging the
statistical distribution of diluted cells, but only one-third of the prepared wells can
be efficiently achieved. The micromanipulation is used for isolating single-cell from
microorganisms, which is time-consuming and low throughput. The FACS method
is a commonly used method for isolating highly purified single cells. It first tags cells
of interest to a fluorescent monoclonal antibody that can recognize specific surface
markers and enables sorting of distinct cell subpopulations. The LCM uses a laser
capture system but it is low throughput. The microfluidics technology is currently
the most widely used scRNA-seq technology due to its precise fluid control and
scalability, and it enables quantification of single-cell gene expression profiles
in a high-throughput manner. For example, the microdroplet-based microfluidics
enables the manipulation and screening of tens of thousands of cells in one assay
[14]. The second step of scRNA-seq is library preparation, which includes cell lysis,
reverse transcription (RT), and cDNA amplification [15]. Typically, polyA selection
is utilized to capture mRNA after the cell lysis, but only 10–20% of transcripts
will be captured at this stage [16]. Recently, unique molecular identifiers (UMIs) or
barcodes are used in the reverse transcription, which can help to effectively group
reads by its original cell [16, 17]. The cDNA amplification can be accomplished
by using either conventional PCR [1] or in vitro transcription (IVT) [18]. Several
widely used scRNA-seq library preparation platforms have been developed, which
include smart-seq [19], MARS-seq [18], CEL-seq [20], Drop-seq [21], Microwell-
seq [22], etc. A more detailed discussion of scRNA-seq technologies is provided in
a review of scRNA-seq technologies [6, 7, 23].
With a fast expansion of scRNA-seq technologies, we are facing great computa-
tional challenges to interpret these raw experimental data through in-depth analysis
and gain biological insights into cellular composition, cell lineage or cell response

Pre-processing, Dimension Reduction, and Clustering for Single-Cell RNA-seq Data 39
in heterogeneous or complex disease tissues. Due to various noises caused by low-
capture rate, low RT efficiency and amplification bias, it is a common practice
to filter low-qualified cells or genes, impute dropout (i.e., excessively expressed
zeros), and normalize the raw count data into continuous data in the pre-processing
stage. Removal batch effects are also an essential step when analyzing samples
that generated from different batches or protocols. After pre-processing, clustering,
trajectory inference and differential gene expression analysis are common strategies
for study of cell lineage and heterogeneity in complex tissues.
In the following sections, we will review statistical models and algorithmic
key points of several current state-of-the-art methods in the pre-processing and
downstream analysis of single-cell RNA-seq data, with a focus on unsupervised
analyses.
2 Pre-processing of scRNA-seq Data
2.1 Removal of Batch Effects
Many factors can introduce unwanted variation in an scRNA-seq dataset and
confound the downstream analysis. Such unwanted variation are often due to a
range of scRNAseq specific conditions that include amplification bias, low amount
of input material and low transcript capture efficiency; dropout events that are
driven by both biological and technical factors; global changes in expression due to
transcriptional bursts; as well as changes in cell cycle and cell size. Common hidden
confounding factors include various technical artifacts during library preparation
and sequencing, and unwanted biological confounders such as cell cycle status.
These confounders introduce unwanted variation, called batch effects, within an
scRNA-seq data, and can cause systematic bias.
Many algorithms have been proposed to remove batch effects within an scRNA-
seq data, which can be generally grouped into two categories: supervised and
unsupervised methods. The supervised methods are designed to infer the confound-
ing factors in the presence of a known predictor variable. One supervised method is
surrogate variable analysis (SVA) [24], which can appropriately borrow information
across genes to capture signatures of batch effects directly from the expression data,
rather than attempting to estimate specific unmodeled factors such as age or gender.
Mathematically, the statistical model of SVA can be written as below:
xij = μi + fi(yj ) + L
l=1γliglj + eij , (1)
where xij is the normalized expression for the i-th gene on the j-th array; μi is
the baseline level of expression; fi(yj ) gives the relationship between measured
variable of interest and gene i; gl is an arbitrarily complicated factor across all
arrays; γli is a gene-specific coefficient for the l-th unmodeled factor; eij is the

40 J. Hu et al.
true gene-specific noise. The SVA algorithm does not attempt to directly estimate
the specific variable gl. Instead, it chooses an orthogonal set of vectors that spans
the same linear space to estimate the linear combination L
l=1γliglj . Some other
supervised methods include sparse regression models [25, 26] and remove unwanted
variation (RUV) [27]. However, these supervised methods are restricted to cases
where there are multiple variables of interest, which meet their limitations when the
primary variable of interest is not observed.
The unsupervised methods, thus, are designed to infer the confounding factors
without knowing the predictor variable of interest. For example, an unsupervised
version of RUV [28] adjusts for nuisance technical effects by performing factor
analysis on suitable sets of control genes or samples. Another algorithm, named
single-cell latent variable model (scLVM) [29], uses a latent factor to account
for different hidden factors that might result in gene expression heterogeneity.
The scLVM algorithm uses a gaussian process latent variable model in the fitting
process, and the covariance matrix can be used for a range of analyses to decompose
variance, test for gene-gene correlation or produce residuals. A more recent unsu-
pervised method was proposed by Chen et al. [30] based on a linear factor model,
called single cell partial least square (scPLS). Before modeling the expression level
into a statistical model, it split all genes into two groups: a control set of genes
and a target set of genes. Assuming we have the expression of an individual cell i
measured on q control genes and p target genes, scPLS uses the following formula
to jointly model both the control and target genes:
xi = xzi + xi, xi ∼ MV N(0, xi) (2)
yi = yzi + uui + yi, yi ∼ MV N(0, yi). (3)
Here, xi represents the expression vector of q control genes; yi represents the
expression vector of p target genes; zi represents a kz-vector of unknown confound-
ing factors that affect on both control and target genes; ui represents a ku-vector of
unknown factors in target genes; x, y and u are the coefficient loading matrices;
xi and yi are idiosyncratic errors with diagonal covariance matrices xi and yi,
respectively. In the model, it assumes that xi, yi, zi, and ui are all independent
from each other, zi ∼ MV N(0, I), ui ∼ MV N(0, I), the expression vectors are
transformed data, the expression levels of each gene have been centered to mean
zero. There are two latent factors in the above model: zi and ui. The first factor zi
represents the unknown confounding factors that can introduce batch effects on both
control and target genes. The coefficient matrices x and y are used to capture the
effects of zi on the two sets of genes, respectively. The second factor ui can be
interpreted as the true biological interest (e.g., cell subtypes), and its coefficient
matrix u is used to capture the structure of the expression level of the p target
genes. By modeling the two factors, scPLS is capable to making full use of the data
to improve the inference of confounding effects.
With the advance of scRNA-seq technologies, there have tremendous scRNA-
seq datasets been generated across samples, organisms, species, technologies,

and labs. This challenges us with a problem of cross-sample single-cell data
alignment, which can benefit us for many integrated downstream analyses, such
as comparative analysis for two or more heterogeneous tissues across different
conditions, cellular composition analysis based on reference atlas. To solve this
problem, many alignment methods have been developed.
The first class of alignment tools employ reference-based strategies to assign
each cell of inquiry datasets to a specific cell sub type based on knowledge learned
from a well-annotated reference dataset. These methods include Scmap-cluster [31]
and scAlign [32]. However, they might be inaccurate in aligning query data with
new cell types onto the reference.
In the second class, the concept of mutual nearest neighbors (MNNs) were
utilized for finding matched cell pairs across batches by many alignment tools,
including mnnCorrect [33] (Batchlor/mnn_correct), Scanorama [34], and Seurat V3
[35]. In mnnCorrect, it first finds the most similar cells by detecting MNN pairs
across batches, and obtains a batch correction vector by averaging these MNN pairs.
Inspired by the idea of panorama stitching in computer vision, Scanorama employs a
generalized MNN matching method to identify overlapped cells among all datasets,
and takes advantage of these matches for batch correction and integration. Seurat V3
is one of the most widely used alignment tools. In the pipeline, first, a linear model
(i.e., canonical correlation analysis [36]) is used to project individual cells from
different datasets into a lower-dimensional space. Then, the most similar MNN pairs
(termed as “anchors”) are further used for anchor scoring, anchor weighting and
calculating a transformation matrix, where each column represents the difference
between the two expression vectors of every pair of anchor cells. Finally, an
integrated expression matrix is calculated by subtracting the transformation matrix
from the original expression matrix.
Another group of methods were developed based on factor analysis, which
includes scMerge [37], LIGER [38], SPOTLight [39] and Duren’s method [40].
Specifically, scMerge characterized gene expression pattern using gamma-Gaussian
mixture model and utilized the RUVIII model to remove the unwanted varia-
tion across datasets. LIGER uses an integrative non-negative matrix factorization
(iNMF) to delineate shared and dataset-specific factors of cell identify. It attempts
to reconstruct the original dataset (Ei) using three matrix factors in the lower-
dimensional space, such that Ei ≈ Hi(W + Vi), where all these matrices are
constrained to be non-negative.Mathematically, its objective function can be written
as follows:
arg min
Hi ,Vi,W≥0
d
i Ei − Hi(W + Vi) 2
F +λi HiVi 2
F . (4)
It is noted that the Hi and Vi matrices are unique to each dataset, while the W matrix
is shared across all datasets. The Hi matrix is cell representation matrix in the lower-
dimensional space, which can be further used for many downstream analyses such
as identifying cellular heterogeneity. Furthermore, SPOTLight and Duren’s methods
were also proposed based on factor analysis that can integrate single-cell datasets

42 J. Hu et al.
across two different modalities (e.g., scRNA-seq and scATAC-seq), which is beyond
the scope of this chapter.
2.2 Quality Control and Feature Selection
As stated above, UMIs have been used to group reads into their original cells in
the step of RT. However, some literature [41, 42] suggest that UMI barcoding leads
to excessive zeros or “drop-outs” in the scRNA-seq data. A wide array of tools
were developed to impute these excessive zeros as one step of quality control. For
example, Drimpute [43] attempts to impute the reads using a zero-inflated model to
reduce the noise from drop-outs. Sctransform [44] tries to model the entire data into
negative binomial regression. SAVER [45] tries to fit the data using Poisson LASSO
regression. DCA [46] incorporates a zero-inflated negative binomial model and a
deep-learning framework to learn cell representation in the reduced dimensional
space. All the above imputation methods assume that pre-processing is a necessary
step before feature selection and downstream analysis. However, imputation or
normalization before resolving the cellular heterogeneity may lead to inevitable
loss of biological signals. To solve this problem, Kim et al. [47] developed a
computational tool called HIPPO, which provides a new perspective on scRNA-
seq data analysis by integrating zero-inflation test and a hierarchical clustering
framework. HIPPO leverages zero proportions to detect different levels of cell type
heterogeneity in each gene of UMI data, such as datasets generated from 10X
protocols.
Due to low-capture rate and low RT efficiency, the mixture of technical noises
and intrinsic biological variation makes the identification of technical artifacts
from the real biological signals very challenging. Therefore, it is one major task
to develop methods for detecting and filtering out technical artifacts. Before read
mapping, a very commonly used quality control tool is fastQC (https://www.
bioinformatics.babraham.ac.uk/projects/fastqc/),which returns quality metrics from
sequencing data. For example, GC content is one of these quality metrics that can
be an indicator of sample contamination. Once the raw count matrices obtained,
collected quality metrics can be used to filter the data to only retain true cells
and genes that are of high quality. These quality metrics usually include total read
counts or UMIs per cell (nCount_RNA), total reads counts per gene, number of
genes detected per cell (nFeature_RNA), number of cells detected per gene, and
mitochondrial ratio (percent.mt). The mitochondrial ratio represents a percentage of
cell reads originating from the mitochondrial genes. Cells with a high mitochondrial
ratio are filtered away for quality control because some poor-quality cells are caused
by cytoplasmic RNA [48]. Besides these, overall gene expression patterns, features
of housekeeping genes and spike-in RNA, and total number of mapped reads are
also used to detect technical artifacts in previous studies [49, 50].
After the de-noising or normalization, feature selection is another important step
before the pre-processed datasets are piped into downstream analysis tools such

as clustering [51] or lineage inference [52]. There are usually tens of thousands
of genes in a typical genome, but, only a subset of genes are relevant for specific
functions. Feature selection can significantly improve the computational efficiency
and the signal to noise ratio by reducing the number of gene under consideration.
The existing feature selection methods usually identify the top 500–2000 features
by computing either their highly variable genes (HVGs) [53] or average expression
level across all cells. The computation of HVGs are provided in many software
packages, which include scLVM [29], M3Drop [54], scran, Seurat [35] Scanpy [55],
etc. (see more in [56]).
3 Dimension Reduction and Clustering
3.1 Dimension Reduction
Dimension reduction seeks to embed the high-dimensional expression profile of
each cell into a low-dimensional representation to facilitate visualization and
clustering. A plethora of dimensionality reduction methods have been used in the
analysis of scRNA-seq data [57].
Principal component analysis (PCA) [58] is a classic dimension reduction
method. When PCA is applied to scRNA-seq data, it captures the cross-cell vari-
ability through the linear combination of genes. The first few principal components
contain the main differences of the data and construct a best approximation of
the original data in the low-dimensional space. Ideally, we hope that each new
dimension after PCA can correspond to some biological meanings with straight-
forward interpretation. But in real data analysis, this is often difficult to achieve.
Nevertheless, the simplicity and high efficiency of PCA make it widely used in
scRNA-seq data analysis.
Multidimensional scaling (MDS) [59] is another commonly used linear dimen-
sion reduction method, which aims to project points in the high-dimensional
coordinates into the low-dimensional space and keep the distance between two
points in the new low-dimensional space same as that in the original space.
In the MDS algorithm, no prior knowledge is required and the calculation is
simple. However, new dimensions from MDS analysis can have very unbalanced
contributions to the separation of points. The distance between two points in MDS
is often defined by their Euclidean distance, which makes it inappropriate to handle
data with non-Euclidean properties, such as manifolds (spaces with local Euclidean
properties).
Isometric Mapping (Isomap) [60] is a nonlinear dimension reduction method
based on the principle of local Euclidean space property, so that the distance
between two points is approximately equal to the sum of the lengths of several
adjacent points in turn. Another nonlinear dimension reduction method, non-
negative matrix factorization (NMF) [61] decomposes a given non-negative matrix

44 J. Hu et al.
V into the product of a non-negative matrix W and a non-negative matrix H.
Non-negative matrix factorization is an NP problem, so the multiplication iteration
method is used to solve W and H. The multiplication iteration rules are as follows:
Haμ ← Haμ
(WT V )aμ
(WT WH)aμ
,
Wia ← Wia
(V HT )ia
(WHHT )ia
,
where aμ refers to the element in row a and column μ of the matrix. In this case,
the dimension reduction of the original matrix can be realized by replacing the
original matrix V with the matrix H. It outperforms classic low-rank approximation
approaches such as PCA in some cases, and makes the post-processing much easier.
NMF has been successfully applied to scRNA-seq data to extract non-negative
meaningful features.
T-SNE [62, 63] and Uniform Manifold Approximation and Projection (UMAP)
[64, 65] are the other two nonlinear dimensionality reduction methods, the develop-
ment of which are primarily driven by the need to analyze scRNA-seq data. Different
from the idea of distance invariance in Isomap, t-SNE first transforms Euclidean
distance into conditional probability pj|i to express the similarity between point xi
and point xj . Mathematically, the conditional probability pj|i is given by
pj|i =
exp(−||xi − xj ||2/2σ2
i )

k=i exp(−||xi − xk||2/2σ2
i )
,
where σi is the Gaussian variance, which is centered on point xi. Then t-SNE opti-
mizes the distance between two distributions (like KL divergence), so as to ensure
that the distribution probability between points remains unchanged. Compared with
the original SNE method [66], t-SNE uses a long tailed t-distribution in low-
dimensional space to avoid crowding and optimization problems. However, t-SNE
focuses on capturing local structural similarity, so it may exaggerate the differences
between cell populations and ignore the potential connections between them. In
contrast, UMAP is similar to t-SNE, but it retains the relationship between cells
in high-dimensional space. So UMAP can better reflect the potential topological
structure of cell population, and is more practical for cell trajectory inference
analysis.
There are some dimension reduction methods that leverage latest advances
from deep neural networks, such as VASC [67], DCA [46]. The natural network
architecture for dimension reduction is an autoencoder [46, 68], whose input and
output are high-dimensional data, but the low-dimensional data in the middle layer
almost has all information of the input and output data, so these low-dimensional
data can be used to represent the original high-dimensional data.

Random documents with unrelated
content Scribd suggests to you:

small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws regulating
charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states where
we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Handbook Of Statistical Bioinformatics 2nd Edition Henry Horngshing Lu

More Related Content

Similar to Handbook Of Statistical Bioinformatics 2nd Edition Henry Horngshing Lu (20)

Recently uploaded (20)

Handbook Of Statistical Bioinformatics 2nd Edition Henry Horngshing Lu