Principal component analysis (PCA) to analyze data
1. Rajiv Gandhi University
Rono hills, Doimukh (A.P)
Assignment on Principal component Analysis (PCA) in biological
studies.
Paper code: ZOOC-511
Paper title: Biostatistics and bioinformatics.
Submitted to: Dr. M. Stelin singh
By: Bethel Modi
Roll no: 23/ZOO/PG/00-004
Class: 4th
semester
HP
2. Contents:
1. Introduction
2. PCA in biological researches
3. PCA in gene expression analysis
4. Advantages of PCA
5. Disadvantages of PCA
6. Conclusion
references
3. INTRODUCTION
Principal component analysis (PCA) is a standard tool in multivariate
data analysis to reduce the number of dimensions, while retaining as
much as possible of the data’s variation. It is a dimensionality
reduction technique used in machine learning and statistics to simplify
complex sets while retaining as much information as possible. It
transforms a set of correlated variables into a smaller set of
uncorrelated variables called principal components.
Principal components are a few linear combinations of the original
variables that maximally explain the variance of all the variables. In
the process, the method provides an approximation of the original
data table using o
In principal component analysis often reveals the relationship that
were not previously suspected and thereby allows interpretations that
would not ordinarily result.
The visualization and statistical analysis of these new variables, the
principal components, can help to find similarities and differences
4. between samples. Important original variables that are major
contributors to the first component can be discovered as well.
The main graphical result is often in the form of a biplot, using the
major components to map the cases and adding the original variables
to support the distance interpretation of the case’s positions.
Variants of the method are also treated, such as the analysis of
grouped data, as well as the analysis of categorical data, known as
correspondence analysis.
The advantage of using PCA is
Reduces computational complexity.
Removes noise and redundancy.
Improves visualization for high dimensional data.
Help reduce multicollinearity in models.
5. The goal of PCA is to identify a reduced set of features that represent
the original data in a lower dimensional subspace with minimal loss
of information. PCA and related methods provide means to
summarize the data and extract information about individual
differences. This makes these methods particularly useful in the era of
big data and personalized medicine. (Ferath kherif, Adeliya latypova)
There is a neat distinction between the general purpose statistical
techniques and quantitative models developed for specific problems.
Principal Component analysis (PCA) blurs this distinction: while
being a general purpose statistical technique, it implies a peculiar
style of reasoning. PCA is a hypothesis generating tool creating a
statistical mechanics frame for biological systems modeling without
the need for strong a prioritheoretical assumptions. This makes PCA
of utmost importance for approaching drug discovery by a systematic
perspective overcoming too narrow reductionist approach.
6. PCA in biological researches.
The application of PCA ranges across all the main themes of
pharmacology and biomedical sciences as well, going from
quantitative structure activity relationship to data mining and different
omics approaches. Here’s how PCA is applied in different biological
fields.
1. Genomics and transcriptomics
Gene expression analysis: PCA helps reduce the
complexity of RNA sequenced and microarray data by
identifying patterns in gene expression across different
conditions (eg. Healthy vs. diseased state).
Population Genetics: it is used to detect population
structure ancestry relationship in genome wide association
studies (GWAS).
2. Proteomics and Metabolomics.
Biomarker Discovery: PCA is used to identify key
proteins or metabolites that differentiate disease and
control samples.
7. Spectroscopy Data analysis: Helps in processing and
interpreting mass spectrometry (MS) and nuclear magnetic
resonance (NMR) data
3. Microbiology and Ecology
Microbial community analysis: PCA is applied to 16S
rRNA sequencing data to visualize microbiome
differences across samples.
Ecological Studies: Helps in analyzing species distribution
patterns based on environmental factors.
4. Medical and clinal research
Disease classification: PCA assists in clustering patients
based on clinical and genetic features.
Drug response analysis: Used to analyze how different
individuals respond to treatment based on multi- omics
data.
5. Structural Biology and imaging.
Protein structural analysis: Helps identify structural
variations in protein dynamics simulations.
8. Medical Imaging: PCA is used for dimensionality
reduction in MRI, CT and histopathological image
analysis.
PCA is a fundamental tool in computational biology, enabling
researchers to make sense of large, complex datasets efficiently.
PCA in Gene Expression Analysis.
Karl pearson synthetically defined the main goal of PCA: ‘In many
physical, statistical and biological investigations is desirable to
represent a system of points in plane, three or higher dimensioned
spaced by the best fitting straight line or plane’.
Pearson continues: ‘In nearly all the cases dealt with in the text books
of least squares, the variables on the right of our equaions are treated
as independent, those on the left as independent variables.’ This
implies that the minimization of the sum of squared distances only
deals with the dependent (y) variable. The variance along independent
9. (x) variable, being the consequence of the choice of the scientist (eg.
Dose, time of observation.) is supposed to be strictly controlled and
thus does not enter in least squares computation.
The novelty of PCA lies in a different look at reality: in many cases of
physics and biology, however the ‘independent’ variable is a subject
to just as much deviation of error as a ‘ dependent’ variable , we don’t
for
Example, know x accurately and then proceed to find y, but both x
and y are found by experiment or observation.
The axes represent the directions with maximum variability and
provide a simple and parsimonious descriptions of the covariance
structure.
10. There are different type of cells, unfortunately we can observe the
differences from outside, so we sequence mRNA in each cell to
identify which genes are active, this tells us what the cell is doing.
Cell 1 Cell 2 Cell 3 Cell 4
Gene 1 3 0.25 2.8 0.1
Gene 2 2.9 0.8 2.2 1.8
Gene 3 2.2 1 1.5 3.2
Gene 4 2 1.4 2 0.3
Gene 5 1.3 1.6 1.6 0
11. Gene 6 1.5 2 2.1 3
Gene 7 1.1 2.2 0.9 2.8
Gene 8 1 2.7 0.9 0.3
Gene 9 0.4 3 0.6 0.1
Each cell show how much gene is transcribed in each cell. The gene 1
is transcribed lowly in cell 1 and it is transcribed highly in cell 2. In
general the cell 1 and cell 2 have inverse correlation. This means that
they are probably 2 different types of cells, since they are using
different genes.
Alternatively, we could try to plot all three cells at once on a 3
Dimensional graph.
We can see how these cells are related to each other. When these cells
are more than 3 plots, instead we use Principle component analysis
plot.
12. The axes are ranked in order of importance, differences along the first
Principal component axis PC1 are more important than differences
along the second Principle component axis PC2.
A PCA plot converts the correlations ( or lack there of) among all of
the cells into a 2 dimensional graph. Cells that highly closer are
correlated.
Once identified the clusters in the PCA plot we go back to original
cells and see they represent 3 different types of cells doing 3 different
things with their genes.
Advantages of PCA
1. Useful for Noise reduction- PCA can filter out noise by
capturing only the most significant variations in the data.
13. 2. Improves intrepretibility- By transforming data into principle
components, it can reveal hidden structures and patterns that
were not obvious in the original features.
3. Removes Redundancy- PCA eliminates correlated and
redundant features, improving model efficiency.
4. Enhances Computational Efficiency- By reducing the no. of
dimension, PCA helps in visualizing patterns by projecting data
into 2D or 3D space.
5. Removes Dimensionality – It simplifies complex datasets by
reducing the number of features while preserving important
patterns, making data easier to visualize and analyze.
However PCA should be used carefully, as it can lead to
information loss if too many components are removed.
Disadvantages of PCA
While PCA has many advantages, it also has a several disadvantage
that should be considered:
14. 1. Difficulty in choosing the number of components- Depending
how many principal components to retain require careful
analysis, such as using the explained variance ratio.
2. May not work for categorical Data: PCA is designed for
continuous numerical data and may not perform well with
categorical variables unless properly encoded.
3. Computational Cost: For every large datasets, computing the
covariance matrix and eigenvectors can be computationally
expensive.
4. Assumes Linearity- PCA assumes that relationship between
variables are linear, which may not be true for real world data.
5. Risk of Information loss- If too many Principal components are
removed, important data variations may be lost, leading to
reduced model performance.
Conclusion:
Principle component analysis is a powerful dimensionality
reduction technique widely used in data analysis and machine
15. learning. It helps simplify complex data sets by transforming
correlated variables into a smaller set of uncorrelated principal
components while preserving the most significant variance.
However, it also has limitations, such as potential information
loss, difficulty in interpretability and sensitivity to data scaling.
Despite this drawbacks, PCA remains a valuable tool for feature
extraction, noise reduction and improving model performance
when applied appropriately.
REFERENCES:
1. Andrzej mackliewicz, Waldemar Ratajczak (1993), Principal component
analysis.
16. 2. Andreas Daffertshofer, Claudine J.C lamoth, Onno G Meijer, Peter J. Beek
(2004). PCA in studying coordination and variability: a tutorial. Elsevier
publications.
3. K.Y. Yeung, W.L. Ruzzo (2001). Principal component publication analysis
for clustering gene expression data.
4. Herve Abdi, Lynne J. Williams (2010). Principal component analysis.
WIREs computational statistics. Wiley publications.
5. Yang chen (2016). Reference- related component analysis: A new method
inheriting the advantages of PLS and PCA for separating interesting
information and reducing data dimension. Elsevier publications.
6. Kelly J. Egan, L. Brian Ready (1994). Patient satisfaction with intravenous
PCA or Epidural morphine. Canadian journal of Anesthesia.
7. Vincent Barra (2004). Analysis of gene expression data using functional
principal components.