Curse of dimensionality

Dalmas Chituyi

| AI Engineer | Data Scientist | Applied Machine Learning | ML Researcher.

Published Mar 9, 2024

Introduction

In the world of data science, dealing with high-dimensional data is a common challenge. This is often referred to as the "curse of dimensionality". Two popular techniques to combat this issue are Principal Component Analysis (PCA) and Partial Least Squares (PLS). Both methods are used for dimensionality reduction, but they approach the problem in slightly different ways.

Principal Component Analysis (PCA)

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA is sensitive to the relative scaling of the original variables, but it does not consider the response variable (in case of supervised learning).

Partial Least Squares (PLS)

PLS, on the other hand, is a method that bears some relation to principal components regression. Unlike PCA, it uses the response variable to guide the data compression process. Hence, it is particularly useful when we need to predict an outcome variable.

PLS attempts to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values.

PCA vs PLS

While both PCA and PLS are used for dimensionality reduction, the choice between the two depends on the specific problem at hand. PCA is a good choice when the goal is to reduce the dimensionality of independent variables without considering a dependent variable. On the other hand, PLS takes the dependent variable into account, which can result in better predictive performance when there is a strong relationship between the dependent variable and some directions in the data.

Choosing Your Champion 😊

So, PCA or PLS? It depends on your objective:

i. General Dimensionality Reduction. Go with PCA for overall data compression and improved algorithm performance. (default to this!)

ii. Prediction with High-Dimensional Data. Choose PLS if your primary goal is to predict a specific target variable while considering complex relationships between features.

Conclusion

In conclusion, both PCA and PLS serve as effective remedies for the curse of dimensionality. They allow us to simplify our data without losing important information, making it easier to perform further analysis or build predictive models. The choice between PCA and PLS will depend on the specifics of your data and the problem you are trying to solve.

Curse of dimensionality

Dalmas Chituyi

| AI Engineer | Data Scientist | Applied Machine Learning | ML Researcher.

More articles by this author

Others also viewed

Beyond Bloom: A Deep Dive into Filters for Big Data Processing

The Importance of Data Science in the IT Industry

Understanding Statistical Distributions

Mastering Time Series Analysis from Scratch: A Data Scientist's Roadmap

Log-Normal Distribution in Data Science: Applications and Insights

Why Missing Values are Important in Data Science and Analytics

Very Simple Example of Using Data Science in Real-Life Situation (Real-Time Scenario)

From Historical Data to Future Insights: Building Time Series Models with Low-Code Tools

Demystifying Data Science

Missing data, Information and Survivorship bias - Advanced Data Science perspectives

Explore topics

Single Page Applications (SPAs) vs. Multi-Page Applications (MPAs). Navigating the Web App Landscape

Mar 28, 2024

ML Model Deployed as Microservice

Mar 10, 2024