Linear Regression from a Machine Learning Perspective
Let's discuss one of the simplest machine learning (ML) algorithms today: regression. Like other ML algorithms, regression can predict both continuous outcomes (e.g., linear regression) and categorical outcomes (e.g., binomial or multinomial regression). Today, we'll focus on linear regression.
Differences Between Traditional Statistics and Machine Learning
Before we start, let's briefly discuss the differences between linear regression from a traditional statistics standpoint versus a machine learning standpoint.
Traditional Statistics
Machine Learning
Loss Function
The basis of linear regression is the loss function. To understand how linear regression works, you must first understand how the loss function works. The loss function is a mathematical function that tries to minimize the distance between observed and predicted values to make the model as accurate as possible. The following equation can represent the function:
SSR refers to the sum of squared residuals. X and Y values are the observed data (i.e., the X and Y columns of the data you collected, and {β0,β1} are the unknown parameters. As an optimization process, the goal of the loss function is to find the set of {β0,β1} that provides the minimum value of this function (AKA the minimum distance between observed and predicted values). If you are still confused about the idea, let's start by looking at a typical linear regression equation below:
Here, i refers to sample i . Say you have a dataset with 100 samples or observations, and you just randomly come up with the idea that your first pair of β0 and β1 should be 2.5 and 0.8, respectively. If your sample i has X = 20, the model would predict their y as 2.5 + 0.8(20) = 18.5. However, if their true y score is 20, our error term is 20 - 18.5 = 1.5. Remember, we have 100 observations in the dataset, so we will need to aggregate them in some way to get the total amount of errors. This can be calculated using different strategies, such as SSR, which I mentioned previously, or the sum of absolute residuals (SAR). SSR is preferable because SAR has absolute values that are mathematically more challenging to deal with.
Now, imagine we get the SSR for all samples, representing the total errors in the dataset based on our β0 and β1, which are 2.5 and 0.8.
One challenge is that we do not know if β0 and β1 are the best coefficient estimates. Thus, we have to try out many pairs of β0and β1 and see what is the best that we could get, meaning what pair of β0 and β1 could best reduce the sum of squared errors.
If you try different values of β0 and β1 ranging from -10 to 10 and plot the beta values against the SSR, you will get something like the figure below.
The lowest point is called the global minimum, which is the point where the values of β0 and β1 yield the least error in the model. In other words, this is the point where the regression coefficient values provide the optimal solution. People may use alternative phrases to refer to the global minimum, such as the point where the model converges or where we find a convex surface.
Behind the scenes, when you run a linear regression model, the machine (e.g., your laptop) uses linear algebra to find the global minimum, which can be written in a matrix operation form, as shown below.
Keep in mind that in this matrix operation example, we have only one predictor β1, which I set to range from 1 to 10, and the intercept (β0 ), which I put to be 1 for simplicity. In the real world, you will 100% have more than one predictor and get a more complex matrix operation equation. Now that we have the matrix form, we can find the values of the betas by performing some matrix transposition and inversion, as shown in the equation:
Note that in linear regression, the equation above provides a closed-form solution as we have the global minimum due to the convex nature of the loss function. For other non-linear machine learning models (e.g., neural networks), you may have many local minima, as shown in the picture below.
This can make the model's interpretation more complicated. For those algorithms, we use an iterative process like gradient descent to find the best spot close to the lowest point possible, thereby obtaining the optimal beta values. These equations do not have a closed-form solution because the data is non-convex and can have multiple local minima.
In reality, you don't have to calculate the loss function manually, as there are packages for linear regression in Python such as scikit-learn, statsmodels, and numpy, or caret if you are an R user.
Multicollinearity and Feature Selection
Now that you understand the loss function let’s discuss ways to deal with multicollinearity from an ML perspective. Spoiler alert: The idea is similar to traditional statistics, with some additional tasks that we can play with!
Tolerance
VIF
Regularization
Regularization is the process of adding penalty terms (e.g., lambda, alpha) into model fitting to prevent overfitting from multicollinearity issues. This is especially important when there are too many predictors in the model, but only some are meaningful. There are three major types of regularization: ridge, lasso, and elastic net (the combination of both ridge and lasso).
1. Ridge Regression
For ridge regression, we add a penalty term (lambda, λ) to every coefficient in the model. Imagine if we have p coefficients (i.e., p predictors). The penalty term integration can be written as the following equation (keep in mind that we do not add this penalty term to the intercept, only predictors!):
When we fit the penalty term into a loss function, the function becomes something like this:
Technically, λ can take any positive value between 0 and ∞. If we set λ = 0, the loss function of ridge regression would end up in the same form as the loss function of a typical regression.
If you still don't understand how adding penalty terms alters the shape of your data and regression solution, imagine trying to fit a model to your dataset where the best β1 that could reduce SSE is quite large. The shape of the data might look like the plot below:
However, machine learning aims to produce a model that best predicts future datasets. Simply put, we want a model whose shape is generalizable. Imagine you have to explain to a minor ethnic islander who has never experienced the outside world what a cup is. You would want to show them picture A below, not picture B, as picture A is more like a cup according to a global standard.
Now, could you apply the cup metaphor to the beta plot previously? Suppose we penalize β1 by applying a λ term to the coefficient. In that case, the shape of your dataset in 3D dimensional space will look more general and less specific to your current data only. It would look something more like the plot below on the right.
Notice that the scale of β1 is smaller after being penalized, ranging from -3 to 3, compared to before, of which the coefficients range from -10 to 10. This is because the original coefficient of β1 was quite large, contributing to a more significant decrease in its scale during the adjustment process of the loss function to minimize loss.
Just so you know, your dataset will have more than one predictor. So, the 3D visual I inserted above is just for simplicity in the demonstration. In the real world, your data could have hundreds of dimensions that would be too challenging for human eyes to comprehend!
At the beginning, I mentioned standardizing your variables. This is very important for ridge regression. If you don't standardize, ridge regression's penalty will be amplified for the coefficients of those variables with a larger range of values (e.g., age will be penalized more than the number of flights per week). This could result in inaccurate model fitting as age gets penalized not because it causes variance but because it is naturally on a larger scale before standardization.
Finally, just like other hyperparameters of ML algorithms, you can fine-tune λ. You can use cross-validation to test different values of λ with an increment as small as 0.01 if you have enough computational power and time. Note that ridge regression will shrink coefficients to close to zero but will never be zero.
2. Lasso Regression
The concept of lasso regression is quite similar to ridge regression. The only difference is that Lasso applies a different penalty term to the loss function by using the absolute values of each coefficient instead of their squared values. Thus, some variables may have zero regression coefficients after applying the penalty term. Lasso helps with feature selection because it can force some coefficients to zero, indicating which features are meaningful in the dataset. See the equation below:
Now, you may wonder when to use Ridge or Lasso regression, as these functions work similarly. I suggest using Ridge regression if you don’t want to completely exclude any features from the model (for example, if you do not have many predictors and all predictors are conceptually meaningful). However, if you have a large dataset, such as World Bank data with tens of thousands of predictors, using Lasso regression to force certain features to zero can be beneficial to ensure the model has generalizability and fewer variance issues.
3. Elastic Net
Elastic Net combines Ridge and Lasso regression into one by mixing them, as you can see in the equation below:
Unlike ridge and lasso regression, we have two parameters to fine-tune in Elastic Net: λ1 (lasso) and λ2 (ridge). Note that if you set λ1 = 1, you would get a Lasso model, whereas if you set the parameter = 0, you would get a ridge model. Setting the value between 0 and 1 balances the contributions of λ1 and λ2 penalties. We can consider all possible combinations of these two hyperparameters and try to find the optimal combination using cross-validation techniques.
Elastic net is useful when you want to deal with variance issues by suppressing the coefficients of a certain predictor with large coefficients and still want to perform feature selection if you have a large dataset with too many predictors by forcing some of them to be zero. We also use regularization in some other ML algorithms (e.g., neural networks and support vector machines).
To continue reading more with a real-world example and full code access, feel free to visit my GitHub page: https://github.com/KayChansiri/LinearRegressionML/blob/main/README.md
Research Associate
1yThanks for sharing! Loved it.