Linear Regression from a Machine Learning Perspective

Linear Regression from a Machine Learning Perspective

Let's discuss one of the simplest machine learning (ML) algorithms today: regression. Like other ML algorithms, regression can predict both continuous outcomes (e.g., linear regression) and categorical outcomes (e.g., binomial or multinomial regression). Today, we'll focus on linear regression.

Differences Between Traditional Statistics and Machine Learning

Before we start, let's briefly discuss the differences between linear regression from a traditional statistics standpoint versus a machine learning standpoint.

Traditional Statistics

  • Traditional statistics uses only one dataset without dividing it into testing, training, and validation sets for modeling.
  • The traditional approach also involves standard data preparation processes, such as encoding variables, checking for and imputing missing data, and testing linear assumptions (e.g., linearity, normality, independence of errors, heteroscedasticity). Linear regression from a traditional statistics perspective also involves checking for multicollinearity using methods such as VIF (Variance Inflation Factor) or correlation tests.
  • For VIF, if the value of a specific feature is higher, the feature is likely to be highly correlated with other features in the dataset. You may find that some sources use a VIF value of more than 10 as the cutoff point if they want to be a bit more liberal. Some may use 4 to be more conservative.
  • Regarding Pearson correlations, if any pairs of variables are correlated more than 0.9, the variables are highly correlated. If multicollinearity issues exist, feature selection processes such as dimensionality reduction (e.g., exploratory factor analysis, confirmatory factor analysis), centering (especially when interaction terms are included in the model), or excluding predictors with conceptually less importance may be performed.
  • As linear regression relies on the ordinary least squares approach (for a small dataset) or gradient descent methods (for a large dataset) to identify the best regression coefficients, standardizing predictors to the same scale is vital to ensure the optimal solution. If your features are on different scales, interpreting the model fit and predictability would be challenging.
  • For example, imagine you are predicting customer satisfaction towards an airline service on a scale of 1 to 5 (this is your y or target outcome) using customer age (ranging from 5 to 100) (X1) and customers’ frequency in flying per week (ranging from 0 to 7) (X2). An increase in one unit of the coefficient of your X1 would be different from an increase in one unit of the coefficient of your X2. X1 can increase to 100, whereas X2 can increase to 7. To determine which feature has more impact on the outcome, you have to standardize them to be on the same scale.
  • While standardization is important for linear regression, it might be less crucial for other ML algorithms, such as random forests or decision trees. For those algorithms, different scales of features only matter a little since the algorithm relies on decision boundaries to fit the model.

  • Finally, from a traditional approach, linear regression emphasizes inference and determining whether a model explains past data well rather than predicting future data.

Machine Learning

  • Unlike traditional statistical methods, linear regression from an ML perspective separates data into testing, training, and validation sets to ensure the model works well with unseen data. Separating the data into testing and training sets without a validation set may be sufficient for projects with less computational power or budget.
  • Similar to linear regression from a traditional standpoint, the ML approach should ensure that linear assumptions are met, missing data are imputed or excluded, all continuous variables are standardized, and no multicollinearity issues exist. Taking an additional step from the traditional statistical approach, linear regression from an ML perspective uses regularization techniques (e.g., ridge, lasso, elastic net) and fine-tunes hyperparameters (e.g., lambda, alpha) to deal with multicollinearity or variance issues from the meaningless variables. The process also involves K-fold cross-validation to adjust the model's hyperparameters during the training process before model testing deployment. You will understand these concepts better later in the post.
  • Model Emphasis: Unlike traditional statistics that explain existing data, ML focuses on building robust models that can generalize well to new data.


Article content
Figure 1. The Difference in Data Management between Statistics and Machine Learning


Loss Function

The basis of linear regression is the loss function. To understand how linear regression works, you must first understand how the loss function works. The loss function is a mathematical function that tries to minimize the distance between observed and predicted values to make the model as accurate as possible. The following equation can represent the function:


Article content


SSR refers to the sum of squared residuals. X and Y values are the observed data (i.e., the X and Y columns of the data you collected, and {β0,β1} are the unknown parameters. As an optimization process, the goal of the loss function is to find the set of {β0,β1} that provides the minimum value of this function (AKA the minimum distance between observed and predicted values). If you are still confused about the idea, let's start by looking at a typical linear regression equation below:


Article content

Here, i refers to sample i . Say you have a dataset with 100 samples or observations, and you just randomly come up with the idea that your first pair of β0 and β1 should be 2.5 and 0.8, respectively. If your sample i has X = 20, the model would predict their y as 2.5 + 0.8(20) = 18.5. However, if their true y score is 20, our error term is 20 - 18.5 = 1.5. Remember, we have 100 observations in the dataset, so we will need to aggregate them in some way to get the total amount of errors. This can be calculated using different strategies, such as SSR, which I mentioned previously, or the sum of absolute residuals (SAR). SSR is preferable because SAR has absolute values that are mathematically more challenging to deal with.

Now, imagine we get the SSR for all samples, representing the total errors in the dataset based on our β0 and β1, which are 2.5 and 0.8.

One challenge is that we do not know if β0 and β1 are the best coefficient estimates. Thus, we have to try out many pairs of β0and β1 and see what is the best that we could get, meaning what pair of β0 and β1 could best reduce the sum of squared errors.

If you try different values of β0 and β1 ranging from -10 to 10 and plot the beta values against the SSR, you will get something like the figure below.

Article content
Figure 2. The Relationship between SSR, B1, and B0

The lowest point is called the global minimum, which is the point where the values of β0 and β1 yield the least error in the model. In other words, this is the point where the regression coefficient values provide the optimal solution. People may use alternative phrases to refer to the global minimum, such as the point where the model converges or where we find a convex surface.

Behind the scenes, when you run a linear regression model, the machine (e.g., your laptop) uses linear algebra to find the global minimum, which can be written in a matrix operation form, as shown below.


Article content

Keep in mind that in this matrix operation example, we have only one predictor β1, which I set to range from 1 to 10, and the intercept (β0 ), which I put to be 1 for simplicity. In the real world, you will 100% have more than one predictor and get a more complex matrix operation equation. Now that we have the matrix form, we can find the values of the betas by performing some matrix transposition and inversion, as shown in the equation:

Article content

Note that in linear regression, the equation above provides a closed-form solution as we have the global minimum due to the convex nature of the loss function. For other non-linear machine learning models (e.g., neural networks), you may have many local minima, as shown in the picture below.


Article content



This can make the model's interpretation more complicated. For those algorithms, we use an iterative process like gradient descent to find the best spot close to the lowest point possible, thereby obtaining the optimal beta values. These equations do not have a closed-form solution because the data is non-convex and can have multiple local minima.

In reality, you don't have to calculate the loss function manually, as there are packages for linear regression in Python such as scikit-learn, statsmodels, and numpy, or caret if you are an R user.

Multicollinearity and Feature Selection

Now that you understand the loss function let’s discuss ways to deal with multicollinearity from an ML perspective. Spoiler alert: The idea is similar to traditional statistics, with some additional tasks that we can play with!

Tolerance

  • Tolerance refers to the amount of variance unique to one predictor and cannot be explained by the rest of the predictors in the model.
  • When a variable’s tolerance is 0 or close to zero, you will get a singularity issue warning when you try to fit the model. The warning indicates that two or more variables explain the same thing in the outcome.
  • When we encounter the singularity warning, ordinary least squares cannot find the global minimum. Variables that explain the same thing could lead to multiple global minima, resulting in your model not converging.

VIF

  • VIF is the inverse of tolerance (VIF = 1/tolerance) and explains how much the variance of a regression coefficient is inflated due to collinearity with other features.
  • For instance, a VIF value of 10.45 for feature X indicates that the variance of the regression coefficient for the feature is 10.45 times higher than it would be if X were not correlated with other features in the model.
  • The standard error of a regression coefficient that you often see in an output when you run a regression model is the square root of the variance. Thus, a VIF of 10.45 means the standard error is √10.45 or about 3.23 times larger than when feature X is not colinear with other features in the model. It is important not to include features with high VIFs in the model, as larger standard errors mean regression coefficients are less accurate. This could result in a final model with high variance or less generalization ability to unseen data.
  • You may then wonder what to do if a feature in your dataset suffers from a high VIF. Just like traditional linear regression, there are several techniques you can use for machine learning linear regression, such as factor analysis (e.g., principal axis factoring, exploratory factor analysis, confirmatory factor analysis), variable selection (e.g., forward selection, backward elimination, or stepwise regression - similar to traditional statistical methods), or use previous literature in the area of interest to inform your feature selection conceptually.
  • Another technique for dealing with feature selection that appears in both traditional statistics and ML but is more discussed in the ML world is regularization.

Regularization

Regularization is the process of adding penalty terms (e.g., lambda, alpha) into model fitting to prevent overfitting from multicollinearity issues. This is especially important when there are too many predictors in the model, but only some are meaningful. There are three major types of regularization: ridge, lasso, and elastic net (the combination of both ridge and lasso).

1. Ridge Regression

For ridge regression, we add a penalty term (lambda, λ) to every coefficient in the model. Imagine if we have p coefficients (i.e., p predictors). The penalty term integration can be written as the following equation (keep in mind that we do not add this penalty term to the intercept, only predictors!):

Article content

When we fit the penalty term into a loss function, the function becomes something like this:



Article content


Technically, λ can take any positive value between 0 and ∞. If we set λ = 0, the loss function of ridge regression would end up in the same form as the loss function of a typical regression.

If you still don't understand how adding penalty terms alters the shape of your data and regression solution, imagine trying to fit a model to your dataset where the best β1 that could reduce SSE is quite large. The shape of the data might look like the plot below:


Article content
Figure 3. The Relationship between Betas and SSE without Regularization.


However, machine learning aims to produce a model that best predicts future datasets. Simply put, we want a model whose shape is generalizable. Imagine you have to explain to a minor ethnic islander who has never experienced the outside world what a cup is. You would want to show them picture A below, not picture B, as picture A is more like a cup according to a global standard.


Article content
Figure 4. Picture A (on the left) and Picture B (on the right)

Now, could you apply the cup metaphor to the beta plot previously? Suppose we penalize β1 by applying a λ term to the coefficient. In that case, the shape of your dataset in 3D dimensional space will look more general and less specific to your current data only. It would look something more like the plot below on the right.


Article content
Figure 5. The Left Plot is Regression, and the Right Plot is Regression with Regularization.


Notice that the scale of β1 is smaller after being penalized, ranging from -3 to 3, compared to before, of which the coefficients range from -10 to 10. This is because the original coefficient of β1 was quite large, contributing to a more significant decrease in its scale during the adjustment process of the loss function to minimize loss.

Just so you know, your dataset will have more than one predictor. So, the 3D visual I inserted above is just for simplicity in the demonstration. In the real world, your data could have hundreds of dimensions that would be too challenging for human eyes to comprehend!

At the beginning, I mentioned standardizing your variables. This is very important for ridge regression. If you don't standardize, ridge regression's penalty will be amplified for the coefficients of those variables with a larger range of values (e.g., age will be penalized more than the number of flights per week). This could result in inaccurate model fitting as age gets penalized not because it causes variance but because it is naturally on a larger scale before standardization.

Finally, just like other hyperparameters of ML algorithms, you can fine-tune λ. You can use cross-validation to test different values of λ with an increment as small as 0.01 if you have enough computational power and time. Note that ridge regression will shrink coefficients to close to zero but will never be zero.

2. Lasso Regression

The concept of lasso regression is quite similar to ridge regression. The only difference is that Lasso applies a different penalty term to the loss function by using the absolute values of each coefficient instead of their squared values. Thus, some variables may have zero regression coefficients after applying the penalty term. Lasso helps with feature selection because it can force some coefficients to zero, indicating which features are meaningful in the dataset. See the equation below:

Article content


Now, you may wonder when to use Ridge or Lasso regression, as these functions work similarly. I suggest using Ridge regression if you don’t want to completely exclude any features from the model (for example, if you do not have many predictors and all predictors are conceptually meaningful). However, if you have a large dataset, such as World Bank data with tens of thousands of predictors, using Lasso regression to force certain features to zero can be beneficial to ensure the model has generalizability and fewer variance issues.

3. Elastic Net

Elastic Net combines Ridge and Lasso regression into one by mixing them, as you can see in the equation below:

Article content

Unlike ridge and lasso regression, we have two parameters to fine-tune in Elastic Net: λ1 (lasso) and λ2 (ridge). Note that if you set λ1 = 1, you would get a Lasso model, whereas if you set the parameter = 0, you would get a ridge model. Setting the value between 0 and 1 balances the contributions of λ1 and λ2 penalties. We can consider all possible combinations of these two hyperparameters and try to find the optimal combination using cross-validation techniques.

Elastic net is useful when you want to deal with variance issues by suppressing the coefficients of a certain predictor with large coefficients and still want to perform feature selection if you have a large dataset with too many predictors by forcing some of them to be zero. We also use regularization in some other ML algorithms (e.g., neural networks and support vector machines).

To continue reading more with a real-world example and full code access, feel free to visit my GitHub page: https://github.com/KayChansiri/LinearRegressionML/blob/main/README.md

Thanks for sharing! Loved it.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics