The Art of Gradient Boosting Machines: A Practical Approach

Welcome to the third article in my series on ensemble techniques for machine learning. In my previous post, I discussed Random Forest, an ensemble technique that combines decision trees in parallel. In this post, I will introduce you to boosting, another powerful ensemble technique that works sequentially.

There are several boosting methods available, but the most popular by far are AdaBoost (adaptive boosting) and gradient boosting. Today, we’ll focus on gradient boosting, also known as gradient boosted trees (GBT).

Whenever I think of gradient boosting, it reminds me of sculpting a layered structure, like carefully crafting a sandcastle at the beach. The foundation you build influences each layer that follows, and every adjustment contributes to the overall shape of the final castle. It’s an artistic and iterative approach to machine learning, where every decision impacts the next. Let’s dive deeper into the art of gradient boosting together.

What is Boosting?

Understand Weak Learner:

Before diving into boosting, let me first introduce the concept of a weak learner. A weak learner, often represented by a single decision tree, can be a classifier or regressor whose predictive power is only slightly better than random guessing. The depth of a weal learner is usually very shallow, ranging from one to five. Despite its simplicity and limited predictive performance, a weak learner becomes valuable when combined with other weak learners to create a more powerful model. It's also important to note that a weak learner is considered adequate as long as its predictions are not completely uncorrelated (perpendicular) with the observed values in a vector space. If they are, the weak learner is no longer useful.

The Concept of Boosting:

Boosting is an ensemble technique where each classifier or regressor tries to correct the errors made by the previous one. Unlike bagging (used in Random Forest), which builds trees in parallel, boosting builds trees sequentially, creating more dependency between the trees in the model.

How Gradient Boosting Works:

In GBT, the process starts by using a simple model (a weak learner) to make predictions. The residuals (errors) from this model—calculated as the difference between the actual values and predicted values—are then used as the target for the next model. The next model is trained specifically to predict these residuals, and the new predictions are combined with the previous model’s predictions to update the overall prediction. This process continues iteratively, with each new model trying to reduce the residuals further until the overall residuals are minimized. Keep in mind that the concept of adjusted residuals is specific to GBT. AdaBoost as another boosting algorithm relies on weight adjustment to handle misclassified samples, which I will cover in my next post.

Final Prediction in Boosting:

The final prediction depends on what type of boosting we are talking about. For classification GBT, the final prediction calculates the sum of the log odds from the initial prediction and the weighted sum of the log odds from all trees. This cumulative sum is then converted to a probability using a sigmoid function to get the final prediction. I know that concept sounds confusing for now, but bare with me. I promise you will understand it better after you finish reading the post. For regression tasks, GBT calculate the final prediction as the sum of the initial prediction (e.g., the mean of the outcome) and the weighted sum of the residual corrections from all the trees. In this case, the weighted sum simply refers to the sum of residual corrections across the trees multiplied by the learning rate, which controls how much each tree's residual prediction contributes to the final prediction. Learning rates prevent overfitting by scaling down the updates. For AdaBoost, things work a bit differently. I will not focus on it now though.

We can write a one-line equation representing GBT boosting as below:

F(x) is the final boosted model.

F0(x) is the initial model (e.g., the mean for regression, the log of the odd for classification).

hm(x) is the weak learner at iteration

η is the learning rate, which is a parameter that can be fine-tuned.

M is the total number of iterations or trees.

Data Preparation Assumptions for Boosting Techniques

Handling of Scale: Similar to bagging in Random Forest, boosting algorithms do not require standardization or normalization of the input features as the algorithm is based on decision trees, which are are invariant to the scale of the features.
Dimensionality: Unlike certain algorithms that may suffer from the curse of dimensionality such as KNN, boosting can handle high-dimensional data relatively well. However, keep in mind that if the number of features is much larger than the number of samples, the model might suffer from overfitting issues, just like many ML algorithms. Dimensionality reduction techniques (like PCA, EFA) or feature selection based on your project or research objectives can help in such cases. Although GBT works well for high-dimensioanl data, the algorithm does not work well for high-dimensional sparse data, which this limitation is similar to other tree-based models.
Data Quality: Boosting is sensitive to noisy data and outliers because it tries to correct errors from previous learners. If the data is noisy, boosting can overfit to the noise. Some preprocessing, like removing outliers or using regularization techniques, can help mitigate this issue.
Handling Missing Values: Many boosting algorithms like XGBoost, which I will discuss in my next post, can handle missing values internally. However, it is still a good practice to handle missing values before applying the model to improve model performance. Imputation techniques such as expectation maximization or multiple imputation are helpful if you have missing completely at nonrandom. For data suffering from missing at random and the missing percentage is low, a simpler technique such as single imputation, mean imputation for continuous predictors, or mode imputation for categorical predictors is helpful.
Data Size: Boosting techniques can be computationally expensive, especially with large datasets, because of their sequential nature. Certain boosting libraries, such as XGBoost or LightGBM provide optimizations for faster computations.

Types of GBT

GBT can be used with both classification and regression problems, unlike certain boosting algorithms such as AdaBoost that focuses on only classification tasks. For GBT regression problem, the goal is to minimize a loss function, which measures the difference between the predicted values and the actual values. Common loss functions for regression include Mean Squared Error (MSE) and Mean Absolute Error (MAE). For classification problems, the goal is to minimize the log loss or cross-entropy loss. Minimizing log loss is equivalent to maximizing the log-likelihood, which you might have heard if you work with logistic regression before. Both minimizing log loss and maximizing log likelihood are mathematically related and represent the same objective.

Let's spend some time here for a sec to take a closer look at the concept of log loss. Bascially, log loss measures how well a classification model's predicted probabilities match the actual class labels. It is a measure of error or "loss" that quantifies the distance between the true class labels and the predicted probabilities. The log loss formula for binary classification is as below. A lower log loss means better probability estimates from the model.

Now, for log-likelihood, the concept measures how likely it is that the observed data (actual labels) would occur given the predicted probabilities by the model. The higher the likelihood, the better the model fits the data. If you are not familiar with the concept. In classification tasks, we aim to maximize the log likelihood because a higher log likelihood indicates that the model is producing predictions that align well with the actual data. The equation of the log likelihood can be represented as below:

Notice that the log likelihood is just the negative of the log loss (without the averaging and negative sign). That's why I mentioned previously that the concepts of log loss and log likelihood are mathematically similar.

On a side note, don't get confused between the concepts of log loss and log of the odds. Those are totally diferent things. The log loss measures the distance between the true labels and the predicted probabilities. The log of the odds (AKA logit) are defined as the ratio of the probability of the event occurring to the probability of it not occurring.

Now that you understand the types of GBT and the relevant concepts, let’s start with GBT for regression.

1.GBT for Regression

The core concept behind GBT for regression is gradient descent. If you're unfamiliar with gradient descent, think of it this way: A gradient is a vector that points in the direction of the steepest increase in the loss function (the function we are trying to minimize). The magnitude of the gradient tells us how steep the slope is. In optimization, which is fundamental to machine learning algorithms, we aim to move in the opposite direction of the gradient—the steepest descent—because our goal is to minimize the loss. This is why we talk about gradient descent, focusing on the "negative gradient."

To understand how GBT for regression works, let’s start with the most basic concept— the initial prediction. Before we build any decision trees, we begin with a simple, standalone prediction, much like the intercept in a linear regression model. This initial prediction, or we call the initial leaf, is simply the average of the observed outcome across all samples in the dataset.

For example, if you're building a model to predict customer satisfaction, and the average satisfaction score across all customers is 3.4, this initial leaf is set to 3.4. This value stands alone and is not yet associated with any tree—it's just the model's first guess, much like an intercept in a regression model.

Now, imagine your data looks like this:

For each data point, if we subtract the predicted satisfaction score (i.e., 3.4) from the actual satisfaction score, you get the residual. These residuals in the Residual column represent the errors of the current model across customers. Now that we get the initial leaf and the residuals for all participants, let's build the first tree.

Unlike trees in Random Forest, the trees we build for GBT predict residuals NOT the observed data, which in this case is the actual customer satisfaction scores. However, similar to a typical decision tree, at each node in this first tree of the GBT, the algorithm evaluates splits using predictors in the dataset, which could be demographic variables or any variables important for your project, to determine which split best reduces the loss function of the residuals. In this context where the outcome is continuous, the loss function is measured by MSE.

For example, suppose we're considering a split on 'Income.' This first tree evaluates different split points (e.g., "Income < $50,000" vs. "Income ≥ $50,000"). For each possible split, it calculates the MSE of the residuals for the data points falling into the left and right child nodes after the split. The split value that results in the lowest total MSE for the residuals is chosen.

Still confused? Suppose the algorithm is considering a split on the Income variable, with a threshold of $50,000. This means we create two groups of samples: 1) Group 1 representing customers with Income < $50,000, and 2) Group 2 representing customers with Income ≥ $50,000. For each group, the algorithm computes the mean of the residuals. Again, I emphasize the mean of the residuals NOT the actual customer satisfaction scores themselves. For examples, the residuals for Group 1 (Income < $50,000) are [0.6,−0.4,−0.4], the mean residual for this group would be - 0.07. Similarly, the algorithm computes the mean residual for Group 2 in the same way.

For each group, The algorithm calculates the Mean Squared Error (MSE) between the actual residuals and the predicted residuals, using the equation below as an example for Group 1:

The total MSE for this split is a weighted sum of the MSEs for each group:

Here, n1 and n2 represent the number of samples in Group 1 and Group 2, respectively, while n is the total number of samples. The split that results in the lowest total MSE is chosen as the best split for that node. For the current fictitious data, let's assume that 50,000 is the best split.

Now that we have the averaged residuals from Tree 1, we will use them to update the predicted value (i.e., customer satisfaction score) for each customer. In the real world, you will likely have more than one predictor. However, I have built the first tree using only income as the predictor here for simplicity. To update the predicted value for each customer, you simply pass the data point down the tree according to the decision rules at each node, find the residual for that value, and add it to the initial leaf value to obtain the new predicted value.

For example, for customer #1 in Table 1, their income is less than 50,000, and the averaged residual for them is -0.07, according to the tree in Figure 2. Thus, the new predicted value is:

New Predicted Value = Initial Leaf + (Averaged Residual from Tree n) = 3.4 + (-0.07) = 3.33

Now, we apply this process for every customer, and you will get the updated table below:

You may notice that the updated predicted values for certain customers, such as customer #5 (3.33), are close to the observed value (3). Isn't this excellent?

The answer is no. We are facing an overfitting issue because Tree 1, as the first weak learner, learns too quickly to predict the observed values, leading to a potential lack of generalizability to new, unseen data. To address this issue, we need to apply a learning rate, a hyperparameter that can be fine-tuned and typically ranges from 0 to 1. Lowering the learning rate scales down the influence of each new tree, reducing the overfitting problem. The equation is as follows:

New Predicted Value = Initial Leaf + (learning rate × Averaged Residual from Tree n)

With a learning rate of 0.1, the updated predicted value for the first customer is 3.4 + 0.1(−0.07). We apply this for every customer and get the updated table below:

Now, you can see that the predicted values for the first weak learner are less close to the observed values, yet still closer to the actual values than the initial leaf predictions.

With a lower learning rate, the model takes smaller steps toward reducing the loss, requiring more trees to reach a similar level of accuracy. This slower, more controlled process allows the model to capture underlying patterns gradually without fitting to noise. I often found that using more trees with a smaller learning rate can result in a model that generalizes better than a few large steps

The concept I explained above is applicable to building all next trees in the boosting process. In other words, for each new tree, we calculate residuals based on the updated predicted values from the previous tree. Then we use these residuals to continue building trees, each time reducing the model's error further as an iteractive process. The final prediction for any customer is the sum of the initial leaf, adjusted by the weighted residuals from each tree:

Final Predicted Value=Initial Leaf+(Learning Rate×Residuals from Tree 1)+(Learning Rate×Residuals from Tree 2)+…

2. GBT for Classification

To understand GBT for classification, we need to first grasp the concepts of log odds (logit), odds, probability, and log likelihood. These are foundational concepts of logistic regression, which I explained in my previous post.

Let’s breifly discuss these concepts. For binary outcomes (e.g., yes/no, true/false), a linear algorithm cannot be directly used to predict the outcome because the predictions might fall outside the intended boundary [0,1]. To restrict the predicted values within this range, we use functions like the sigmoid function. The sigmoid function is mathematically represented as below (note that z is the output of a linear regression equation, which can be represented as B0 + B1X1 + B2X2 + ...

In the sigmoid function, a typical linear regression model is transformed such that its output values are constrained within the [0,1] range. This transformation allows the output to be interpreted as a probability, as shown in the plot below.

However, it is challenging to interpret this non-linear plot directly. For instance, you cannot determine exactly how a one-unit increase in the predictor results in a one-unit increase in Y because the relationship is not linear. Therefore, we apply the log function to transform the non-linear sigmoid output back into a linear form. This transformation is expressed as:

In the equation,p is the probability obtained from the sigmoid function. After obtaining a linear function with log odds, it may still not be intuitive to interpret how a one-unit increase in X results in an increase in Y in terms of log odds. Therefore, we often convert log odds to an odds ratio, which essentially refers to the ratio of the probability of an event of interest happening to the probability of it not happening, given a one-unit increase in predictor X. Probability, log odds (logit), and odds ratios can be converted back and forth using the following equations:

Now that you understand these concepts, let’s return to gradient boosting for classification.

Unlike gradient boosting for regression, where the initial leaf is the average of the outcome, the initial leaf of gradient boosting for classification is the log odds (i.e., logit), which is equal to the logarithm of the probability of an event happening versus not happening, as explained earlier. Consider the same example of age and income predicting customer satisfaction, but this time the outcome is whether the customer is satisfied (1) or dissatisfied (0). Imagine we have 5 customers, and two of them are satisfied, as depicted in the table below:

The log of the odds can be calculated using the equation:

In our example, the log of the odds is:

This logit value is our initial leaf. Now, you may wonder how we use the log odds to get the residual values so that we can use them to grow a tree when the values in our table are just 0 and 1. The answer is that we convert the log odds to a probability so that we can subtract the predicted probability from the actual observed values (0 or 1). If you recall the equations above, we can get the probability as:

Now that we have the probability as our initial leaf, we can calculate the residuals by subtracting the predicted probability from the actual observed values (0 or 1), just like we did in gradient boosting regression.

Similar to what we did earlier in gradient boosting for regression, these residuals will then be used to build our first tree using the same logic we used prevoiusly for GBT for regression. For example, if we want to use income to predict customer satisfaction, the algorithm evaluates various potential cut-off points (e.g., "income < 40,000", "income < 60,000", etc.) and selects the one that minimizes the Gini Index or Cross-Entropy, making the leaf nodes as "pure" as possible in terms of residuals. If "income < 50,000" is considered a potential cut-off point, the residuals of customers with an income below this threshold are considered in the split calculation. The goal is to find the split that results in more homogeneity in terms of their residuals. We can calculate this using the formula below:

If the split results in nodes where the residuals are closer to zero, it is considered a good split. The process involves iteratively adjusting the cut-off point until the leaf nodes are as pure as possible, meaning they contain residuals that are minimal and represent better predictions for the classification task. Let’s imagine the first tree we obtained is shown below:

Note that for GBT tasks, including both classification, and regression, residuals are referred to as "pseudo-residuals" because they represent the difference between the actual outcomes and the predicted residuals (i.e., averaged residuals for a regression task and probabilities for a classification task), rather than the difference between the actual and predicted values as in linear regression. A more technical way to explain this is that pseudo-residuals are derived from the gradient of the loss function used, such as the negative log-likelihood (log loss) or binary cross-entropy for classification tasks, and MSE for regression tasks.

Now that we have the residuals as the final leaves of the tree and the initial leaf (e.g., −0.4), we can calculate the predicted probability for each customer. However, since the initial leaf is in the log of the odds (logit) form and the tree is built based on probabilities, we can't simply sum the log odds and residuals. As explained earlier, log odds and probabilities are two different representations.

To address this issue, we need to perform some mathematical transformations. For each final leaf, we will plug its residual into the formula below to get the adjusted residuals in a form that allows us to combine the value with the initial leaf.

Let's use Final Leaf 1 in the tree below as an example and apply the formula to calculate the adjusted residual. Note that the previous probability for all customers in this leaf is 0.4 (see Table 4). Thus, for the numerator, we have 0.6+(−0.4)+(−0.4) = −0.2. For the denominator, we calculate (0.4×0.6)+(0.4×0.6)+(0.4×0.6)=0.72. Therefore, the adjusted residual for the first leaf is −0.2/0.72=−0.28. If you repeat this calculation for all final leaves, you will get a tree that looks like the one shown below:

Now, we have the final output for each leaf in a form that can be combined with the initial leaf, which is in the log-odds form. To do this, we can use the formula below.

For Customer 1, we calculate −0.4+(0.1×−0.28). Note that the learning rate can be fine-tuned, but I will use 0.1 for now just for demonstration purposes. We repeat this calculation for every customer to obtain the log-odds for each of them. Then, we convert the log-odds to probabilities so that we can subtract these probabilities from the previous probabilities in the table to get the new set of residuals, as shown in the table below:

You may notice that the new residuals for some customers are smaller than their initial residuals. This indicates that we are on the right track. However, for certain customers, the new residuals are larger than the original ones. This is why we need to keep repeating the process until all residuals become sufficiently small or until we reach a maximum number of trees specified, indicating that further improvement is minimal. By controlling the learning rate and the number of trees, we can ensure that the model does not overfit the data and generalizes well to unseen data. The ultimate goal is to build a robust model that balances bias and variance. In the end, we will have ntrees, where n is a parameter we fine-tune.

When we get a new, unseen data point, we will run it through every tree and then sum the final log-odds predictions across all trees, along with the initial leaf (-0.4), to obtain the final predicted log-odds for that specific data point. We then convert the log-odds into a probability to obtain the final prediction. Typically, we set the threshold at 0.5, meaning that if someone has a final probability greater than 0.5, their customer satisfaction will be predicted as 1. If it is less than 0.5, it will be predicted as 0.

Parameters to Fine-Tune in GBT

Now that you understand the concepts of how GBT for regression and classification work, let's talk more about which paramaters can be fine-tuned.

1. Learning Rate (Range between 0 and 1):

GBT include a learning rate (also known as shrinkage or step size) that scales the contribution of each tree. A smaller learning rate (closer to 0) requires more trees to model the data but often results in a more robust model that generalizes better to unseen data. A larger learning rate (closer to 1) may speed up the learning process but risks overfitting if the model captures too much noise from the training data. The choice of learning rate is a trade-off between the number of trees needed and the model's generalization ability.

2. Tree Depth:

The individual trees in GBT are usually shallow, ranging from one to a few levels deep. These shallow trees, again known as "weak learners," are not very powerful on their own but become highly effective when combined through the boosting process. Deeper trees provide more complex decision boundaries, leading to faster learning but also increasing the risk of overfitting, which adds more variance to the model. Conversely, shallower trees take smaller steps in learning (learning more slowly), requiring more iterations to capture the patterns in the data. However, they tend to reduce variance, leading to a more stable model. The optimal depth depends on the complexity of the underlying data.

3. Regularization Techniques:

In addition to the learning rate, GBT can be regularized through other techniques, such as subsampling the training data (also known as stochastic gradient boosting) and controlling the complexity of the trees (e.g., limiting tree depth, minimum samples per leaf). These techniques help prevent overfitting to the training data, especially when dealing with large datasets.

As always, this demo doesn't end here. If you're interested in seeing how these fundamentals can be applied to a real-world example, feel free to check out my GitHub page here, where I’ve provided the full code along with relevant dates. I hope you enjoyed today's content as much as I enjoyed writing it. See you in the next algorithm!

The Art of Gradient Boosting Machines: A Practical Approach

Kay Chansiri, Ph.D.

HCI & AI Research Scientist | LLM Applications | Human-Centered AI Design | Data-Driven Product Evaluation

What is Boosting?

Data Preparation Assumptions for Boosting Techniques

Types of GBT

1.GBT for Regression

2. GBT for Classification

Parameters to Fine-Tune in GBT

More articles by this author

Explore topics

What is Boosting?

Data Preparation Assumptions for Boosting Techniques

Types of GBT

1.GBT for Regression

2. GBT for Classification

Parameters to Fine-Tune in GBT

Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

Jul 31, 2024

Linear Regression from a Machine Learning Perspective

Jul 3, 2024

Let's Talk About Performance Evaluation Metrics for Machine Learning

Jun 5, 2024

From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

May 22, 2024

Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

May 7, 2024

Classification Tree - Read This Before Applying Your Random Forest Algorithms

Feb 28, 2024

Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

Jan 30, 2024

Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

Dec 18, 2023

Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

Dec 12, 2023

Unraveling the World of Data Structures: Vectors to Arrays Explained!

Sep 8, 2023

Explore topics