Classification Tree - Read This Before Applying Your Random Forest Algorithms
Hello, LinkedIn people! Today, I am back with another post about a simple yet powerful machine learning algorithm: Decision Tree. You might wonder: Why is this straightforward algorithm so compelling, and why should I deeply understand it before applying a more advanced algorithm like Random Forest to my project? Here is the answer:
Beyond Linear Constraints: Unlike many algorithms that seek linear paths through data, decision trees embrace its non-linear nature as nonparametric models. This adaptability is crucial, acknowledging the rarity of perfectly linear relationships in our real-world projects.
Preparation Made Simple: Decision trees streamline the often tedious process of data preparation by eliminating the need for feature scaling or centering. The algorithm makes decisions by selecting features to split the data into subsets based on threshold values. These decisions are based on the order of the data points for a given feature, not on the absolute values. Whether a feature's values are scaled or centered does not change the order of the data points, and therefore does not affect the tree's decision-making process. For instance, whether a feature is measured in centimeters or inches would not change the decisions made by the tree, as it merely looks for the best splitting points regardless of the actual units or scale.
Handling Heterogeneous Data: Decision trees can handle features with different scales or units naturally, without requiring normalization or standardization. This capability is particularly useful when dealing with heterogeneous data, where different features might represent vastly different types of data (e.g., age in years and income in dollars).
Improving Prediction Accuracy with Ensemble Techniques: While decision trees alone may lead to overfitting or struggle with balancing bias and variance, incorporating them into ensemble techniques like bagging and bootstrapping (random sampling with replacement) can help improve prediction accuracy. Overly simplistic trees may exhibit high bias, while overly complex trees can suffer from high variance, both of which compromise the model's performance on unseen datasets.
Thus, before diving deeper into Random Forest or other ensemble methods, gaining a solid understanding of decision trees is crucial. This foundation will enable you to navigate the complexities of ensemble machine learning algorithms much more effectively.
Elements of A Decision Tree
Before we dive into how the algorithm works, let's familiarize ourselves with what makes up a tree. To simplify the topic, I've used the iris dataset, which many of us recognize from statistics classes. At this stage, there's no need to worry about interpreting values like the Gini index or the meanings of the numbers within each box. Instead, let's concentrate on getting to know the basic structure of a tree.
Root Node: This is the starting point of a decision tree. The root node is the top box, which, according to the figure above, splits based on the number of ‘sepal length <= 5.45’. It's where the first decision is made and splits into two or more branches.
Parent Node: These are nodes that split into other nodes. Any box that has a subsequent split is a parent node, and the subsequent node is a child node. For example, the node with ‘sepal width’ <= 2.8’ (the node at the first level on the left as the root node is considered the zero level) is a parent node.
Leaf Node or Terminal Node: These are the final nodes that do not split further and provide the outcome or classification. The boxes which have a 'class' label but no further branches are leaf nodes. They are the boxed at the last level representing the final decision or classification.
Branch: These are subsections of the tree that split from either the root or parent nodes. Each condition leads to another decision, which could be a leaf or another decision node. In the figure above, the branches are the lines connecting nodes (i.e., boxes).
Depth of a tree: The root noted is considered depth 0 and each split increases the depth of the tree by one. According to the figure, the depth is 2.
Classification Trees
Now that you have a basic understanding of what a decision tree is and its key elements, let's explore the first type of decision tree: the classification tree.
This type of tree is designed to predict categorical outcomes, which can be either binary or multiclass. For simplicity, this post will focus on binary targets.
While classification trees predict categorical outcomes, the features used to develop the tree can be both categorical and continuous.
The feature selected as the first node (i.e., root node) is not chosen randomly. Instead, the algorithm selects a feature from the dataset and evaluates how effectively it can categorize the target. This process is iterative, continuing until the algorithm identifies the optimal root node that results in the subsequent nodes being as pure as possible. For instance, in the iris example mentioned above, 'sepal length' is the root node because this feature better purifies the subsequent nodes compared to other features in the dataset.
You might be wondering how the purity of a node is defined. Imagine selecting two features from your dataset and visualizing them in a 2D space where the features represent the X and Y axes, and the outcomes are data points within this space. The algorithm assesses each feature to find the optimal split point or 'decision boundary' along its axis. All splits are perpendicular to the axis, meaning decision trees have orthogonal decision boundaries.
But what is 'decision boundary' really? This term refers to the boundary that demarcates the different classes within the feature space. Each region on one side of the boundary is predicted by the algorithm to belong to a specific class. In the most simplest scenario where your dataset comprises only two features, the decision boundary manifests as a line. With three features, it takes the form of a plane. In scenarios involving more than three features, it becomes a hyperplane. It's important to note that the concept of a decision boundary is not exclusive to decision trees but is prevalent across various algorithms. In linear regression models, the decision boundary is linear, delineating a straightforward separation between classes. Conversely, decision trees typically exhibit a non-linear decision boundary, which can be more complex, reflecting the algorithm's ability to capture more nuanced patterns within the data.
According to the decision boundary plot created using the Iris dataset above, the areas filled in yellow, purple, and blue represent the decision regions as determined by the tree. These regions indicate where the algorithm predicts a specific class (i.e., Setosa, Versicolor, Virginica) for any given point based on the input features (i.e., sepal length and sepal width). The lines separating each of the three colored regions constitute the decision boundaries, which demarcate where the model's predictions shift from one flower class to another. Ideally, each decision region should correspond to a single class. However, as observed, the real world rarely aligns perfectly with our models, especially when constructing a decision tree with only a few features. Within the yellow decision region, despite the predominance of Versicolor points, there are several blue dots. These represent instances where the model has misclassified Virginica as Versicolor.
At each point along each feature axis, the algorithm calculates impurity indices, including Gini index or entropy. The value on the axis that yields the lowest Gini index or entropy, which indicate whether a node is pure enough, is used as the split point. In essence, a node is considered pure enough when the samples within it are more homogeneous than heterogeneous. The algorithm repeats this process with all features until it discovers the optimal decision boundary. Let's delve deeper into the concepts of Gini index and entropy for a clearer understanding of how these parameters influence classification tree growth.
Gini Index
Gini Index is a metric used to determine whether each node of a decision tree is sufficiently pure, containing homogeneous rather than heterogeneous samples. It calculates the probability of a randomly selected sample from a node being incorrectly classified. The closer the Gini index is to zero, the higher the purity of the leaf node, with a lower Gini index indicating greater purity.
If the split points of a feature result in a high Gini index, the tree might adjust these points or opt for new features to construct an alternative tree that reduces impurity.
Here's the formula for the Gini Index for a specific node:
m is the number of classes. For the iris example, we have three classes. Thus m = 3.
Pi is the proportion of class i within the node. For example, in the decision tree of the Iris dataset below, at the first-level node displayed within the orange box, the Gini index is calculated asm is the number of classes. For the iris example, we have three classes. Thus m = 3.
Pi is the proportion of class i within the node. For example, in the decision tree of the Iris dataset below, at the first-level node displayed within the orange box, the Gini index is calculated as
The Gini index value of 0.2147 indicates that the node exhibits a certain level of purity, predominantly classifying cases as Setosa. However, being the result of the first split, this value is not optimal. For binary targets, the Gini index ranges from 0 (complete purity) to 0.5 (maximum impurity), which means a value of 0.2147 isn't deemed low enough to consider this node purely homogeneous. In the context of multiclassification targets, the Gini index can reach up to 1, accommodating a broader spectrum of impurity levels.
Entropy
Entropy is a concept borrowed from information theory, signifying the measure of uncertainty or unpredictability in the information content. It helps quantify how much information there is in an event's outcome.
Consider a scenario where a node is associated with three possible classes, each with a distinct probability of occurrence. Although this situation implies the leaf is not perfectly pure, it raises an important question: is it beneficial to grow a tree with such a leaf, or is it better not to grow the tree at all? This is where entropy comes into play.
Similar to the Gini index, entropy serves as a criterion for determining where to make splits in a decision tree. Both metrics aim to diminish impurity with each subsequent split, with lower values indicating a higher degree of node purity. While some folks lean towards the Gini index due to its marginally lower computational demands, making it quicker to calculate, others prefer entropy for its foundational ties to information theory.
For Python users, the DecisionTreeClassifier defaults to using the Gini index. However, you can opt for entropy by adjusting the criterion hyperparameter. I would recommended to experiment with both criteria, along with other decision tree parameters, to identify the optimal setup for your particular dataset.
The entropy of a dataset D is calculated using the equation:
m represents the total number of classes.
Pi denotes the proportion of examples in the dataset that belong to class i.
Let's revisit the orange box at the first-level node of the iris decision tree for an illustrative example. The node contains a total of 52 cases, with 45 belonging to the class Setosa. Therefore, the proportion p or the Setosa class is 4552. For the remaining two classes, their proportions are 652 and 152, respectively.
To calculate the entropy D, we insert these numbers into the entropy equation
The calculated entropy value of 0.6496 signals a level of purity in the node, significantly shaped by the majority class, Setosa. However, considering that the maximum entropy for binary outcomes can be upto 1, the value of 0.6496 falls short of being considered ideal. A higher entropy value signifies a more equitable class distribution within the node, typically denoting a higher degree of 'impurity disorder.'
It's worthy to note that in the context of classification problems, such as with classification trees, the Gini index and entropy serve as criteria for deciding whether to split or further develop the tree. These metrics do not measure model performance. Instead, performance metrics for classification problems include accuracy, precision, recall, F1 scores, and the area under the curve, among others, which I will explain later in the post.
Interested in learning more about when to stop growing a tree, how to fine-tune a decision tree, using cross-validation to find the best hyperparameter values, understanding training complexity, and exploring a real-world example using a banking dataset? Click https://github.com/KayChansiri/Demo_Classification_Tree to continue reading on my GitHub, where you'll find all of the code examples and datasets.
Understanding decision trees is key to unlocking the potential of advanced algorithms. Stay tuned! Kay Chansiri, Ph.D.