PRUNING in Decision Trees

Pruning is a technique used in decision tree algorithms to prevent overfitting and improve the generalization ability of the model. Overfitting occurs when a decision tree is too complex and captures noise in the training data rather than the underlying patterns in the data. Pruning involves removing parts of the tree that do not provide significant predictive power, making the tree simpler and more interpretable.

There are two main types of pruning:

  1. Pre-Pruning: In pre-pruning, the tree is pruned while it is being built. At each node, the algorithm evaluates whether splitting the node further would improve the overall performance on the validation data. If not, the node is declared as a leaf without further splitting.
  2. Post-Pruning (or Pruning the Tree): Post-pruning, as the name suggests, involves first building the tree to its entirety and then removing nodes that do not provide significant predictive power. This is typically done using techniques like cost-complexity pruning.

Cost-Complexity Pruning:

Cost-complexity pruning is a common method for post-pruning decision trees. It involves assigning a cost to each subtree in the fully grown tree and then selecting the subtree with the smallest cost as the pruned tree. The cost of a subtree is determined by a complexity parameter (often denoted as alpha) and the number of leaf nodes in the subtree.

  1. Calculate Cost for Subtrees: For each subtree, calculate the total error (such as misclassification rate) on the validation data and add a complexity penalty based on the number of leaf nodes and the complexity parameter.Cost = Error + alpha * (Number of Leaf Nodes)
  2. Select the Best Subtree: Choose the subtree with the smallest cost as the pruned tree.

Example:

Consider a decision tree for predicting whether a customer will buy a product based on two features: age and income. The fully grown tree might look like this:


IF age < 30 AND income < 50000

THEN Classify as "Not Buy"

ELSE IF age >= 30 AND income >= 50000

THEN Classify as "Buy"

ELSE IF age >= 30 AND income < 50000

THEN Classify as "Not Buy"

ELSE

THEN Classify as "Buy"


In this example, if the tree is overfitting the data, pruning might occur as follows:

  1. Assign Costs:Subtree 1 (age < 30 and income < 50000): Error = 5 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 5 + 0.05 1 = 5.05Subtree 2 (age >= 30 and income >= 50000): Error = 2 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 2 + 0.05 1 = 2.05Subtree 3 (age >= 30 and income < 50000): Error = 3 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 3 + 0.05 1 = 3.05Subtree 4 (Else): Error = 4 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 4 + 0.05 1 = 4.05
  2. Prune Tree:The subtree with the smallest cost (Subtree 2) will be selected for pruning. The pruned tree would then look like this:

IF age < 30 AND income < 50000

THEN Classify as "Not Buy"

ELSE IF age >= 30 AND income >= 50000

THEN Classify as "Buy"

ELSE

THEN Classify as "Not Buy"


By pruning, the complexity of the tree is reduced, and it becomes less likely to overfit the training data. Pruning is essential to ensure that the decision tree generalizes well to unseen data and makes accurate predictions.



Ashutosh Ranjan

🚀 NITRR'25 | Chemical Engineering Grad | Data Analyst | Data Scientist | Skilled in Python, Tableau, SQL | Passionate about Deep Learning and Neural Networks 🌟 #DataScience #DataAnalytics #ChemicalEngineer

4mo

I didn't get this : do we keep the tree with the least cost, or do we remove that and keep rest. I'll here attach the image of the tree that I made, can somebody explain what the pruned tree will look like?

  • No alternative text description for this image
Like
Reply
Pulluri Soumya

Attended BVRIT Hyderabad Student of Smart interviews

1y

how error values are taken?

To view or add a comment, sign in

Others also viewed

Explore topics