PRUNING in Decision Trees

Vishwajit Sen

Data Science / Artificial Intelligence/ Machine Learning/ Deep Learning/ Computer Vision / Natural Language Processing Expert

Published Oct 5, 2023

Pruning is a technique used in decision tree algorithms to prevent overfitting and improve the generalization ability of the model. Overfitting occurs when a decision tree is too complex and captures noise in the training data rather than the underlying patterns in the data. Pruning involves removing parts of the tree that do not provide significant predictive power, making the tree simpler and more interpretable.

There are two main types of pruning:

Pre-Pruning: In pre-pruning, the tree is pruned while it is being built. At each node, the algorithm evaluates whether splitting the node further would improve the overall performance on the validation data. If not, the node is declared as a leaf without further splitting.
Post-Pruning (or Pruning the Tree): Post-pruning, as the name suggests, involves first building the tree to its entirety and then removing nodes that do not provide significant predictive power. This is typically done using techniques like cost-complexity pruning.

Cost-Complexity Pruning:

Cost-complexity pruning is a common method for post-pruning decision trees. It involves assigning a cost to each subtree in the fully grown tree and then selecting the subtree with the smallest cost as the pruned tree. The cost of a subtree is determined by a complexity parameter (often denoted as alpha) and the number of leaf nodes in the subtree.

Calculate Cost for Subtrees: For each subtree, calculate the total error (such as misclassification rate) on the validation data and add a complexity penalty based on the number of leaf nodes and the complexity parameter.Cost = Error + alpha * (Number of Leaf Nodes)
Select the Best Subtree: Choose the subtree with the smallest cost as the pruned tree.

Example:

Consider a decision tree for predicting whether a customer will buy a product based on two features: age and income. The fully grown tree might look like this:

IF age < 30 AND income < 50000

THEN Classify as "Not Buy"

ELSE IF age >= 30 AND income >= 50000

THEN Classify as "Buy"

ELSE IF age >= 30 AND income < 50000

THEN Classify as "Not Buy"

ELSE

THEN Classify as "Buy"

In this example, if the tree is overfitting the data, pruning might occur as follows:

Assign Costs:Subtree 1 (age < 30 and income < 50000): Error = 5 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 5 + 0.05 1 = 5.05Subtree 2 (age >= 30 and income >= 50000): Error = 2 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 2 + 0.05 1 = 2.05Subtree 3 (age >= 30 and income < 50000): Error = 3 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 3 + 0.05 1 = 3.05Subtree 4 (Else): Error = 4 misclassifications, alpha = 0.05, Number of Leaf Nodes = 1, Cost = 4 + 0.05 1 = 4.05
Prune Tree:The subtree with the smallest cost (Subtree 2) will be selected for pruning. The pruned tree would then look like this:

IF age < 30 AND income < 50000

THEN Classify as "Not Buy"

ELSE IF age >= 30 AND income >= 50000

THEN Classify as "Buy"

ELSE

THEN Classify as "Not Buy"

By pruning, the complexity of the tree is reduced, and it becomes less likely to overfit the training data. Pruning is essential to ensure that the decision tree generalizes well to unseen data and makes accurate predictions.

Ashutosh Ranjan

4mo

I didn't get this : do we keep the tree with the least cost, or do we remove that and keep rest. I'll here attach the image of the tree that I made, can somebody explain what the pruned tree will look like?

Pulluri Soumya

Attended BVRIT Hyderabad Student of Smart interviews

how error values are taken?

1 Reaction

See more comments

PRUNING in Decision Trees

Vishwajit Sen

Data Science / Artificial Intelligence/ Machine Learning/ Deep Learning/ Computer Vision / Natural Language Processing Expert

Cost-Complexity Pruning:

Example:

More articles by this author

Others also viewed

Can You Afford to Ignore Geospatial Data in the Race for EUDR Compliance?

🔰 AI for Seeds: Next-Gen Quality Control 🔰

Data Gardens

🌽AI for Maize: Revolutionizing Crop Yield Prediction in Breeding 🌽

The role of GeoAI in scaling credible sustainability systems

AI vs. Human Intuition: Lessons from Fantasy Football, Farming, and Cybersecurity

Open forest data ARE safe for EUDR compliance: Here’s why

Time Series Analysis in Crop Classification: A Research Journey through Remote Sensing and AI

Artificial Idiocy: The Rush to Delete Ourselves

Is there light at the end of the turf wars?

Explore topics

Cost-Complexity Pruning:

Example:

Data Science Machine Learning

Aug 11, 2025

Exploring new opportunities in Data Science

Oct 25, 2023

Technical indicators in the stock market:

Oct 7, 2023

Preparing data for a recommendation system??

Oct 7, 2023

Pooling and Padding in CNN??

Oct 7, 2023

What is Computer Vision??

Oct 7, 2023

"NO" need to check for multicollinearity or remove correlated variables explicitly when using decision trees.

Oct 5, 2023

MLOps concepts

Sep 21, 2023

Python library & It's Uses

Aug 11, 2023

How much do you know about Weight initialization in Neural Networks ??

Aug 9, 2023

Others also viewed

Can You Afford to Ignore Geospatial Data in the Race for EUDR Compliance?

🔰 AI for Seeds: Next-Gen Quality Control 🔰

Data Gardens

🌽AI for Maize: Revolutionizing Crop Yield Prediction in Breeding 🌽

The role of GeoAI in scaling credible sustainability systems

AI vs. Human Intuition: Lessons from Fantasy Football, Farming, and Cybersecurity

Open forest data ARE safe for EUDR compliance: Here’s why

Time Series Analysis in Crop Classification: A Research Journey through Remote Sensing and AI

Artificial Idiocy: The Rush to Delete Ourselves

Is there light at the end of the turf wars?

Explore topics