SlideShare a Scribd company logo
2
Most read
Know How to Create and Visualize a Decision
Tree with Python
Decision trees are a very popular and important method of Machine Learning (ML)
models. The best aspect of it comes from its easy-to-understand visualization and fast
deployment into production. To visualize a decision tree it is very essential to
understand the concepts related to decision tree algorithm/model so that one can
perform well decision tree analysis.
Knowing about the decision trees and the elements of decision tree visualization, will
surely help to create and visualize it in a better way. Great decision tree visualization
is something that speaks for itself. One must have all the inputs before creating it. It is
always advisable to improve the old way of plotting the decision trees so that it can be
easily understandable.
Decision Trees
Decision trees are the core building blocks of several advanced algorithms, which
include the two most popular machine learning models for structured data - XGBoost
and Random Forest. A Decision Tree is a supervised Machine learning algorithm. It is
used in both classification and regression algorithms.
The decision tree is like a tree with nodes. The branches are based on a number of
factors. It splits data into branches till it accomplishes a threshold value. A decision
tree consists of the root nodes, children nodes, and leaf nodes. Each leaf in the
decision tree is responsible for creating a specific prediction. A decision tree learns the
relationship present in the observations in a training set, which is represented as
feature vectors x and target values y, by examining and condensing training data into a
binary tree of interior nodes and leaf nodes.
The disadvantage of decision trees is that the split it makes at each node will be
optimized for the dataset it is fit to. This splitting process will generalize well to other
data. However, one can generate huge numbers of these decision trees, tuned in
slightly varied ways, and combine their predictions to create some of the best models.
The visualization decision tree is a tremendous task to learn, understand interpretation
and working of the models. One of the biggest benefits of the decision trees is their
interpretability — after fitting the model, it is effectively a set of rules that are helpful
to predict the target variable. One does not need to be familiar at all with ML
techniques to understand what a decision tree is doing. That is the main reason, as it is
easy to plot the rules and show them to stakeholders, so they can easily understand the
model’s underlying logic.
For instance, find a library that visualizes the decision nodes split up the feature space.
It is also uncommon for libraries to support visualizing a certain feature vector as it
weaves down through a tree's decision nodes; one could only find one image showing
this.
Essential elements of decision tree
visualization:
Before digging deeper, it is very essential to know the most important elements that
decision tree visualizations must highlight:
 Decision node feature versus target value distributions:
A decision node is where the tree splits according to the value of some
attribute/feature of the dataset. One must have an understanding about
how separable the target values are depending upon the feature and a
split point.
 Decision node feature name and feature split value:
A root node is the node where the first split takes place. One must know
which feature each decision node is testing and where in that space the
nodes splits the observations.
 Leaf node purity that affects the prediction confidence:
Leaves with low variance among the target values (regression) or
majority target class (classification) are more reliable predictors.
 Leaf node prediction value:
Leaf node is the terminal node, which predicts the outcome of the
decision tree. There must be an understanding of what is being predicted
by the leaf from the collection of target values.
 Numbers of samples in decision nodes:
Sometimes they are very helpful to know where most of the samples are
being routed through the decision nodes.
 Numbers of samples in leaf nodes:
The main objective of a decision tree is to have a larger and purer leaves.
Nodes with few samples are possible indications of over fitting.
 An understanding of how a particular feature vector is run down the
tree to a leaf:
This helps explain why a particular feature vector gets the prediction it
does. For instance, in a regression tree predicting apartment rent prices,
one might seek a feature vector routed into a high predicted price leaf
because of a decision node that checks for more than three bedrooms.
Creating and visualizing decision trees
with Python
While creating a decision tree, the key thing is to select the best attribute from the
total features list of the dataset for the root node and for sub-nodes. The selection of
best attributes is being achieved with the help of a technique known as the Attribute
Selection Measure (ASM). By using the ASM one can very quickly and easily select
the best features for the respective nodes of the decision tree.
For creating and visualizing decision trees with Python the classic iris dataset will be
used. Here is the code which can be used for loading.
 Data: Iris Dataset
 import sklearn.datasets as datasets
 import pandas as pd
 iris=datasets.load_iris()
 df=pd.DataFrame(iris.data, columns=iris.feature_names)
 y=iris.target
Sklearn will generate a decision tree for the dataset using an optimized
version of the Classification And Regression Trees (CART) algorithm
while running the following code.
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(df,y)
One can also import DecisionTreeRegressor from sklearn.tree if they
want to use a decision tree to predict a numerical target variable.
 Model: Random Forest Classifier
Here two versions are created-one where the maximum depth is limited
to 3 and another where the maximum depth is unlimited. If one wants
they can use a single decision tree for this. But here let’s use a random
forest for modeling
 from sklearn.ensemble import RandomForestClassifier
 # Limit max depth
 model = RandomForestClassifier(max_depth = 3, n_estimators=10)
 # Train
 model.fit(iris.data, iris.target)
 # Extract single tree
 estimator_limited = model.estimators_[5]
 estimator_limited

 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth
=3,
 max_features='auto', max_leaf_nodes=None,
 min_impurity_decrease=0.0, min_impurity_split=None,
 min_samples_leaf=1, min_samples_split=2,
 min_weight_fraction_leaf=0.0, presort=False,
 random_state=1538259045, splitter='best')
 # No max depth
 model = RandomForestClassifier(max_depth = None, n_estimators=10)
 model.fit(iris.data, iris.target)
 estimator_nonlimited = model.estimators_[5]
 Creation of visualization
Now that the creation of decision tree is done, let’s use the pydotplus
package to create visualization for it. One needs to install pydotplus and
graphviz. Thses can be installed with the package manager.
Graphviz is a tool that is used for drawing graphics; it takes help from
dot files. Pydotplus is a module to graphviz’s dot language. Here is the
code:
 from sklearn.externals.six import StringIO
 from IPython.display import Image
 from sklearn.tree import export_graphviz
 import pydotplus

 dot_data = StringIO()
 export_graphviz(dtree, out_file=dot_data,
 filled=True, rounded=True,
 special_characters=True)

 graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
 Image(graph.create_png())
The ‘value’ row in each node gives details related to the many
observations that were sorted into that node, which fall into each of the
categories. The feature X2, which is the petal length, was able to
completely distinguish one species of flower (Iris-Setosa) from the rest.
Conclusion
Visualizing a single decision tree can help provide an idea of how an entire random
forest makes predictions: it's not random, but rather an ordered logical sequence of
steps. The plots created using this library are much easier to understand for people
who do not work with ML on a daily basis and these plots can help in conveying the
model’s logic to the stakeholders.
Industrial data science is about building a smarter company infrastructure. It should
blur or thin the line between the operational and digital processes. Big data
analytics provides innovative opportunities to establish an efficient process, reduce
cost and risk, improve safety measures, maintain regulatory compliance, and better
decision-making.

More Related Content

PDF
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
DOCX
Introduction
PPTX
MachinaFiesta: A Vision into Machine Learning 🚀
PPTX
Seminar Presentation
PDF
Analysis using r
PPTX
WELCOME TO AI PROJECT shidhant mittaal.pptx
PDF
Start machine learning in 5 simple steps
PPTX
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Introduction
MachinaFiesta: A Vision into Machine Learning 🚀
Seminar Presentation
Analysis using r
WELCOME TO AI PROJECT shidhant mittaal.pptx
Start machine learning in 5 simple steps
1-SDLC - Development Models – Waterfall, Rapid Application Development, Agile...

Similar to Know How to Create and Visualize a Decision Tree with Python.pdf (20)

PDF
Formalization & data abstraction during use case modeling in object oriented ...
PDF
FORMALIZATION & DATA ABSTRACTION DURING USE CASE MODELING IN OBJECT ORIENTED ...
PPTX
1) Introduction to Data Analyticszz.pptx
PPTX
data mining.pptx
PPT
PPTX
Lecture-6-7.pptx
PPTX
Text Analytics for Legal work
PDF
502 Object Oriented Analysis and Design.pdf
PDF
Bigdataanalytics
PPTX
An Introduction to Random Forest and linear regression algorithms
PPTX
Software_Engineering_Presentation (1).pptx
PPTX
Big Data Analytics
PDF
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
PPTX
Big Data Analytics
PDF
Distributed Digital Artifacts on the Semantic Web
PDF
Python for Data Analysis: A Comprehensive Guide
PDF
2013 Lecture 5: AR Tools and Interaction
PDF
Data Science - Part V - Decision Trees & Random Forests
PPTX
Mis End Term Exam Theory Concepts
Formalization & data abstraction during use case modeling in object oriented ...
FORMALIZATION & DATA ABSTRACTION DURING USE CASE MODELING IN OBJECT ORIENTED ...
1) Introduction to Data Analyticszz.pptx
data mining.pptx
Lecture-6-7.pptx
Text Analytics for Legal work
502 Object Oriented Analysis and Design.pdf
Bigdataanalytics
An Introduction to Random Forest and linear regression algorithms
Software_Engineering_Presentation (1).pptx
Big Data Analytics
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
Big Data Analytics
Distributed Digital Artifacts on the Semantic Web
Python for Data Analysis: A Comprehensive Guide
2013 Lecture 5: AR Tools and Interaction
Data Science - Part V - Decision Trees & Random Forests
Mis End Term Exam Theory Concepts
Ad

More from Data Science Council of America (20)

PDF
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
PDF
Why Data Scientists Should Learn Machine Learning.pdf
PDF
The Value of Data Visualization for Data Science Professionals.pdf
PDF
Why Big Data Automation is Important for Your Business.pdf
PDF
Why Big Data Automation is Important for Your Business.pdf
PDF
Top 3 Interesting Careers in Big Data.pdf
PDF
Achieving Business Success with Data.pdf
PDF
Data Science - The New Skill for Today’s Entrepreneurs.pdf
PDF
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
PDF
Augmented Analytics The Future Of Data & Analytics.pdf
PDF
Is Data Visualization Literacy Part of Your Company Culture.pdf
PDF
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
PDF
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
PDF
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
PDF
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
PDF
Top Trends & Predictions That Will Drive Data Science in 2022.pdf
PDF
Essential capabilities of data scientist to have in 2022
PDF
Senior Data Scientist
PDF
Senior Big Data Analyst
PDF
Associate Big Data Analyst | ABDA
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Why Data Scientists Should Learn Machine Learning.pdf
The Value of Data Visualization for Data Science Professionals.pdf
Why Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdf
Top 3 Interesting Careers in Big Data.pdf
Achieving Business Success with Data.pdf
Data Science - The New Skill for Today’s Entrepreneurs.pdf
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
Augmented Analytics The Future Of Data & Analytics.pdf
Is Data Visualization Literacy Part of Your Company Culture.pdf
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
Top Trends & Predictions That Will Drive Data Science in 2022.pdf
Essential capabilities of data scientist to have in 2022
Senior Data Scientist
Senior Big Data Analyst
Associate Big Data Analyst | ABDA
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...

Know How to Create and Visualize a Decision Tree with Python.pdf

  • 1. Know How to Create and Visualize a Decision Tree with Python Decision trees are a very popular and important method of Machine Learning (ML) models. The best aspect of it comes from its easy-to-understand visualization and fast deployment into production. To visualize a decision tree it is very essential to understand the concepts related to decision tree algorithm/model so that one can perform well decision tree analysis. Knowing about the decision trees and the elements of decision tree visualization, will surely help to create and visualize it in a better way. Great decision tree visualization is something that speaks for itself. One must have all the inputs before creating it. It is always advisable to improve the old way of plotting the decision trees so that it can be easily understandable. Decision Trees Decision trees are the core building blocks of several advanced algorithms, which include the two most popular machine learning models for structured data - XGBoost and Random Forest. A Decision Tree is a supervised Machine learning algorithm. It is used in both classification and regression algorithms. The decision tree is like a tree with nodes. The branches are based on a number of factors. It splits data into branches till it accomplishes a threshold value. A decision tree consists of the root nodes, children nodes, and leaf nodes. Each leaf in the decision tree is responsible for creating a specific prediction. A decision tree learns the relationship present in the observations in a training set, which is represented as feature vectors x and target values y, by examining and condensing training data into a binary tree of interior nodes and leaf nodes. The disadvantage of decision trees is that the split it makes at each node will be optimized for the dataset it is fit to. This splitting process will generalize well to other data. However, one can generate huge numbers of these decision trees, tuned in slightly varied ways, and combine their predictions to create some of the best models. The visualization decision tree is a tremendous task to learn, understand interpretation and working of the models. One of the biggest benefits of the decision trees is their interpretability — after fitting the model, it is effectively a set of rules that are helpful to predict the target variable. One does not need to be familiar at all with ML techniques to understand what a decision tree is doing. That is the main reason, as it is easy to plot the rules and show them to stakeholders, so they can easily understand the model’s underlying logic.
  • 2. For instance, find a library that visualizes the decision nodes split up the feature space. It is also uncommon for libraries to support visualizing a certain feature vector as it weaves down through a tree's decision nodes; one could only find one image showing this. Essential elements of decision tree visualization: Before digging deeper, it is very essential to know the most important elements that decision tree visualizations must highlight:  Decision node feature versus target value distributions: A decision node is where the tree splits according to the value of some attribute/feature of the dataset. One must have an understanding about how separable the target values are depending upon the feature and a split point.  Decision node feature name and feature split value: A root node is the node where the first split takes place. One must know which feature each decision node is testing and where in that space the nodes splits the observations.  Leaf node purity that affects the prediction confidence: Leaves with low variance among the target values (regression) or majority target class (classification) are more reliable predictors.  Leaf node prediction value: Leaf node is the terminal node, which predicts the outcome of the decision tree. There must be an understanding of what is being predicted by the leaf from the collection of target values.  Numbers of samples in decision nodes: Sometimes they are very helpful to know where most of the samples are being routed through the decision nodes.  Numbers of samples in leaf nodes: The main objective of a decision tree is to have a larger and purer leaves. Nodes with few samples are possible indications of over fitting.  An understanding of how a particular feature vector is run down the tree to a leaf: This helps explain why a particular feature vector gets the prediction it does. For instance, in a regression tree predicting apartment rent prices,
  • 3. one might seek a feature vector routed into a high predicted price leaf because of a decision node that checks for more than three bedrooms. Creating and visualizing decision trees with Python While creating a decision tree, the key thing is to select the best attribute from the total features list of the dataset for the root node and for sub-nodes. The selection of best attributes is being achieved with the help of a technique known as the Attribute Selection Measure (ASM). By using the ASM one can very quickly and easily select the best features for the respective nodes of the decision tree. For creating and visualizing decision trees with Python the classic iris dataset will be used. Here is the code which can be used for loading.  Data: Iris Dataset  import sklearn.datasets as datasets  import pandas as pd  iris=datasets.load_iris()  df=pd.DataFrame(iris.data, columns=iris.feature_names)  y=iris.target Sklearn will generate a decision tree for the dataset using an optimized version of the Classification And Regression Trees (CART) algorithm while running the following code. from sklearn.tree import DecisionTreeClassifier dtree=DecisionTreeClassifier() dtree.fit(df,y) One can also import DecisionTreeRegressor from sklearn.tree if they want to use a decision tree to predict a numerical target variable.
  • 4.  Model: Random Forest Classifier Here two versions are created-one where the maximum depth is limited to 3 and another where the maximum depth is unlimited. If one wants they can use a single decision tree for this. But here let’s use a random forest for modeling  from sklearn.ensemble import RandomForestClassifier  # Limit max depth  model = RandomForestClassifier(max_depth = 3, n_estimators=10)  # Train  model.fit(iris.data, iris.target)  # Extract single tree  estimator_limited = model.estimators_[5]  estimator_limited   DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth =3,  max_features='auto', max_leaf_nodes=None,  min_impurity_decrease=0.0, min_impurity_split=None,  min_samples_leaf=1, min_samples_split=2,  min_weight_fraction_leaf=0.0, presort=False,  random_state=1538259045, splitter='best')  # No max depth  model = RandomForestClassifier(max_depth = None, n_estimators=10)  model.fit(iris.data, iris.target)  estimator_nonlimited = model.estimators_[5]
  • 5.  Creation of visualization Now that the creation of decision tree is done, let’s use the pydotplus package to create visualization for it. One needs to install pydotplus and graphviz. Thses can be installed with the package manager. Graphviz is a tool that is used for drawing graphics; it takes help from dot files. Pydotplus is a module to graphviz’s dot language. Here is the code:  from sklearn.externals.six import StringIO  from IPython.display import Image  from sklearn.tree import export_graphviz  import pydotplus   dot_data = StringIO()  export_graphviz(dtree, out_file=dot_data,  filled=True, rounded=True,  special_characters=True)   graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  Image(graph.create_png())
  • 6. The ‘value’ row in each node gives details related to the many observations that were sorted into that node, which fall into each of the categories. The feature X2, which is the petal length, was able to completely distinguish one species of flower (Iris-Setosa) from the rest. Conclusion Visualizing a single decision tree can help provide an idea of how an entire random forest makes predictions: it's not random, but rather an ordered logical sequence of steps. The plots created using this library are much easier to understand for people who do not work with ML on a daily basis and these plots can help in conveying the model’s logic to the stakeholders. Industrial data science is about building a smarter company infrastructure. It should blur or thin the line between the operational and digital processes. Big data analytics provides innovative opportunities to establish an efficient process, reduce cost and risk, improve safety measures, maintain regulatory compliance, and better decision-making.