SlideShare a Scribd company logo
Multiclassification
with Decision Tree
in Spark MLlib 1.3
References
● 201504 Advanced Analytics with Spark
● 201409 (清華大學) 資料挖礦與大數據分析
● Apache Spark MLlib API
● MCI Machine Learning Repository
A simple Decision Tree
Figure 4-1. Decision tree: Is it spoiled?
LabeledPoint
The Spark MLlib abstraction for a feature vector is known
as a LabeledPoint, which consists of a Spark MLlib Vector
of features, and a target value, here called the label.
1-of-n Encoding
LabeledPoint can
be used with
categorical
features, with
appropriate
encoding.
Covtype Dataset
The data set records the types of forest covering parcels of
land in Colorado, USA.
Name Data Type Measurement
Elevation quantitative meters
Aspect quantitative azimuth
Slope quantitative degrees
Horizontal_Distance_To_Hydrology quantitative meters
Vertical_Distance_To_Hydrology quantitative meters
Horizontal_Distance_To_Roadways quantitative meters
Hillshade_9am quantitative 0 to 255 index
Hillshade_Noon quantitative 0 to 255 index
Hillshade_3pm quantitative 0 to 255 index
Horizontal_Distance_To_Fire_Points quantitative meters
Wilderness_Area (4 binary columns) qualitative 0 or 1
Soil_Type (40 binary columns) qualitative 0 or 1
Cover_Type (7 types) integer 1 to 7
2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
2590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
A First Decision Tree (1)
val rawData = sc. textFile("hdfs:///user/ds/covtype.data")
val data = rawData. map { line =>
val values: Array[Double] = line. split(',').map(_.toDouble)
val featureVector:Vector[Double] = Vectors.dense(values.init)
val label = values. last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData)
= data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache();cvData.cache();testData.cache()
for classification, labels
should take values {0, 1, ...,
numClasses-1}.
A First Decision Tree (2)
val model = DecisionTree.trainClassifier (trainData, 7, Map[Int,Int](),
"gini", 4, 100)
val predictionsAndLabels = cvData.map(example =>
(model.predict(example.features), example.label)
)
val metrics = new MulticlassMetrics( predictionsAndLabels)
A First Decision Tree (3)
println(model. toDebugString)
metrics.precision
= 0.6996101063190258
metrics.confusionMatrix
DecisionTreeModel classifier of depth 4 with 31
nodes
If (feature 0 <= 3046.0)
If (feature 0 <= 2497.0)
If (feature 3 <= 0.0)
If (feature 12 <= 0.0)
Predict: 3.0
Else (feature 12 > 0.0)
Predict: 5.0
Actual 
Predict
cat0 cat1 cat2 cat3 cat4 cat5 cat6
cat0 14248.0 6615.0 5.0 0.0 0.0 1.0 422.0
cat1 5556.0 22440.0 355.0 19.0 0.0 4.0 41.0
cat2 0.0 452.0 3050.0 74.0 0.0 14.0 0.0
cat3 0.0 0.0 163.0 109.0 0.0 0.0 0.0
cat4 0.0 885.0 40.0 1.0 0.0 0.0 0.0
cat5 0.0 564.0 1091.0 37.0 0.0 53.0 0.0
cat6 1078.0 24.0 0.0 0.0 0.0 0.0 883.0
Tuning Decision Trees (1)
val evaluations: Array[((String, Int, Int), Double)] =
for (impurity <- Array("gini", "entropy");
depth <- Array(1, 20);
bins <- Array(10, 300)
)
yield {
val model = DecisionTree.trainClassifier(
trainData, 7, Map[Int,Int](), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
(model.predict(example.features), example.label) )
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}
Tuning Decision Trees (2)
evaluations.sortBy{
case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
...
((entropy,20,300),0.9119046392195256))
((gini ,20,300),0.9058758867075454)
((entropy,20,10 ),0.8968585218391989)
((gini ,20,10 ),0.89050342659865)
((gini ,1 ,10 ),0.6330018378248399)
((gini ,1 ,300),0.6323319764346198)
((entropy,1 ,300),0.48406932206592124)
((entropy,1 ,10 ),0.48406932206592124)
Sort by accuracy,
descending, and print
Revising Categorical Features (1)
With one 40-valued categorical feature, the decision tree can
create decisions based on groups of categories in one decision,
which may be more direct and optimal. On the other hand,
having 40 numeric features represent one 40-valued categorical
feature also increases memory usage and slows things down
val data = rawData.map { line =>
val values = line.split(',').map(_.toDouble)
val wilderness = values.slice(10, 14).indexOf(1.0).toDouble
val soil = values.slice(14, 54).indexOf(1.0).toDouble
val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness
:+ soil) // (3)
val label = values.last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))
..1000.. => 0
..0100.. => 1
..0010.. => 2
..0001.. => 3
val evaluations = for (impurity <- Array("gini", "entropy");
depth <- Array(10, 20, 30);
bins <- Array(40, 300)
} yield {
val model = DecisionTree.trainClassifier( trainData, 7
, Map(10 -> 4, 11 -> 40), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
(model.predict(example.features), example.label) )
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}
evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
…
((entropy,30,300),0.9446513552658804)
((gini, 30,300),0.9391509759293745)
((entropy,30,40) ,0.9389268225394855)
((gini, 30,40) ,0.9355817642596042)
Revising Categorical Features (2)
vs. tuned 1-of-n encoding DT
((entropy,20,300),0.9119046392195256))
Map storing arity of categorical features. E.g., an entry (n ->
k) indicates that feature n is categorical with k categories
indexed from 0: {0, 1, ..., k-1}.
CV set vs. Test set
If the purpose of the CV set was to evaluate parameters fit to the
training set, then the purpose of the test set is to evaluate
hyperparameters that were “fit” to the CV set. That is, the test
set ensures an unbiased estimate of the accuracy of the final,
chosen model and its hyperparameters.
val model = DecisionTree. trainClassifier(
trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)
val predictionsAndLabels = testData.map(example =>
(model.predict(example.features), example.label) )
val metrics = new MulticlassMetrics( predictionsAndLabels)
metricsOpt.precision = 0.9161946933031271
Random Decision Forests
It would be great to have not one tree, but many trees, each producing
reasonable but different and independent estimations of the right target
value. Their collective average prediction should fall close to the true
answer, more than any individual tree’s does. It’s the randomness in
the process of building that helps create this independence. This is the
key to random decision forests.
val model = RandomForest.trainClassifier( dataTrain, 7, Map(10 -> 4, 11 ->
40), 20, "auto", "entropy", 30, 300)
val predictionsAndLabels = cvData.map(example => (model.predict(example.
features), example.label) )
val metrics = new MulticlassMetrics( predictionsAndLabels)
metrics.precision = 0.9630068932322555
vs. Categorical Features DT ,0.9446513552658804))

More Related Content

PDF
Linear models
 
PDF
Cubist
 
PDF
Regression kriging
 
PDF
11. Linear Models
 
PDF
13. Cubist
 
PDF
Vectors data frames
 
PDF
6. Vectors – Data Frames
 
PDF
Cloudera - A Taste of random decision forests
Linear models
 
Cubist
 
Regression kriging
 
11. Linear Models
 
13. Cubist
 
Vectors data frames
 
6. Vectors – Data Frames
 
Cloudera - A Taste of random decision forests

What's hot (20)

PDF
Intelligent System Optimizations
PDF
R data mining-Time Series Analysis with R
PDF
Datamining R 4th
PDF
Time series-mining-slides
PDF
Stata cheatsheet transformation
PDF
Stata cheat sheet: data transformation
PDF
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
PDF
A walk in graph databases v1.0
PPTX
Dian Vitiana Ningrum ()6211540000020)
PDF
Stata cheat sheet: data processing
PPT
Functional programming in scala
PDF
Killing The Unit test talk
DOCX
Ml all programs
PDF
Will it Blend? - ScalaSyd February 2015
PDF
Stata Programming Cheat Sheet
PDF
Maneuvering target track prediction model
PDF
r for data science 2. grammar of graphics (ggplot2) clean -ref
PDF
ScalaMeter 2012
PDF
Stata cheat sheet analysis
PDF
Machine Learning Live
Intelligent System Optimizations
R data mining-Time Series Analysis with R
Datamining R 4th
Time series-mining-slides
Stata cheatsheet transformation
Stata cheat sheet: data transformation
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
A walk in graph databases v1.0
Dian Vitiana Ningrum ()6211540000020)
Stata cheat sheet: data processing
Functional programming in scala
Killing The Unit test talk
Ml all programs
Will it Blend? - ScalaSyd February 2015
Stata Programming Cheat Sheet
Maneuvering target track prediction model
r for data science 2. grammar of graphics (ggplot2) clean -ref
ScalaMeter 2012
Stata cheat sheet analysis
Machine Learning Live
Ad

Viewers also liked (6)

PDF
Random Forest for Big Data
PPTX
Decision tree for Predictive Modeling
PPTX
Random forest
PPTX
Step By Step Guide to Learn R
PPTX
Decision tree
PDF
Decision tree
Random Forest for Big Data
Decision tree for Predictive Modeling
Random forest
Step By Step Guide to Learn R
Decision tree
Decision tree
Ad

Similar to Multiclassification with Decision Tree in Spark MLlib 1.3 (20)

PDF
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
PDF
Machine learning in science and industry — day 2
PPTX
Apache Spark MLlib - Random Foreset and Desicion Trees
PDF
From decision trees to random forests
PDF
XGBoost @ Fyber
PDF
Machine Learning Lecture 3 Decision Trees
PDF
The Beginnings of a Search Engine
PDF
The Beginnings Of A Search Engine
PPTX
Thesis presentation: Applications of machine learning in predicting supply risks
PDF
Tree models with Scikit-Learn: Great models with little assumptions
PPTX
Machine learning using spark
DOCX
Classification Using Decision Trees and RulesChapter 5.docx
PPTX
learning using decision trees_machine.pptx
PDF
lec8_annotated.pdf ml csci 567 vatsal sharan
PPTX
Random Decision Forests at Scale
PPTX
Module 2_ Hyperparameters in Decision Tree Learning.pptx
PPT
[ppt]
PPT
[ppt]
PDF
Practical Data Science: Data Modelling and Presentation
PPTX
Learning using decision tree.pptx BCA 5th sem
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Machine learning in science and industry — day 2
Apache Spark MLlib - Random Foreset and Desicion Trees
From decision trees to random forests
XGBoost @ Fyber
Machine Learning Lecture 3 Decision Trees
The Beginnings of a Search Engine
The Beginnings Of A Search Engine
Thesis presentation: Applications of machine learning in predicting supply risks
Tree models with Scikit-Learn: Great models with little assumptions
Machine learning using spark
Classification Using Decision Trees and RulesChapter 5.docx
learning using decision trees_machine.pptx
lec8_annotated.pdf ml csci 567 vatsal sharan
Random Decision Forests at Scale
Module 2_ Hyperparameters in Decision Tree Learning.pptx
[ppt]
[ppt]
Practical Data Science: Data Modelling and Presentation
Learning using decision tree.pptx BCA 5th sem

More from leorick lin (6)

PDF
How to prepare for pca certification 2021
PDF
1.5.ensemble learning with apache spark m llib 1.5
PDF
1.5.recommending music with apache spark ml
PDF
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
PDF
Email Classifier using Spark 1.3 Mlib / ML Pipeline
PDF
Integrating data stored in rdbms and hadoop
How to prepare for pca certification 2021
1.5.ensemble learning with apache spark m llib 1.5
1.5.recommending music with apache spark ml
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Integrating data stored in rdbms and hadoop

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Programs and apps: productivity, graphics, security and other tools
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Multiclassification with Decision Tree in Spark MLlib 1.3

  • 2. References ● 201504 Advanced Analytics with Spark ● 201409 (清華大學) 資料挖礦與大數據分析 ● Apache Spark MLlib API ● MCI Machine Learning Repository
  • 3. A simple Decision Tree Figure 4-1. Decision tree: Is it spoiled?
  • 4. LabeledPoint The Spark MLlib abstraction for a feature vector is known as a LabeledPoint, which consists of a Spark MLlib Vector of features, and a target value, here called the label.
  • 5. 1-of-n Encoding LabeledPoint can be used with categorical features, with appropriate encoding.
  • 6. Covtype Dataset The data set records the types of forest covering parcels of land in Colorado, USA. Name Data Type Measurement Elevation quantitative meters Aspect quantitative azimuth Slope quantitative degrees Horizontal_Distance_To_Hydrology quantitative meters Vertical_Distance_To_Hydrology quantitative meters Horizontal_Distance_To_Roadways quantitative meters Hillshade_9am quantitative 0 to 255 index Hillshade_Noon quantitative 0 to 255 index Hillshade_3pm quantitative 0 to 255 index Horizontal_Distance_To_Fire_Points quantitative meters Wilderness_Area (4 binary columns) qualitative 0 or 1 Soil_Type (40 binary columns) qualitative 0 or 1 Cover_Type (7 types) integer 1 to 7 2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5 2590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
  • 7. A First Decision Tree (1) val rawData = sc. textFile("hdfs:///user/ds/covtype.data") val data = rawData. map { line => val values: Array[Double] = line. split(',').map(_.toDouble) val featureVector:Vector[Double] = Vectors.dense(values.init) val label = values. last - 1 LabeledPoint(label, featureVector) } val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1)) trainData.cache();cvData.cache();testData.cache() for classification, labels should take values {0, 1, ..., numClasses-1}.
  • 8. A First Decision Tree (2) val model = DecisionTree.trainClassifier (trainData, 7, Map[Int,Int](), "gini", 4, 100) val predictionsAndLabels = cvData.map(example => (model.predict(example.features), example.label) ) val metrics = new MulticlassMetrics( predictionsAndLabels)
  • 9. A First Decision Tree (3) println(model. toDebugString) metrics.precision = 0.6996101063190258 metrics.confusionMatrix DecisionTreeModel classifier of depth 4 with 31 nodes If (feature 0 <= 3046.0) If (feature 0 <= 2497.0) If (feature 3 <= 0.0) If (feature 12 <= 0.0) Predict: 3.0 Else (feature 12 > 0.0) Predict: 5.0 Actual Predict cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat0 14248.0 6615.0 5.0 0.0 0.0 1.0 422.0 cat1 5556.0 22440.0 355.0 19.0 0.0 4.0 41.0 cat2 0.0 452.0 3050.0 74.0 0.0 14.0 0.0 cat3 0.0 0.0 163.0 109.0 0.0 0.0 0.0 cat4 0.0 885.0 40.0 1.0 0.0 0.0 0.0 cat5 0.0 564.0 1091.0 37.0 0.0 53.0 0.0 cat6 1078.0 24.0 0.0 0.0 0.0 0.0 883.0
  • 10. Tuning Decision Trees (1) val evaluations: Array[((String, Int, Int), Double)] = for (impurity <- Array("gini", "entropy"); depth <- Array(1, 20); bins <- Array(10, 300) ) yield { val model = DecisionTree.trainClassifier( trainData, 7, Map[Int,Int](), impurity, depth, bins) val predictionsAndLabels = cvData.map( example => (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy) }
  • 11. Tuning Decision Trees (2) evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy }.reverse.foreach(println) ... ((entropy,20,300),0.9119046392195256)) ((gini ,20,300),0.9058758867075454) ((entropy,20,10 ),0.8968585218391989) ((gini ,20,10 ),0.89050342659865) ((gini ,1 ,10 ),0.6330018378248399) ((gini ,1 ,300),0.6323319764346198) ((entropy,1 ,300),0.48406932206592124) ((entropy,1 ,10 ),0.48406932206592124) Sort by accuracy, descending, and print
  • 12. Revising Categorical Features (1) With one 40-valued categorical feature, the decision tree can create decisions based on groups of categories in one decision, which may be more direct and optimal. On the other hand, having 40 numeric features represent one 40-valued categorical feature also increases memory usage and slows things down val data = rawData.map { line => val values = line.split(',').map(_.toDouble) val wilderness = values.slice(10, 14).indexOf(1.0).toDouble val soil = values.slice(14, 54).indexOf(1.0).toDouble val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness :+ soil) // (3) val label = values.last - 1 LabeledPoint(label, featureVector) } val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1)) ..1000.. => 0 ..0100.. => 1 ..0010.. => 2 ..0001.. => 3
  • 13. val evaluations = for (impurity <- Array("gini", "entropy"); depth <- Array(10, 20, 30); bins <- Array(40, 300) } yield { val model = DecisionTree.trainClassifier( trainData, 7 , Map(10 -> 4, 11 -> 40), impurity, depth, bins) val predictionsAndLabels = cvData.map( example => (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy) } evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy }.reverse.foreach(println) … ((entropy,30,300),0.9446513552658804) ((gini, 30,300),0.9391509759293745) ((entropy,30,40) ,0.9389268225394855) ((gini, 30,40) ,0.9355817642596042) Revising Categorical Features (2) vs. tuned 1-of-n encoding DT ((entropy,20,300),0.9119046392195256)) Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
  • 14. CV set vs. Test set If the purpose of the CV set was to evaluate parameters fit to the training set, then the purpose of the test set is to evaluate hyperparameters that were “fit” to the CV set. That is, the test set ensures an unbiased estimate of the accuracy of the final, chosen model and its hyperparameters. val model = DecisionTree. trainClassifier( trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300) val predictionsAndLabels = testData.map(example => (model.predict(example.features), example.label) ) val metrics = new MulticlassMetrics( predictionsAndLabels) metricsOpt.precision = 0.9161946933031271
  • 15. Random Decision Forests It would be great to have not one tree, but many trees, each producing reasonable but different and independent estimations of the right target value. Their collective average prediction should fall close to the true answer, more than any individual tree’s does. It’s the randomness in the process of building that helps create this independence. This is the key to random decision forests. val model = RandomForest.trainClassifier( dataTrain, 7, Map(10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300) val predictionsAndLabels = cvData.map(example => (model.predict(example. features), example.label) ) val metrics = new MulticlassMetrics( predictionsAndLabels) metrics.precision = 0.9630068932322555 vs. Categorical Features DT ,0.9446513552658804))