Multiclassification with Decision Tree in Spark MLlib 1.3

Multiclassification
with Decision Tree
in Spark MLlib 1.3

References
● 201504 Advanced Analytics with Spark
● 201409 (清華大學) 資料挖礦與大數據分析
● Apache Spark MLlib API
● MCI Machine Learning Repository

A simple Decision Tree
Figure 4-1. Decision tree: Is it spoiled?

LabeledPoint
The Spark MLlib abstraction for a feature vector is known
as a LabeledPoint, which consists of a Spark MLlib Vector
of features, and a target value, here called the label.

1-of-n Encoding
LabeledPoint can
be used with
categorical
features, with
appropriate
encoding.

Covtype Dataset
The data set records the types of forest covering parcels of
land in Colorado, USA.
Name Data Type Measurement
Elevation quantitative meters
Aspect quantitative azimuth
Slope quantitative degrees
Horizontal_Distance_To_Hydrology quantitative meters
Vertical_Distance_To_Hydrology quantitative meters
Horizontal_Distance_To_Roadways quantitative meters
Hillshade_9am quantitative 0 to 255 index
Hillshade_Noon quantitative 0 to 255 index
Hillshade_3pm quantitative 0 to 255 index
Horizontal_Distance_To_Fire_Points quantitative meters
Wilderness_Area (4 binary columns) qualitative 0 or 1
Soil_Type (40 binary columns) qualitative 0 or 1
Cover_Type (7 types) integer 1 to 7
2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
2590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5

A First Decision Tree (1)
val rawData = sc. textFile("hdfs:///user/ds/covtype.data")
val data = rawData. map { line =>
val values: Array[Double] = line. split(',').map(_.toDouble)
val featureVector:Vector[Double] = Vectors.dense(values.init)
val label = values. last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData)
= data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache();cvData.cache();testData.cache()
for classification, labels
should take values {0, 1, ...,
numClasses-1}.

val model = DecisionTree.trainClassifier (trainData, 7, Map[Int,Int](),
"gini", 4, 100)
val predictionsAndLabels = cvData.map(example =>
(model.predict(example.features), example.label)
)
val metrics = new MulticlassMetrics( predictionsAndLabels)

println(model. toDebugString)
metrics.precision
= 0.6996101063190258
metrics.confusionMatrix
DecisionTreeModel classifier of depth 4 with 31
nodes
If (feature 0 <= 3046.0)
If (feature 0 <= 2497.0)
If (feature 3 <= 0.0)
If (feature 12 <= 0.0)
Predict: 3.0
Else (feature 12 > 0.0)
Predict: 5.0
Actual
Predict
cat0 cat1 cat2 cat3 cat4 cat5 cat6
cat0 14248.0 6615.0 5.0 0.0 0.0 1.0 422.0
cat1 5556.0 22440.0 355.0 19.0 0.0 4.0 41.0
cat2 0.0 452.0 3050.0 74.0 0.0 14.0 0.0
cat3 0.0 0.0 163.0 109.0 0.0 0.0 0.0
cat4 0.0 885.0 40.0 1.0 0.0 0.0 0.0
cat5 0.0 564.0 1091.0 37.0 0.0 53.0 0.0
cat6 1078.0 24.0 0.0 0.0 0.0 0.0 883.0

Tuning Decision Trees (1)
val evaluations: Array[((String, Int, Int), Double)] =
for (impurity <- Array("gini", "entropy");
depth <- Array(1, 20);
bins <- Array(10, 300)
)
yield {
val model = DecisionTree.trainClassifier(
trainData, 7, Map[Int,Int](), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
(model.predict(example.features), example.label) )
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}

Tuning Decision Trees (2)
evaluations.sortBy{
case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
...
((entropy,20,300),0.9119046392195256))
((gini ,20,300),0.9058758867075454)
((entropy,20,10 ),0.8968585218391989)
((gini ,20,10 ),0.89050342659865)
((gini ,1 ,10 ),0.6330018378248399)
((gini ,1 ,300),0.6323319764346198)
((entropy,1 ,300),0.48406932206592124)
((entropy,1 ,10 ),0.48406932206592124)
Sort by accuracy,
descending, and print

Revising Categorical Features (1)
With one 40-valued categorical feature, the decision tree can
create decisions based on groups of categories in one decision,
which may be more direct and optimal. On the other hand,
having 40 numeric features represent one 40-valued categorical
feature also increases memory usage and slows things down
val data = rawData.map { line =>
val values = line.split(',').map(_.toDouble)
val wilderness = values.slice(10, 14).indexOf(1.0).toDouble
val soil = values.slice(14, 54).indexOf(1.0).toDouble
val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness
:+ soil) // (3)
val label = values.last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))
..1000.. => 0
..0100.. => 1
..0010.. => 2
..0001.. => 3

val evaluations = for (impurity <- Array("gini", "entropy");
depth <- Array(10, 20, 30);
bins <- Array(40, 300)
} yield {
val model = DecisionTree.trainClassifier( trainData, 7
, Map(10 -> 4, 11 -> 40), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}
evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
…
((entropy,30,300),0.9446513552658804)
((gini, 30,300),0.9391509759293745)
((entropy,30,40) ,0.9389268225394855)
((gini, 30,40) ,0.9355817642596042)
Revising Categorical Features (2)
vs. tuned 1-of-n encoding DT
((entropy,20,300),0.9119046392195256))
Map storing arity of categorical features. E.g., an entry (n ->
k) indicates that feature n is categorical with k categories
indexed from 0: {0, 1, ..., k-1}.

CV set vs. Test set
If the purpose of the CV set was to evaluate parameters fit to the
training set, then the purpose of the test set is to evaluate
hyperparameters that were “fit” to the CV set. That is, the test
set ensures an unbiased estimate of the accuracy of the final,
chosen model and its hyperparameters.
val model = DecisionTree. trainClassifier(
trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)
val predictionsAndLabels = testData.map(example =>
metricsOpt.precision = 0.9161946933031271

Random Decision Forests
It would be great to have not one tree, but many trees, each producing
reasonable but different and independent estimations of the right target
value. Their collective average prediction should fall close to the true
answer, more than any individual tree’s does. It’s the randomness in
the process of building that helps create this independence. This is the
key to random decision forests.
val model = RandomForest.trainClassifier( dataTrain, 7, Map(10 -> 4, 11 ->
40), 20, "auto", "entropy", 30, 300)
val predictionsAndLabels = cvData.map(example => (model.predict(example.
features), example.label) )
metrics.precision = 0.9630068932322555
vs. Categorical Features DT ,0.9446513552658804))

Multiclassification with Decision Tree in Spark MLlib 1.3

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Multiclassification with Decision Tree in Spark MLlib 1.3 (20)

More from leorick lin (6)

Recently uploaded (20)

Multiclassification with Decision Tree in Spark MLlib 1.3