Email Classifier using Spark 1.3 Mlib / ML Pipeline

Email Classifier
using Spark 1.3 Mlib
/ ML Pipeline
leoricklin@gmail.com

Inspired by
201503 Email Classifier using Mahout on Hadoop
● Dataset from Apache Spam Assassin
o One file per email, with mail headers and HTML tags
o #spam = 501, #ham = 2501
● Output of Confusion Matrix
Actual Predict spam ham
spam 69 TP 1 FN Recall = 98.5714%
ham 1 FP 382 TN
Precision = 98.5714% Accuacy = 99.5585%

Spam Sample
From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002
Return-Path: <12a1mailbot1@web.de>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32
for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype>
<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD>
<BODY>
<TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r=
<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff=
0000
face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">
<CENTER>Why Spend More Than You Have To?
<CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT=
SIZE=3D"10">
<CENTER>Life Quote Savings
Email headers
HTML tags
Email body

val tf = new HashingTF(numFeatures = 100)
val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam")
val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham")
val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 1, features))
val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 0, features))
val sampleData = spamTrain ++ hamTrain
sampleData.cache()
val trainData = sampleData.sample(false, 0.85, 707L)
val testData = sampleData.sample(false, 0.15, 707L)
val lrLearner = new LogisticRegressionWithSGD()
val model = lrLearner.run(trainData)
Featurization
Using Spark Mlib (1)
#samples=3002
#trainData=2549
#spam=431, #ham=2118
#testData=431
#spam=73, #ham=358
AccracyBase= 83.0626% ( (0+358)/431 )
Tokenization
training

Using Spark Mlib (2)
val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) }
val matirx = validation.map{
ret => ret match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN)
Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout
Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout
Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout
validation

Model Parameters
class org.apache.spark.mllib.feature.HashingTF
● val numFeatures: Int
number of features (default: 220)
class LogisticRegressionWithSGD
● val optimizer: GradientDescent
The optimizer to solve the problem.
class GradientDescent
● def setNumIterations(iters: Int): GradientDescent.this.type
Set the number of iterations for SGD.
● def setRegParam(regParam: Double): GradientDescent.this.type
Set the regularization parameter.
How to find the best
combination of each
parameter?

ML Pipeline Concepts
Transformer
A feature transformer might take a dataset, read a column (e.g., text), convert it into a
new column (e.g., feature vectors)
A learning model might take a dataset, read the column containing feature vectors,
predict the label for each feature vector, append the labels as a new column.
Estimators
An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or
trains on data.
Pipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer
or an Estimator. These stages are run in order, and the input dataset is modified as it
passes through each stage.
Spark 1.3.0 ML Programming Guide

Example ML Workflow
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib

Using ML Pipeline (1)
case class Email(text:String)
case class EmailLabeled(text:String, label:Double)
val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map {
case (file, content) => EmailLabeled(content, 1.0)}
val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map {
case (file, content) => EmailLabeled(content, 0.0) }
val sampleSet = (spamTrain ++ hamTrain).toDF()
sampleSet.cache()
val trainSet = sampleSet.sample(false, 0.85, 100L)
val testSet = sampleSet.sample(false, 0.15, 100L)
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10)
val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) )
val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder(
).addGrid( hashingTF.numFeatures, Array(10, 100, 1000)
).addGrid( lr.regParam, Array(0.1, 0.01)
).addGrid( lr.maxIter, Array(10, 20, 30, 50)
).build()
#samples=3002
#trainData=2528
#spam=421, #ham=2107
#testData=437
#spam=84, #ham=353
AccracyBase
= 80.7780% ( (0+353)/437 )

Using ML Pipeline (2)
crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3)
val cvModel = crossval.fit(trainSet)
val validation = cvModel.transform(testSet)
val matrix = validation.select("label","prediction").map{
case Row(label: Double, prediction: Double) => (label, prediction) match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN)
Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout
Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout
Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout
All in One
Tokenization, Featurization, Model Training, Model Validation, and Prediction
cvModel.bestModel.fittingParamMap = {
LogisticRegression-3cb51fc7-maxIter: 20,
HashingTF-cb518e45-numFeatures: 1000,
LogisticRegression-3cb51fc7-regParam: 0.1 }

Pipelines: Recap
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib

Email Classifier using Spark 1.3 Mlib / ML Pipeline

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Email Classifier using Spark 1.3 Mlib / ML Pipeline (20)

More from leorick lin (6)

Recently uploaded (20)

Email Classifier using Spark 1.3 Mlib / ML Pipeline