SlideShare a Scribd company logo
Email Classifier
using Spark 1.3 Mlib
/ ML Pipeline
leoricklin@gmail.com
Inspired by
201503 Email Classifier using Mahout on Hadoop
● Dataset from Apache Spam Assassin
o One file per email, with mail headers and HTML tags
o #spam = 501, #ham = 2501
● Output of Confusion Matrix
Actual  Predict spam ham
spam 69 TP 1 FN Recall = 98.5714%
ham 1 FP 382 TN
Precision = 98.5714% Accuacy = 99.5585%
Spam Sample
From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002
Return-Path: <12a1mailbot1@web.de>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32
for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype>
<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD>
<BODY><!-- Inserted by Calypso -->
<TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r=
<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff=
0000
face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">
<CENTER>Why Spend More Than You Have To?
<CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT=
SIZE=3D"10">
<CENTER>Life Quote Savings
Email headers
HTML tags
Email body
val tf = new HashingTF(numFeatures = 100)
val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam")
val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham")
val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 1, features))
val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 0, features))
val sampleData = spamTrain ++ hamTrain
sampleData.cache()
val trainData = sampleData.sample(false, 0.85, 707L)
val testData = sampleData.sample(false, 0.15, 707L)
val lrLearner = new LogisticRegressionWithSGD()
val model = lrLearner.run(trainData)
Featurization
Using Spark Mlib (1)
#samples=3002
#trainData=2549
#spam=431, #ham=2118
#testData=431
#spam=73, #ham=358
AccracyBase= 83.0626% ( (0+358)/431 )
Tokenization
training
Using Spark Mlib (2)
val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) }
val matirx = validation.map{
ret => ret match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN)
Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout
Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout
Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout
validation
Model Parameters
class org.apache.spark.mllib.feature.HashingTF
● val numFeatures: Int
number of features (default: 220)
class LogisticRegressionWithSGD
● val optimizer: GradientDescent
The optimizer to solve the problem.
class GradientDescent
● def setNumIterations(iters: Int): GradientDescent.this.type
Set the number of iterations for SGD.
● def setRegParam(regParam: Double): GradientDescent.this.type
Set the regularization parameter.
How to find the best
combination of each
parameter?
ML Pipeline Concepts
Transformer
A feature transformer might take a dataset, read a column (e.g., text), convert it into a
new column (e.g., feature vectors)
A learning model might take a dataset, read the column containing feature vectors,
predict the label for each feature vector, append the labels as a new column.
Estimators
An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or
trains on data.
Pipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer
or an Estimator. These stages are run in order, and the input dataset is modified as it
passes through each stage.
Spark 1.3.0 ML Programming Guide
Example ML Workflow
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib
Using ML Pipeline (1)
case class Email(text:String)
case class EmailLabeled(text:String, label:Double)
val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map {
case (file, content) => EmailLabeled(content, 1.0)}
val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map {
case (file, content) => EmailLabeled(content, 0.0) }
val sampleSet = (spamTrain ++ hamTrain).toDF()
sampleSet.cache()
val trainSet = sampleSet.sample(false, 0.85, 100L)
val testSet = sampleSet.sample(false, 0.15, 100L)
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10)
val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) )
val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder(
).addGrid( hashingTF.numFeatures, Array(10, 100, 1000)
).addGrid( lr.regParam, Array(0.1, 0.01)
).addGrid( lr.maxIter, Array(10, 20, 30, 50)
).build()
#samples=3002
#trainData=2528
#spam=421, #ham=2107
#testData=437
#spam=84, #ham=353
AccracyBase
= 80.7780% ( (0+353)/437 )
Using ML Pipeline (2)
crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3)
val cvModel = crossval.fit(trainSet)
val validation = cvModel.transform(testSet)
val matrix = validation.select("label","prediction").map{
case Row(label: Double, prediction: Double) => (label, prediction) match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN)
Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout
Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout
Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout
All in One
Tokenization, Featurization, Model Training, Model Validation, and Prediction
cvModel.bestModel.fittingParamMap = {
LogisticRegression-3cb51fc7-maxIter: 20,
HashingTF-cb518e45-numFeatures: 1000,
LogisticRegression-3cb51fc7-regParam: 0.1 }
Pipelines: Recap
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib

More Related Content

PPTX
PPTX
Multithreaded programming
ODP
Python course Day 1
PDF
Underscore.js
PDF
PPTX
Introduction to python programming 1
PPTX
Introduction to python programming 2
PDF
Arrays In Python | Python Array Operations | Edureka
Multithreaded programming
Python course Day 1
Underscore.js
Introduction to python programming 1
Introduction to python programming 2
Arrays In Python | Python Array Operations | Edureka

What's hot (20)

PPTX
Chap1 array
PDF
Test string and array
PDF
Functions in python
PDF
Input and Output
PPT
Oop lecture7
PPTX
Python programming Part -6
PPTX
Python programming –part 7
PDF
Spock the enterprise ready specifiation framework - Ted Vinke
PPT
Python
PDF
Python Cheat Sheet
PPTX
Python programming- Part IV(Functions)
PPTX
Java String Handling
PPT
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
PDF
Strings in python
PDF
Python Functions (PyAtl Beginners Night)
PDF
Using Java Streams
PPTX
Working with tf.data (TF 2)
PDF
Scala - en bedre Java?
PPTX
The Arrow Library in Kotlin
PPT
Chapter 2 Java Methods
Chap1 array
Test string and array
Functions in python
Input and Output
Oop lecture7
Python programming Part -6
Python programming –part 7
Spock the enterprise ready specifiation framework - Ted Vinke
Python
Python Cheat Sheet
Python programming- Part IV(Functions)
Java String Handling
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
Strings in python
Python Functions (PyAtl Beginners Night)
Using Java Streams
Working with tf.data (TF 2)
Scala - en bedre Java?
The Arrow Library in Kotlin
Chapter 2 Java Methods
Ad

Viewers also liked (19)

PPTX
Pixel shaders
PPTX
Spark Data Streaming Pipeline
PDF
Big Data Logging Pipeline with Apache Spark and Kafka
ODP
PPTX
Machine learning with Spark
PPTX
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
PPT
Geometry Shader-based Bump Mapping Setup
PDF
Shaders - Claudia Doppioslash - Unity With the Best
PPTX
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
PDF
Unity Surface Shader for Artist 02
PDF
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Building Scalable Big Data Pipelines
PPTX
Building a unified data pipeline in Apache Spark
PPTX
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
PPTX
Working with Shader with Unity
PPTX
Aws overview
PDF
Scaling Deep Learning with MXNet
PDF
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
Pixel shaders
Spark Data Streaming Pipeline
Big Data Logging Pipeline with Apache Spark and Kafka
Machine learning with Spark
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Geometry Shader-based Bump Mapping Setup
Shaders - Claudia Doppioslash - Unity With the Best
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Unity Surface Shader for Artist 02
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Building Scalable Big Data Pipelines
Building a unified data pipeline in Apache Spark
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Working with Shader with Unity
Aws overview
Scaling Deep Learning with MXNet
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
Ad

Similar to Email Classifier using Spark 1.3 Mlib / ML Pipeline (20)

PDF
Machine learning pipeline with spark ml
PDF
Practical Machine Learning Pipelines with MLlib
PPTX
Introduction to Spark ML
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
PDF
Introduction to Spark ML Pipelines Workshop
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
PDF
Machinelearning Spark Hadoop User Group Munich Meetup 2016
PDF
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
PPTX
Building machine learning inference pipelines at scale (March 2019)
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
PDF
Introduction to and Extending Spark ML
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Extending spark ML for custom models now with python!
PDF
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
PDF
Spark DataFrames and ML Pipelines
PDF
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Machine learning pipeline with spark ml
Practical Machine Learning Pipelines with MLlib
Introduction to Spark ML
Machine Learning Pipelines - Joseph Bradley - Databricks
Building Machine Learning Inference Pipelines at Scale (July 2019)
Introduction to Spark ML Pipelines Workshop
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building machine learning inference pipelines at scale (March 2019)
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
Introduction to and Extending Spark ML
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Practical Distributed Machine Learning Pipelines on Hadoop
Extending spark ML for custom models now with python!
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Spark DataFrames and ML Pipelines
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15

More from leorick lin (6)

PDF
How to prepare for pca certification 2021
PDF
1.5.ensemble learning with apache spark m llib 1.5
PDF
1.5.recommending music with apache spark ml
PDF
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
PDF
Multiclassification with Decision Tree in Spark MLlib 1.3
PDF
Integrating data stored in rdbms and hadoop
How to prepare for pca certification 2021
1.5.ensemble learning with apache spark m llib 1.5
1.5.recommending music with apache spark ml
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
Multiclassification with Decision Tree in Spark MLlib 1.3
Integrating data stored in rdbms and hadoop

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Email Classifier using Spark 1.3 Mlib / ML Pipeline

  • 1. Email Classifier using Spark 1.3 Mlib / ML Pipeline leoricklin@gmail.com
  • 2. Inspired by 201503 Email Classifier using Mahout on Hadoop ● Dataset from Apache Spam Assassin o One file per email, with mail headers and HTML tags o #spam = 501, #ham = 2501 ● Output of Confusion Matrix Actual Predict spam ham spam 69 TP 1 FN Recall = 98.5714% ham 1 FP 382 TN Precision = 98.5714% Accuacy = 99.5585%
  • 3. Spam Sample From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype> <META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD> <BODY><!-- Inserted by Calypso --> <TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r= <CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff= 0000 face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10"> <CENTER>Why Spend More Than You Have To? <CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT= SIZE=3D"10"> <CENTER>Life Quote Savings Email headers HTML tags Email body
  • 4. val tf = new HashingTF(numFeatures = 100) val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam") val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham") val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" ")) }.map(features => LabeledPoint( 1, features)) val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" ")) }.map(features => LabeledPoint( 0, features)) val sampleData = spamTrain ++ hamTrain sampleData.cache() val trainData = sampleData.sample(false, 0.85, 707L) val testData = sampleData.sample(false, 0.15, 707L) val lrLearner = new LogisticRegressionWithSGD() val model = lrLearner.run(trainData) Featurization Using Spark Mlib (1) #samples=3002 #trainData=2549 #spam=431, #ham=2118 #testData=431 #spam=73, #ham=358 AccracyBase= 83.0626% ( (0+358)/431 ) Tokenization training
  • 5. Using Spark Mlib (2) val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) } val matirx = validation.map{ ret => ret match { case (1.0, 1.0) => Array(1, 0, 0, 0) // TP case (0.0, 1.0) => Array(0, 1, 0, 0) // FP case (0.0, 0.0) => Array(0, 0, 1, 0) // TN case (1.0, 0.0) => Array(0, 0, 0, 1) // FN } }.reduce{ (ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3)) } matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN) Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout validation
  • 6. Model Parameters class org.apache.spark.mllib.feature.HashingTF ● val numFeatures: Int number of features (default: 220) class LogisticRegressionWithSGD ● val optimizer: GradientDescent The optimizer to solve the problem. class GradientDescent ● def setNumIterations(iters: Int): GradientDescent.this.type Set the number of iterations for SGD. ● def setRegParam(regParam: Double): GradientDescent.this.type Set the regularization parameter. How to find the best combination of each parameter?
  • 7. ML Pipeline Concepts Transformer A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors) A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column. Estimators An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Pipeline A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input dataset is modified as it passes through each stage. Spark 1.3.0 ML Programming Guide
  • 8. Example ML Workflow 201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib
  • 9. Using ML Pipeline (1) case class Email(text:String) case class EmailLabeled(text:String, label:Double) val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map { case (file, content) => EmailLabeled(content, 1.0)} val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map { case (file, content) => EmailLabeled(content, 0.0) } val sampleSet = (spamTrain ++ hamTrain).toDF() sampleSet.cache() val trainSet = sampleSet.sample(false, 0.85, 100L) val testSet = sampleSet.sample(false, 0.15, 100L) val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features") val lr = new LogisticRegression().setMaxIter(10) val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) ) val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder( ).addGrid( hashingTF.numFeatures, Array(10, 100, 1000) ).addGrid( lr.regParam, Array(0.1, 0.01) ).addGrid( lr.maxIter, Array(10, 20, 30, 50) ).build() #samples=3002 #trainData=2528 #spam=421, #ham=2107 #testData=437 #spam=84, #ham=353 AccracyBase = 80.7780% ( (0+353)/437 )
  • 10. Using ML Pipeline (2) crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3) val cvModel = crossval.fit(trainSet) val validation = cvModel.transform(testSet) val matrix = validation.select("label","prediction").map{ case Row(label: Double, prediction: Double) => (label, prediction) match { case (1.0, 1.0) => Array(1, 0, 0, 0) // TP case (0.0, 1.0) => Array(0, 1, 0, 0) // FP case (0.0, 0.0) => Array(0, 0, 1, 0) // TN case (1.0, 0.0) => Array(0, 0, 0, 1) // FN } }.reduce{ (ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3)) } matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN) Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout All in One Tokenization, Featurization, Model Training, Model Validation, and Prediction cvModel.bestModel.fittingParamMap = { LogisticRegression-3cb51fc7-maxIter: 20, HashingTF-cb518e45-numFeatures: 1000, LogisticRegression-3cb51fc7-regParam: 0.1 }
  • 11. Pipelines: Recap 201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib