Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks
ﬁle:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6
Machine Learning with Spark
Scikit Learn Cheat Sheet
Load basic dependencies
>
inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv
ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/
import java.util.Base64
import java.nio.charset.StandardCharsets
encB64: (str: String)String
decB64: (str: String)String
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
import java.net.URI
import org.apache.hadoop.fs.FileStatus
listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus]
ls3: (s3FolderPath: String)Unit
rm3: (s3Path: String)Boolean
? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB]
Read taxi data as dataframe from parquet
>
%run "/meetup/kickoff/connect_s3"
// read Parquet files
val parquetTable= sqlContext.read.parquet(ouputParquetDir)
val toDouble = udf[Double, Float]( _.toDouble)
val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount")))
(http://databricks.com)  Import Notebook
MachineLearning

>
>

Showing the first 1000 rows.
2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-
07T15:33:28.000+0000
0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-
07T22:25:46.000+0000
312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-
05T11:54:49.000+0000
DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-
02T06:58:08.000+0000
0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-
07T14:46:55.000+0000
Scatter plot for tip amount and fare amount
>
500m
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
9.00
9.50
5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0
fare_amount
tip_amount
Showing sample based on the first 1000 rows.
Transformation of data with standard dataframe operations
>
The pipeline concept of Spark ML
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
taxiData.registerTempTable("ml_nyc_taxi")
%sql SELECT * FROM ml_nyc_taxi
%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val toVec = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }
val trainingData =
taxiData
.filter(toDouble(taxiData.col("tip_amount")) > 0.0)
.withColumn("label", toDouble(taxiData.col("tip_amount")))
.withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))

A Pipeline chains Transformers and Estimators
A Transformer can also be an estimator from a previous trained model
Important for easily
training with different model parameters e.g. for cross-validation
with different test and training data (train-validation split)
repeat the transformation steps before estimation
Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operators
on Spark
SQL transformer:
Select and filter the relevant data
>
VectorAssembler:
Transform the data into labeled data as needed for ML estimators
>
+------------------+----------+
| label| features|
+------------------+----------+
|1.2000000476837158| [1.0,5.5]|
| 4.199999809265137|[1.0,20.5]|
| 5.900000095367432|[1.0,29.0]|
| 5.380000114440918|[1.0,21.0]|
| 1.399999976158142| [6.0,6.5]|
| 1.0| [1.0,5.0]|
| 1.25| [1.0,4.5]|
| 3.0|[6.0,26.0]|
| 1.0|[1.0,14.5]|
|1.2999999523162842| [1.0,6.5]|
| 1.899999976158142| [5.0,9.5]|
|1.6200000047683716| [1.0,6.5]|
| 1.899999976158142| [1.0,9.0]|
| 2.0|[1.0,22.0]|
| 6.0|[1.0,25.0]|
|3.5999999046325684|[1.0,17.5]|
|1.2000000476837158| [1.0,6.0]|
| 7.5|[1.0,24.5]|
Initialize the estimator
import org.apache.spark.ml.feature.SQLTransformer
val taxiDataSelector = new SQLTransformer().setStatement(
"SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")
val selectedTaxiData = taxiDataSelector.transform(taxiData)
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
val trainingDataAssembler = new VectorAssembler()
.setInputCols(Array("passenger_count", "fare_amount"))
.setOutputCol("features")
val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)
assembledTaxiData.select("label", "features").show()

>
LogisticRegression parameters:
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =
1, it is an L1 penalty (default: 0.0, current: 0.8)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 10)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3)
solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto)
standardization: whether to standardize the training features before fitting the model (default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )
import org.apache.spark.ml.regression.LinearRegression
linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd
Split the data into training and test set
>
Setup the transformation and estimation PIPELINE
>
Use the pipeline to train the model
>
Predict with the trained model on the test data
>
5.00
10.0
15.0
20.0
25.0
30.0
35.0
5.00 10.0 15.0
prediction
label
Showing sample based on the ﬁrst 1000 rows.
How to get started with Spark ML
Setup your Laptop (16+ GB RAM recommended)
import org.apache.spark.ml.regression.LinearRegression
// Create a LogisticRegression instance. This instance is an Estimator.
val linearRegressionEstimator = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:n" + linearRegressionEstimator.explainParams() + "n")
val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)
import org.apache.spark.ml.{Pipeline, PipelineModel}
val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))
// Learn a LogisticRegression model.
// val lrModel = linearRegressionEstimator.fit(trainingData)
val lrModel = pipeline.fit(trainingTaxiData)
display(lrModel.transform(testTaxiData)
.select("label", "prediction"))

mac$ brew install spark
or get Databricks Community Edition Notebook (Wait List)
Get data
Join a ML competition and get BIG data from Kaggle
Analyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016
(https://github.com/amaboura/panama-papers-dataset-2016)
Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/
(https://zeppelin.incubator.apache.org/))
Throw some algorithms on it !
? have a coﬀee
? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-
guide.html)
? read the Kaggle competition forums and blog
Graphs from the Panama Papers

Machinelearning Spark Hadoop User Group Munich Meetup 2016

More Related Content

What's hot (20)

Similar to Machinelearning Spark Hadoop User Group Munich Meetup 2016 (20)

More from Comsysto Reply GmbH (20)

Recently uploaded (20)

Machinelearning Spark Hadoop User Group Munich Meetup 2016