SlideShare a Scribd company logo
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6
Machine Learning with Spark
Scikit Learn Cheat Sheet
Load basic dependencies
> 
inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv
ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/
import java.util.Base64
import java.nio.charset.StandardCharsets
encB64: (str: String)String
decB64: (str: String)String
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
import java.net.URI
import org.apache.hadoop.fs.FileStatus
listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus]
ls3: (s3FolderPath: String)Unit
rm3: (s3Path: String)Boolean
? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB]
Read taxi data as dataframe from parquet
> 
%run "/meetup/kickoff/connect_s3"
// read Parquet files
val parquetTable= sqlContext.read.parquet(ouputParquetDir)
val toDouble = udf[Double, Float]( _.toDouble)
val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount")))
(http://databricks.com)  Import Notebook
MachineLearning
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6
> 
> 

Showing the first 1000 rows.
2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-
07T15:33:28.000+0000
0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-
07T22:25:46.000+0000
312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-
05T11:54:49.000+0000
DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-
02T06:58:08.000+0000
0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-
07T14:46:55.000+0000
Scatter plot for tip amount and fare amount
> 
500m
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
9.00
9.50
5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0
fare_amount
tip_amount
Showing sample based on the first 1000 rows.
Transformation of data with standard dataframe operations
> 
The pipeline concept of Spark ML
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
taxiData.registerTempTable("ml_nyc_taxi")
%sql SELECT * FROM ml_nyc_taxi
%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val toVec   = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }
val trainingData =
taxiData
.filter(toDouble(taxiData.col("tip_amount")) > 0.0)
.withColumn("label", toDouble(taxiData.col("tip_amount")))
.withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6
A Pipeline chains Transformers and Estimators
A Transformer can also be an estimator from a previous trained model
Important for easily
training with different model parameters e.g. for cross-validation
with different test and training data (train-validation split)
repeat the transformation steps before estimation
Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operators
on Spark
SQL transformer:
Select and filter the relevant data
> 
VectorAssembler:
Transform the data into labeled data as needed for ML estimators
> 
+------------------+----------+
| label| features|
+------------------+----------+
|1.2000000476837158| [1.0,5.5]|
| 4.199999809265137|[1.0,20.5]|
| 5.900000095367432|[1.0,29.0]|
| 5.380000114440918|[1.0,21.0]|
| 1.399999976158142| [6.0,6.5]|
| 1.0| [1.0,5.0]|
| 1.25| [1.0,4.5]|
| 3.0|[6.0,26.0]|
| 1.0|[1.0,14.5]|
|1.2999999523162842| [1.0,6.5]|
| 1.899999976158142| [5.0,9.5]|
|1.6200000047683716| [1.0,6.5]|
| 1.899999976158142| [1.0,9.0]|
| 2.0|[1.0,22.0]|
| 6.0|[1.0,25.0]|
|3.5999999046325684|[1.0,17.5]|
|1.2000000476837158| [1.0,6.0]|
| 7.5|[1.0,24.5]|
Initialize the estimator
import org.apache.spark.ml.feature.SQLTransformer
val taxiDataSelector = new SQLTransformer().setStatement(
"SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")
val selectedTaxiData = taxiDataSelector.transform(taxiData)
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
val trainingDataAssembler = new VectorAssembler()
.setInputCols(Array("passenger_count", "fare_amount"))
.setOutputCol("features")
val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)
assembledTaxiData.select("label", "features").show()
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6
> 
LogisticRegression parameters:
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =
1, it is an L1 penalty (default: 0.0, current: 0.8)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 10)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3)
solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto)
standardization: whether to standardize the training features before fitting the model (default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )
import org.apache.spark.ml.regression.LinearRegression
linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd
Split the data into training and test set
> 
Setup the transformation and estimation PIPELINE
> 
Use the pipeline to train the model
> 
Predict with the trained model on the test data
> 
5.00
10.0
15.0
20.0
25.0
30.0
35.0
5.00 10.0 15.0
prediction
label
Showing sample based on the first 1000 rows.
How to get started with Spark ML
Setup your Laptop (16+ GB RAM recommended)
import org.apache.spark.ml.regression.LinearRegression
// Create a LogisticRegression instance. This instance is an Estimator.
val linearRegressionEstimator = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:n" + linearRegressionEstimator.explainParams() + "n")
val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)
import org.apache.spark.ml.{Pipeline, PipelineModel}
val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))
// Learn a LogisticRegression model.
// val lrModel = linearRegressionEstimator.fit(trainingData)
val lrModel = pipeline.fit(trainingTaxiData)
display(lrModel.transform(testTaxiData)
.select("label", "prediction"))
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6
mac$ brew install spark
or get Databricks Community Edition Notebook (Wait List)
Get data
Join a ML competition and get BIG data from Kaggle
Analyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016
(https://github.com/amaboura/panama-papers-dataset-2016)
Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/
(https://zeppelin.incubator.apache.org/))
Throw some algorithms on it !
? have a coffee
? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-
guide.html)
? read the Kaggle competition forums and blog
Graphs from the Panama Papers
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 6/6

More Related Content

PDF
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
PDF
R Programming: Export/Output Data In R
PPTX
Time Series Analysis for Network Secruity
PPTX
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
PDF
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PPTX
Python database interfaces
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
R Programming: Export/Output Data In R
Time Series Analysis for Network Secruity
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
ComputeFest 2012: Intro To R for Physical Sciences
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Python database interfaces

What's hot (20)

PDF
Scaling up data science applications
DOCX
Spark_Documentation_Template1
PDF
Meet scala
PDF
The Ring programming language version 1.9 book - Part 33 of 210
PDF
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
PDF
RMySQL Tutorial For Beginners
PDF
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
PDF
Grid gain paper
PDF
Spark Schema For Free with David Szakallas
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
PDF
AJUG April 2011 Cascading example
PDF
Introduce spark (by 조창원)
PDF
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
PDF
Chapter 8 advanced sorting and hashing for print
PDF
Compact and safely: static DSL on Kotlin
PDF
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
PPTX
Apache Flink Training: DataSet API Basics
PDF
Full Text Search in PostgreSQL
PDF
Data Love Conference - Window Functions for Database Analytics
Scaling up data science applications
Spark_Documentation_Template1
Meet scala
The Ring programming language version 1.9 book - Part 33 of 210
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
RMySQL Tutorial For Beginners
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
Grid gain paper
Spark Schema For Free with David Szakallas
GeoMesa on Apache Spark SQL with Anthony Fox
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
AJUG April 2011 Cascading example
Introduce spark (by 조창원)
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Chapter 8 advanced sorting and hashing for print
Compact and safely: static DSL on Kotlin
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Apache Flink Training: DataSet API Basics
Full Text Search in PostgreSQL
Data Love Conference - Window Functions for Database Analytics
Ad

Similar to Machinelearning Spark Hadoop User Group Munich Meetup 2016 (20)

PDF
Ge aviation spark application experience porting analytics into py spark ml p...
PDF
M|18 Ingesting Data with the New Bulk Data Adapters
PPTX
Device status anomaly detection
PDF
Competition 1 (blog 1)
PDF
Lab 2: Classification and Regression Prediction Models, training and testing ...
PPT
Tony jambu (obscure) tools of the trade for tuning oracle sq ls
PPT
Save Coding Time with Proc SQL.ppt
PPTX
Shifu plugin-trainer and pmml-adapter
PDF
Machine Learning Model for M.S admissions
PPT
ADO.Net Improvements in .Net 2.0
PPTX
Employee Salary Presentation.l based on data science collection of data
PPT
Tony Jambu (obscure) tools of the trade for tuning oracle sq ls
PDF
Productizing Structured Streaming Jobs
PPTX
Flux - Open Machine Learning Stack / Pipeline
PDF
Spark ml streaming
PDF
SQL Performance Tuning and New Features in Oracle 19c
PDF
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
PDF
Aspects of 10 Tuning
PDF
Viktor Tsykunov: Azure Machine Learning Service
Ge aviation spark application experience porting analytics into py spark ml p...
M|18 Ingesting Data with the New Bulk Data Adapters
Device status anomaly detection
Competition 1 (blog 1)
Lab 2: Classification and Regression Prediction Models, training and testing ...
Tony jambu (obscure) tools of the trade for tuning oracle sq ls
Save Coding Time with Proc SQL.ppt
Shifu plugin-trainer and pmml-adapter
Machine Learning Model for M.S admissions
ADO.Net Improvements in .Net 2.0
Employee Salary Presentation.l based on data science collection of data
Tony Jambu (obscure) tools of the trade for tuning oracle sq ls
Productizing Structured Streaming Jobs
Flux - Open Machine Learning Stack / Pipeline
Spark ml streaming
SQL Performance Tuning and New Features in Oracle 19c
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Aspects of 10 Tuning
Viktor Tsykunov: Azure Machine Learning Service
Ad

More from Comsysto Reply GmbH (20)

PDF
Architectural Decisions: Smoothly and Consistently
PDF
ljug-meetup-2023-03-hexagonal-architecture.pdf
PDF
Software Architecture and Architectors: useless VS valuable
PDF
Invited-Talk_PredAnalytics_München (2).pdf
PDF
MicroFrontends für Microservices
PDF
Alles offen = gut(ai)
PDF
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
PDF
Smart City Munich Kickoff Meetup
PDF
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
PDF
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
PDF
Data lake vs Data Warehouse: Hybrid Architectures
PPTX
Java 9 Modularity and Project Jigsaw
PDF
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
PDF
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
PDF
Building a fully-automated Fast Data Platform
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
PDF
Geospatial applications created using java script(and nosql)
PDF
Java cro 2016 - From.... to Scrum by Jurica Krizanic
PDF
21.04.2016 Meetup: Spark vs. Flink
Architectural Decisions: Smoothly and Consistently
ljug-meetup-2023-03-hexagonal-architecture.pdf
Software Architecture and Architectors: useless VS valuable
Invited-Talk_PredAnalytics_München (2).pdf
MicroFrontends für Microservices
Alles offen = gut(ai)
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Smart City Munich Kickoff Meetup
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
Data lake vs Data Warehouse: Hybrid Architectures
Java 9 Modularity and Project Jigsaw
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Building a fully-automated Fast Data Platform
Apache Apex: Stream Processing Architecture and Applications
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Geospatial applications created using java script(and nosql)
Java cro 2016 - From.... to Scrum by Jurica Krizanic
21.04.2016 Meetup: Spark vs. Flink

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to machine learning and Linear Models
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Computer network topology notes for revision
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Clinical guidelines as a resource for EBP(1).pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to machine learning and Linear Models
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
Computer network topology notes for revision
climate analysis of Dhaka ,Banglades.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
IB Computer Science - Internal Assessment.pptx

Machinelearning Spark Hadoop User Group Munich Meetup 2016

  • 1. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6 Machine Learning with Spark Scikit Learn Cheat Sheet Load basic dependencies >  inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/ import java.util.Base64 import java.nio.charset.StandardCharsets encB64: (str: String)String decB64: (str: String)String import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration import java.net.URI import org.apache.hadoop.fs.FileStatus listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus] ls3: (s3FolderPath: String)Unit rm3: (s3Path: String)Boolean ? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB] Read taxi data as dataframe from parquet >  %run "/meetup/kickoff/connect_s3" // read Parquet files val parquetTable= sqlContext.read.parquet(ouputParquetDir) val toDouble = udf[Double, Float]( _.toDouble) val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount"))) (http://databricks.com)  Import Notebook MachineLearning
  • 2. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6 >  >   Showing the first 1000 rows. 2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01- 07T15:33:28.000+0000 0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01- 07T22:25:46.000+0000 312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01- 05T11:54:49.000+0000 DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01- 02T06:58:08.000+0000 0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01- 07T14:46:55.000+0000 Scatter plot for tip amount and fare amount >  500m 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50 7.00 7.50 8.00 8.50 9.00 9.50 5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0 fare_amount tip_amount Showing sample based on the first 1000 rows. Transformation of data with standard dataframe operations >  The pipeline concept of Spark ML medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime taxiData.registerTempTable("ml_nyc_taxi") %sql SELECT * FROM ml_nyc_taxi %sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50 import org.apache.spark.mllib.linalg.{Vector, Vectors} val toVec   = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) } val trainingData = taxiData .filter(toDouble(taxiData.col("tip_amount")) > 0.0) .withColumn("label", toDouble(taxiData.col("tip_amount"))) .withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))
  • 3. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6 A Pipeline chains Transformers and Estimators A Transformer can also be an estimator from a previous trained model Important for easily training with different model parameters e.g. for cross-validation with different test and training data (train-validation split) repeat the transformation steps before estimation Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operators on Spark SQL transformer: Select and filter the relevant data >  VectorAssembler: Transform the data into labeled data as needed for ML estimators >  +------------------+----------+ | label| features| +------------------+----------+ |1.2000000476837158| [1.0,5.5]| | 4.199999809265137|[1.0,20.5]| | 5.900000095367432|[1.0,29.0]| | 5.380000114440918|[1.0,21.0]| | 1.399999976158142| [6.0,6.5]| | 1.0| [1.0,5.0]| | 1.25| [1.0,4.5]| | 3.0|[6.0,26.0]| | 1.0|[1.0,14.5]| |1.2999999523162842| [1.0,6.5]| | 1.899999976158142| [5.0,9.5]| |1.6200000047683716| [1.0,6.5]| | 1.899999976158142| [1.0,9.0]| | 2.0|[1.0,22.0]| | 6.0|[1.0,25.0]| |3.5999999046325684|[1.0,17.5]| |1.2000000476837158| [1.0,6.0]| | 7.5|[1.0,24.5]| Initialize the estimator import org.apache.spark.ml.feature.SQLTransformer val taxiDataSelector = new SQLTransformer().setStatement( "SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0") val selectedTaxiData = taxiDataSelector.transform(taxiData) import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.Vectors val trainingDataAssembler = new VectorAssembler() .setInputCols(Array("passenger_count", "fare_amount")) .setOutputCol("features") val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData) assembledTaxiData.select("label", "features").show()
  • 4. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6 >  LogisticRegression parameters: elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.8) featuresCol: features column name (default: features) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label) maxIter: maximum number of iterations (>= 0) (default: 100, current: 10) predictionCol: prediction column name (default: prediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto) standardization: whether to standardize the training features before fitting the model (default: true) tol: the convergence tolerance for iterative algorithms (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: ) import org.apache.spark.ml.regression.LinearRegression linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd Split the data into training and test set >  Setup the transformation and estimation PIPELINE >  Use the pipeline to train the model >  Predict with the trained model on the test data >  5.00 10.0 15.0 20.0 25.0 30.0 35.0 5.00 10.0 15.0 prediction label Showing sample based on the first 1000 rows. How to get started with Spark ML Setup your Laptop (16+ GB RAM recommended) import org.apache.spark.ml.regression.LinearRegression // Create a LogisticRegression instance. This instance is an Estimator. val linearRegressionEstimator = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) // Print out the parameters, documentation, and any default values. println("LogisticRegression parameters:n" + linearRegressionEstimator.explainParams() + "n") val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345) import org.apache.spark.ml.{Pipeline, PipelineModel} val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator)) // Learn a LogisticRegression model. // val lrModel = linearRegressionEstimator.fit(trainingData) val lrModel = pipeline.fit(trainingTaxiData) display(lrModel.transform(testTaxiData) .select("label", "prediction"))
  • 5. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6 mac$ brew install spark or get Databricks Community Edition Notebook (Wait List) Get data Join a ML competition and get BIG data from Kaggle Analyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016 (https://github.com/amaboura/panama-papers-dataset-2016) Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/ (https://zeppelin.incubator.apache.org/)) Throw some algorithms on it ! ? have a coffee ? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib- guide.html) ? read the Kaggle competition forums and blog Graphs from the Panama Papers
  • 6. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 6/6