SlideShare a Scribd company logo
Machine Learning & Spark
Predict Prices of Houses
About me Ran Silberman
Architect at Tikal Knowledge
Big Data Consultant
mailto:ran@tikalk.com
Predict house prices
Use Case:
Predict house prices in Tel Aviv based on parameters of houses
Use data from Kaggle
Hypotheses:
1. ML: not only for mathematicians!
2. ML + Big-Data = Spark!
Technology Stack
Spark Machine Learning library
Scala
Apache Spark - RDD
Block-1
Block-2
Block-3
Block-4
HDFS Input File
Partition-1
Partition-2
Partition-3
Partition-4
RDD
Partition-1
Partition-2
Partition-3
Partition-4
RDD
Load Map
Block-1
Block-2
HDFS Output
Block-3
Block-4
Write
Spark MlLib
Block-1
Block-2
Block-3
Training Data
Partition-1
Partition-2
Partition-3
RDD
Algorithm
Function
ML Model
Load
Build
Model
Block-1
Block-2
Block-3
Test Data
Predict
Partition-1
Partition-2
Partition-3
RDD
Load
Spark MLlib API
RDD-based
API
DataFrame-based
API
Spark 2.0
Package: spark.mllib Package: spark.ml
Approach
1. Explore the Data
2. Assess Algorithms
3. Put it all together
Explore the Data
Cond Bsmt
Cond
Year
built
Bsmt
area
Roof
Style
Grnd
Liv
Area
Grg
cars
Grg
area
Year
sold
Sale
Price
7 Good 2003 856 Gable 1710 2 548 2008 208500
6 Good 1976 1262 Gable 1262 2 460 2007 181500
7 Exc 2001 920 Hip 1786 2 608 2008 223500
7 Fair 1915 756 Hip 1717 3 642 2006 140000
8 Typical 2000 1145 Gable 2198 3 836 2008 250000
8 No Bsmt 2004 1686 Flat 1694 (null) 636 2007 307000
Data Format
208500 1:7.00 2:5.00 3:2003.00 4:856.00 5:1710.00 6:2.00
7:548.00 8:856.00 9:2.00 10:8.00
181500 1:6.00 2:8.00 3:1976.00 4:1262.00 5:1262.00 6:2.00
7:460.00 8:1262.00 9:2.00 10:6.00
223500 1:7.00 2:5.00 3:2001.00 4:920.00 5:1786.00 6:2.00
7:608.00 8:920.00 9:2.00 10:6.00
“Label”
value 1st
feature 2nd
feature 3rd
feature
10th
feature
Machine learning using spark
Roof Types
One-Hot Encoding
RoofStyle: [“Gable”,”Hip”,”Flat”]
[{“Gable”,0},{“Hip”,1},{“Flat”,2}]
Line # 1: RoofStyle: “Gable” Vector: [1.0, 0, 0]
Line # 2: RoofStyle: “Hip” Vector: [0, 1.0, 0]
Line # 3: RoofStyle: “Flat” Vector: [0, 0, 1.0]
Pipelines in Spark MLlib
val indexer = new StringIndexer()
.setInputCol("roofType").setOutputCol("roofIndex")
val encoder = new OneHotEncoder()
.setInputCol("roofIndex").setOutputCol("features")
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(indexer, encoder, lr))
val model = pipeline.fit(dataFrame)
Transform Strings
to numeric index
encode index values to
one-hot vector
Linear Regression
algorithm
Pipeline put it
all togetherTrain Model
Machine learning using spark
Linear Regression using Spark-MLlib
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(trainingFile)
val lr = new LinearRegression()
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Linear Regression
Class
Train model
Predict
using model
Load data file
Print RMSE
Set Spark
Session
Cost Function - Root Mean squared Error
Way to define how well our model can predict
Root Mean Square Error
Linear Regression - 10 Features 49972
Find Correlations
val seriesX: RDD[Double] = sc.parallelize(grdLivArea) // X-axis
val seriesY: RDD[Double] = sc.parallelize(SalePrice) // Y-axis
val correlation: Double = Statistics.corr(seriesX, seriesY)
Compute correlation
using scala Statistics lib
2. Find Best Correlated Features
Correlation Matrix - Best features to use
Bivariate Analysis of several Features
Linear Regression - Improve Training Data
val spark = SparkSession.builder().getOrCreate()
val training = spark.read.format("libsvm").load(betterFeaturesFile)
val lr = new LinearRegression()
val model = lr.fit(training)
model.transform(test).show
println(s" ${model.summary.rootMeanSquaredError}")
Features set with
best correlation
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression (Bivariate Analysis)
Gradient Descent of Cost Function on one feature
= RMSE
Linear Regression - Set more iterations
val lr = new LinearRegression()
val model = lr.fit(training).setMaxIter(50)
Set Higher number
of iterations (was 10)
numIterations: 33
RMSE: 35583
Output Window
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Decision Tree
Regressor Class
Train model
Predict
using model
Load data file
Print RMSE
Split data to
training and test
Decision Tree a > 7
a > 6
b>1389
$126,901
a > 8
b>1964 b>1964 c>1996
$162,206 $194,954 $257,919 $252,771 $320,193 $551,666$368,137
a: Overall Quality (1-10)
b: Grd. Living Area (ft.)
c: Year built
yes
yes
yesyes
yes
yesyes
no
no no
no no no no
level-1
level-2
level-3
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree 46643
Decision Tree - set Tree Depth
val dt = new DecisionTreeRegressor()
.setMaxDepth(3)
RMSE = 46643
val dt = new DecisionTreeRegressor()
.setMaxDepth(20)
RMSE = 40571
val dt = new DecisionTreeRegressor()
.setMaxDepth(8)
RMSE = 37198
Bias-Variance tradeoff: Tree-depth
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest Regressor
val data = spark.read.format("libsvm").load(logFile)
val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(20)
val model = dt.fit(train)
val predictions = model.transform(test)
val rmse = new RegressionEvaluator().evaluate(predictions)
println(s"RMSE = $rmse")
Random Forest
Regressor Class
Set num of trees
Random Forest
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest (depth = 8, #Trees = 10) 35023
Random Forest - set Number of Trees
val dt = new RandomForestRegressor()
.setMaxDepth(8).setNumTrees(10)
RMSE = 35023
val dt = new RandomForestRegressor()
.setMaxDepth(20).setNumTrees(100)
RMSE = 34607
Root Mean Square Error
Linear Regression - 10 Arbitrary Features 49972
Linear Regression - 10 Best Features 35654
Linear Regression - Higher num of iterations 35583
Decision Tree (depth = 3) 46643
Decision Tree (depth = 20) 40571
Decision Tree (depth = 8) 37198
Random Forest (depth = 8, #Trees = 10) 35023
Random Forest (depth = 20, #Trees = 100) 34607
Summary
1. Explored the Data - data transformations, correlations
2. Assessed Algorithms: Regression, Tree, Forest
3. Play with parameters: # iterations, tree depth, # of trees
Thank You!

More Related Content

PDF
Pivoting Data with SparkSQL by Andrew Ray
PDF
Machinelearning Spark Hadoop User Group Munich Meetup 2016
PDF
Data profiling with Apache Calcite
PDF
No more struggles with Apache Spark workloads in production
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PDF
Introduce spark (by 조창원)
PDF
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Time Series Analysis for Network Secruity
Pivoting Data with SparkSQL by Andrew Ray
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Data profiling with Apache Calcite
No more struggles with Apache Spark workloads in production
Spark 4th Meetup Londond - Building a Product with Spark
Introduce spark (by 조창원)
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Time Series Analysis for Network Secruity

What's hot (20)

PDF
Tulsa techfest Spark Core Aug 5th 2016
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Grouping & Summarizing Data in R
PDF
Data Profiling in Apache Calcite
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
ODT
ACADILD:: HADOOP LESSON
PDF
Data manipulation on r
PPTX
Odtug2011 adf developers make the database work for you
PDF
Data handling in r
PDF
Tactical data engineering
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
3 R Tutorial Data Structure
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
Don’t optimize my queries, optimize my data!
PDF
Spatial query on vanilla databases
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Tulsa techfest Spark Core Aug 5th 2016
GeoMesa on Apache Spark SQL with Anthony Fox
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Grouping & Summarizing Data in R
Data Profiling in Apache Calcite
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
ACADILD:: HADOOP LESSON
Data manipulation on r
Odtug2011 adf developers make the database work for you
Data handling in r
Tactical data engineering
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
3 R Tutorial Data Structure
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Don’t optimize my queries, optimize my data!
Spatial query on vanilla databases
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Ad

Similar to Machine learning using spark (20)

PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
PDF
Random forest algorithm for regression a beginner's guide
PDF
AIRLINE FARE PRICE PREDICTION
PPTX
House Sale Price Prediction
PPTX
522323444-Presentation-HousePricePredictionSystem.pptx
PDF
IRJET - House Price Predictor using ML through Artificial Neural Network
PPTX
Bathi%20Ram%20PPT.pptx
PPTX
housepriceprediction-ml.pptx
PDF
Real Estate Investment Advising Using Machine Learning
PPTX
End-to-End Machine Learning Project
PPTX
housepriceprediction.pptx
PDF
Intro to Machine Learning by Microsoft Ventures
PDF
1.5.recommending music with apache spark ml
PPTX
Build Deep Learning model to identify santader bank's dissatisfied customers
PPTX
Predicting house price
PPTX
Competition16
PPTX
Day17.pptx department of computer science and eng
PPTX
House price prediction
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Random forest algorithm for regression a beginner's guide
AIRLINE FARE PRICE PREDICTION
House Sale Price Prediction
522323444-Presentation-HousePricePredictionSystem.pptx
IRJET - House Price Predictor using ML through Artificial Neural Network
Bathi%20Ram%20PPT.pptx
housepriceprediction-ml.pptx
Real Estate Investment Advising Using Machine Learning
End-to-End Machine Learning Project
housepriceprediction.pptx
Intro to Machine Learning by Microsoft Ventures
1.5.recommending music with apache spark ml
Build Deep Learning model to identify santader bank's dissatisfied customers
Predicting house price
Competition16
Day17.pptx department of computer science and eng
House price prediction
Ad

More from Ran Silberman (6)

PPTX
Clash of clans data structures
PPTX
Dev ops for big data cluster management tools
PDF
Hadoop ecosystem
PPTX
Hadoop ecosystem
PDF
Real Time Data Streaming using Kafka & Storm
PPTX
From a kafkaesque story to The Promised Land
Clash of clans data structures
Dev ops for big data cluster management tools
Hadoop ecosystem
Hadoop ecosystem
Real Time Data Streaming using Kafka & Storm
From a kafkaesque story to The Promised Land

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Machine learning using spark

  • 1. Machine Learning & Spark Predict Prices of Houses
  • 2. About me Ran Silberman Architect at Tikal Knowledge Big Data Consultant mailto:ran@tikalk.com
  • 3. Predict house prices Use Case: Predict house prices in Tel Aviv based on parameters of houses Use data from Kaggle
  • 4. Hypotheses: 1. ML: not only for mathematicians! 2. ML + Big-Data = Spark!
  • 5. Technology Stack Spark Machine Learning library Scala
  • 6. Apache Spark - RDD Block-1 Block-2 Block-3 Block-4 HDFS Input File Partition-1 Partition-2 Partition-3 Partition-4 RDD Partition-1 Partition-2 Partition-3 Partition-4 RDD Load Map Block-1 Block-2 HDFS Output Block-3 Block-4 Write
  • 7. Spark MlLib Block-1 Block-2 Block-3 Training Data Partition-1 Partition-2 Partition-3 RDD Algorithm Function ML Model Load Build Model Block-1 Block-2 Block-3 Test Data Predict Partition-1 Partition-2 Partition-3 RDD Load
  • 8. Spark MLlib API RDD-based API DataFrame-based API Spark 2.0 Package: spark.mllib Package: spark.ml
  • 9. Approach 1. Explore the Data 2. Assess Algorithms 3. Put it all together
  • 11. Cond Bsmt Cond Year built Bsmt area Roof Style Grnd Liv Area Grg cars Grg area Year sold Sale Price 7 Good 2003 856 Gable 1710 2 548 2008 208500 6 Good 1976 1262 Gable 1262 2 460 2007 181500 7 Exc 2001 920 Hip 1786 2 608 2008 223500 7 Fair 1915 756 Hip 1717 3 642 2006 140000 8 Typical 2000 1145 Gable 2198 3 836 2008 250000 8 No Bsmt 2004 1686 Flat 1694 (null) 636 2007 307000
  • 12. Data Format 208500 1:7.00 2:5.00 3:2003.00 4:856.00 5:1710.00 6:2.00 7:548.00 8:856.00 9:2.00 10:8.00 181500 1:6.00 2:8.00 3:1976.00 4:1262.00 5:1262.00 6:2.00 7:460.00 8:1262.00 9:2.00 10:6.00 223500 1:7.00 2:5.00 3:2001.00 4:920.00 5:1786.00 6:2.00 7:608.00 8:920.00 9:2.00 10:6.00 “Label” value 1st feature 2nd feature 3rd feature 10th feature
  • 15. One-Hot Encoding RoofStyle: [“Gable”,”Hip”,”Flat”] [{“Gable”,0},{“Hip”,1},{“Flat”,2}] Line # 1: RoofStyle: “Gable” Vector: [1.0, 0, 0] Line # 2: RoofStyle: “Hip” Vector: [0, 1.0, 0] Line # 3: RoofStyle: “Flat” Vector: [0, 0, 1.0]
  • 16. Pipelines in Spark MLlib val indexer = new StringIndexer() .setInputCol("roofType").setOutputCol("roofIndex") val encoder = new OneHotEncoder() .setInputCol("roofIndex").setOutputCol("features") val lr = new LinearRegression() val pipeline = new Pipeline().setStages(Array(indexer, encoder, lr)) val model = pipeline.fit(dataFrame) Transform Strings to numeric index encode index values to one-hot vector Linear Regression algorithm Pipeline put it all togetherTrain Model
  • 18. Linear Regression using Spark-MLlib val spark = SparkSession.builder().getOrCreate() val training = spark.read.format("libsvm").load(trainingFile) val lr = new LinearRegression() val model = lr.fit(training) model.transform(test).show println(s" ${model.summary.rootMeanSquaredError}") Linear Regression Class Train model Predict using model Load data file Print RMSE Set Spark Session
  • 19. Cost Function - Root Mean squared Error Way to define how well our model can predict
  • 20. Root Mean Square Error Linear Regression - 10 Features 49972
  • 21. Find Correlations val seriesX: RDD[Double] = sc.parallelize(grdLivArea) // X-axis val seriesY: RDD[Double] = sc.parallelize(SalePrice) // Y-axis val correlation: Double = Statistics.corr(seriesX, seriesY) Compute correlation using scala Statistics lib
  • 22. 2. Find Best Correlated Features
  • 23. Correlation Matrix - Best features to use
  • 24. Bivariate Analysis of several Features
  • 25. Linear Regression - Improve Training Data val spark = SparkSession.builder().getOrCreate() val training = spark.read.format("libsvm").load(betterFeaturesFile) val lr = new LinearRegression() val model = lr.fit(training) model.transform(test).show println(s" ${model.summary.rootMeanSquaredError}") Features set with best correlation
  • 26. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654
  • 28. Gradient Descent of Cost Function on one feature = RMSE
  • 29. Linear Regression - Set more iterations val lr = new LinearRegression() val model = lr.fit(training).setMaxIter(50) Set Higher number of iterations (was 10) numIterations: 33 RMSE: 35583 Output Window
  • 30. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583
  • 31. Decision Tree Regressor val data = spark.read.format("libsvm").load(logFile) val Array(train, test) = data.randomSplit(Array(0.7, 0.3)) val dt = new DecisionTreeRegressor() .setMaxDepth(3) val model = dt.fit(train) val predictions = model.transform(test) val rmse = new RegressionEvaluator().evaluate(predictions) println(s"RMSE = $rmse") Decision Tree Regressor Class Train model Predict using model Load data file Print RMSE Split data to training and test
  • 32. Decision Tree a > 7 a > 6 b>1389 $126,901 a > 8 b>1964 b>1964 c>1996 $162,206 $194,954 $257,919 $252,771 $320,193 $551,666$368,137 a: Overall Quality (1-10) b: Grd. Living Area (ft.) c: Year built yes yes yesyes yes yesyes no no no no no no no level-1 level-2 level-3
  • 33. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree 46643
  • 34. Decision Tree - set Tree Depth val dt = new DecisionTreeRegressor() .setMaxDepth(3) RMSE = 46643 val dt = new DecisionTreeRegressor() .setMaxDepth(20) RMSE = 40571 val dt = new DecisionTreeRegressor() .setMaxDepth(8) RMSE = 37198
  • 36. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree (depth = 3) 46643 Decision Tree (depth = 20) 40571 Decision Tree (depth = 8) 37198
  • 37. Random Forest Regressor val data = spark.read.format("libsvm").load(logFile) val Array(train, test) = data.randomSplit(Array(0.7, 0.3)) val dt = new RandomForestRegressor() .setMaxDepth(8).setNumTrees(20) val model = dt.fit(train) val predictions = model.transform(test) val rmse = new RegressionEvaluator().evaluate(predictions) println(s"RMSE = $rmse") Random Forest Regressor Class Set num of trees
  • 39. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree (depth = 3) 46643 Decision Tree (depth = 20) 40571 Decision Tree (depth = 8) 37198 Random Forest (depth = 8, #Trees = 10) 35023
  • 40. Random Forest - set Number of Trees val dt = new RandomForestRegressor() .setMaxDepth(8).setNumTrees(10) RMSE = 35023 val dt = new RandomForestRegressor() .setMaxDepth(20).setNumTrees(100) RMSE = 34607
  • 41. Root Mean Square Error Linear Regression - 10 Arbitrary Features 49972 Linear Regression - 10 Best Features 35654 Linear Regression - Higher num of iterations 35583 Decision Tree (depth = 3) 46643 Decision Tree (depth = 20) 40571 Decision Tree (depth = 8) 37198 Random Forest (depth = 8, #Trees = 10) 35023 Random Forest (depth = 20, #Trees = 100) 34607
  • 42. Summary 1. Explored the Data - data transformations, correlations 2. Assessed Algorithms: Regression, Tree, Forest 3. Play with parameters: # iterations, tree depth, # of trees

Editor's Notes

  • #17: Look how this code is simple…Can deal with very big data But need to make all features have float values...
  • #19: Look how this code is simple…Can deal with very big data But need to make all features have float values...
  • #22: Look how this code is simple…Can deal with very big data But things are not that simple...
  • #26: Look how this code is simple…Can deal with very big data But things are not that simple...
  • #30: Look how this code is simple…Can deal with very big data But things are not that simple...
  • #33: Some problems with trees: If the tree has few levels of depths then all predicted values are among a predefined list If the tree has too many levels of depths, we get an overfitting problem
  • #35: Look how this code is simple…Can deal with very big data But things are not that simple...
  • #41: Look how this code is simple…Can deal with very big data But things are not that simple...