Chapter 5
Call Housing Data
I worked on the Call-housing dataset provided by Prof Cesa Bianchi.This data shows various variable that
may or maynot be effect the prices of houses in a certain area.I explored the relationship between prices of
houses with the aid of Decision trees and Regression tress and tried to built a model initially of how prices
vary by location across a region.
longitude and latitude = Location of the areas of the particular houses
housing median age= Median age of houses in that Area
total rooms = Total Rooms of the houses in that area
total bedrooms= Total bedrooms out of those total tooms
population = Population living in that Area
households = Number of households in that area
median income = Median Income of the household living in the area
median house value:= Median House Values in that area
ocean proximity = Houses Proximity from Ocean Classified into
< 1H OCEAN
NEAR BAY
INLAND
ISLAND
NEAR OCEAN
Chapter 6
Decision Tree with Data on R
6.1 Analyzing the Data
Figure 6.1: Appendix Ref 1
There are 20640 observations corresponding to 20640 census tracts. I am interested in building a model of
how prices vary by location across a region. So, let’s first see how the points are laid out.
Figure 6.2: Appendix Ref 2
Now lets plot Different Region types on the Map.As from Data study and real estate
knowledge we know that Houses near oceans tend to be more expensive and on islands being
more exclusive tend to be high Luxury.
< 1H OCEAN =Purple
NEAR BAY =Blue
INLAND =Black
ISLAND =Green
NEAR OCEAN=Pink
Figure 6.3: Appendix Ref 3
6.1.1 2. Applying Linear Regression to the problem
Since, this is a regression problem as the target value to be predicted is continuous(house price), it is but
natural to look up to Linear Regression algorithm to solve the problem. As seen in the last figure that the
house prices were distributed over the area in an interesting way according to ocean proximity, certainly
not the kind of linear way and I feel Linear Regression is not going to work very well here.
Figure 6.4: Appendix Ref 4
• R-squared is around 0.24, which is not great.
• The latitude is not significant, which means the north-south location differences aren’t going to be
really used at all. This also seems unlikely.
• Longitude is significant, but negative which means that as we go towards the east house prices decrease
linearly, which is also unlikely.
To see how this linear regression model looks on a plot.So I can plot the census tracts again and then plot
the above-median house prices with bright red dots. The red dots will tell the actual positions in data
where houses are costly. We shall then test the same fact with Linear Regression predictions using blue
sign,
Figure 6.5: Appendix Ref 5
The linear regression model has plotted a blue euro sign for every time it thinks the census tract is above
value 3,00,000. It’s almost a sharp line that the linear regression defines. Also notice the shape is almost
vertical since the latitude variable was not very significant in the regression. The blue and the red dots do
not overlap especially in the east and north west. It turns out, the linear regression model isn’t really doing
a good job. And it has completely ignored everything to the right and left side of the picture. So that’s
interesting and pretty wrong.
6.1.2 Applying Regression Trees to the problem
I built a regression tree by using the rpart command.I am predicting median house value as a function of
latitude and longitude, using the dataset.
Figure 6.6: Appendix Ref 6
The leaves of the tree are important.
• In a classification tree, the leaves would be the classification that I assign .
• In regression trees, instead predicting the number.That number here is the average of the median house
prices in that bucket.
Figure 6.7: Appendix Ref 7
The fitted values are greater than 3,00,000, the colour is blue, The Regression tree has done a much better
job and has kind of overlapped the red dots. It has left the low value area out, and has correctly managed
to classify some of those points in the bottom right and top right,but still not accurate as it left the houses
in the center.Now I further run Regression Tree to better the Model and predictions
6.1.3 Making a Regression Tree Model
• Training and Splitting median house value
• Finding and adding important variables to regression- latitude,longitude + median income +
population
• Making Regression Tree
-
Figure 6.8: Appendix Ref 8
Results: The median income is the most important variable and split ,and decision starts from there,then
algorithm further distinguish between more income values,and then algorithm makes decision based on the
locations,longitude and latitude
tree.sse= [1] 3.269545e+13
Figure 6.9: Appendix Ref 9
6.1.4 Conclusions
Even though the Decision Trees appears to be working very well in certain conditions, it comes with its
own drawbacks.The model has very high chances of “over-fitting”. If no limit is set, in the worst case, it
will end up putting each observation into a leaf node. Maybe If i further try to use PCA and Cross
validation with Regression maybe I can remove the problem of overfitting that I can see with decision tree
right now.For that I am going to do futher implementation of PCA and Cross validation on the project of
UNSUPERVISED LEARNING with this same data .

More Related Content

DOCX
Nameweeks 5 7 statistics islanddirectionsfor all c
DOCX
G6 m5-a-lesson 2-s
DOCX
G6 m5-a-lesson 2-t
PDF
Day 9 shaded region
PDF
Choosing CORE Math Apps for Your K-5 Classroom!
PPT
Math unit5 ratio and proportion
DOC
Business statistics nmims latest solved assignments
PPTX
Nameweeks 5 7 statistics islanddirectionsfor all c
G6 m5-a-lesson 2-s
G6 m5-a-lesson 2-t
Day 9 shaded region
Choosing CORE Math Apps for Your K-5 Classroom!
Math unit5 ratio and proportion
Business statistics nmims latest solved assignments

Similar to Supervised learning 1 (20)

PDF
Supervised learning (2)
DOCX
Lab 1 Measurement Accuracy and Precision Lab Materi.docx
PPTX
Final Presentation.pptx
DOCX
1 BBS300 Empirical Research Methods for Business .docx
PPT
2 simple regression
PPTX
REGRESSION AND EXPLORATORY FACTOR ANALYSIS
PDF
DOCX
Student 26 Revised Assign.1
PPTX
Making sense of data visually: A modern look at datavisualization
PDF
About the size and frequency of prime gapsMaximum prime gaps
PDF
Predicting US house prices using Multiple Linear Regression in R
PDF
9A Two-Dimensional Geometry
PDF
diagrammatic and graphical representation of data
PPTX
L2 - Simplifing Ratios.pptx
PDF
Week 3-4 solutions
PPT
Michal Erel's SIFT presentation
PPTX
(8) Lesson 9.2
PPT
Chapter 3 Review
PDF
07 dimensionality reduction
PPTX
522323444-Presentation-HousePricePredictionSystem.pptx
Supervised learning (2)
Lab 1 Measurement Accuracy and Precision Lab Materi.docx
Final Presentation.pptx
1 BBS300 Empirical Research Methods for Business .docx
2 simple regression
REGRESSION AND EXPLORATORY FACTOR ANALYSIS
Student 26 Revised Assign.1
Making sense of data visually: A modern look at datavisualization
About the size and frequency of prime gapsMaximum prime gaps
Predicting US house prices using Multiple Linear Regression in R
9A Two-Dimensional Geometry
diagrammatic and graphical representation of data
L2 - Simplifing Ratios.pptx
Week 3-4 solutions
Michal Erel's SIFT presentation
(8) Lesson 9.2
Chapter 3 Review
07 dimensionality reduction
522323444-Presentation-HousePricePredictionSystem.pptx
Ad

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Managing Community Partner Relationships
PPTX
Introduction to Inferential Statistics.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Microsoft 365 products and services descrption
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Introduction to the R Programming Language
PDF
Global Data and Analytics Market Outlook Report
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
A Complete Guide to Streamlining Business Processes
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
[EN] Industrial Machine Downtime Prediction
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
SET 1 Compulsory MNH machine learning intro
New ISO 27001_2022 standard and the changes
Pilar Kemerdekaan dan Identi Bangsa.pptx
modul_python (1).pptx for professional and student
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Navigating the Thai Supplements Landscape.pdf
Managing Community Partner Relationships
Introduction to Inferential Statistics.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Microsoft 365 products and services descrption
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Introduction to the R Programming Language
Global Data and Analytics Market Outlook Report
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
A Complete Guide to Streamlining Business Processes
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Ad

Supervised learning 1

  • 1. Chapter 5 Call Housing Data I worked on the Call-housing dataset provided by Prof Cesa Bianchi.This data shows various variable that may or maynot be effect the prices of houses in a certain area.I explored the relationship between prices of houses with the aid of Decision trees and Regression tress and tried to built a model initially of how prices vary by location across a region. longitude and latitude = Location of the areas of the particular houses housing median age= Median age of houses in that Area total rooms = Total Rooms of the houses in that area total bedrooms= Total bedrooms out of those total tooms population = Population living in that Area households = Number of households in that area median income = Median Income of the household living in the area median house value:= Median House Values in that area ocean proximity = Houses Proximity from Ocean Classified into < 1H OCEAN NEAR BAY INLAND ISLAND NEAR OCEAN
  • 2. Chapter 6 Decision Tree with Data on R 6.1 Analyzing the Data Figure 6.1: Appendix Ref 1 There are 20640 observations corresponding to 20640 census tracts. I am interested in building a model of how prices vary by location across a region. So, let’s first see how the points are laid out. Figure 6.2: Appendix Ref 2
  • 3. Now lets plot Different Region types on the Map.As from Data study and real estate knowledge we know that Houses near oceans tend to be more expensive and on islands being more exclusive tend to be high Luxury. < 1H OCEAN =Purple NEAR BAY =Blue INLAND =Black ISLAND =Green NEAR OCEAN=Pink Figure 6.3: Appendix Ref 3 6.1.1 2. Applying Linear Regression to the problem Since, this is a regression problem as the target value to be predicted is continuous(house price), it is but natural to look up to Linear Regression algorithm to solve the problem. As seen in the last figure that the house prices were distributed over the area in an interesting way according to ocean proximity, certainly not the kind of linear way and I feel Linear Regression is not going to work very well here. Figure 6.4: Appendix Ref 4
  • 4. • R-squared is around 0.24, which is not great. • The latitude is not significant, which means the north-south location differences aren’t going to be really used at all. This also seems unlikely. • Longitude is significant, but negative which means that as we go towards the east house prices decrease linearly, which is also unlikely. To see how this linear regression model looks on a plot.So I can plot the census tracts again and then plot the above-median house prices with bright red dots. The red dots will tell the actual positions in data where houses are costly. We shall then test the same fact with Linear Regression predictions using blue sign, Figure 6.5: Appendix Ref 5 The linear regression model has plotted a blue euro sign for every time it thinks the census tract is above value 3,00,000. It’s almost a sharp line that the linear regression defines. Also notice the shape is almost vertical since the latitude variable was not very significant in the regression. The blue and the red dots do not overlap especially in the east and north west. It turns out, the linear regression model isn’t really doing a good job. And it has completely ignored everything to the right and left side of the picture. So that’s interesting and pretty wrong.
  • 5. 6.1.2 Applying Regression Trees to the problem I built a regression tree by using the rpart command.I am predicting median house value as a function of latitude and longitude, using the dataset. Figure 6.6: Appendix Ref 6 The leaves of the tree are important. • In a classification tree, the leaves would be the classification that I assign . • In regression trees, instead predicting the number.That number here is the average of the median house prices in that bucket. Figure 6.7: Appendix Ref 7 The fitted values are greater than 3,00,000, the colour is blue, The Regression tree has done a much better job and has kind of overlapped the red dots. It has left the low value area out, and has correctly managed
  • 6. to classify some of those points in the bottom right and top right,but still not accurate as it left the houses in the center.Now I further run Regression Tree to better the Model and predictions 6.1.3 Making a Regression Tree Model • Training and Splitting median house value • Finding and adding important variables to regression- latitude,longitude + median income + population • Making Regression Tree - Figure 6.8: Appendix Ref 8 Results: The median income is the most important variable and split ,and decision starts from there,then algorithm further distinguish between more income values,and then algorithm makes decision based on the locations,longitude and latitude tree.sse= [1] 3.269545e+13
  • 7. Figure 6.9: Appendix Ref 9 6.1.4 Conclusions Even though the Decision Trees appears to be working very well in certain conditions, it comes with its own drawbacks.The model has very high chances of “over-fitting”. If no limit is set, in the worst case, it will end up putting each observation into a leaf node. Maybe If i further try to use PCA and Cross validation with Regression maybe I can remove the problem of overfitting that I can see with decision tree right now.For that I am going to do futher implementation of PCA and Cross validation on the project of UNSUPERVISED LEARNING with this same data .