Supervised learning 1

Chapter 5
Call Housing Data
I worked on the Call-housing dataset provided by Prof Cesa Bianchi.This data shows various variable that
may or maynot be eﬀect the prices of houses in a certain area.I explored the relationship between prices of
houses with the aid of Decision trees and Regression tress and tried to built a model initially of how prices
vary by location across a region.
longitude and latitude = Location of the areas of the particular houses
housing median age= Median age of houses in that Area
total rooms = Total Rooms of the houses in that area
total bedrooms= Total bedrooms out of those total tooms
population = Population living in that Area
households = Number of households in that area
median income = Median Income of the household living in the area
median house value:= Median House Values in that area
ocean proximity = Houses Proximity from Ocean Classiﬁed into
< 1H OCEAN
NEAR BAY
INLAND
ISLAND
NEAR OCEAN

Chapter 6
Decision Tree with Data on R
6.1 Analyzing the Data
Figure 6.1: Appendix Ref 1
There are 20640 observations corresponding to 20640 census tracts. I am interested in building a model of
how prices vary by location across a region. So, let’s ﬁrst see how the points are laid out.

Now lets plot Diﬀerent Region types on the Map.As from Data study and real estate
knowledge we know that Houses near oceans tend to be more expensive and on islands being
more exclusive tend to be high Luxury.
< 1H OCEAN =Purple
NEAR BAY =Blue
INLAND =Black
ISLAND =Green
NEAR OCEAN=Pink
6.1.1 2. Applying Linear Regression to the problem
Since, this is a regression problem as the target value to be predicted is continuous(house price), it is but
natural to look up to Linear Regression algorithm to solve the problem. As seen in the last ﬁgure that the
house prices were distributed over the area in an interesting way according to ocean proximity, certainly
not the kind of linear way and I feel Linear Regression is not going to work very well here.

• R-squared is around 0.24, which is not great.
• The latitude is not significant, which means the north-south location differences aren’t going to be
really used at all. This also seems unlikely.
• Longitude is significant, but negative which means that as we go towards the east house prices decrease
linearly, which is also unlikely.
To see how this linear regression model looks on a plot.So I can plot the census tracts again and then plot
the above-median house prices with bright red dots. The red dots will tell the actual positions in data
where houses are costly. We shall then test the same fact with Linear Regression predictions using blue
sign,
The linear regression model has plotted a blue euro sign for every time it thinks the census tract is above
value 3,00,000. It’s almost a sharp line that the linear regression defines. Also notice the shape is almost
vertical since the latitude variable was not very significant in the regression. The blue and the red dots do
not overlap especially in the east and north west. It turns out, the linear regression model isn’t really doing
a good job. And it has completely ignored everything to the right and left side of the picture. So that’s
interesting and pretty wrong.

6.1.2 Applying Regression Trees to the problem
I built a regression tree by using the rpart command.I am predicting median house value as a function of
latitude and longitude, using the dataset.
The leaves of the tree are important.
• In a classification tree, the leaves would be the classification that I assign .
• In regression trees, instead predicting the number.That number here is the average of the median house
prices in that bucket.
The fitted values are greater than 3,00,000, the colour is blue, The Regression tree has done a much better
job and has kind of overlapped the red dots. It has left the low value area out, and has correctly managed

to classify some of those points in the bottom right and top right,but still not accurate as it left the houses
in the center.Now I further run Regression Tree to better the Model and predictions
6.1.3 Making a Regression Tree Model
• Training and Splitting median house value
• Finding and adding important variables to regression- latitude,longitude + median income +
population
• Making Regression Tree
-
Results: The median income is the most important variable and split ,and decision starts from there,then
algorithm further distinguish between more income values,and then algorithm makes decision based on the
locations,longitude and latitude
tree.sse= [1] 3.269545e+13

6.1.4 Conclusions
Even though the Decision Trees appears to be working very well in certain conditions, it comes with its
own drawbacks.The model has very high chances of “over-ﬁtting”. If no limit is set, in the worst case, it
will end up putting each observation into a leaf node. Maybe If i further try to use PCA and Cross
validation with Regression maybe I can remove the problem of overﬁtting that I can see with decision tree
right now.For that I am going to do futher implementation of PCA and Cross validation on the project of
UNSUPERVISED LEARNING with this same data .

Supervised learning 1

More Related Content

Similar to Supervised learning 1 (20)

Recently uploaded (20)

Supervised learning 1