SlideShare a Scribd company logo
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 1/23
ξ€€LOGΒ HOM DATAQUT.IO LARNΒ DATA CINC INΒ YOURΒ ξ€€ROWR
ο˜…
MachineΒ learningΒ withΒ Pthon:Β A
Tutorial
VikΒ ParuchuriΒ  21Β OCTΒ 2015Β inΒ tutorial
Machine learning is a eld that uses algorithms to learn from data and
make predictions. Practically, this means that we can feed data into an al-
gorithm, and use it to make predictions about what might happen in the
future. This has a vast range of applications, from self-driving cars to
stock price prediction. Not only is machine learning interesting, it’s also
starting to be widely used, making it an extremely practical skill to learn.
In this tutorial, we’ll guide you through the basic principles of machine
learning, and how to get started with machine learning with Python.
Luckily for us, Python has an amazing ecosystem of libraries that make
machine learning easy to get started with. We’ll be using the excellent
Scikit-learn, Pandas, and Matplotlib libraries in this tutorial.
If you want to dive more deeply into machine learning, and apply algo-
rithms in your browser, check out our courses here.
TheΒ dataet
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 2/23
Before we dive into machine learning, we’re going to explore a dataset,
and gure out what might be interesting to predict. The dataset is from
BoardGameGeek, and contains data on 80000 board games. Here’s a single
boardgame on the site. This information was kindly scraped into csv for-
mat by Sean Beck, and can be downloaded here.
The dataset contains several data points about each board game. Here’s a
list of the interesting ones:
name – name of the board game.
plaingtime – the playing time (given by the manufacturer).
minplatime – the minimum playing time (given by the manufacturer).
maxplatime – the maximum playing time (given by the manufacturer).
minage – the minimum recommended age to play.
uer_rated – the number of users who rated the game.
average_rating – the average rating given to the game by users. (0-10)
total_weight – Number of weights given by users. Weight is a subjective
measure that is made up by BoardGameGeek. It’s how β€œdeep” or in-
volved a game is. Here’s a full explanation.
average_weight – the average of all the subjective weights (0-5).
IntroductionΒ toΒ Panda
The rst step in our exploration is to read in the data and print some
quick summary statistics. In order to do this, we’ll us the Pandas library.
Pandas provides data structures and data analysis tools that make manip-
ulating data in Python much quicker and more e ective. The most com-
mon data structure is called a dataframe. A dataframe is an extension of a
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 3/23
matrix, so we’ll talk about what a matrix is before coming back to
dataframes.
Our data le looks like this (we removed some columns to make it easier
to look at):
id,tpe,name,earpulihed,minplaer,maxplaer,plaingtimeΒ 
12333,oardgame,Twilight truggle,2005,2,2,180Β 
120677,oardgame,TerraΒ Mtica,2012,2,5,150Β 
This is in a format called csv, or comma-separated values, which you can
read more about here. Each row of the data is a di erent board game, and
di erent data points about each board game are separated by commas
within the row. The rst row is the header row, and describes what each
data point is. The entire set of one data point, going down, is a column.
We can easily conceptualize a csv le as a matrix:
Β Β Β Β 1Β Β Β Β Β Β Β 2Β Β Β Β Β Β Β Β Β Β Β 3Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 4Β 
1Β Β Β idΒ Β Β Β Β Β tpeΒ Β Β Β Β Β Β Β name                earpulihedΒ 
2Β Β Β 12333   oardgameΒ Β Β Twilight truggleΒ Β Β 2005Β 
3Β Β Β 120677  oardgameΒ Β Β TerraΒ MticaΒ Β Β Β Β Β Β 2012Β 
We removed some of the columns here for display purposes, but you can
still get a sense of how the data looks visually. A matrix is a two-dimen-
sional data structure, with rows and columns. We can access elements in a
matrix by position. The rst row starts with id , the second row starts
with 12333 , and the third row starts with 120677 . The rst column is id ,
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 4/23
the second is tpe , and so on. Matrices in Python can be used via the
NumPy library.
A matrix has some downsides, though. You can’t easily access columns
and rows by name, and each column has to have the same datatype. This
means that we can’t e ectively store our board game data in a matrix –
the name column contains strings, and the earpulihed column contains
integers, which means that we can’t store them both in the same matrix.
A dataframe, on the other hand, can have di erent datatypes in each col-
umn. It has has a lot of built-in niceities for analyzing data as well, such
as looking up columns by name. Pandas gives us access to these features,
and generally makes working with data much simpler.
ReadingΒ inΒ ourΒ data
We’ll now read in our data from a csv le into a Pandas dataframe, using
the read_cv method.
#Β ImportΒ theΒ panda lirar.Β 
importΒ panda 
Β 
#Β ReadΒ inΒ theΒ data.Β 
game =Β panda.read_cv("oard_game.cv")Β 
#Β PrintΒ theΒ name ofΒ theΒ column inΒ game.Β 
print(game.column)
Index(['id',Β 'tpe',Β 'name',Β 'earpulihed',Β 'minplaer',Β 'maxplaer',
Β Β Β Β Β Β Β 'plaingtime',Β 'minplatime',Β 'maxplatime',Β 'minage',Β 'uer_rated',
Β Β Β Β Β Β Β 'average_rating',Β 'ae_average_rating',Β 'total_owner',Β 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 5/23
The code above read the data in, and shows us all of the column names.
The columns that are in the data but aren’t listed above should be fairly
self-explanatory.
print(game.hape)
(81312,Β 20)Β 
We can also see the shape of the data, which shows that it has 81312 rows,
or games, and 20 columns, or data points describing each game.
PlottingΒ ourΒ targetΒ variale
It could be interesting to predict the average score that a human would
give to a new, unreleased, board game. This is stored in the average_rating
column, which is the average of all the user ratings for a board game. Pre-
dicting this column could be useful to board game manufacturers who are
thinking of what kind of game to make next, for instance.
We can access a column is a dataframe with Pandas using game["aver-
age_rating"] . This will extract a single column from the dataframe.
Let’s plot a histogram of this column so we can visualize the distribution
of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is
Β Β Β Β Β Β Β 'total_trader',Β 'total_wanter',Β 'total_wiher',Β 'total_comment',
Β Β Β Β Β Β Β 'total_weight',Β 'average_weight'],Β 
Β Β Β Β Β Β dtpe='oject')Β 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 6/23
the main plotting infrastructure in Python, and most other plotting li-
braries, like seaborn and ggplot2 are built on top of Matplotlib.
We import Matplotlib’s plotting functions with importΒ matplotli.pplotΒ a 
plt . We can then draw and show plots.
#Β ImportΒ matplotli 
importΒ matplotli.pplotΒ a pltΒ 
Β 
#Β MakeΒ aΒ hitogramΒ ofΒ allΒ theΒ rating inΒ theΒ average_ratingΒ column.Β 
plt.hit(game["average_rating"])Β 
Β 
# howΒ theΒ plot.Β 
plt.how()
What we see here is that there are quite a few games with a 0 rating.
There’s a fairly normal distribution of ratings, with some right skew, and
a mean rating around 6 (if you remove the zeros).
xploringΒ theΒ 0Β rating
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 7/23
Are there truly so many terrible games that were given a 0 rating? Or is
something else happening? We’ll need to dive into the data bit more to
check on this.
With Pandas, we can select subsets of data using Boolean series (vectors,
or one column/row of data, are known as series in Pandas). Here’s an
example:
game[game["average_rating"]Β ==Β 0]Β 
The code above will create a new dataframe, with only the rows in game
where the value of the average_rating column equals 0 .
We can then index the resulting dataframe to get the values we want.
There are two ways to index in Pandas – we can index by the name of the
row or column, or we can index by position. Indexing by names looks like
game["average_rating"] – this will return the whole average_rating column of
game . Indexing by position looks like game.iloc[0] – this will return the
whole rst row of the dataframe. We can also pass in multiple index val-
ues at once – game.iloc[0,0] will return the rst column in the rst row of
game . Read more about Pandas indexing here.
#Β PrintΒ theΒ firtΒ rowΒ ofΒ allΒ theΒ game withΒ zero core.Β 
#Β TheΒ .ilocΒ methodΒ onΒ dataframe allow u toΒ index  poition.Β 
print(game[game["average_rating"]Β ==Β 0].iloc[0])Β 
#Β PrintΒ theΒ firtΒ rowΒ ofΒ allΒ theΒ game with core greaterΒ thanΒ 0.Β 
print(game[game["average_rating"]Β >Β 0].iloc[0])
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 8/23
idΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 318Β 
tpe                     oardgameΒ 
nameΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Loone LeoΒ 
uer_ratedΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0Β 
average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0Β 
ae_average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β 0Β 
Name:Β 13048,Β dtpe:Β ojectΒ 
idΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 12333Β 
tpe                            oardgameΒ 
nameΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Twilight truggleΒ 
uer_ratedΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 20113Β 
average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 8.33774Β 
ae_average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β Β 8.22186Β 
Name:Β 0,Β dtpe:Β ojectΒ 
This shows us that the main di erence between a game with a 0 rating
and a game with a rating above 0 is that the 0 rated game has no reviews.
The uer_rated column is 0 . By ltering out any board games with 0 re-
views, we can remove much of the noise.
RemovingΒ game withoutΒ review
#Β RemoveΒ an row withoutΒ uerΒ review.Β 
game =Β game[game["uer_rated"]Β >Β 0]Β 
#Β RemoveΒ an row withΒ miingΒ value.Β 
game =Β game.dropna(axi=0)
We just ltered out all of the rows without user reviews. While we were at
it, we also took out any rows with missing values. Many machine learning
algorithms can’t work with missing values, so we need some way to deal
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 9/23
with them. Filtering them out is one common technique, but it means that
we may potentially lose valuable data. Other techniques for dealing with
missing data are listed here.
CluteringΒ game
We’ve seen that there may be distinct sets of games. One set (which we
just removed) was the set of games without reviews. Another set could be
a set of highly rated games. One way to gure out more about these sets
of games is a technique called clustering. Clustering enables you to nd
patterns within your data easily by grouping similar rows (in this case,
games), together.
We’ll use a particular type of clustering called k-means clustering. Scikit-
learn has an excellent implementation of k-means clustering that we can
use. Scikit-learn is the primary machine learning library in Python, and
contains implementations of most common algorithms, including random
forests, support vector machines, and logistic regression. Scikit-learn has
a consistent API for accessing these algorithms.
#Β ImportΒ theΒ kmean cluteringΒ model.Β 
from klearn.cluterΒ importΒ KMean 
Β 
#Β InitializeΒ theΒ modelΒ withΒ 2Β parameter --Β numerΒ ofΒ cluter andΒ random tate.
kmean_modelΒ =Β KMean(n_cluter=5,Β random_tate=1)Β 
#Β GetΒ onl theΒ numericΒ column fromΒ game.Β 
good_column =Β game._get_numeric_data()Β 
#Β FitΒ theΒ modelΒ uingΒ theΒ goodΒ column.Β 
kmean_model.fit(good_column)Β 
#Β GetΒ theΒ cluterΒ aignment.Β 
lael =Β kmean_model.lael_
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 10/23
In order to use the clustering algorithm in Scikit-learn, we’ll rst intialize
it using two parameters – n_cluter de nes how many clusters of games
that we want, and random_tate is a random seed we set in order to repro-
duce our results later. Here’s more information on the implementation.
We then only get the numeric columns from our dataframe. Most machine
learning algorithms can’t directly operate on text data, and can only take
numbers as input. Getting only the numeric columns removes tpe and
name , which aren’t usable by the clustering algorithm.
Finally, we t our kmeans model to our data, and get the cluster assign-
ment labels for each row.
PlottingΒ cluter
Now that we have cluster labels, let’s plot the clusters. One sticking point
is that our data has many columns – it’s outside of the realm of human
understanding and physics to be able to visualize things in more than 3
dimensions. So we’ll have to reduce the dimensionality of our data, with-
out losing too much information. One way to do this is a technique called
principal component analysis, or PCA. PCA takes multiple columns, and
turns them into fewer columns while trying to preserve the unique infor-
mation in each column. To simplify, say we have two columns, total_own-
er , and total_trader . There is some correlation between these two col-
umns, and some overlapping information. PCA will compress this infor-
mation into one column with new numbers while trying not to lose any
information.
We’ll try to turn our board game data into two dimensions, or columns, so
we can easily plot it out.
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 11/23
We rst initialize a PCA model from Scikit-learn. PCA isn’t a machine
learning technique, but Scikit-learn also contains other models that are
useful for performing machine learning. Dimensionality reduction tech-
niques like PCA are widely used when preprocessing data for machine
learning algorithms.
We then turn our data into 2 columns, and plot the columns. When we
plot the columns, we shade them according to their cluster assignment.
#Β ImportΒ theΒ PCAΒ model.Β 
from klearn.decompoitionΒ importΒ PCAΒ 
Β 
#Β CreateΒ aΒ PCAΒ model.Β 
pca_2Β =Β PCA(2)Β 
#Β FitΒ theΒ PCAΒ modelΒ onΒ theΒ numericΒ column fromΒ earlier.Β 
plot_column =Β pca_2.fit_tranform(good_column)Β 
#Β MakeΒ a catterΒ plotΒ ofΒ eachΒ game, hadedΒ accordingΒ toΒ cluterΒ aignment.
plt.catter(x=plot_column[:,0], =plot_column[:,1],Β c=lael)Β 
# howΒ theΒ plot.Β 
plt.how()
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 12/23
The plot shows us that there are 5 distinct clusters. We could dive more
into which games are in each cluster to learn more about what factors
cause games to be clustered.
FiguringΒ outΒ whatΒ toΒ predict
There are two things we need to determine before we jump into machine
learning – how we’re going to measure error, and what we’re going to
predict. We thought earlier that average_rating might be good to predict on,
and our exploration reinforces this notion.
There are a variety of ways to measure error (many are listed here). Gen-
erally, when we’re doing regression, and predicting continuous variables,
we’ll need a di erent error metric than when we’re performing classi ca-
tion, and predicting discrete values.
For this, we’ll use mean squared error – it’s easy to calculate, and simple
to understand. It shows us how far, on average, our predictions are from
the actual values.
FindingΒ correlation
njoingΒ thi pot?Β LearnΒ data cienceΒ withΒ Dataquet!
> LearnΒ fromΒ theΒ comfortΒ of our rower.
> WorkΒ withΒ realΒ­lifeΒ data et.
> ξ€€uildΒ aΒ portfolioΒ ofΒ project.
tartΒ forΒ Free
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 13/23
Now that we want to predict average_rating , let’s see what columns might
be interesting for our prediction. One way is to nd the correlation be-
tween average_rating and each of the other columns. This will show us
which other columns might predict average_rating the best. We can use the
corr method on Pandas dataframes to easily nd correlations. This will
give us the correlation between each column and each other column. Since
the result of this is a dataframe, we can index it and only get the correla-
tions for the average_rating column.
game.corr()["average_rating"]
idΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0.304201Β 
earpulihedΒ Β Β Β Β Β Β Β Β Β Β 0.108461Β 
minplaer             -0.032701Β 
maxplaer             -0.008335Β 
plaingtimeΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.048994Β 
minplatimeΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.043985Β 
maxplatimeΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.048994Β 
minageΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0.210049Β 
uer_ratedΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.112564Β 
average_ratingΒ Β Β Β Β Β Β Β Β Β 1.000000Β 
ae_average_ratingΒ Β Β Β 0.231563Β 
total_owner            0.137478Β 
total_trader           0.119452Β 
total_wanter           0.196566Β 
total_wiher           0.171375Β 
total_comment          0.123714Β 
total_weight           0.109691Β 
average_weightΒ Β Β Β Β Β Β Β Β Β 0.351081Β 
Name:Β average_rating,Β dtpe:Β float64Β 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 14/23
We see that the average_weight and id columns correlate best to rating. id
are presumably assigned when the game is added to the database, so this
likely indicates that games created later score higher in the ratings.
Maybe reviewers were not as nice in the early days of BoardGameGeek, or
older games were of lower quality. average_weight indicates the β€œdepth” or
complexity of a game, so it may be that more complex games are re-
viewed better.
PickingΒ predictorΒ column
Before we get started predicting, let’s only select the columns that are
relevant when training our algorithm. We’ll want to remove certain col-
umns that aren’t numeric.
We’ll also want to remove columns that can only be computed if you al-
ready know the average rating. Including these columns will destroy the
purpose of the classi er, which is to predict the rating without any previ-
ous knowledge. Using columns that can only be computed with knowledge
of the target can lead to over tting, where your model is good in a train-
ing set, but doesn’t generalize well to future data.
The ae_average_rating column appears to be derived from average_rating in
some way, so let’s remove it.
#Β GetΒ allΒ theΒ column fromΒ theΒ dataframe.Β 
column =Β game.column.tolit()Β 
#Β FilterΒ theΒ column toΒ removeΒ one weΒ don'tΒ want.Β 
column =Β [cΒ forΒ cΒ inΒ column ifΒ cΒ notΒ inΒ ["ae_average_rating",Β "average_ratin
Β 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 15/23
plittingΒ intoΒ trainΒ andΒ tet et
We want to be able to gure out how accurate an algorithm is using our
error metrics. However, evaluating the algorithm on the same data it has
been trained on will lead to over tting. We want the algorithm to learn
generalized rules to make predictions, not memorize how to make speci c
predictions. An example is learning math. If you memorize that 1+1=2 , and
2+2=4 , you’ll be able to perfectly answer any questions about 1+1 and 2+2 .
You’ll have 0 error. However, the second anyone asks you something out-
side of your training set where you know the answer, like 3+3 , you won’t
be able to solve it. On the other hand, if you’re able to generalize and
learn addition, you’ll make occasional mistakes because you haven’t
memorized the solutions – maybe you’ll get 3453Β +Β 353535 o by one, but
you’ll be able to solve any addition problem thrown at you.
If your error looks surprisingly low when you’re training a machine
learning algorithm, you should always check to see if you’re over tting.
In order to prevent over tting, we’ll train our algorithm on a set consist-
ing of 80% of the data, and test it on another set consisting of 20% of the
data. To do this, we rst randomly samply 80% of the rows to be in the
training set, then put everything else in the testing set.
# toreΒ theΒ varialeΒ we'll eΒ predictingΒ on.Β 
targetΒ =Β "average_rating"
#Β ImportΒ aΒ convenienceΒ functionΒ to plitΒ the et.Β 
from klearn.cro_validationΒ importΒ train_tet_plitΒ 
Β 
#Β GenerateΒ theΒ training et.  etΒ random_tateΒ to eΒ aleΒ toΒ replicateΒ reult.
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 16/23
(45515,Β 20)Β 
(11379,Β 20)Β 
Above, we exploit the fact that every Pandas row has a unique index to
select any row not in the training set to be in the testing set.
FittingΒ aΒ linearΒ regreion
Linear regression is a powerful and commonly used machine learning al-
gorithm. It predicts the target variable using linear combinations of the
predictor variables. Let’s say we have a 2 values, 3 , and 4 . A linear com-
bination would be 3Β *Β .5Β +Β 4Β *Β .5 . A linear combination involves multiply-
ing each number by a constant, and adding the results. You can read more
here.
Linear regression only works well when the predictor variables and the
target variable are linearly correlated. As we saw earlier, a few of the pre-
dictors are correlated with the target, so linear regression should work
well for us.
We can use the linear regression implementation in Scikit-learn, just as
we used the k-means implementation earlier.
trainΒ =Β game.ample(frac=0.8,Β random_tate=1)Β 
# electΒ anthingΒ notΒ inΒ theΒ training etΒ andΒ putΒ itΒ inΒ theΒ teting et.Β 
tetΒ =Β game.loc[~game.index.iin(train.index)]Β 
#Β PrintΒ the hape of oth et.Β 
print(train.hape)Β 
print(tet.hape)
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 17/23
#Β ImportΒ theΒ linearregreionΒ model.Β 
from klearn.linear_modelΒ importΒ LinearRegreionΒ 
Β 
#Β InitializeΒ theΒ modelΒ cla.Β 
modelΒ =Β LinearRegreion()Β 
#Β FitΒ theΒ modelΒ toΒ theΒ trainingΒ data.Β 
model.fit(train[column],Β train[target])
When we t the model, we pass in the predictor matrix, which consists of
all the columns from the dataframe that we picked earlier. If you pass a
list to a Pandas dataframe when you index it, it will generate a new
dataframe with all of the columns in the list. We also pass in the target
variable, which we want to make predictions for.
The model learns the equation that maps the predictors to the target with
minimal error.
PredictingΒ error
After we train the model, we can make predictions on new data with it.
This new data has to be in the exact same format as the training data, or
the model won’t make accurate predictions. Our testing set is identical to
the training set (except the rows contain di erent board games). We se-
lect the same subset of columns from the test set, and then make predic-
tions on it.
#Β ImportΒ the cikit-learnΒ functionΒ toΒ computeΒ error.Β 
from klearn.metric importΒ mean_quared_errorΒ 
Β 
#Β GenerateΒ ourΒ prediction forΒ theΒ tet et.Β 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 18/23
prediction =Β model.predict(tet[column])Β 
Β 
#Β ComputeΒ error etweenΒ ourΒ tetΒ prediction andΒ theΒ actualΒ value.Β 
mean_quared_error(prediction,Β tet[target])
1.8239281903519875Β 
Once we have the predictions, we’re able to compute error between the
test set predictions and the actual values. Mean squared error has the for-
mula . Basically, we subtract each predicted value from
the actual value, square the di erences, and add them together. Then we
divide the result by the total number of predicted values. This will give us
the average error for each prediction.
TringΒ aΒ differentΒ model
One of the nice things about Scikit-learn is that it enables us to try more
powerful algorithms very easily. One such algorithm is called random for-
est. The random forest algorithm can nd nonlinearities in data that a
linear regression wouldn’t be able to pick up on. Say, for example, that if
the minage of a game, is less than 5, the rating is low, if it’s 5-10 , it’s high,
and if it is between 10-15 , it is low. A linear regression algorithm wouldn’t
be able to pick up on this because there isn’t a linear relationship between
the predictor and the target. Predictions made with a random forest usu-
ally have less error than predictions made by a linear regression.
#Β ImportΒ theΒ randomΒ foretΒ model.Β 
from klearn.enemleΒ importΒ RandomForetRegreorΒ 
Β 
#Β InitializeΒ theΒ modelΒ with omeΒ parameter.Β 
( βˆ’
1
n
βˆ‘
n
i=1
yi yΜ‚Β 
i
)
2
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 19/23
1.4144905030983794Β 
FurtherΒ exploration
We’ve managed to go from data in csv format to making predictions. Here
are some ideas for further exploration:
Try a support vector machine.
Try ensembling multiple models to create better predictions.
Try predicting a di erent column, such as average_weight .
Generate features from the text, such as length of the name of the
game, number of words, etc.
WantΒ toΒ learnΒ moreΒ aoutΒ machine
learning?
At Dataquest, we o er interactive lessons on machine learning and data
science. We believe in learning by doing, and you’ll learn interactively in
your browser by analyzing real data and building projects. Check out our
machine learning lessons here.
modelΒ =Β RandomForetRegreor(n_etimator=100,Β min_ample_leaf=10,Β random_tate
#Β FitΒ theΒ modelΒ toΒ theΒ data.Β 
model.fit(train[column],Β train[target])Β 
#Β MakeΒ prediction.Β 
prediction =Β model.predict(tet[column])Β 
#Β ComputeΒ theΒ error.Β 
mean_quared_error(prediction,Β tet[target])
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 20/23
Join 40,000+ Data Scientists: Get Data Science Tips, Tricks and Tuto-
rials, Delivered Weekly.
Enter your email address Sign me up!
VikΒ Paruchuri
Developer and Data Scientist in San Francisco; Founder of Dataque-
st.io (Learn Data Science in your Browser).
Get in touch @vikparuchuri.
hareΒ thi pot
ο˜‚ ο˜„ ο˜ƒ
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 21/23
16 Comments Dataquest Login
ξ˜ƒ
1
Share
β€€ Sort by Oldest
Join the discussion…
β€’ Reply β€’
Egor Ignatenkov β€’ a year ago
jupyter notebook says: "Unrecognized alias: '--profile=pyspark', it will probably have no
effect."
1 β–³ β–½
β€’ Reply β€’
Vik Paruchuri β€’ a year ago
Mod > Egor Ignatenkov
This may be due to a newer version of Jupyter no longer supporting the flag. I think
Srini used IPython 3.
β–³ β–½
β€’ Reply β€’
ron β€’ 10 months ago
> Vik Paruchuri
I'm having this same error as well - how can I get it working?
β–³ β–½
β€’ Reply β€’
Robert Dupont β€’ 6 months ago
> ron
Not sure if that will help you, after I did all of that i got the same
error, I tried to run spark/bin/pyspark.cmd and it lunched the
notebook with the sc environment.
β–³ β–½
β€’ Reply β€’
essay writers online β€’ a year ago
Machine learning is very complicated to learn. It involves different codes that contain
complex terms. All the data's that you have posted was relevant to a common data
structures that are applicable in the common data frame.
β–³ β–½
β€’ Reply β€’
Samantha Zeitlin β€’ a year ago
It would help if you could show an example of what the expected output should look like if
the spark context was initialized properly?
1 β–³ β–½
β€’ Reply β€’
Pushpam Choudhury β€’ 10 months ago
How can I add some user-defined property such as username, in the SparkConf object? I
can hardcode the details in pyspark.context by using setExecutorEnv, but how for security I
would like to use the details(username etc) captured from the notebook's login page. Can
you please suggest a viable approach?
β–³ β–½
Recommend
ο„…
Share β€Ί
Share β€Ί
Share β€Ί
Share β€Ί
Share β€Ί
Share β€Ί
Share β€Ί
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 22/23
β€’ Reply β€’
raj chandrasekaran β€’ 10 months ago
I found http://tonysyu.github.io/py... useful to add sparkInstallationFolder/python to add
so that there is no need of profile.
pip install pypath_magic
%load_ext pypath_magic
%cd <your spark="" python="" path="">
%pypath –a
β–³ β–½
β€’ Reply β€’
Ebisa β€’ 10 months ago
Very helpful explanation. I am using Windows operating system. And Jupyter Notebook is
installed and running fine. When I run the command "Jupyter profile create pyspark" on
windows cmd, I receive an error message that says "jupyter-profile is not recognized as an
internal or external command, operable program or batch file". Any ideas to go around
this?
1 β–³ β–½
β€’ Reply β€’
Asher King Abramson β€’ 9 months ago
In getting Spark up and running on my machine ( ), I hit an error that said "Exception:
Java gateway process exited before sending the driver its port number" and I had to change
one of the lines you guys add to .bash_profile.
I had to change
export PYSPARK_SUBMIT_ARGS="--master local[2]"
to
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
to get it running.
Hope this helps someone else who hits the error down the road!
1 β–³ β–½
β€’ Reply β€’
Josh β€’ 9 months ago
Mod > Asher King Abramson
Hey Asher - thanks for the note. Because there have been a few changes to the
packages since this post was written, we're planning a mini-update real soon :)
β–³ β–½
Yuanwen Wang β€’ 8 months ago
Hi Just wondering since Jupyter does not have the concept of "profile" anymore
http://jupyter.readthedocs....
How could we do this step:
Share β€Ί
Share β€Ί
Share β€Ί
Share β€Ί
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 23/23
β€’ Reply β€’
"Jupyter profile create pyspark"
We get errors like
"jupyter: 'profile' is not a Jupyter command"
β–³ β–½
β€’ Reply β€’
Josh β€’ 8 months ago
Mod > Yuanwen Wang
I'd recommend using findspark - we'll update this post soon!
https://github.com/minrk/fi...
β–³ β–½
β€’ Reply β€’
Agustin Luques β€’ 7 months ago
> Josh
Please help me, I was following this post.
My problem: http://stackoverflow.com/qu...
β–³ β–½
Josh β€’ 7 months ago
Mod > Agustin Luques
Share β€Ί
Share β€Ί
Share β€Ί
njoingΒ thi pot?Β LearnΒ data cienceΒ withΒ Dataquet!
> LearnΒ fromΒ theΒ comfortΒ of our rower.
> WorkΒ withΒ realΒ­lifeΒ data et.
> ξ€€uildΒ aΒ portfolioΒ ofΒ project.
tartΒ forΒ Free
DataquetΒ ξ€€logΒ Β©
2017Β β€’Β AllΒ right
reerved.

More Related Content

PDF
Data Science - Part XVII - Deep Learning & Image Processing
PPTX
Introduction Tensorflow
PPT
Machine learning-in-details-with-out-python-code
PDF
C++
PPT
11 syllabus
PDF
Cs229 final report
PDF
Hash function landscape
PDF
Understanding why Artificial Intelligence will become the most prevalent serv...
Data Science - Part XVII - Deep Learning & Image Processing
Introduction Tensorflow
Machine learning-in-details-with-out-python-code
C++
11 syllabus
Cs229 final report
Hash function landscape
Understanding why Artificial Intelligence will become the most prevalent serv...

Similar to Tutorial machine learning with python - a tutorial (20)

PPTX
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PDF
A Gentle Introduction to Coding ... with Python
DOCX
employee turnover prediction document.docx
PPTX
PPT on Data Science Using Python
PPTX
L1-Introduction for Computer Science.pptx
PDF
Lesson 21. Pattern 13. Data alignment
PPTX
Lecture 3 intro2data
PDF
Using pandas library for data analysis in python
PDF
Start machine learning in 5 simple steps
PDF
IRJET- Machine Learning: Survey, Types and Challenges
PPTX
Sam python pro_points_slide
PDF
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
PPTX
More on Pandas.pptx
PPTX
Internship - Python - AI ML.pptx
PPTX
Internship - Python - AI ML.pptx
PDF
Python cheat-sheet
PDF
Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf
DOCX
AQA Computer science easter revision
PDF
A gentle introduction to algorithm complexity analysis
DOCX
First ML Experience
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
A Gentle Introduction to Coding ... with Python
employee turnover prediction document.docx
PPT on Data Science Using Python
L1-Introduction for Computer Science.pptx
Lesson 21. Pattern 13. Data alignment
Lecture 3 intro2data
Using pandas library for data analysis in python
Start machine learning in 5 simple steps
IRJET- Machine Learning: Survey, Types and Challenges
Sam python pro_points_slide
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
More on Pandas.pptx
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptx
Python cheat-sheet
Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf
AQA Computer science easter revision
A gentle introduction to algorithm complexity analysis
First ML Experience
Ad

Recently uploaded (20)

PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
DOCX
Unit-3 cyber security network security of internet system
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PDF
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
Β 
PPTX
innovation process that make everything different.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
artificial intelligence overview of it and more
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
Introuction about ICD -10 and ICD-11 PPT.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
Slides PDF The World Game (s) Eco Economic Epochs.pdf
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
presentation_pfe-universite-molay-seltan.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx
Module 1 - Cyber Law and Ethics 101.pptx
Unit-3 cyber security network security of internet system
Unit-1 introduction to cyber security discuss about how to secure a system
QR Codes Qr codecodecodecodecocodedecodecode
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
international classification of diseases ICD-10 review PPT.pptx
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
RPKI Status Update, presented by Makito Lay at IDNOG 10
Β 
innovation process that make everything different.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
artificial intelligence overview of it and more
An introduction to the IFRS (ISSB) Stndards.pdf
Ad

Tutorial machine learning with python - a tutorial

  • 1. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 1/23 ξ€€LOGΒ HOM DATAQUT.IO LARNΒ DATA CINC INΒ YOURΒ ξ€€ROWR ο˜… MachineΒ learningΒ withΒ Pthon:Β A Tutorial VikΒ ParuchuriΒ  21Β OCTΒ 2015Β inΒ tutorial Machine learning is a eld that uses algorithms to learn from data and make predictions. Practically, this means that we can feed data into an al- gorithm, and use it to make predictions about what might happen in the future. This has a vast range of applications, from self-driving cars to stock price prediction. Not only is machine learning interesting, it’s also starting to be widely used, making it an extremely practical skill to learn. In this tutorial, we’ll guide you through the basic principles of machine learning, and how to get started with machine learning with Python. Luckily for us, Python has an amazing ecosystem of libraries that make machine learning easy to get started with. We’ll be using the excellent Scikit-learn, Pandas, and Matplotlib libraries in this tutorial. If you want to dive more deeply into machine learning, and apply algo- rithms in your browser, check out our courses here. TheΒ dataet
  • 2. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 2/23 Before we dive into machine learning, we’re going to explore a dataset, and gure out what might be interesting to predict. The dataset is from BoardGameGeek, and contains data on 80000 board games. Here’s a single boardgame on the site. This information was kindly scraped into csv for- mat by Sean Beck, and can be downloaded here. The dataset contains several data points about each board game. Here’s a list of the interesting ones: name – name of the board game. plaingtime – the playing time (given by the manufacturer). minplatime – the minimum playing time (given by the manufacturer). maxplatime – the maximum playing time (given by the manufacturer). minage – the minimum recommended age to play. uer_rated – the number of users who rated the game. average_rating – the average rating given to the game by users. (0-10) total_weight – Number of weights given by users. Weight is a subjective measure that is made up by BoardGameGeek. It’s how β€œdeep” or in- volved a game is. Here’s a full explanation. average_weight – the average of all the subjective weights (0-5). IntroductionΒ toΒ Panda The rst step in our exploration is to read in the data and print some quick summary statistics. In order to do this, we’ll us the Pandas library. Pandas provides data structures and data analysis tools that make manip- ulating data in Python much quicker and more e ective. The most com- mon data structure is called a dataframe. A dataframe is an extension of a
  • 3. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 3/23 matrix, so we’ll talk about what a matrix is before coming back to dataframes. Our data le looks like this (we removed some columns to make it easier to look at): id,tpe,name,earpulihed,minplaer,maxplaer,plaingtimeΒ  12333,oardgame,Twilight truggle,2005,2,2,180Β  120677,oardgame,TerraΒ Mtica,2012,2,5,150Β  This is in a format called csv, or comma-separated values, which you can read more about here. Each row of the data is a di erent board game, and di erent data points about each board game are separated by commas within the row. The rst row is the header row, and describes what each data point is. The entire set of one data point, going down, is a column. We can easily conceptualize a csv le as a matrix: Β Β Β Β 1Β Β Β Β Β Β Β 2Β Β Β Β Β Β Β Β Β Β Β 3Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 4Β  1Β Β Β idΒ Β Β Β Β Β tpeΒ Β Β Β Β Β Β Β name                earpulihedΒ  2Β Β Β 12333   oardgameΒ Β Β Twilight truggleΒ Β Β 2005Β  3Β Β Β 120677  oardgameΒ Β Β TerraΒ MticaΒ Β Β Β Β Β Β 2012Β  We removed some of the columns here for display purposes, but you can still get a sense of how the data looks visually. A matrix is a two-dimen- sional data structure, with rows and columns. We can access elements in a matrix by position. The rst row starts with id , the second row starts with 12333 , and the third row starts with 120677 . The rst column is id ,
  • 4. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 4/23 the second is tpe , and so on. Matrices in Python can be used via the NumPy library. A matrix has some downsides, though. You can’t easily access columns and rows by name, and each column has to have the same datatype. This means that we can’t e ectively store our board game data in a matrix – the name column contains strings, and the earpulihed column contains integers, which means that we can’t store them both in the same matrix. A dataframe, on the other hand, can have di erent datatypes in each col- umn. It has has a lot of built-in niceities for analyzing data as well, such as looking up columns by name. Pandas gives us access to these features, and generally makes working with data much simpler. ReadingΒ inΒ ourΒ data We’ll now read in our data from a csv le into a Pandas dataframe, using the read_cv method. #Β ImportΒ theΒ panda lirar.Β  importΒ panda  Β  #Β ReadΒ inΒ theΒ data.Β  game =Β panda.read_cv("oard_game.cv")Β  #Β PrintΒ theΒ name ofΒ theΒ column inΒ game.Β  print(game.column) Index(['id',Β 'tpe',Β 'name',Β 'earpulihed',Β 'minplaer',Β 'maxplaer', Β Β Β Β Β Β Β 'plaingtime',Β 'minplatime',Β 'maxplatime',Β 'minage',Β 'uer_rated', Β Β Β Β Β Β Β 'average_rating',Β 'ae_average_rating',Β 'total_owner',Β 
  • 5. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 5/23 The code above read the data in, and shows us all of the column names. The columns that are in the data but aren’t listed above should be fairly self-explanatory. print(game.hape) (81312,Β 20)Β  We can also see the shape of the data, which shows that it has 81312 rows, or games, and 20 columns, or data points describing each game. PlottingΒ ourΒ targetΒ variale It could be interesting to predict the average score that a human would give to a new, unreleased, board game. This is stored in the average_rating column, which is the average of all the user ratings for a board game. Pre- dicting this column could be useful to board game manufacturers who are thinking of what kind of game to make next, for instance. We can access a column is a dataframe with Pandas using game["aver- age_rating"] . This will extract a single column from the dataframe. Let’s plot a histogram of this column so we can visualize the distribution of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is Β Β Β Β Β Β Β 'total_trader',Β 'total_wanter',Β 'total_wiher',Β 'total_comment', Β Β Β Β Β Β Β 'total_weight',Β 'average_weight'],Β  Β Β Β Β Β Β dtpe='oject')Β 
  • 6. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 6/23 the main plotting infrastructure in Python, and most other plotting li- braries, like seaborn and ggplot2 are built on top of Matplotlib. We import Matplotlib’s plotting functions with importΒ matplotli.pplotΒ a  plt . We can then draw and show plots. #Β ImportΒ matplotli  importΒ matplotli.pplotΒ a pltΒ  Β  #Β MakeΒ aΒ hitogramΒ ofΒ allΒ theΒ rating inΒ theΒ average_ratingΒ column.Β  plt.hit(game["average_rating"])Β  Β  # howΒ theΒ plot.Β  plt.how() What we see here is that there are quite a few games with a 0 rating. There’s a fairly normal distribution of ratings, with some right skew, and a mean rating around 6 (if you remove the zeros). xploringΒ theΒ 0Β rating
  • 7. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 7/23 Are there truly so many terrible games that were given a 0 rating? Or is something else happening? We’ll need to dive into the data bit more to check on this. With Pandas, we can select subsets of data using Boolean series (vectors, or one column/row of data, are known as series in Pandas). Here’s an example: game[game["average_rating"]Β ==Β 0]Β  The code above will create a new dataframe, with only the rows in game where the value of the average_rating column equals 0 . We can then index the resulting dataframe to get the values we want. There are two ways to index in Pandas – we can index by the name of the row or column, or we can index by position. Indexing by names looks like game["average_rating"] – this will return the whole average_rating column of game . Indexing by position looks like game.iloc[0] – this will return the whole rst row of the dataframe. We can also pass in multiple index val- ues at once – game.iloc[0,0] will return the rst column in the rst row of game . Read more about Pandas indexing here. #Β PrintΒ theΒ firtΒ rowΒ ofΒ allΒ theΒ game withΒ zero core.Β  #Β TheΒ .ilocΒ methodΒ onΒ dataframe allow u toΒ index  poition.Β  print(game[game["average_rating"]Β ==Β 0].iloc[0])Β  #Β PrintΒ theΒ firtΒ rowΒ ofΒ allΒ theΒ game with core greaterΒ thanΒ 0.Β  print(game[game["average_rating"]Β >Β 0].iloc[0])
  • 8. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 8/23 idΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 318Β  tpe                     oardgameΒ  nameΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Loone LeoΒ  uer_ratedΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0Β  average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0Β  ae_average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β 0Β  Name:Β 13048,Β dtpe:Β ojectΒ  idΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 12333Β  tpe                            oardgameΒ  nameΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Twilight truggleΒ  uer_ratedΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 20113Β  average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 8.33774Β  ae_average_ratingΒ Β Β Β Β Β Β Β Β Β Β Β Β Β 8.22186Β  Name:Β 0,Β dtpe:Β ojectΒ  This shows us that the main di erence between a game with a 0 rating and a game with a rating above 0 is that the 0 rated game has no reviews. The uer_rated column is 0 . By ltering out any board games with 0 re- views, we can remove much of the noise. RemovingΒ game withoutΒ review #Β RemoveΒ an row withoutΒ uerΒ review.Β  game =Β game[game["uer_rated"]Β >Β 0]Β  #Β RemoveΒ an row withΒ miingΒ value.Β  game =Β game.dropna(axi=0) We just ltered out all of the rows without user reviews. While we were at it, we also took out any rows with missing values. Many machine learning algorithms can’t work with missing values, so we need some way to deal
  • 9. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 9/23 with them. Filtering them out is one common technique, but it means that we may potentially lose valuable data. Other techniques for dealing with missing data are listed here. CluteringΒ game We’ve seen that there may be distinct sets of games. One set (which we just removed) was the set of games without reviews. Another set could be a set of highly rated games. One way to gure out more about these sets of games is a technique called clustering. Clustering enables you to nd patterns within your data easily by grouping similar rows (in this case, games), together. We’ll use a particular type of clustering called k-means clustering. Scikit- learn has an excellent implementation of k-means clustering that we can use. Scikit-learn is the primary machine learning library in Python, and contains implementations of most common algorithms, including random forests, support vector machines, and logistic regression. Scikit-learn has a consistent API for accessing these algorithms. #Β ImportΒ theΒ kmean cluteringΒ model.Β  from klearn.cluterΒ importΒ KMean  Β  #Β InitializeΒ theΒ modelΒ withΒ 2Β parameter --Β numerΒ ofΒ cluter andΒ random tate. kmean_modelΒ =Β KMean(n_cluter=5,Β random_tate=1)Β  #Β GetΒ onl theΒ numericΒ column fromΒ game.Β  good_column =Β game._get_numeric_data()Β  #Β FitΒ theΒ modelΒ uingΒ theΒ goodΒ column.Β  kmean_model.fit(good_column)Β  #Β GetΒ theΒ cluterΒ aignment.Β  lael =Β kmean_model.lael_
  • 10. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 10/23 In order to use the clustering algorithm in Scikit-learn, we’ll rst intialize it using two parameters – n_cluter de nes how many clusters of games that we want, and random_tate is a random seed we set in order to repro- duce our results later. Here’s more information on the implementation. We then only get the numeric columns from our dataframe. Most machine learning algorithms can’t directly operate on text data, and can only take numbers as input. Getting only the numeric columns removes tpe and name , which aren’t usable by the clustering algorithm. Finally, we t our kmeans model to our data, and get the cluster assign- ment labels for each row. PlottingΒ cluter Now that we have cluster labels, let’s plot the clusters. One sticking point is that our data has many columns – it’s outside of the realm of human understanding and physics to be able to visualize things in more than 3 dimensions. So we’ll have to reduce the dimensionality of our data, with- out losing too much information. One way to do this is a technique called principal component analysis, or PCA. PCA takes multiple columns, and turns them into fewer columns while trying to preserve the unique infor- mation in each column. To simplify, say we have two columns, total_own- er , and total_trader . There is some correlation between these two col- umns, and some overlapping information. PCA will compress this infor- mation into one column with new numbers while trying not to lose any information. We’ll try to turn our board game data into two dimensions, or columns, so we can easily plot it out.
  • 11. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 11/23 We rst initialize a PCA model from Scikit-learn. PCA isn’t a machine learning technique, but Scikit-learn also contains other models that are useful for performing machine learning. Dimensionality reduction tech- niques like PCA are widely used when preprocessing data for machine learning algorithms. We then turn our data into 2 columns, and plot the columns. When we plot the columns, we shade them according to their cluster assignment. #Β ImportΒ theΒ PCAΒ model.Β  from klearn.decompoitionΒ importΒ PCAΒ  Β  #Β CreateΒ aΒ PCAΒ model.Β  pca_2Β =Β PCA(2)Β  #Β FitΒ theΒ PCAΒ modelΒ onΒ theΒ numericΒ column fromΒ earlier.Β  plot_column =Β pca_2.fit_tranform(good_column)Β  #Β MakeΒ a catterΒ plotΒ ofΒ eachΒ game, hadedΒ accordingΒ toΒ cluterΒ aignment. plt.catter(x=plot_column[:,0], =plot_column[:,1],Β c=lael)Β  # howΒ theΒ plot.Β  plt.how()
  • 12. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 12/23 The plot shows us that there are 5 distinct clusters. We could dive more into which games are in each cluster to learn more about what factors cause games to be clustered. FiguringΒ outΒ whatΒ toΒ predict There are two things we need to determine before we jump into machine learning – how we’re going to measure error, and what we’re going to predict. We thought earlier that average_rating might be good to predict on, and our exploration reinforces this notion. There are a variety of ways to measure error (many are listed here). Gen- erally, when we’re doing regression, and predicting continuous variables, we’ll need a di erent error metric than when we’re performing classi ca- tion, and predicting discrete values. For this, we’ll use mean squared error – it’s easy to calculate, and simple to understand. It shows us how far, on average, our predictions are from the actual values. FindingΒ correlation njoingΒ thi pot?Β LearnΒ data cienceΒ withΒ Dataquet! > LearnΒ fromΒ theΒ comfortΒ of our rower. > WorkΒ withΒ realΒ­lifeΒ data et. > ξ€€uildΒ aΒ portfolioΒ ofΒ project. tartΒ forΒ Free
  • 13. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 13/23 Now that we want to predict average_rating , let’s see what columns might be interesting for our prediction. One way is to nd the correlation be- tween average_rating and each of the other columns. This will show us which other columns might predict average_rating the best. We can use the corr method on Pandas dataframes to easily nd correlations. This will give us the correlation between each column and each other column. Since the result of this is a dataframe, we can index it and only get the correla- tions for the average_rating column. game.corr()["average_rating"] idΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0.304201Β  earpulihedΒ Β Β Β Β Β Β Β Β Β Β 0.108461Β  minplaer             -0.032701Β  maxplaer             -0.008335Β  plaingtimeΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.048994Β  minplatimeΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.043985Β  maxplatimeΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.048994Β  minageΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 0.210049Β  uer_ratedΒ Β Β Β Β Β Β Β Β Β Β Β Β 0.112564Β  average_ratingΒ Β Β Β Β Β Β Β Β Β 1.000000Β  ae_average_ratingΒ Β Β Β 0.231563Β  total_owner            0.137478Β  total_trader           0.119452Β  total_wanter           0.196566Β  total_wiher           0.171375Β  total_comment          0.123714Β  total_weight           0.109691Β  average_weightΒ Β Β Β Β Β Β Β Β Β 0.351081Β  Name:Β average_rating,Β dtpe:Β float64Β 
  • 14. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 14/23 We see that the average_weight and id columns correlate best to rating. id are presumably assigned when the game is added to the database, so this likely indicates that games created later score higher in the ratings. Maybe reviewers were not as nice in the early days of BoardGameGeek, or older games were of lower quality. average_weight indicates the β€œdepth” or complexity of a game, so it may be that more complex games are re- viewed better. PickingΒ predictorΒ column Before we get started predicting, let’s only select the columns that are relevant when training our algorithm. We’ll want to remove certain col- umns that aren’t numeric. We’ll also want to remove columns that can only be computed if you al- ready know the average rating. Including these columns will destroy the purpose of the classi er, which is to predict the rating without any previ- ous knowledge. Using columns that can only be computed with knowledge of the target can lead to over tting, where your model is good in a train- ing set, but doesn’t generalize well to future data. The ae_average_rating column appears to be derived from average_rating in some way, so let’s remove it. #Β GetΒ allΒ theΒ column fromΒ theΒ dataframe.Β  column =Β game.column.tolit()Β  #Β FilterΒ theΒ column toΒ removeΒ one weΒ don'tΒ want.Β  column =Β [cΒ forΒ cΒ inΒ column ifΒ cΒ notΒ inΒ ["ae_average_rating",Β "average_ratin Β 
  • 15. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 15/23 plittingΒ intoΒ trainΒ andΒ tet et We want to be able to gure out how accurate an algorithm is using our error metrics. However, evaluating the algorithm on the same data it has been trained on will lead to over tting. We want the algorithm to learn generalized rules to make predictions, not memorize how to make speci c predictions. An example is learning math. If you memorize that 1+1=2 , and 2+2=4 , you’ll be able to perfectly answer any questions about 1+1 and 2+2 . You’ll have 0 error. However, the second anyone asks you something out- side of your training set where you know the answer, like 3+3 , you won’t be able to solve it. On the other hand, if you’re able to generalize and learn addition, you’ll make occasional mistakes because you haven’t memorized the solutions – maybe you’ll get 3453Β +Β 353535 o by one, but you’ll be able to solve any addition problem thrown at you. If your error looks surprisingly low when you’re training a machine learning algorithm, you should always check to see if you’re over tting. In order to prevent over tting, we’ll train our algorithm on a set consist- ing of 80% of the data, and test it on another set consisting of 20% of the data. To do this, we rst randomly samply 80% of the rows to be in the training set, then put everything else in the testing set. # toreΒ theΒ varialeΒ we'll eΒ predictingΒ on.Β  targetΒ =Β "average_rating" #Β ImportΒ aΒ convenienceΒ functionΒ to plitΒ the et.Β  from klearn.cro_validationΒ importΒ train_tet_plitΒ  Β  #Β GenerateΒ theΒ training et.  etΒ random_tateΒ to eΒ aleΒ toΒ replicateΒ reult.
  • 16. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 16/23 (45515,Β 20)Β  (11379,Β 20)Β  Above, we exploit the fact that every Pandas row has a unique index to select any row not in the training set to be in the testing set. FittingΒ aΒ linearΒ regreion Linear regression is a powerful and commonly used machine learning al- gorithm. It predicts the target variable using linear combinations of the predictor variables. Let’s say we have a 2 values, 3 , and 4 . A linear com- bination would be 3Β *Β .5Β +Β 4Β *Β .5 . A linear combination involves multiply- ing each number by a constant, and adding the results. You can read more here. Linear regression only works well when the predictor variables and the target variable are linearly correlated. As we saw earlier, a few of the pre- dictors are correlated with the target, so linear regression should work well for us. We can use the linear regression implementation in Scikit-learn, just as we used the k-means implementation earlier. trainΒ =Β game.ample(frac=0.8,Β random_tate=1)Β  # electΒ anthingΒ notΒ inΒ theΒ training etΒ andΒ putΒ itΒ inΒ theΒ teting et.Β  tetΒ =Β game.loc[~game.index.iin(train.index)]Β  #Β PrintΒ the hape of oth et.Β  print(train.hape)Β  print(tet.hape)
  • 17. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 17/23 #Β ImportΒ theΒ linearregreionΒ model.Β  from klearn.linear_modelΒ importΒ LinearRegreionΒ  Β  #Β InitializeΒ theΒ modelΒ cla.Β  modelΒ =Β LinearRegreion()Β  #Β FitΒ theΒ modelΒ toΒ theΒ trainingΒ data.Β  model.fit(train[column],Β train[target]) When we t the model, we pass in the predictor matrix, which consists of all the columns from the dataframe that we picked earlier. If you pass a list to a Pandas dataframe when you index it, it will generate a new dataframe with all of the columns in the list. We also pass in the target variable, which we want to make predictions for. The model learns the equation that maps the predictors to the target with minimal error. PredictingΒ error After we train the model, we can make predictions on new data with it. This new data has to be in the exact same format as the training data, or the model won’t make accurate predictions. Our testing set is identical to the training set (except the rows contain di erent board games). We se- lect the same subset of columns from the test set, and then make predic- tions on it. #Β ImportΒ the cikit-learnΒ functionΒ toΒ computeΒ error.Β  from klearn.metric importΒ mean_quared_errorΒ  Β  #Β GenerateΒ ourΒ prediction forΒ theΒ tet et.Β 
  • 18. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 18/23 prediction =Β model.predict(tet[column])Β  Β  #Β ComputeΒ error etweenΒ ourΒ tetΒ prediction andΒ theΒ actualΒ value.Β  mean_quared_error(prediction,Β tet[target]) 1.8239281903519875Β  Once we have the predictions, we’re able to compute error between the test set predictions and the actual values. Mean squared error has the for- mula . Basically, we subtract each predicted value from the actual value, square the di erences, and add them together. Then we divide the result by the total number of predicted values. This will give us the average error for each prediction. TringΒ aΒ differentΒ model One of the nice things about Scikit-learn is that it enables us to try more powerful algorithms very easily. One such algorithm is called random for- est. The random forest algorithm can nd nonlinearities in data that a linear regression wouldn’t be able to pick up on. Say, for example, that if the minage of a game, is less than 5, the rating is low, if it’s 5-10 , it’s high, and if it is between 10-15 , it is low. A linear regression algorithm wouldn’t be able to pick up on this because there isn’t a linear relationship between the predictor and the target. Predictions made with a random forest usu- ally have less error than predictions made by a linear regression. #Β ImportΒ theΒ randomΒ foretΒ model.Β  from klearn.enemleΒ importΒ RandomForetRegreorΒ  Β  #Β InitializeΒ theΒ modelΒ with omeΒ parameter.Β  ( βˆ’ 1 n βˆ‘ n i=1 yi yΜ‚Β  i ) 2
  • 19. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 19/23 1.4144905030983794Β  FurtherΒ exploration We’ve managed to go from data in csv format to making predictions. Here are some ideas for further exploration: Try a support vector machine. Try ensembling multiple models to create better predictions. Try predicting a di erent column, such as average_weight . Generate features from the text, such as length of the name of the game, number of words, etc. WantΒ toΒ learnΒ moreΒ aoutΒ machine learning? At Dataquest, we o er interactive lessons on machine learning and data science. We believe in learning by doing, and you’ll learn interactively in your browser by analyzing real data and building projects. Check out our machine learning lessons here. modelΒ =Β RandomForetRegreor(n_etimator=100,Β min_ample_leaf=10,Β random_tate #Β FitΒ theΒ modelΒ toΒ theΒ data.Β  model.fit(train[column],Β train[target])Β  #Β MakeΒ prediction.Β  prediction =Β model.predict(tet[column])Β  #Β ComputeΒ theΒ error.Β  mean_quared_error(prediction,Β tet[target])
  • 20. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 20/23 Join 40,000+ Data Scientists: Get Data Science Tips, Tricks and Tuto- rials, Delivered Weekly. Enter your email address Sign me up! VikΒ Paruchuri Developer and Data Scientist in San Francisco; Founder of Dataque- st.io (Learn Data Science in your Browser). Get in touch @vikparuchuri. hareΒ thi pot ο˜‚ ο˜„ ο˜ƒ
  • 21. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 21/23 16 Comments Dataquest Login ξ˜ƒ 1 Share β€€ Sort by Oldest Join the discussion… β€’ Reply β€’ Egor Ignatenkov β€’ a year ago jupyter notebook says: "Unrecognized alias: '--profile=pyspark', it will probably have no effect." 1 β–³ β–½ β€’ Reply β€’ Vik Paruchuri β€’ a year ago Mod > Egor Ignatenkov This may be due to a newer version of Jupyter no longer supporting the flag. I think Srini used IPython 3. β–³ β–½ β€’ Reply β€’ ron β€’ 10 months ago > Vik Paruchuri I'm having this same error as well - how can I get it working? β–³ β–½ β€’ Reply β€’ Robert Dupont β€’ 6 months ago > ron Not sure if that will help you, after I did all of that i got the same error, I tried to run spark/bin/pyspark.cmd and it lunched the notebook with the sc environment. β–³ β–½ β€’ Reply β€’ essay writers online β€’ a year ago Machine learning is very complicated to learn. It involves different codes that contain complex terms. All the data's that you have posted was relevant to a common data structures that are applicable in the common data frame. β–³ β–½ β€’ Reply β€’ Samantha Zeitlin β€’ a year ago It would help if you could show an example of what the expected output should look like if the spark context was initialized properly? 1 β–³ β–½ β€’ Reply β€’ Pushpam Choudhury β€’ 10 months ago How can I add some user-defined property such as username, in the SparkConf object? I can hardcode the details in pyspark.context by using setExecutorEnv, but how for security I would like to use the details(username etc) captured from the notebook's login page. Can you please suggest a viable approach? β–³ β–½ Recommend ο„… Share β€Ί Share β€Ί Share β€Ί Share β€Ί Share β€Ί Share β€Ί Share β€Ί
  • 22. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 22/23 β€’ Reply β€’ raj chandrasekaran β€’ 10 months ago I found http://tonysyu.github.io/py... useful to add sparkInstallationFolder/python to add so that there is no need of profile. pip install pypath_magic %load_ext pypath_magic %cd <your spark="" python="" path=""> %pypath –a β–³ β–½ β€’ Reply β€’ Ebisa β€’ 10 months ago Very helpful explanation. I am using Windows operating system. And Jupyter Notebook is installed and running fine. When I run the command "Jupyter profile create pyspark" on windows cmd, I receive an error message that says "jupyter-profile is not recognized as an internal or external command, operable program or batch file". Any ideas to go around this? 1 β–³ β–½ β€’ Reply β€’ Asher King Abramson β€’ 9 months ago In getting Spark up and running on my machine ( ), I hit an error that said "Exception: Java gateway process exited before sending the driver its port number" and I had to change one of the lines you guys add to .bash_profile. I had to change export PYSPARK_SUBMIT_ARGS="--master local[2]" to export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell" to get it running. Hope this helps someone else who hits the error down the road! 1 β–³ β–½ β€’ Reply β€’ Josh β€’ 9 months ago Mod > Asher King Abramson Hey Asher - thanks for the note. Because there have been a few changes to the packages since this post was written, we're planning a mini-update real soon :) β–³ β–½ Yuanwen Wang β€’ 8 months ago Hi Just wondering since Jupyter does not have the concept of "profile" anymore http://jupyter.readthedocs.... How could we do this step: Share β€Ί Share β€Ί Share β€Ί Share β€Ί
  • 23. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 23/23 β€’ Reply β€’ "Jupyter profile create pyspark" We get errors like "jupyter: 'profile' is not a Jupyter command" β–³ β–½ β€’ Reply β€’ Josh β€’ 8 months ago Mod > Yuanwen Wang I'd recommend using findspark - we'll update this post soon! https://github.com/minrk/fi... β–³ β–½ β€’ Reply β€’ Agustin Luques β€’ 7 months ago > Josh Please help me, I was following this post. My problem: http://stackoverflow.com/qu... β–³ β–½ Josh β€’ 7 months ago Mod > Agustin Luques Share β€Ί Share β€Ί Share β€Ί njoingΒ thi pot?Β LearnΒ data cienceΒ withΒ Dataquet! > LearnΒ fromΒ theΒ comfortΒ of our rower. > WorkΒ withΒ realΒ­lifeΒ data et. > ξ€€uildΒ aΒ portfolioΒ ofΒ project. tartΒ forΒ Free DataquetΒ ξ€€logΒ Β© 2017Β β€’Β AllΒ right reerved.