Tutorial machine learning with python - a tutorial

3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 1/23
LOG HOM DATAQUT.IO LARN DATA CINC IN YOUR ROWR

Machine learning with Pthon: A
Tutorial
Vik Paruchuri 21 OCT 2015 in tutorial
Machine learning is a eld that uses algorithms to learn from data and
make predictions. Practically, this means that we can feed data into an al-
gorithm, and use it to make predictions about what might happen in the
future. This has a vast range of applications, from self-driving cars to
stock price prediction. Not only is machine learning interesting, it’s also
starting to be widely used, making it an extremely practical skill to learn.
In this tutorial, we’ll guide you through the basic principles of machine
learning, and how to get started with machine learning with Python.
Luckily for us, Python has an amazing ecosystem of libraries that make
machine learning easy to get started with. We’ll be using the excellent
Scikit-learn, Pandas, and Matplotlib libraries in this tutorial.
If you want to dive more deeply into machine learning, and apply algo-
rithms in your browser, check out our courses here.
The dataet

Before we dive into machine learning, we’re going to explore a dataset,
and gure out what might be interesting to predict. The dataset is from
BoardGameGeek, and contains data on 80000 board games. Here’s a single
boardgame on the site. This information was kindly scraped into csv for-
mat by Sean Beck, and can be downloaded here.
The dataset contains several data points about each board game. Here’s a
list of the interesting ones:
name – name of the board game.
plaingtime – the playing time (given by the manufacturer).
minplatime – the minimum playing time (given by the manufacturer).
maxplatime – the maximum playing time (given by the manufacturer).
minage – the minimum recommended age to play.
uer_rated – the number of users who rated the game.
average_rating – the average rating given to the game by users. (0-10)
total_weight – Number of weights given by users. Weight is a subjective
measure that is made up by BoardGameGeek. It’s how “deep” or in-
volved a game is. Here’s a full explanation.
average_weight – the average of all the subjective weights (0-5).
Introduction to Panda
The rst step in our exploration is to read in the data and print some
quick summary statistics. In order to do this, we’ll us the Pandas library.
Pandas provides data structures and data analysis tools that make manip-
ulating data in Python much quicker and more e ective. The most com-
mon data structure is called a dataframe. A dataframe is an extension of a

matrix, so we’ll talk about what a matrix is before coming back to
dataframes.
Our data le looks like this (we removed some columns to make it easier
to look at):
id,tpe,name,earpulihed,minplaer,maxplaer,plaingtime
12333,oardgame,Twilight truggle,2005,2,2,180
120677,oardgame,Terra Mtica,2012,2,5,150
This is in a format called csv, or comma-separated values, which you can
read more about here. Each row of the data is a di erent board game, and
di erent data points about each board game are separated by commas
within the row. The rst row is the header row, and describes what each
data point is. The entire set of one data point, going down, is a column.
We can easily conceptualize a csv le as a matrix:
    1       2           3                   4
1   id      tpe        name                earpulihed
2   12333   oardgame   Twilight truggle   2005
3   120677  oardgame   Terra Mtica       2012
We removed some of the columns here for display purposes, but you can
still get a sense of how the data looks visually. A matrix is a two-dimen-
sional data structure, with rows and columns. We can access elements in a
matrix by position. The rst row starts with id , the second row starts
with 12333 , and the third row starts with 120677 . The rst column is id ,

the second is tpe , and so on. Matrices in Python can be used via the
NumPy library.
A matrix has some downsides, though. You can’t easily access columns
and rows by name, and each column has to have the same datatype. This
means that we can’t e ectively store our board game data in a matrix –
the name column contains strings, and the earpulihed column contains
integers, which means that we can’t store them both in the same matrix.
A dataframe, on the other hand, can have di erent datatypes in each col-
umn. It has has a lot of built-in niceities for analyzing data as well, such
as looking up columns by name. Pandas gives us access to these features,
and generally makes working with data much simpler.
Reading in our data
We’ll now read in our data from a csv le into a Pandas dataframe, using
the read_cv method.
# Import the panda lirar.
import panda

# Read in the data.
game = panda.read_cv("oard_game.cv")
# Print the name of the column in game.
print(game.column)
Index(['id', 'tpe', 'name', 'earpulihed', 'minplaer', 'maxplaer',
'plaingtime', 'minplatime', 'maxplatime', 'minage', 'uer_rated',
'average_rating', 'ae_average_rating', 'total_owner',

The code above read the data in, and shows us all of the column names.
The columns that are in the data but aren’t listed above should be fairly
self-explanatory.
print(game.hape)
(81312, 20)
We can also see the shape of the data, which shows that it has 81312 rows,
or games, and 20 columns, or data points describing each game.
Plotting our target variale
It could be interesting to predict the average score that a human would
give to a new, unreleased, board game. This is stored in the average_rating
column, which is the average of all the user ratings for a board game. Pre-
dicting this column could be useful to board game manufacturers who are
thinking of what kind of game to make next, for instance.
We can access a column is a dataframe with Pandas using game["aver-
age_rating"] . This will extract a single column from the dataframe.
Let’s plot a histogram of this column so we can visualize the distribution
of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is
       'total_trader', 'total_wanter', 'total_wiher', 'total_comment',
       'total_weight', 'average_weight'],
      dtpe='oject')

the main plotting infrastructure in Python, and most other plotting li-
braries, like seaborn and ggplot2 are built on top of Matplotlib.
We import Matplotlib’s plotting functions with import matplotli.pplot a
plt . We can then draw and show plots.
# Import matplotli
import matplotli.pplot a plt

# Make a hitogram of all the rating in the average_rating column.
plt.hit(game["average_rating"])

# how the plot.
plt.how()
What we see here is that there are quite a few games with a 0 rating.
There’s a fairly normal distribution of ratings, with some right skew, and
a mean rating around 6 (if you remove the zeros).
xploring the 0 rating

Are there truly so many terrible games that were given a 0 rating? Or is
something else happening? We’ll need to dive into the data bit more to
check on this.
With Pandas, we can select subsets of data using Boolean series (vectors,
or one column/row of data, are known as series in Pandas). Here’s an
example:
game[game["average_rating"] == 0]
The code above will create a new dataframe, with only the rows in game
where the value of the average_rating column equals 0 .
We can then index the resulting dataframe to get the values we want.
There are two ways to index in Pandas – we can index by the name of the
row or column, or we can index by position. Indexing by names looks like
game["average_rating"] – this will return the whole average_rating column of
game . Indexing by position looks like game.iloc[0] – this will return the
whole rst row of the dataframe. We can also pass in multiple index val-
ues at once – game.iloc[0,0] will return the rst column in the rst row of
game . Read more about Pandas indexing here.
# Print the firt row of all the game with zero core.
# The .iloc method on dataframe allow u to index  poition.
print(game[game["average_rating"] == 0].iloc[0])
# Print the firt row of all the game with core greater than 0.
print(game[game["average_rating"] > 0].iloc[0])

id                             318
tpe                     oardgame
name                    Loone Leo
uer_rated                      0
average_rating                   0
ae_average_rating             0
Name: 13048, dtpe: oject
id                                  12333
tpe                            oardgame
name                    Twilight truggle
uer_rated                         20113
average_rating                    8.33774
ae_average_rating              8.22186
Name: 0, dtpe: oject
This shows us that the main di erence between a game with a 0 rating
and a game with a rating above 0 is that the 0 rated game has no reviews.
The uer_rated column is 0 . By ltering out any board games with 0 re-
views, we can remove much of the noise.
Removing game without review
# Remove an row without uer review.
game = game[game["uer_rated"] > 0]
# Remove an row with miing value.
game = game.dropna(axi=0)
We just ltered out all of the rows without user reviews. While we were at
it, we also took out any rows with missing values. Many machine learning
algorithms can’t work with missing values, so we need some way to deal

with them. Filtering them out is one common technique, but it means that
we may potentially lose valuable data. Other techniques for dealing with
missing data are listed here.
Clutering game
We’ve seen that there may be distinct sets of games. One set (which we
just removed) was the set of games without reviews. Another set could be
a set of highly rated games. One way to gure out more about these sets
of games is a technique called clustering. Clustering enables you to nd
patterns within your data easily by grouping similar rows (in this case,
games), together.
We’ll use a particular type of clustering called k-means clustering. Scikit-
learn has an excellent implementation of k-means clustering that we can
use. Scikit-learn is the primary machine learning library in Python, and
contains implementations of most common algorithms, including random
forests, support vector machines, and logistic regression. Scikit-learn has
a consistent API for accessing these algorithms.
# Import the kmean clutering model.
from klearn.cluter import KMean

# Initialize the model with 2 parameter -- numer of cluter and random tate.
kmean_model = KMean(n_cluter=5, random_tate=1)
# Get onl the numeric column from game.
good_column = game._get_numeric_data()
# Fit the model uing the good column.
kmean_model.fit(good_column)
# Get the cluter aignment.
lael = kmean_model.lael_

In order to use the clustering algorithm in Scikit-learn, we’ll rst intialize
it using two parameters – n_cluter de nes how many clusters of games
that we want, and random_tate is a random seed we set in order to repro-
duce our results later. Here’s more information on the implementation.
We then only get the numeric columns from our dataframe. Most machine
learning algorithms can’t directly operate on text data, and can only take
numbers as input. Getting only the numeric columns removes tpe and
name , which aren’t usable by the clustering algorithm.
Finally, we t our kmeans model to our data, and get the cluster assign-
ment labels for each row.
Plotting cluter
Now that we have cluster labels, let’s plot the clusters. One sticking point
is that our data has many columns – it’s outside of the realm of human
understanding and physics to be able to visualize things in more than 3
dimensions. So we’ll have to reduce the dimensionality of our data, with-
out losing too much information. One way to do this is a technique called
principal component analysis, or PCA. PCA takes multiple columns, and
turns them into fewer columns while trying to preserve the unique infor-
mation in each column. To simplify, say we have two columns, total_own-
er , and total_trader . There is some correlation between these two col-
umns, and some overlapping information. PCA will compress this infor-
mation into one column with new numbers while trying not to lose any
information.
We’ll try to turn our board game data into two dimensions, or columns, so
we can easily plot it out.

We rst initialize a PCA model from Scikit-learn. PCA isn’t a machine
learning technique, but Scikit-learn also contains other models that are
useful for performing machine learning. Dimensionality reduction tech-
niques like PCA are widely used when preprocessing data for machine
learning algorithms.
We then turn our data into 2 columns, and plot the columns. When we
plot the columns, we shade them according to their cluster assignment.
# Import the PCA model.
from klearn.decompoition import PCA

# Create a PCA model.
pca_2 = PCA(2)
# Fit the PCA model on the numeric column from earlier.
plot_column = pca_2.fit_tranform(good_column)
# Make a catter plot of each game, haded according to cluter aignment.
plt.catter(x=plot_column[:,0], =plot_column[:,1], c=lael)
# how the plot.
plt.how()

The plot shows us that there are 5 distinct clusters. We could dive more
into which games are in each cluster to learn more about what factors
cause games to be clustered.
Figuring out what to predict
There are two things we need to determine before we jump into machine
learning – how we’re going to measure error, and what we’re going to
predict. We thought earlier that average_rating might be good to predict on,
and our exploration reinforces this notion.
There are a variety of ways to measure error (many are listed here). Gen-
erally, when we’re doing regression, and predicting continuous variables,
we’ll need a di erent error metric than when we’re performing classi ca-
tion, and predicting discrete values.
For this, we’ll use mean squared error – it’s easy to calculate, and simple
to understand. It shows us how far, on average, our predictions are from
the actual values.
Finding correlation
njoing thi pot? Learn data cience with Dataquet!
> Learn from the comfort of our rower.
> Work with reallife data et.
> uild a portfolio of project.
tart for Free

Now that we want to predict average_rating , let’s see what columns might
be interesting for our prediction. One way is to nd the correlation be-
tween average_rating and each of the other columns. This will show us
which other columns might predict average_rating the best. We can use the
corr method on Pandas dataframes to easily nd correlations. This will
give us the correlation between each column and each other column. Since
the result of this is a dataframe, we can index it and only get the correla-
tions for the average_rating column.
game.corr()["average_rating"]
id                      0.304201
earpulihed           0.108461
minplaer             -0.032701
maxplaer             -0.008335
plaingtime             0.048994
minplatime             0.043985
maxplatime             0.048994
minage                  0.210049
uer_rated             0.112564
average_rating          1.000000
ae_average_rating    0.231563
total_owner            0.137478
total_trader           0.119452
total_wanter           0.196566
total_wiher           0.171375
total_comment          0.123714
total_weight           0.109691
average_weight          0.351081
Name: average_rating, dtpe: float64

We see that the average_weight and id columns correlate best to rating. id
are presumably assigned when the game is added to the database, so this
likely indicates that games created later score higher in the ratings.
Maybe reviewers were not as nice in the early days of BoardGameGeek, or
older games were of lower quality. average_weight indicates the “depth” or
complexity of a game, so it may be that more complex games are re-
viewed better.
Picking predictor column
Before we get started predicting, let’s only select the columns that are
relevant when training our algorithm. We’ll want to remove certain col-
umns that aren’t numeric.
We’ll also want to remove columns that can only be computed if you al-
ready know the average rating. Including these columns will destroy the
purpose of the classi er, which is to predict the rating without any previ-
ous knowledge. Using columns that can only be computed with knowledge
of the target can lead to over tting, where your model is good in a train-
ing set, but doesn’t generalize well to future data.
The ae_average_rating column appears to be derived from average_rating in
some way, so let’s remove it.
# Get all the column from the dataframe.
column = game.column.tolit()
# Filter the column to remove one we don't want.
column = [c for c in column if c not in ["ae_average_rating", "average_ratin

plitting into train and tet et
We want to be able to gure out how accurate an algorithm is using our
error metrics. However, evaluating the algorithm on the same data it has
been trained on will lead to over tting. We want the algorithm to learn
generalized rules to make predictions, not memorize how to make speci c
predictions. An example is learning math. If you memorize that 1+1=2 , and
2+2=4 , you’ll be able to perfectly answer any questions about 1+1 and 2+2 .
You’ll have 0 error. However, the second anyone asks you something out-
side of your training set where you know the answer, like 3+3 , you won’t
be able to solve it. On the other hand, if you’re able to generalize and
learn addition, you’ll make occasional mistakes because you haven’t
memorized the solutions – maybe you’ll get 3453 + 353535 o by one, but
you’ll be able to solve any addition problem thrown at you.
If your error looks surprisingly low when you’re training a machine
learning algorithm, you should always check to see if you’re over tting.
In order to prevent over tting, we’ll train our algorithm on a set consist-
ing of 80% of the data, and test it on another set consisting of 20% of the
data. To do this, we rst randomly samply 80% of the rows to be in the
training set, then put everything else in the testing set.
# tore the variale we'll e predicting on.
target = "average_rating"
# Import a convenience function to plit the et.
from klearn.cro_validation import train_tet_plit

# Generate the training et. et random_tate to e ale to replicate reult.

(45515, 20)
(11379, 20)
Above, we exploit the fact that every Pandas row has a unique index to
select any row not in the training set to be in the testing set.
Fitting a linear regreion
Linear regression is a powerful and commonly used machine learning al-
gorithm. It predicts the target variable using linear combinations of the
predictor variables. Let’s say we have a 2 values, 3 , and 4 . A linear com-
bination would be 3 * .5 + 4 * .5 . A linear combination involves multiply-
ing each number by a constant, and adding the results. You can read more
here.
Linear regression only works well when the predictor variables and the
target variable are linearly correlated. As we saw earlier, a few of the pre-
dictors are correlated with the target, so linear regression should work
well for us.
We can use the linear regression implementation in Scikit-learn, just as
we used the k-means implementation earlier.
train = game.ample(frac=0.8, random_tate=1)
# elect anthing not in the training et and put it in the teting et.
tet = game.loc[~game.index.iin(train.index)]
# Print the hape of oth et.
print(train.hape)
print(tet.hape)

# Import the linearregreion model.
from klearn.linear_model import LinearRegreion

# Initialize the model cla.
model = LinearRegreion()
# Fit the model to the training data.
model.fit(train[column], train[target])
When we t the model, we pass in the predictor matrix, which consists of
all the columns from the dataframe that we picked earlier. If you pass a
list to a Pandas dataframe when you index it, it will generate a new
dataframe with all of the columns in the list. We also pass in the target
variable, which we want to make predictions for.
The model learns the equation that maps the predictors to the target with
minimal error.
Predicting error
After we train the model, we can make predictions on new data with it.
This new data has to be in the exact same format as the training data, or
the model won’t make accurate predictions. Our testing set is identical to
the training set (except the rows contain di erent board games). We se-
lect the same subset of columns from the test set, and then make predic-
tions on it.
# Import the cikit-learn function to compute error.
from klearn.metric import mean_quared_error

# Generate our prediction for the tet et.

prediction = model.predict(tet[column])

# Compute error etween our tet prediction and the actual value.
mean_quared_error(prediction, tet[target])
1.8239281903519875
Once we have the predictions, we’re able to compute error between the
test set predictions and the actual values. Mean squared error has the for-
mula . Basically, we subtract each predicted value from
the actual value, square the di erences, and add them together. Then we
divide the result by the total number of predicted values. This will give us
the average error for each prediction.
Tring a different model
One of the nice things about Scikit-learn is that it enables us to try more
powerful algorithms very easily. One such algorithm is called random for-
est. The random forest algorithm can nd nonlinearities in data that a
linear regression wouldn’t be able to pick up on. Say, for example, that if
the minage of a game, is less than 5, the rating is low, if it’s 5-10 , it’s high,
and if it is between 10-15 , it is low. A linear regression algorithm wouldn’t
be able to pick up on this because there isn’t a linear relationship between
the predictor and the target. Predictions made with a random forest usu-
ally have less error than predictions made by a linear regression.
# Import the random foret model.
from klearn.enemle import RandomForetRegreor

# Initialize the model with ome parameter.
( −
1
n
∑
n
i=1
yi ŷ
i
)
2

1.4144905030983794
Further exploration
We’ve managed to go from data in csv format to making predictions. Here
are some ideas for further exploration:
Try a support vector machine.
Try ensembling multiple models to create better predictions.
Try predicting a di erent column, such as average_weight .
Generate features from the text, such as length of the name of the
game, number of words, etc.
Want to learn more aout machine
learning?
At Dataquest, we o er interactive lessons on machine learning and data
science. We believe in learning by doing, and you’ll learn interactively in
your browser by analyzing real data and building projects. Check out our
machine learning lessons here.
model = RandomForetRegreor(n_etimator=100, min_ample_leaf=10, random_tate
# Fit the model to the data.
model.fit(train[column], train[target])
# Make prediction.
prediction = model.predict(tet[column])
# Compute the error.
mean_quared_error(prediction, tet[target])

Join 40,000+ Data Scientists: Get Data Science Tips, Tricks and Tuto-
rials, Delivered Weekly.
Enter your email address Sign me up!
Vik Paruchuri
Developer and Data Scientist in San Francisco; Founder of Dataque-
st.io (Learn Data Science in your Browser).
Get in touch @vikparuchuri.
hare thi pot
  

16 Comments Dataquest Login

1
Share
⤤ Sort by Oldest
Join the discussion…
• Reply •
Egor Ignatenkov • a year ago
jupyter notebook says: "Unrecognized alias: '--profile=pyspark', it will probably have no
effect."
1 △ ▽
• Reply •
Vik Paruchuri • a year ago
Mod > Egor Ignatenkov
This may be due to a newer version of Jupyter no longer supporting the flag. I think
Srini used IPython 3.
△ ▽
• Reply •
ron • 10 months ago
> Vik Paruchuri
I'm having this same error as well - how can I get it working?
△ ▽
• Reply •
Robert Dupont • 6 months ago
> ron
Not sure if that will help you, after I did all of that i got the same
error, I tried to run spark/bin/pyspark.cmd and it lunched the
notebook with the sc environment.
△ ▽
• Reply •
essay writers online • a year ago
Machine learning is very complicated to learn. It involves different codes that contain
complex terms. All the data's that you have posted was relevant to a common data
structures that are applicable in the common data frame.
△ ▽
• Reply •
Samantha Zeitlin • a year ago
It would help if you could show an example of what the expected output should look like if
the spark context was initialized properly?
1 △ ▽
• Reply •
Pushpam Choudhury • 10 months ago
How can I add some user-defined property such as username, in the SparkConf object? I
can hardcode the details in pyspark.context by using setExecutorEnv, but how for security I
would like to use the details(username etc) captured from the notebook's login page. Can
you please suggest a viable approach?
△ ▽
Recommend

Share ›
Share ›
Share ›
Share ›
Share ›
Share ›
Share ›

• Reply •
raj chandrasekaran • 10 months ago
I found http://tonysyu.github.io/py... useful to add sparkInstallationFolder/python to add
so that there is no need of profile.
pip install pypath_magic
%load_ext pypath_magic
%cd <your spark="" python="" path="">
%pypath –a
△ ▽
• Reply •
Ebisa • 10 months ago
Very helpful explanation. I am using Windows operating system. And Jupyter Notebook is
installed and running fine. When I run the command "Jupyter profile create pyspark" on
windows cmd, I receive an error message that says "jupyter-profile is not recognized as an
internal or external command, operable program or batch file". Any ideas to go around
this?
1 △ ▽
• Reply •
Asher King Abramson • 9 months ago
In getting Spark up and running on my machine ( ), I hit an error that said "Exception:
Java gateway process exited before sending the driver its port number" and I had to change
one of the lines you guys add to .bash_profile.
I had to change
export PYSPARK_SUBMIT_ARGS="--master local[2]"
to
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
to get it running.
Hope this helps someone else who hits the error down the road!
1 △ ▽
• Reply •
Josh • 9 months ago
Mod > Asher King Abramson
Hey Asher - thanks for the note. Because there have been a few changes to the
packages since this post was written, we're planning a mini-update real soon :)
△ ▽
Yuanwen Wang • 8 months ago
Hi Just wondering since Jupyter does not have the concept of "profile" anymore
http://jupyter.readthedocs....
How could we do this step:
Share ›
Share ›
Share ›
Share ›

• Reply •
"Jupyter profile create pyspark"
We get errors like
"jupyter: 'profile' is not a Jupyter command"
△ ▽
• Reply •
Mod > Yuanwen Wang
I'd recommend using findspark - we'll update this post soon!
https://github.com/minrk/fi...
△ ▽
• Reply •
Agustin Luques • 7 months ago
> Josh
Please help me, I was following this post.
My problem: http://stackoverflow.com/qu...
△ ▽
Mod > Agustin Luques
Share ›
Share ›
Share ›
njoing thi pot? Learn data cience with Dataquet!
> Learn from the comfort of our rower.
> Work with reallife data et.
> uild a portfolio of project.
tart for Free
Dataquet log ©
2017 • All right
reerved.

Tutorial machine learning with python - a tutorial

More Related Content

Similar to Tutorial machine learning with python - a tutorial (20)

Recently uploaded (20)

Tutorial machine learning with python - a tutorial