Hands-on - Machine Learning using scikitLearn

Python-Machine Learning
using Scikit-Learn package
Dr. Sarwan Singh

Agenda
• Introduction (SciKit-Learn Toolkit)
• History, contributors
• Data representation in Machine Learning
• Supervised learning example
• Classification model
• Machine Learning Project
using Iris dataset
Artificial Intelligence
Machine Learning
Deep Learning
Machine learning is a
branch in computer
science that studies the
design of algorithms
that can learn.
sarwan@NIELIT 2

History
• Scikit-learn was original authored by an data scientist
David Courapeau in 2007
• Google Summer of Code Project
• This project was started in 2007 as a Google Summer of Code project by David
Cournapeau. Later that year, Matthieu Brucher started work on this project as
part of his thesis.
• In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
Michel of INRIA took leadership of the project and made the first public
release, February the 1st 2010.
• Since then, several releases have appeared following a ~3 month cycle, and a
thriving international community has been leading the development.
• Of the various scikits, scikit-learn as well as scikit-image were described as
"well-maintained and popular" in November 2012
sarwan@NIELIT 3

Introduction
• Machine learning library written
in Python
• Simple and efficient, for both
experts and non-experts
• Classical, well-established
machine learning algorithms
• BSD 3 license
• characterized by a clean, uniform,
and streamlined API
• Community driven development
• 20~ core developers (mostly
researchers)
• 500+ occasional contributors
• All working publicly
together on GitHub
• Emphasis on keeping the project
maintainable
• Style consistency
• Unit-test coverage
• Documentation and examples
• Code review
sarwan@NIELIT 4

Pandas NumPy Scikit-Learn workflow
• Start with CSV
• Convert to Pandas DataFrame
• Slice and dice in Pandas
• Convert to NumPy array to feed to Scikit-Learn
Additional web resource :
• UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online
repository of machine learning datasets (at the time of writing, they are listing 233 datasets).
The repository is available online: http://archive.ics.uci.edu/ml/
• https://github.com/rasbt/pattern_classification/blob/master/resources/machine_learning_ebooks.md
sarwan@NIELIT 5

Data Representation in Scikit-Learn
• Machine learning is about creating models from data
• The best way to think about data within Scikit-Learn is in terms of
tables of data.
• Data as table : A basic table is a two-dimensional grid of data, in
which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.
• E.g. the Iris dataset, famously analyzed by Ronald Fisher in 1936.
• This can be downloaded in dataset in the form of a Pandas DataFrame
using the Seaborn library
sarwan@NIELIT 6

Layman’s view of Machine Learning
• Loading the dataset.
• Summarizing the dataset.
• Visualizing the dataset.
• Evaluating some algorithms.
• Making some predictions.
Making
some
predictions
Evaluating
some
algorithms.
Visualizing
the
dataset.
Summariz-
ing the
dataset.
Loading
the
dataset.
sarwan@NIELIT 7

Basics of the Scikit-Learn estimator API
1. Choose a class of model by importing the appropriate estimator class
from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired
values.
3. Arrange data into a features matrix and target vector
4. Fit the model to your data by calling the fit() method
of the model instance.
5. Apply the model to new data:
• For supervised learning, often we predict labels for unknown data using the
predict() method.
• For unsupervised learning, we often transform or infer properties of the data
using the transform() or predict() method
sarwan@NIELIT 8

Basics of the Scikit-Learn estimator API
Choose a class of model
Choose model hyperparameters
Arrange data into a features
matrix and target vector
Fit the model to your data
Apply model to new data
sarwan@NIELIT 9

Supervised learning example: Simple linear regression
• Lets Learn with an example : common
case of fitting a line to x, y data.
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y) ;
sarwan@NIELIT 10

1. Choose a class of model. - In Scikit-Learn, every class of model is
represented by a Python class.
from sklearn.linear_model import LinearRegression
• once the model class is selected, hyperparameters are selected .
Supervised learning example:
Simple linear regression
sarwan@NIELIT 11

2. Choose model hyperparameters. An important point is that a class of
model is not the same as an instance of a model.
• hyperparameters are parameters that must be set before the model
is fit to data
• In Scikit-Learn, hyperparameters are chosen by passing values at
model instantiation.
model = LinearRegression( fit_intercept=True )
Finally the model will become :
LinearRegression( copy_X=True, fit_intercept=True,
n_jobs=1, normalize=False)
• the model is not yet applied to any data: the Scikit-Learn API makes
very clear the distinction between choice of model and application of
model to data.
sarwan@NIELIT 12

3. Arrange data into a features matrix and target vector.
• Make two-dimensional features matrix (X) and
a one-dimensional target array (Y)
• target variable y is already in the correct form (a length-n_samples
array)
• Make the data x into a matrix of size [n_samples, n_features].
X = x[:, np.newaxis]
X.shape –output- (50,1)
Earlier state :
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
sarwan@NIELIT 13

4. Fit the model to your data.
• apply model to data using fit() method
model.fit( X , y )
Final: LinearRegression( copy_X=True,
fit_intercept=True, n_jobs=1, normalize=False)
• fit() command causes a number of model-dependent internal
computations to take place, and the results of these computations
are stored in model specific attributes
• In Scikit-Learn, by convention all model parameters that were
learned during the fit() process have trailing underscores
sarwan@NIELIT 14

4. Fit the model to your data.(contd..)
• The two parameters represent the slope and intercept of the simple
linear fit to the data. In our data definition, its very close to the
input slope of 2 and intercept of –1
• In general, Scikit-Learn does not provide tools to draw conclusions
from internal model parameters themselves: interpreting model
parameters is much more a statistical modeling question than a
machine learning question.
• Machine learning rather focuses on what the model predicts.
sarwan@NIELIT 15

5. Predict labels for unknown data.
• Once the model is trained, the main task of supervised machine
learning is to evaluate it based on what it says about new data that
was not part of the training set.
• In Scikit-Learn, the predict() method is used.
xfit = np.linspace(-1, 11)
#coerce x values into a [n_samples, n_features] features matrix
Xfit = xfit [ : , np.newaxis ]
yfit = model.predict (Xfit)
#visualize the result
plt.scatter(x, y)
plt.plot(xfit, yfit);
sarwan@NIELIT 16

What makes up a classification model?
• The structure of the model: In this, we use a threshold on a single
feature.
• The search procedure: In this, we try every possible combination of
feature and threshold.
• The loss function: Using the loss function, we decide which of the
possibilities is less bad (because we can rarely talk about the perfect
solution). We can use the training error or just define this point the
other way around and say that we want the best accuracy.
• Traditionally, people want the loss function to be minimum.
sarwan@NIELIT 17

• Alternatively, we might have different loss functions. It might be that
one type of error is much more costly than another. In a medical
setting, false negatives and false positives are not equivalent.
• A false negative (when the result of a test comes back negative, but
that is false) might lead to the patient not receiving treatment for a
serious disease.
• A false positive (when the test comes back positive even though the
patient does not actually have that disease) might lead to additional
tests for confirmation purposes or unnecessary treatment (which can
still have costs, including side effects from the treatment).
• With spam filtering, we may face the same problem; incorrectly
deleting a non-spam e-mail can be very dangerous for the user, while
letting a spam e-mail through is just a minor annoyance.
sarwan@NIELIT 18

• What the cost function should be is always dependent on the exact
problem you are working on.
• When we present a general-purpose algorithm, we often focus on
minimizing the number of mistakes (achieving the highest accuracy).
• However, if some mistakes are more costly than others, it might be
better to accept a lower overall accuracy to minimize overall costs.
sarwan@NIELIT 19

• This is a general area normally termed feature engineering; it is
sometimes seen as less glamorous than algorithms, but it may matter
more for performance (a simple algorithm on well-chosen features
will perform better than a fancy algorithm on not-so-good features).
• Features and feature engineering
• Feature selection.
sarwan@NIELIT 20

First Machine Learning
Project using Iris dataset
Hello world program of machine learning
“classification of iris flowers”
Iris virginica
Iris setosa
Iris versicolor
sarwan@NIELIT 21

Question
• After looking at new flower in the field,
could we make a good prediction about
its species from its measurements?
Iris virginica
Iris setosa
Iris versicolor
sarwan@NIELIT 22

Iris dataset
• The Iris dataset is a classic dataset from the 1930s; it is
one of the first modern examples of statistical
classification.
• The setting is that of Iris flowers, of which there
are multiple species that can be identified by their
morphology.
• Today, the species would be defined by their genomic
signatures, but in the 1930s, DNA had not even been
identified as the carrier of genetic information.
• The following four attributes of each plant were
measured:
• Sepal length , Sepal width, Petal length, Petal width
sarwan@NIELIT 23

Iris dataset
• Generally, any measurement from our data as features.
• This is the supervised learning or classification problem; given labeled
examples, we can design a rule that will eventually be applied to
other examples.
• Other modern application examples of Pattern classification : Optical
Character Recognition (OCR) in the post office, spam filtering in our
email clients(spam messages vs “ham” {= not-spam} messages),
barcode scanners in the supermarket, etc
sarwan@NIELIT 24

Hello World of Machine Learning with Iris
• The best small project to start with on a new tool is the classification of
iris flowers. why iris dataset
• Attributes are numeric so you have to figure out how to load and
handle data.
• It is a classification problem, allowing to practice with perhaps an
easier type of supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may
require some specialized handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily
fits into memory (and a screen or A4 page).
• All of the numeric attributes are in the same units and the same
scale, not requiring any special scaling or transforms to get started.
sarwan@NIELIT 25

Iris Dataset
• Iris dataset contains 150
observations of iris
flowers.
• Has four columns of
measurements of the
flowers in centimeters.
• The fifth column is the
species of the flower
observed.
• All observed flowers
belong to one of three
species
Inputs from : machinelearningmastery, google, kaggle,etc
sarwan@NIELIT 26

Summarize dataset
• Take statistical summary using
describe().
• Grouping the rows/records based
on class of flower, using
irisDataframe.groupby('class').size()
sarwan@NIELIT 27

Data Visualization
Two types of plots:
• Univariate plots to better understand
each attribute.
• Multivariate plots to better understand
the relationships between attributes.
sarwan@NIELIT 28

Multivariate plots
• scatterplots of all pairs of
attributes.
• It is helpful to spot structured
relationships between input
variables
• The diagonal grouping of some
pairs of attributes, suggests a high
correlation and a predictable
relationship
sarwan@NIELIT 29

Create a Validation Dataset
Split the loaded dataset into two:
• 80% of which we will use to train our models and
• 20% that we will hold back as a validation dataset.
training data in the
• X_train and Y_train for preparing models and
• X_validation and Y_validation sets
sarwan@NIELIT 30

Arranging data into a features matrix and target vector
sarwan@NIELIT 31

K-fold cross validation
• Cross-validation, sometimes called rotation
estimation or out-of-sample testing is any
of various similar model validation
techniques for assessing how the results of
a statistical analysis will generalize to an
independent data set.
Source:wikipedia.org
• Mainly used in settings where the goal is prediction, and one wants to
estimate how accurately a predictive model will perform in practice.
• In a prediction problem, a model is usually given a dataset of known data on
which training is run (training dataset), and a dataset of unknown data (or
first seen data) against which the model is tested (called the validation
dataset or testing set).
• The goal of cross-validation is to test the model’s ability to predict new data
that were not used in estimating it, in order to flag problems like overfitting
sarwan@NIELIT 32

Test Harness
• use 10-fold cross validation to estimate accuracy.
• This will split the dataset into 10 parts, train on 9 and test on 1 and
repeat for all combinations of train-test splits.
• use ‘accuracy’ metric to evaluate models.
• This is a ratio of the number of correctly predicted instances in divided
by the total number of instances in the dataset multiplied by 100 to give
a percentage (e.g. 95% accurate).
sarwan@NIELIT 33

Evaluate 6 different algorithms:
• Logistic Regression (LR)
• Linear Discriminant Analysis (LDA)
• K-Nearest Neighbours (KNN).
• Classification and Regression Trees
(CART).
• Gaussian Naive Bayes (NB).
• Support Vector Machines (SVM).
Its good mix of simple linear
(LR and LDA), nonlinear
(KNN, CART, NB and SVM) algorithms.
To ensures the results are directly
comparable, reset the random number
seed before each run to ensure that the
evaluation of each algorithm is performed
using exactly the same data splits. sarwan@NIELIT 34

Compare algorithms
sarwan@NIELIT 35

Fit the model to your data
sarwan@NIELIT 36

Hands-on - Machine Learning using scikitLearn

More Related Content

Similar to Hands-on - Machine Learning using scikitLearn (20)

Recently uploaded (20)

Hands-on - Machine Learning using scikitLearn