SlideShare a Scribd company logo
Python-Machine Learning
using Scikit-Learn package
Dr. Sarwan Singh
Agenda
• Introduction (SciKit-Learn Toolkit)
• History, contributors
• Data representation in Machine Learning
• Supervised learning example
• Classification model
• Machine Learning Project
using Iris dataset
Artificial Intelligence
Machine Learning
Deep Learning
Machine learning is a
branch in computer
science that studies the
design of algorithms
that can learn.
sarwan@NIELIT 2
History
• Scikit-learn was original authored by an data scientist
David Courapeau in 2007
• Google Summer of Code Project
• This project was started in 2007 as a Google Summer of Code project by David
Cournapeau. Later that year, Matthieu Brucher started work on this project as
part of his thesis.
• In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
Michel of INRIA took leadership of the project and made the first public
release, February the 1st 2010.
• Since then, several releases have appeared following a ~3 month cycle, and a
thriving international community has been leading the development.
• Of the various scikits, scikit-learn as well as scikit-image were described as
"well-maintained and popular" in November 2012
sarwan@NIELIT 3
Introduction
• Machine learning library written
in Python
• Simple and efficient, for both
experts and non-experts
• Classical, well-established
machine learning algorithms
• BSD 3 license
• characterized by a clean, uniform,
and streamlined API
• Community driven development
• 20~ core developers (mostly
researchers)
• 500+ occasional contributors
• All working publicly
together on GitHub
• Emphasis on keeping the project
maintainable
• Style consistency
• Unit-test coverage
• Documentation and examples
• Code review
sarwan@NIELIT 4
Pandas NumPy Scikit-Learn workflow
• Start with CSV
• Convert to Pandas DataFrame
• Slice and dice in Pandas
• Convert to NumPy array to feed to Scikit-Learn
Additional web resource :
• UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online
repository of machine learning datasets (at the time of writing, they are listing 233 datasets).
The repository is available online: http://archive.ics.uci.edu/ml/
• https://github.com/rasbt/pattern_classification/blob/master/resources/machine_learning_ebooks.md
sarwan@NIELIT 5
Data Representation in Scikit-Learn
• Machine learning is about creating models from data
• The best way to think about data within Scikit-Learn is in terms of
tables of data.
• Data as table : A basic table is a two-dimensional grid of data, in
which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.
• E.g. the Iris dataset, famously analyzed by Ronald Fisher in 1936.
• This can be downloaded in dataset in the form of a Pandas DataFrame
using the Seaborn library
sarwan@NIELIT 6
Layman’s view of Machine Learning
• Loading the dataset.
• Summarizing the dataset.
• Visualizing the dataset.
• Evaluating some algorithms.
• Making some predictions.
Making
some
predictions
Evaluating
some
algorithms.
Visualizing
the
dataset.
Summariz-
ing the
dataset.
Loading
the
dataset.
sarwan@NIELIT 7
Basics of the Scikit-Learn estimator API
1. Choose a class of model by importing the appropriate estimator class
from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired
values.
3. Arrange data into a features matrix and target vector
4. Fit the model to your data by calling the fit() method
of the model instance.
5. Apply the model to new data:
• For supervised learning, often we predict labels for unknown data using the
predict() method.
• For unsupervised learning, we often transform or infer properties of the data
using the transform() or predict() method
sarwan@NIELIT 8
Basics of the Scikit-Learn estimator API
Choose a class of model
Choose model hyperparameters
Arrange data into a features
matrix and target vector
Fit the model to your data
Apply model to new data
sarwan@NIELIT 9
Supervised learning example: Simple linear regression
• Lets Learn with an example : common
case of fitting a line to x, y data.
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y) ;
sarwan@NIELIT 10
1. Choose a class of model. - In Scikit-Learn, every class of model is
represented by a Python class.
from sklearn.linear_model import LinearRegression
• once the model class is selected, hyperparameters are selected .
Supervised learning example:
Simple linear regression
sarwan@NIELIT 11
2. Choose model hyperparameters. An important point is that a class of
model is not the same as an instance of a model.
• hyperparameters are parameters that must be set before the model
is fit to data
• In Scikit-Learn, hyperparameters are chosen by passing values at
model instantiation.
model = LinearRegression( fit_intercept=True )
Finally the model will become :
LinearRegression( copy_X=True, fit_intercept=True,
n_jobs=1, normalize=False)
• the model is not yet applied to any data: the Scikit-Learn API makes
very clear the distinction between choice of model and application of
model to data.
Supervised learning example: Simple linear regression
sarwan@NIELIT 12
3. Arrange data into a features matrix and target vector.
• Make two-dimensional features matrix (X) and
a one-dimensional target array (Y)
• target variable y is already in the correct form (a length-n_samples
array)
• Make the data x into a matrix of size [n_samples, n_features].
X = x[:, np.newaxis]
X.shape –output- (50,1)
Supervised learning example: Simple linear regression
Earlier state :
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
sarwan@NIELIT 13
4. Fit the model to your data.
• apply model to data using fit() method
model.fit( X , y )
Final: LinearRegression( copy_X=True,
fit_intercept=True, n_jobs=1, normalize=False)
• fit() command causes a number of model-dependent internal
computations to take place, and the results of these computations
are stored in model specific attributes
• In Scikit-Learn, by convention all model parameters that were
learned during the fit() process have trailing underscores
Supervised learning example: Simple linear regression
sarwan@NIELIT 14
4. Fit the model to your data.(contd..)
• The two parameters represent the slope and intercept of the simple
linear fit to the data. In our data definition, its very close to the
input slope of 2 and intercept of –1
• In general, Scikit-Learn does not provide tools to draw conclusions
from internal model parameters themselves: interpreting model
parameters is much more a statistical modeling question than a
machine learning question.
• Machine learning rather focuses on what the model predicts.
Supervised learning example: Simple linear regression
sarwan@NIELIT 15
5. Predict labels for unknown data.
• Once the model is trained, the main task of supervised machine
learning is to evaluate it based on what it says about new data that
was not part of the training set.
• In Scikit-Learn, the predict() method is used.
xfit = np.linspace(-1, 11)
#coerce x values into a [n_samples, n_features] features matrix
Xfit = xfit [ : , np.newaxis ]
yfit = model.predict (Xfit)
#visualize the result
plt.scatter(x, y)
plt.plot(xfit, yfit);
Supervised learning example: Simple linear regression
sarwan@NIELIT 16
What makes up a classification model?
• The structure of the model: In this, we use a threshold on a single
feature.
• The search procedure: In this, we try every possible combination of
feature and threshold.
• The loss function: Using the loss function, we decide which of the
possibilities is less bad (because we can rarely talk about the perfect
solution). We can use the training error or just define this point the
other way around and say that we want the best accuracy.
• Traditionally, people want the loss function to be minimum.
sarwan@NIELIT 17
• Alternatively, we might have different loss functions. It might be that
one type of error is much more costly than another. In a medical
setting, false negatives and false positives are not equivalent.
• A false negative (when the result of a test comes back negative, but
that is false) might lead to the patient not receiving treatment for a
serious disease.
• A false positive (when the test comes back positive even though the
patient does not actually have that disease) might lead to additional
tests for confirmation purposes or unnecessary treatment (which can
still have costs, including side effects from the treatment).
• With spam filtering, we may face the same problem; incorrectly
deleting a non-spam e-mail can be very dangerous for the user, while
letting a spam e-mail through is just a minor annoyance.
sarwan@NIELIT 18
• What the cost function should be is always dependent on the exact
problem you are working on.
• When we present a general-purpose algorithm, we often focus on
minimizing the number of mistakes (achieving the highest accuracy).
• However, if some mistakes are more costly than others, it might be
better to accept a lower overall accuracy to minimize overall costs.
sarwan@NIELIT 19
• This is a general area normally termed feature engineering; it is
sometimes seen as less glamorous than algorithms, but it may matter
more for performance (a simple algorithm on well-chosen features
will perform better than a fancy algorithm on not-so-good features).
• Features and feature engineering
• Feature selection.
sarwan@NIELIT 20
First Machine Learning
Project using Iris dataset
Hello world program of machine learning
“classification of iris flowers”
Iris virginica
Iris setosa
Iris versicolor
sarwan@NIELIT 21
Question
• After looking at new flower in the field,
could we make a good prediction about
its species from its measurements?
Iris virginica
Iris setosa
Iris versicolor
sarwan@NIELIT 22
Iris dataset
• The Iris dataset is a classic dataset from the 1930s; it is
one of the first modern examples of statistical
classification.
• The setting is that of Iris flowers, of which there
are multiple species that can be identified by their
morphology.
• Today, the species would be defined by their genomic
signatures, but in the 1930s, DNA had not even been
identified as the carrier of genetic information.
• The following four attributes of each plant were
measured:
• Sepal length , Sepal width, Petal length, Petal width
sarwan@NIELIT 23
Iris dataset
• Generally, any measurement from our data as features.
• This is the supervised learning or classification problem; given labeled
examples, we can design a rule that will eventually be applied to
other examples.
• Other modern application examples of Pattern classification : Optical
Character Recognition (OCR) in the post office, spam filtering in our
email clients(spam messages vs “ham” {= not-spam} messages),
barcode scanners in the supermarket, etc
sarwan@NIELIT 24
Hello World of Machine Learning with Iris
• The best small project to start with on a new tool is the classification of
iris flowers. why iris dataset
• Attributes are numeric so you have to figure out how to load and
handle data.
• It is a classification problem, allowing to practice with perhaps an
easier type of supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may
require some specialized handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily
fits into memory (and a screen or A4 page).
• All of the numeric attributes are in the same units and the same
scale, not requiring any special scaling or transforms to get started.
sarwan@NIELIT 25
Iris Dataset
• Iris dataset contains 150
observations of iris
flowers.
• Has four columns of
measurements of the
flowers in centimeters.
• The fifth column is the
species of the flower
observed.
• All observed flowers
belong to one of three
species
Inputs from : machinelearningmastery, google, kaggle,etc
sarwan@NIELIT 26
Summarize dataset
• Take statistical summary using
describe().
• Grouping the rows/records based
on class of flower, using
irisDataframe.groupby('class').size()
sarwan@NIELIT 27
Data Visualization
Two types of plots:
• Univariate plots to better understand
each attribute.
• Multivariate plots to better understand
the relationships between attributes.
sarwan@NIELIT 28
Multivariate plots
• scatterplots of all pairs of
attributes.
• It is helpful to spot structured
relationships between input
variables
• The diagonal grouping of some
pairs of attributes, suggests a high
correlation and a predictable
relationship
sarwan@NIELIT 29
Create a Validation Dataset
Split the loaded dataset into two:
• 80% of which we will use to train our models and
• 20% that we will hold back as a validation dataset.
training data in the
• X_train and Y_train for preparing models and
• X_validation and Y_validation sets
sarwan@NIELIT 30
Arranging data into a features matrix and target vector
sarwan@NIELIT 31
K-fold cross validation
• Cross-validation, sometimes called rotation
estimation or out-of-sample testing is any
of various similar model validation
techniques for assessing how the results of
a statistical analysis will generalize to an
independent data set.
Source:wikipedia.org
• Mainly used in settings where the goal is prediction, and one wants to
estimate how accurately a predictive model will perform in practice.
• In a prediction problem, a model is usually given a dataset of known data on
which training is run (training dataset), and a dataset of unknown data (or
first seen data) against which the model is tested (called the validation
dataset or testing set).
• The goal of cross-validation is to test the model’s ability to predict new data
that were not used in estimating it, in order to flag problems like overfitting
sarwan@NIELIT 32
Test Harness
• use 10-fold cross validation to estimate accuracy.
• This will split the dataset into 10 parts, train on 9 and test on 1 and
repeat for all combinations of train-test splits.
• use ‘accuracy’ metric to evaluate models.
• This is a ratio of the number of correctly predicted instances in divided
by the total number of instances in the dataset multiplied by 100 to give
a percentage (e.g. 95% accurate).
sarwan@NIELIT 33
Evaluate 6 different algorithms:
• Logistic Regression (LR)
• Linear Discriminant Analysis (LDA)
• K-Nearest Neighbours (KNN).
• Classification and Regression Trees
(CART).
• Gaussian Naive Bayes (NB).
• Support Vector Machines (SVM).
Its good mix of simple linear
(LR and LDA), nonlinear
(KNN, CART, NB and SVM) algorithms.
To ensures the results are directly
comparable, reset the random number
seed before each run to ensure that the
evaluation of each algorithm is performed
using exactly the same data splits. sarwan@NIELIT 34
Compare algorithms
sarwan@NIELIT 35
Fit the model to your data
sarwan@NIELIT 36

More Related Content

PPTX
UNIT_5_Data Wrangling.pptx
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
Building High Available and Scalable Machine Learning Applications
PPT
Intro_2.ppt
PPT
Intro.ppt
PPT
Intro.ppt
PPTX
Facial Emotion Detection on Children's Emotional Face
PPTX
Session 2
UNIT_5_Data Wrangling.pptx
Introduction to Machine Learning with SciKit-Learn
Building High Available and Scalable Machine Learning Applications
Intro_2.ppt
Intro.ppt
Intro.ppt
Facial Emotion Detection on Children's Emotional Face
Session 2

Similar to Hands-on - Machine Learning using scikitLearn (20)

PPTX
Net campus2015 antimomusone
PPTX
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PDF
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
PPTX
background.pptx
PPTX
Spark MLlib - Training Material
PDF
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
PDF
Start machine learning in 5 simple steps
PDF
2015 03-28-eb-final
PPTX
Computer Vision for Beginners
PPTX
Python ml
DOCX
Data structure and algorithm.
PPTX
House price prediction
PPTX
Artificial Intelligence (AI) INTERNSHIP.pptx
PPTX
Machine Learning - Simple Linear Regression
PPTX
data_preprocessingknnnaiveandothera.pptx
PDF
Data analytcis-first-steps
PDF
The ABC of Implementing Supervised Machine Learning with Python.pptx
PDF
OpenML 2019
PPT
algo 1.ppt
PPTX
Deep learning with keras
Net campus2015 antimomusone
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
background.pptx
Spark MLlib - Training Material
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
Start machine learning in 5 simple steps
2015 03-28-eb-final
Computer Vision for Beginners
Python ml
Data structure and algorithm.
House price prediction
Artificial Intelligence (AI) INTERNSHIP.pptx
Machine Learning - Simple Linear Regression
data_preprocessingknnnaiveandothera.pptx
Data analytcis-first-steps
The ABC of Implementing Supervised Machine Learning with Python.pptx
OpenML 2019
algo 1.ppt
Deep learning with keras
Ad

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to machine learning and Linear Models
Miokarditis (Inflamasi pada Otot Jantung)
IBA_Chapter_11_Slides_Final_Accessible.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ISS -ESG Data flows What is ESG and HowHow
Qualitative Qantitative and Mixed Methods.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to machine learning and Linear Models
Ad

Hands-on - Machine Learning using scikitLearn

  • 2. Agenda • Introduction (SciKit-Learn Toolkit) • History, contributors • Data representation in Machine Learning • Supervised learning example • Classification model • Machine Learning Project using Iris dataset Artificial Intelligence Machine Learning Deep Learning Machine learning is a branch in computer science that studies the design of algorithms that can learn. sarwan@NIELIT 2
  • 3. History • Scikit-learn was original authored by an data scientist David Courapeau in 2007 • Google Summer of Code Project • This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started work on this project as part of his thesis. • In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the project and made the first public release, February the 1st 2010. • Since then, several releases have appeared following a ~3 month cycle, and a thriving international community has been leading the development. • Of the various scikits, scikit-learn as well as scikit-image were described as "well-maintained and popular" in November 2012 sarwan@NIELIT 3
  • 4. Introduction • Machine learning library written in Python • Simple and efficient, for both experts and non-experts • Classical, well-established machine learning algorithms • BSD 3 license • characterized by a clean, uniform, and streamlined API • Community driven development • 20~ core developers (mostly researchers) • 500+ occasional contributors • All working publicly together on GitHub • Emphasis on keeping the project maintainable • Style consistency • Unit-test coverage • Documentation and examples • Code review sarwan@NIELIT 4
  • 5. Pandas NumPy Scikit-Learn workflow • Start with CSV • Convert to Pandas DataFrame • Slice and dice in Pandas • Convert to NumPy array to feed to Scikit-Learn Additional web resource : • UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they are listing 233 datasets). The repository is available online: http://archive.ics.uci.edu/ml/ • https://github.com/rasbt/pattern_classification/blob/master/resources/machine_learning_ebooks.md sarwan@NIELIT 5
  • 6. Data Representation in Scikit-Learn • Machine learning is about creating models from data • The best way to think about data within Scikit-Learn is in terms of tables of data. • Data as table : A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements. • E.g. the Iris dataset, famously analyzed by Ronald Fisher in 1936. • This can be downloaded in dataset in the form of a Pandas DataFrame using the Seaborn library sarwan@NIELIT 6
  • 7. Layman’s view of Machine Learning • Loading the dataset. • Summarizing the dataset. • Visualizing the dataset. • Evaluating some algorithms. • Making some predictions. Making some predictions Evaluating some algorithms. Visualizing the dataset. Summariz- ing the dataset. Loading the dataset. sarwan@NIELIT 7
  • 8. Basics of the Scikit-Learn estimator API 1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn. 2. Choose model hyperparameters by instantiating this class with desired values. 3. Arrange data into a features matrix and target vector 4. Fit the model to your data by calling the fit() method of the model instance. 5. Apply the model to new data: • For supervised learning, often we predict labels for unknown data using the predict() method. • For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method sarwan@NIELIT 8
  • 9. Basics of the Scikit-Learn estimator API Choose a class of model Choose model hyperparameters Arrange data into a features matrix and target vector Fit the model to your data Apply model to new data sarwan@NIELIT 9
  • 10. Supervised learning example: Simple linear regression • Lets Learn with an example : common case of fitting a line to x, y data. import matplotlib.pyplot as plt import numpy as np rng = np.random.RandomState(42) x = 10 * rng.rand(50) y = 2 * x - 1 + rng.randn(50) plt.scatter(x, y) ; sarwan@NIELIT 10
  • 11. 1. Choose a class of model. - In Scikit-Learn, every class of model is represented by a Python class. from sklearn.linear_model import LinearRegression • once the model class is selected, hyperparameters are selected . Supervised learning example: Simple linear regression sarwan@NIELIT 11
  • 12. 2. Choose model hyperparameters. An important point is that a class of model is not the same as an instance of a model. • hyperparameters are parameters that must be set before the model is fit to data • In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. model = LinearRegression( fit_intercept=True ) Finally the model will become : LinearRegression( copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) • the model is not yet applied to any data: the Scikit-Learn API makes very clear the distinction between choice of model and application of model to data. Supervised learning example: Simple linear regression sarwan@NIELIT 12
  • 13. 3. Arrange data into a features matrix and target vector. • Make two-dimensional features matrix (X) and a one-dimensional target array (Y) • target variable y is already in the correct form (a length-n_samples array) • Make the data x into a matrix of size [n_samples, n_features]. X = x[:, np.newaxis] X.shape –output- (50,1) Supervised learning example: Simple linear regression Earlier state : x = 10 * rng.rand(50) y = 2 * x - 1 + rng.randn(50) sarwan@NIELIT 13
  • 14. 4. Fit the model to your data. • apply model to data using fit() method model.fit( X , y ) Final: LinearRegression( copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) • fit() command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model specific attributes • In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores Supervised learning example: Simple linear regression sarwan@NIELIT 14
  • 15. 4. Fit the model to your data.(contd..) • The two parameters represent the slope and intercept of the simple linear fit to the data. In our data definition, its very close to the input slope of 2 and intercept of –1 • In general, Scikit-Learn does not provide tools to draw conclusions from internal model parameters themselves: interpreting model parameters is much more a statistical modeling question than a machine learning question. • Machine learning rather focuses on what the model predicts. Supervised learning example: Simple linear regression sarwan@NIELIT 15
  • 16. 5. Predict labels for unknown data. • Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. • In Scikit-Learn, the predict() method is used. xfit = np.linspace(-1, 11) #coerce x values into a [n_samples, n_features] features matrix Xfit = xfit [ : , np.newaxis ] yfit = model.predict (Xfit) #visualize the result plt.scatter(x, y) plt.plot(xfit, yfit); Supervised learning example: Simple linear regression sarwan@NIELIT 16
  • 17. What makes up a classification model? • The structure of the model: In this, we use a threshold on a single feature. • The search procedure: In this, we try every possible combination of feature and threshold. • The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. • Traditionally, people want the loss function to be minimum. sarwan@NIELIT 17
  • 18. • Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. • A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. • A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). • With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance. sarwan@NIELIT 18
  • 19. • What the cost function should be is always dependent on the exact problem you are working on. • When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). • However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs. sarwan@NIELIT 19
  • 20. • This is a general area normally termed feature engineering; it is sometimes seen as less glamorous than algorithms, but it may matter more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). • Features and feature engineering • Feature selection. sarwan@NIELIT 20
  • 21. First Machine Learning Project using Iris dataset Hello world program of machine learning “classification of iris flowers” Iris virginica Iris setosa Iris versicolor sarwan@NIELIT 21
  • 22. Question • After looking at new flower in the field, could we make a good prediction about its species from its measurements? Iris virginica Iris setosa Iris versicolor sarwan@NIELIT 22
  • 23. Iris dataset • The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. • The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. • Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information. • The following four attributes of each plant were measured: • Sepal length , Sepal width, Petal length, Petal width sarwan@NIELIT 23
  • 24. Iris dataset • Generally, any measurement from our data as features. • This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. • Other modern application examples of Pattern classification : Optical Character Recognition (OCR) in the post office, spam filtering in our email clients(spam messages vs “ham” {= not-spam} messages), barcode scanners in the supermarket, etc sarwan@NIELIT 24
  • 25. Hello World of Machine Learning with Iris • The best small project to start with on a new tool is the classification of iris flowers. why iris dataset • Attributes are numeric so you have to figure out how to load and handle data. • It is a classification problem, allowing to practice with perhaps an easier type of supervised learning algorithm. • It is a multi-class classification problem (multi-nominal) that may require some specialized handling. • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page). • All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started. sarwan@NIELIT 25
  • 26. Iris Dataset • Iris dataset contains 150 observations of iris flowers. • Has four columns of measurements of the flowers in centimeters. • The fifth column is the species of the flower observed. • All observed flowers belong to one of three species Inputs from : machinelearningmastery, google, kaggle,etc sarwan@NIELIT 26
  • 27. Summarize dataset • Take statistical summary using describe(). • Grouping the rows/records based on class of flower, using irisDataframe.groupby('class').size() sarwan@NIELIT 27
  • 28. Data Visualization Two types of plots: • Univariate plots to better understand each attribute. • Multivariate plots to better understand the relationships between attributes. sarwan@NIELIT 28
  • 29. Multivariate plots • scatterplots of all pairs of attributes. • It is helpful to spot structured relationships between input variables • The diagonal grouping of some pairs of attributes, suggests a high correlation and a predictable relationship sarwan@NIELIT 29
  • 30. Create a Validation Dataset Split the loaded dataset into two: • 80% of which we will use to train our models and • 20% that we will hold back as a validation dataset. training data in the • X_train and Y_train for preparing models and • X_validation and Y_validation sets sarwan@NIELIT 30
  • 31. Arranging data into a features matrix and target vector sarwan@NIELIT 31
  • 32. K-fold cross validation • Cross-validation, sometimes called rotation estimation or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Source:wikipedia.org • Mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. • In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). • The goal of cross-validation is to test the model’s ability to predict new data that were not used in estimating it, in order to flag problems like overfitting sarwan@NIELIT 32
  • 33. Test Harness • use 10-fold cross validation to estimate accuracy. • This will split the dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits. • use ‘accuracy’ metric to evaluate models. • This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). sarwan@NIELIT 33
  • 34. Evaluate 6 different algorithms: • Logistic Regression (LR) • Linear Discriminant Analysis (LDA) • K-Nearest Neighbours (KNN). • Classification and Regression Trees (CART). • Gaussian Naive Bayes (NB). • Support Vector Machines (SVM). Its good mix of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. To ensures the results are directly comparable, reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. sarwan@NIELIT 34
  • 36. Fit the model to your data sarwan@NIELIT 36