Ml programming with python

Machine Learning with Python
Compiled by : Dr. Kumud Kundu

Outline
● The general concepts of machine learning
● The three types of learning and basic terminology
● The building blocks for successfully designing machine learning systems
● Introduction to Pandas, Matlplotlib and sklearn framework
○ For basics of Python refer to (https://www.python.org/) and
○ For basics of NumPy refer to (http://www.numpy.org/).
● Simple Program of Plotting Graphs with Matplotlib.pyplot
● Coding Template of Analyzing and Visualizing Dataframe with Pandas
● Simple Program for supervised learning (prediction modelling) with Linear Regression
● Simple Program for unsupervised learning (clustering) with Kmeans

Machine Learning
Machine learning, the application and science of algorithms that make sense of data
Or
Machine Learning uses algorithms that takes input data, learns from data and make
informed decisions.
Or
To design and implement programs that improve with experience

ML: Giving Computers the Ability to Learn from Data

Machine Learning is…
Automating automation
Getting computers to program themselves
Let the data do the work instead!
Training
Data
model/
predictor
past
model/
predictor
future
Testing
Data

JOURNEY FROM DATA TO PREDICTIONS
“Machine learning is the next Internet”

Traditional Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Program
Traditional Programming Vs. Machine Learning Programmming

Machine learning is inherently a multi-disciplinary field
It draws on results from :
Artificial intelligence,
Probability
Statistics
Computational complexity theory
Information theory
Philosophy
Psychology
Neurobiology
and other fields.

Most machine learning methods work well because of human-designed representations and input
features
ML becomes just optimizing weights to best make a final prediction
Machine Learning

How Machines Learn???
Learning is all about discovering the best parameter values (a, b, c …) that maps
input to output.
Or
The main goal behind learning, we want to learn how the values are calculated
(relationships between output and input) i.e.
Machine learning algorithms are described as learning a target function (f) that
best maps input variables (X) to an output variable (Y), Y = f(X)
The relationships can be linear or non linear.
These values enable the learned model to output results for new instances based on
previous learned ones.

The problem of learning a function from data is a difficult problem
and this is the reason why the field of machine learning and machine
learning algorithms exist.
● Error creeps in predicting output from real life input data instances (X).
i.e. Y = f(X) + e
● This error might be error such as not having enough attributes to sufficiently characterize the best
mapping from X to Y.
Subject 1
Subject 2
As an example, Face Identification program will recognize subject1 similar to subject 2 on the basis
of intensity profile, though expected output is Subject1 with pose
Subject 1
with pose

The following diagram shows a typical workflow for
using machine learning in predictive modeling:

ML Program
● A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.

Python for Machine Learning Program

Why Python??
Python is one of the most popular programming languages for data science and thanks to its very active developer
and open source community, a large number of useful libraries LIKE as NumPy and SciPy for scientific
computing and machine learning have been developed.
For machine learning programming tasks, the scikit-learn library, one of the most popular and accessible open
source machine learning libraries will be used.

Python on Jupyter Notebook
The Jupyter Notebook is an open-source web application that allows you
to create and share documents that contain live code, equations,
visualizations and narrative text.
The core programming languages supported by Jupyter are Julia, Python
and R.
Use it on Google Colab colab.research.google.com
or Use Jupyter notebook on Anaconda
● Using the Anaconda Python distribution and package manager
● The Anaconda installer can be downloaded at https://docs.anaconda.com/anaconda/install/, and an
Anaconda quick start guide is available at https://docs.anaconda.com/anaconda/user-guide/getting-started/.

Key Terms in Machine Language Program
● Training example: A row in a table representing the dataset and synonymous with an observation, record,
instance, or sample (in most contexts, sample refers to a collection of training examples).
● Training: Model fitting, for parametric models similar to parameter estimation.
● Feature Set : A column in a data table or data (design) matrix. Synonymous with predictor, variable, input,
attribute, or covariate.
● Target or Test Set y: Outcome, output, response variable, dependent variable, (class) label, and ground truth.
● Loss function / Cost Function / Error Function: Function that measure the deviation of predicted output from
the expected output.

Import the Libraries into the Jupyter Notebook
● Import Numpy as np
● Import Pandas as pd
● Import Matplotlib.pyplot as plt

Matplotlib: A Plotting Library for Python
● it makes heavy use of NumPy
● Importing matplotlib :
● from matplotlib import pyplot as plt or
● import matplotlib.pyplot as plt
● Examples:
● # for plotting bar graph
● x=[1,23,4,5,6,7]
● y=[23,45,67,89,90,100]
● plt.bar(x,y)
● plt.title('bar graph')
● plt.xlabel('fff')
● plt.ylabel('Y')
● plt.show()

● plt.scatter(x,y)
● plt.title('Scatter Plot')
● plt.xlabel('fff')
● plt.ylabel('Y')
● plt.show()

For subplots (Simultaneous plotting)
● Matplotlib.pyplot.subplot
● import numpy as np
● x=np.arange(0,10,0.01)
● plt.subplot(1,3,1)
● plt.plot(x,np.sin(x))
● plt.plot(x,np.cos(x))
● plt.plot(x,np.sin(2*x))
● plt.show()

Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool.
Pandas in data analysis:
Importing Data
Writing to different formats
Pandas Data Structures
Data Exploration
Data Manipulation
Aggregating Data
Merging Data

DataFrame
● DataFrame is a two-dimensional array with heterogeneous data.

Reading and Writing into DataFrames
● Import pandas as pd
● Reading Data into Dataframe using Pandas
○ df=pd.read_csv(‘File Name’) # From Comma Seperated Values (CSV) file
○ df=pd.read_csv('C:fdpbatsmen_ratings_all091217.csv')
○ df=pd.read_excel(‘File Name’)
● Writing Data from dataframes to Files on System
df.to_csv(‘File Name’ or ‘Destination Path along with path file’)
df.to_excel(‘File Name’ or ‘Destination Path along with path file’
To display all the records of the file : display(df)
● types = df.dtypes
● print(types)

Getting preview of Dataframe
● To view top n records of dataframe
○ df.head(5)
● To view bottom n records of dataframe
○ df.tail(5)
● View column name
○ df.columns
○ Getting subdataframe from dataframe
○ df['name’] , df[['name','nations']]

SubDataFrame as per Query
To display the records of India with ranking <50
display(df[(df['nations'] == "IND") & (df['rank’] < 50)])
Selecting data columns from dataset with column names:
df[[‘col1’ ‘col2’]]
With iloc (integer-location) based indexing for selection by position
df.iloc[:,:-1] // select all columns but not the last one
df.iloc [:, [4:6]] // select all rows of fourth, fifth and sixth column

Drop Columns from a Dataframe using drop() method.
Drop Columns from a Dataframe using and drop() method.
Method #1: Drop Columns from a Dataframe using drop() method.
Remove specific single column.
k.drop(['rate_date'],axis=1) // Axis =1 denotes dropping column of dataset
Removing specific multiple columns.
k.drop(['rate_date', 'rating'], axis=1)
Remove columns as based on column index.
k.drop[k.columns[[0,1]],axis=1, inplace= True)
Remove all columns between a specific column to another columns
K.iloc(:,[3,4])

Code for Data Reading, Data Manipulation using Pandas
● # Importing Data Reading, Data Manipulation Library of python
import pandas as pd
# import files because the files are not present on google colab
from google.colab import files
upload=files.upload()
# reading dataset using read_csv function
● df=pd.read_csv('rating.csv')
# to display column headers in dataset
df.columns
● # to get the number of instances and associated features
df.shape
# to get insights to data by grouping the data of one column
● df.groupby('nations').size()
# to get smaller dataset as per the query or subqueries
● k=(df[(df['nations'] =="IND") & (df['rank']<50)])
# to display smaller subset of data
display(k)
# to drop desired column from the smaller set of data
● k=dataset.drop(['name','rate_date','nations'],axis=1)

Scikit /sklearn: Free Machine Learning Library for Python
● It supports Python numerical and scientific libraries like NumPy and SciPy .
● Model selection is the process of selecting one final machine learning model from among a collection of candidate
machine learning models for a training dataset. Model selection is a process that can be applied both across different
types of models (e.g. logistic regression, SVM, KNN, etc.)
● from sklearn.model_selection
● model_selection is the process of selecting one final machine learning model among a collection of machine learning
models for training set.
● model parameters are parameters which arise as a result of the fit

Challenge of ML Program
The challenge of applied machine learning is in choosing
a model among a range of different models for your
problem.

Simple Predictive ML Program using Linear Regression
Model
● SIMPLE_REGRESSION.ipynb On Google Colab
# Important Data Reading, Data Manipulation Library of python
import pandas as pd
# import files because the files are not present on google colab
from google.colab import files
upload=files.upload()
# reading dataset using read_csv function
df=pd.read_csv('rating.csv.csv')
# For plotting graphs
import matplotlib.pyplot as plt
# Dividing Dataset into Train Set (X) and Target Set (y)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# from machine learning library of python (sklearn) import train_test_split function
from sklearn.model_selection import train_test_split
# X is training set
# y is the target set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
# split with the help of train_test_split function
# X part is divided in two parts Train and Test
# Y part is divided into two parts Train and Test
X_test.shape
# import Linear Regression Model
from sklearn.linear_model import LinearRegression
# created instance of linear regression model
model = LinearRegression()
# Finding the relationship between input AND OUTPUT with the help of fit function
model.fit(X_train, y_train)
# using the same trained model over the unknown test data i.e. x_test
y_pred = model.predict(X_test)

Visualizing and Evaluation of results
# Visualization of Results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('PCM Marks vs Placement_Package (Training set)')
plt.xlabel('PCM Marks')
plt.ylabel('Placement_Package')
plt.show()
# importing metrics from sklearn to evaluate the predicted result
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:',
# include Numerical Calculation Python Library numpy
import numpy as np
np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

CLUSTERING : Grouping things together
UNSUPERVISED LEARNING

Cluster Analysis : A method of Unsupervised Learning
● Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are
more similar to each other than to those in other groups.
● Clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when
we apply a clustering algorithm.
● To survey academic performance of high school students , the entire population of particular board can be divided into
different clusters (Excellent Learner, Good Learner , Average Learner and Slow learner).

K-Means Clustering
● Aims to partition ‘n’ observations into k clusters in which each observation belongs to the
cluster with the nearest mean, serving as a prototype of the cluster.
● K-Means falls under the category of centroid-based clustering.
•n = number of instances
•k = number of clusters
•t = number of iterations

K-Means Clustering Algorithm involves the following steps-
● Choose the number of clusters K.
● Randomly select any K data points as cluster centers in such a way that they are as farther as possible from each
other.
○ Calculate the distance between each data point and each cluster center by using given distance function.
○ A data point is assigned to that cluster whose center is nearest to that data point.
○ Re-compute the center of newly formed clusters.
○ The center of a cluster is computed by taking mean of all the data points contained in that cluster.
● Keep repeating the above four steps until any of the following stopping criteria is met-
○ No change in the center of newly formed clusters
○ No change in the data points of the cluster
○ Maximum number of iterations are reached

Metric to evaluate the quality of Clusters
● Inertia : Inertia actually calculates the sum of distances of all the points within a cluster from the
centroid of that cluster.
● It tells us how far the points within a cluster are
● the distance between them should be as low as possible.

from sklearn.cluster import KMeans
● Using the K-Means++ algorithm, we optimize the step where we randomly pick the cluster
centroid.
● kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
● Using the elbow method to find the optimal number of clusters

An Elbow Method Algorithm
● The basic idea of the elbow rule is to use a square of the distance between the sample points in
each cluster and the centroid of the cluster to give a series of K values. The sum of squared
errors (SSE) is used as a performance indicator. Iterate over the K-value and calculate the SSE.
● Smaller values indicate that each cluster is more convergent

Clustering Example with K-Means

Agglomerative Clustering
● An agglomerative algorithm is a type of hierarchical clustering algorithm where
each individual element to be clustered is in its own cluster. These clusters are merged
iteratively until all the elements belong to one cluster.
● Hierarchical clustering is a powerful technique that allows to build tree structures from
data similarities.

Hierarchical Clustering Example

Applications of Clustering
● Search Engines.
● Spam Detection
● Customer Segmentation

Ml programming with python

More Related Content

What's hot (20)

Similar to Ml programming with python (20)

Recently uploaded (20)

Ml programming with python