SlideShare a Scribd company logo
UNIT 1
INTRODUCTION
S.Revathi AP/CSE 1
Machine Learning: What & Why?
S.Revathi AP/CSE 2
Machine Learning
• Machine Learning is all about machines learning automatically without being
explicitly programmed or learning without any direct human intervention.
• This machine learning process starts with feeding them good quality data and
then training the machines by building various machine learning models using
the data and different algorithms.
• The choice of algorithms depends on what type of data we have and what kind of
task we are trying to automate.
S.Revathi AP/CSE 3
Example
• Automatic recommendations on Netflix
• Amazon Prime
• Facebook or LinkedIn
S.Revathi AP/CSE 4
Traditional Programming Vs Machine Learning
• In traditional programming, developers write explicit instructions to tell the
computer how to perform a specific task.
• Instead of programming explicit rules, developers feed large amounts of data into
a learning algorithm, which then uses that data to identify patterns and
relationships.
• The algorithm is then able to generalize from the data and make predictions or
decisions on new, unseen data.
S.Revathi AP/CSE 5
Types of Machine Learning
There are two different types of Machine Learning:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
S.Revathi AP/CSE 6
1. Supervised Machine Learning
• Supervised learning is a process of providing input data as well as correct output
data to the machine learning model.
• The aim of a supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).
*The labelled data means some input data is already tagged with the correct output.
S.Revathi AP/CSE 7
How Supervised Learning Works?
• In supervised learning, models are trained using labelled dataset, where the
model learns about each type of data.
• Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
Inputs:
Dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon.
Model Training:
• If the given shape has 4 sides, and all the sides are
equal, then it will be labelled as a Square.
• If the given shape has 3 sides, then it will be labelled
as a triangle.
• If the given shape has 6 equal sides then it will be
labelled as hexagon.
Output:
• If the model finds a new shape, it classifies the shape
on the bases of a number of sides, and predicts the
output.
S.Revathi AP/CSE 8
Types of supervised Machine learning Algorithms:
S.Revathi AP/CSE 9
a. Regression
• Regression finds correlations between dependent and independent variables.
Therefore, regression algorithms help predict continuous variables such as house
prices, market trends, weather patterns, oil and gas prices.
• Example: Predicting prices of a house given the features of house like size, price
etc.
Here,
• Independent variable: Size of the house
• Dependent variable: Price of the house
S.Revathi AP/CSE 10
Types of Regression
• Simple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• These are some of the commonly used types of regression algorithms in machine
learning.
• The choice of the regression algorithm depends on the specific problem, the
nature of the data, and the relationship between the input features and the target
variable.
S.Revathi AP/CSE 11
Simple Linear Regression
• Linear regression is one of the simplest and most fundamental supervised
learning algorithms used for predictive modeling.
• It is used to model the relationship between a dependent variable (target) and one
or more independent variables (features) when the relationship is approximately
linear.
S.Revathi AP/CSE 12
• Consider predicting the salary of an employee based on his/her age.
• We can easily identify that there seems to be a correlation between employee’s age
and salary (more the age more is the salary).
• The hypothesis of linear regression is:
• Y represents salary, X is employee’s age and a and b are the coefficients of the
equation. So in order to predict Y (salary) given X (age), we need to know the values
of a and b (the model’s coefficients).
• “a" is the slope of the line, representing how much the target variable changes for a
one-unit change in the input feature.
• "b" is the y-intercept, representing the value of the target variable when the input
feature is zero.
• The algorithm tries to find the values of “a" and "b" that minimize the error
between the predicted values and the actual target values in the training dataset.
S.Revathi AP/CSE 13
b. Classification
• Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
• Unlike regression, the output variable of Classification is a category, not a value,
such as "Green or Blue", "fruit or animal", etc.
• Since the Classification algorithm is a Supervised learning technique, hence it
takes labeled input data, which means it contains input with the corresponding
output.
S.Revathi AP/CSE 14
Example
• In the below diagram, there are two classes, class A and Class B. These classes
have features that are similar to each other and dissimilar to other classes.
S.Revathi AP/CSE 15
Example
• Iris flower classification is a very popular machine learning project.
• The iris dataset contains three classes of flowers, Versicolor, Setosa, Virginica,
and each class contains 4 features, ‘Sepal length’, ‘Sepal width’, ‘Petal length’,
‘Petal width’.
• The aim of the iris flower classification is to predict flowers based on their
specific features.
S.Revathi AP/CSE 16
From this visualization, we can tell that iris-setosa is well separated from the other two flowers.
And iris virginica is the longest flower and iris setosa is the shortest.
S.Revathi AP/CSE 17
Other Applications of Classification:
• Document classification: A multinomial classification model can be trained to
classify documents in different categories.
• Spam filtering: An algorithm is trained to recognize spam email by learning the
characteristics of what constitutes spam vs non-spam email.
• Image classification: One of the most popular classification problems is image
classification: determining what type of object (or scene) is in a digital image.
S.Revathi AP/CSE 18
2.Unsupervised Machine Learning
• Unsupervised Learning is a machine learning technique in which the users do
not need to supervise the model.
• Instead, it allows the model to work on its own to discover patterns and
information that was previously undetected.
• It mainly deals with the unlabelled data.
• Although, unsupervised learning can be more unpredictable compared with other
natural learning methods.
• Unsupervised learning algorithms include clustering, anomaly detection, neural
networks, etc.
S.Revathi AP/CSE 19
a. Discovering clusters
“Clustering” is the process of grouping similar entities together. The goal of this
unsupervised machine learning technique is to find similarities in the data point
and group similar data points together.
• The left side of the image
shows uncategorized data.
• On the right side, data has been
grouped into clusters that
consist of similar attributes.
S.Revathi AP/CSE 20
Why Clustering is Important
• Attributes of unique entities can be profiled easier. This can subsequently enable
users to sort data and analyze specific groups.
• Clustering enables businesses to approach customer segments differently based on
their attributes and similarities. This helps in maximizing profits.
• It can help in dimensionality reduction if the dataset is comprised of too many
variables. Irrelevant clusters can be identified easier and removed from the dataset.
S.Revathi AP/CSE 21
Types of Clustering
• K-Means
• Hierarchical clustering
• Density-Based Spatial Clustering of Applications with Noise
(DBSCAN)
• Gaussian Mixtures Model (GMM).
S.Revathi AP/CSE 22
K-Means Clustering Algorithm Steps:
1. Choose the value of K (the number of desired clusters).
2. Select K number of cluster centroids randomly.
3. Use the Euclidean distance (between centroids and data points) to assign every data
point to the closest cluster.
4. Recalculate the centers of all clusters (as an average of the data points have been
assigned to each of them).
5. Steps 3-4 should be repeated until there is no further change.
S.Revathi AP/CSE 23
Classification Vs Clustering
S.Revathi AP/CSE 24
b. Discovering Latent Factors
• Discovering latent factors in machine learning refers to the process of identifying
hidden or unobserved variables that capture the underlying structure or patterns in
the data.
• These latent factors are not directly observable but can influence the observed data.
• Latent factor analysis is widely used in various machine learning and statistical
models to explain complex relationships in data and reduce dimensionality.
• A classic example of discovering latent factors is Principal Component Analysis
(PCA)
• Consider a dataset with two features: height and weight of individuals. Each data
point represents a person's height and weight.
• The body size or overall body proportions that influence both height and weight.
These factors are not directly observable but can be inferred from the variations in
the data.
S.Revathi AP/CSE 25
c.Discovering Graph Structure
• Discovering Graph Structure in machine learning refers to the process of
identifying the underlying relationships or connections between data points
represented as a graph.
• In a graph, data points are represented as nodes, and the relationships between
them are represented as edges.
• The graph structure can capture complex relationships that may not be easily
discernible in raw data.
• In the context of graphs, unsupervised learning methods aim to discover the graph
structure without using any labeled information about the nodes or edges.
S.Revathi AP/CSE 26
• Social network analysis
• Recommendation systems
• Bioinformatics
• Fraud detection
here data is naturally organized as a graph and meaningful patterns can emerge from the
relationships between entities.
S.Revathi AP/CSE 27
d. Matrix Completion
• Matrix completion in machine learning refers to the task of filling in missing or
unknown entries in a partially observed matrix.
• This is a common problem in various applications where data is naturally
represented as a matrix, but some entries are missing or unobserved.
• Matrix completion algorithms aim to estimate these missing values based on the
available data, exploiting patterns and relationships within the matrix.
• The goal is to predict or impute the missing entries to reconstruct the full matrix.
• Applications of matrix completion can be found in different domains, including
recommendation systems, collaborative filtering, image and video inpainting,
sensor networks.
S.Revathi AP/CSE 28
Example: Movie Recommendations
Movie A Movie B Movie C Movie D
User 1 5 3 4
User 2 4 2 1
User 3 2 5 3
User 4 2 4
Each row represents a user, each column represents a movie, there is missing value
S.Revathi AP/CSE 29
Supervised Vs Unsupervised Learning
S.Revathi AP/CSE 30
Curse of Dimensionality
• Curse of Dimensionality refers to a set of problems that arise when working with
high-dimensional data.
• The dimension of a dataset corresponds to the number of attributes/features that
exist in a dataset.
• A dataset with a large number of attributes, generally of the order of a hundred or
more, is referred to as high dimensional data.
• Some of the difficulties that come with high dimensional data manifest during
analyzing or visualizing the data to identify patterns, and some manifest while
training machine learning models.
• The difficulties related to training machine learning models due to high
dimensional data is referred to as ‘Curse of Dimensionality’.
S.Revathi AP/CSE 31
As the number of dimensions (features) in a dataset increases, several problems
emerge:
• Increased Data Sparsity: This means that the available data becomes
increasingly sparse, making it more challenging to find meaningful patterns and
relationships.
• Computational Complexity: As the number of dimensions grows, the
computational resources required for various algorithms also increase
significantly.
• Distance and Similarity Measure: In high-dimensional spaces, the concept of
distance becomes less meaningful. The distance between data points tends to
become more uniform, leading to reduced discrimination between points.
S.Revathi AP/CSE 32
Ctd…
• Overfitting: With high-dimensional data, there is an increased risk of
overfitting in machine learning models.
• Curse of Data Visualization: Visualizing data in high-dimensional spaces
becomes extremely challenging for human comprehension. As humans can
only visualize up to three dimensions effectively.
S.Revathi AP/CSE 33
To mitigate the Curse of Dimensionality, some techniques are used:
• Dimensionality Reduction: Methods like Principal Component Analysis (PCA) and t-
Distributed Stochastic Neighbor Embedding (t-SNE) reduce the number of dimensions
while preserving as much data variability as possible.
• Feature Selection: Selecting the most relevant features can help reduce the
dimensionality while retaining essential information.
• Regularization: Applying regularization techniques in machine learning models can
prevent overfitting and improve generalization.
• Sampling: When data is sparse, proper sampling strategies can help make computations
more tractable.
S.Revathi AP/CSE 34
Under Fitting & Over Fitting
• In machine learning, a model’s performance and accuracy is known as prediction
errors.
• Let us consider that we are designing a machine learning model.
• A model is said to be a good machine learning model if it find any new input data
from the problem domain in a proper way.
• This helps us to make predictions about future data, that the data model has never
seen.
• Now, suppose we want to check how well our machine learning model learns and
find the new data.
• For that, we have overfitting and underfitting, which are majorly responsible for the
poor performances of the machine learning algorithms.
S.Revathi AP/CSE 35
Under Fitting
• Underfitting occurs when the model is too simple or lacks the capacity to capture
the underlying patterns in the data. As a result, the model not only performs
poorly on the training data but also tends to perform poorly on new, unseen data
(testing data).
• Underfitting destroys the accuracy of our machine-learning model.
• Its occurrence simply means that our model or the algorithm does not fit the data well
enough.
• It usually happens when we have less data to build an accurate model and also when
we try to build a linear model with fewer non-linear data.
• In such cases, the model will probably make a lot of wrong predictions.
• Underfitting can be avoided by using more data and also reducing the features by
feature selection.
S.Revathi AP/CSE 36
Reasons for Underfitting
• The size of the training dataset used is not enough.
• The model is too simple.
• Training data is not cleaned and also contains noise in it.
S.Revathi AP/CSE 37
Over Fitting
• When a model performs very well for training data but has poor performance with
test data (new data), it is known as overfitting.
• In this case, the machine learning model learns the details and noise in the training
data such that it negatively affects the performance of the model on test data.
• Overfitting can happen due to low bias and high variance.
S.Revathi AP/CSE 38
Reasons for Overfitting:
• High variance and low bias.
• The model is too complex.
• The size of the training data is huge.
S.Revathi AP/CSE 39
Summary
• overfitting occurs when a model is too complex and memorizes the training data,
while underfitting happens when a model is too simple to capture the underlying
patterns in the data.
• The ideal model finds the right balance between complexity and simplicity, so it
can generalize well to new data and make accurate predictions.
S.Revathi AP/CSE 40
Platform for Machine Learning
• In the older days, people used to perform Machine Learning tasks by manually
coding all the algorithms and mathematical and statistical formulas.
• This made the processing time-consuming, tedious, and inefficient.
• But in the modern days, it is become very much easy and more efficient compared to
the olden days with various python libraries, frameworks, and modules.
• Today, Python is one of the most popular programming languages for this task
and it has replaced many languages in the industry, one of the reasons is its vast
collection of libraries.
S.Revathi AP/CSE 41
S.Revathi AP/CSE 42
Python libraries that are used in Machine Learning
are:
• Numpy
• Scipy
• Scikit-learn
• Theano
• TensorFlow
• Keras
• PyTorch
• Pandas
• Matplotlib
S.Revathi AP/CSE 43
a.Numpy
• NumPy is a very popular python library for large multi-dimensional array and
matrix processing, with the help of a large collection of high-level mathematical
functions.
• It is very useful for fundamental scientific computations in Machine Learning.
• It is particularly useful for linear algebra, Fourier transform, and random number
capabilities.
• High-end libraries like TensorFlow uses NumPy internally for manipulation of
Tensors.
S.Revathi AP/CSE 44
Example
S.Revathi AP/CSE 45
b. SciPy
• SciPy is a very popular library among Machine Learning enthusiasts as it contains
different modules for optimization, linear algebra, integration and statistics.
• The SciPy is one of the core packages that make up the SciPy stack. SciPy is also
very useful for image manipulation.
S.Revathi AP/CSE 46
Example: Image Manipulation
from scipy.misc import imread, imsave, imresize
# Read a JPEG image into a numpy array
img = imread('D:/Programs / cat.jpg') # path of the image
print(img.dtype, img.shape)
# Tinting the image
img_tint = img * [1, 0.45, 0.3]
# Saving the tinted image
imsave('D:/Programs / cat_tinted.jpg', img_tint)
# Resizing the tinted image to be 300 x 300 pixels
img_tint_resize = imresize(img_tint, (300, 300))
# Saving the resized tinted image
imsave('D:/Programs / cat_tinted_resized.jpg', img_tint_resize)
S.Revathi AP/CSE 47
S.Revathi AP/CSE 48
Resized tinted image
Tinted image
Original image
c. Scikit-learn
• Scikit-learn is one of the most popular ML libraries for classical ML algorithms.
• It is built on top of two basic Python libraries, NumPy and SciPy.
• Scikit-learn supports most of the supervised and unsupervised learning algorithms.
• Scikit-learn can also be used for data-mining and data-analysis, which makes it a
great tool who is starting out with ML.
S.Revathi AP/CSE 49
d.Theano
• Theano is a popular python library that is used to define, evaluate and optimize
mathematical expressions involving multi-dimensional arrays in an efficient
manner.
• It is achieved by optimizing the utilization of CPU and GPU.
• It is extensively used for unit-testing and self-verification to detect and diagnose
different types of errors.
• Theano is a very powerful library that has been used in large-scale computationally
intensive scientific projects for a long time but is simple and approachable enough
to be used by individuals for their own projects.
S.Revathi AP/CSE 50
e.Tensor Flow
• TensorFlow is a very popular open-source library for high performance numerical
computation developed by the Google Brain team in Google.
• As the name suggests, Tensorflow is a framework that involves defining and
running computations involving tensors.
• It can train and run deep neural networks that can be used to develop several AI
applications.
• TensorFlow is widely used in the field of deep learning research and application.
S.Revathi AP/CSE 51
f.Keras
• It provides many inbuilt methods for groping, combining and filtering data.
• Keras is a very popular Machine Learning library for Python.
• It is a high-level neural networks API capable of running on top of TensorFlow,
Theano.
• It can run seamlessly on both CPU and GPU.
• Keras makes it really for ML beginners to build and design a Neural Network.
S.Revathi AP/CSE 52
g.PyTorch
• PyTorch is a popular open-source Machine Learning library for Python based on
Torch, which is an open-source Machine Learning library.
• It has an extensive choice of tools and libraries that support Computer Vision,
Natural Language Processing(NLP), and many more ML programs.
• It allows developers to perform computations on Tensors with GPU acceleration
and also helps in creating computational graphs.
S.Revathi AP/CSE 53
h. Pandas
• Pandas is a popular Python library for data analysis.
• As we know that the dataset must be prepared before training.
• In this case, Pandas comes handy as it was developed specifically for data
extraction and preparation.
• It provides high-level data structures and wide variety tools for data analysis.
• It provides many inbuilt methods for grouping, combining and filtering data.
S.Revathi AP/CSE 54
g. Matplotlib
• Matplotlib is a very popular Python library for data visualization.
• It particularly comes in handy when a programmer wants to visualize the patterns in
the data.
• It is a 2D plotting library used for creating 2D graphs and plots.
• A module named pyplot makes it easy for programmers for plotting as it provides
features to control line styles, font properties, formatting axes, etc.
• It provides various kinds of graphs and plots for data visualization, viz., histogram,
error charts, bar chats, etc,
S.Revathi AP/CSE 55
Data Visualization
S.Revathi AP/CSE 56

More Related Content

PPTX
ML Unjkfmvjmnb ,mit-2 - Rejhjmfnvhjmnv gression.pptx
PDF
Machine Learning - Deep Learning
PPTX
Doctor, Ismail ishengoma PowerPointL3.pptx
PPT
5_Model for Predictions_Machine_Learning.ppt
PDF
Modelling and evaluation
PPT
3 DM Classification HFCS kilometres .ppt
PDF
Introduction to machine learning
PPTX
Machine learning - session 3
ML Unjkfmvjmnb ,mit-2 - Rejhjmfnvhjmnv gression.pptx
Machine Learning - Deep Learning
Doctor, Ismail ishengoma PowerPointL3.pptx
5_Model for Predictions_Machine_Learning.ppt
Modelling and evaluation
3 DM Classification HFCS kilometres .ppt
Introduction to machine learning
Machine learning - session 3

Similar to Machine Learning and Deep LEarning INTRODUCTION.pptx (20)

PPTX
Mathematics Grade 12 - Statistics OF REGRESSION.pptx
PPTX
data science, prior knowledge ,modeling, scatter plot
PPTX
PPTX
Unit 3 – AIML.pptx
DOCX
dsa 12217554 AdiMunot 4444444444(1).docx
PPTX
Day1-Introdtechhnology of techuction.pptx
PPTX
Singular Value Decomposition (SVD).pptx
PPTX
EDAB Module 5 Singular Value Decomposition (SVD).pptx
PPTX
Object Segmentation in Operating Systems
PDF
Chapter 4 Classification in data sience .pdf
PPTX
Industrial training ppt
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
PPTX
The Research specifically DataAnalysis.pptx
PDF
Data Science Interview Questions PDF By ScholarHat
PPTX
Chapter 05 Machine Learning.pptx
PPTX
Classification
PPTX
Machine learning module 2
PDF
Complete picture of Ensemble-Learning, boosting, bagging
PPTX
Data science notes for ASDS calicut 2.pptx
PPTX
machine learning
Mathematics Grade 12 - Statistics OF REGRESSION.pptx
data science, prior knowledge ,modeling, scatter plot
Unit 3 – AIML.pptx
dsa 12217554 AdiMunot 4444444444(1).docx
Day1-Introdtechhnology of techuction.pptx
Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Object Segmentation in Operating Systems
Chapter 4 Classification in data sience .pdf
Industrial training ppt
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
The Research specifically DataAnalysis.pptx
Data Science Interview Questions PDF By ScholarHat
Chapter 05 Machine Learning.pptx
Classification
Machine learning module 2
Complete picture of Ensemble-Learning, boosting, bagging
Data science notes for ASDS calicut 2.pptx
machine learning
Ad

Recently uploaded (20)

PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Computing-Curriculum for Schools in Ghana
PDF
Complications of Minimal Access Surgery at WLH
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Cell Types and Its function , kingdom of life
PDF
01-Introduction-to-Information-Management.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
master seminar digital applications in india
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
GDM (1) (1).pptx small presentation for students
PDF
RMMM.pdf make it easy to upload and study
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
A systematic review of self-coping strategies used by university students to ...
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Computing-Curriculum for Schools in Ghana
Complications of Minimal Access Surgery at WLH
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Cell Types and Its function , kingdom of life
01-Introduction-to-Information-Management.pdf
Weekly quiz Compilation Jan -July 25.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
master seminar digital applications in india
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Orientation - ARALprogram of Deped to the Parents.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
GDM (1) (1).pptx small presentation for students
RMMM.pdf make it easy to upload and study
Final Presentation General Medicine 03-08-2024.pptx
Cell Structure & Organelles in detailed.
O7-L3 Supply Chain Operations - ICLT Program
A systematic review of self-coping strategies used by university students to ...
Ad

Machine Learning and Deep LEarning INTRODUCTION.pptx

  • 2. Machine Learning: What & Why? S.Revathi AP/CSE 2
  • 3. Machine Learning • Machine Learning is all about machines learning automatically without being explicitly programmed or learning without any direct human intervention. • This machine learning process starts with feeding them good quality data and then training the machines by building various machine learning models using the data and different algorithms. • The choice of algorithms depends on what type of data we have and what kind of task we are trying to automate. S.Revathi AP/CSE 3
  • 4. Example • Automatic recommendations on Netflix • Amazon Prime • Facebook or LinkedIn S.Revathi AP/CSE 4
  • 5. Traditional Programming Vs Machine Learning • In traditional programming, developers write explicit instructions to tell the computer how to perform a specific task. • Instead of programming explicit rules, developers feed large amounts of data into a learning algorithm, which then uses that data to identify patterns and relationships. • The algorithm is then able to generalize from the data and make predictions or decisions on new, unseen data. S.Revathi AP/CSE 5
  • 6. Types of Machine Learning There are two different types of Machine Learning: 1. Supervised Machine Learning 2. Unsupervised Machine Learning S.Revathi AP/CSE 6
  • 7. 1. Supervised Machine Learning • Supervised learning is a process of providing input data as well as correct output data to the machine learning model. • The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y). *The labelled data means some input data is already tagged with the correct output. S.Revathi AP/CSE 7
  • 8. How Supervised Learning Works? • In supervised learning, models are trained using labelled dataset, where the model learns about each type of data. • Once the training process is completed, the model is tested on the basis of test data (a subset of the training set), and then it predicts the output. Inputs: Dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Model Training: • If the given shape has 4 sides, and all the sides are equal, then it will be labelled as a Square. • If the given shape has 3 sides, then it will be labelled as a triangle. • If the given shape has 6 equal sides then it will be labelled as hexagon. Output: • If the model finds a new shape, it classifies the shape on the bases of a number of sides, and predicts the output. S.Revathi AP/CSE 8
  • 9. Types of supervised Machine learning Algorithms: S.Revathi AP/CSE 9
  • 10. a. Regression • Regression finds correlations between dependent and independent variables. Therefore, regression algorithms help predict continuous variables such as house prices, market trends, weather patterns, oil and gas prices. • Example: Predicting prices of a house given the features of house like size, price etc. Here, • Independent variable: Size of the house • Dependent variable: Price of the house S.Revathi AP/CSE 10
  • 11. Types of Regression • Simple Linear Regression • Polynomial Regression • Support Vector Regression • Decision Tree Regression • Random Forest Regression • These are some of the commonly used types of regression algorithms in machine learning. • The choice of the regression algorithm depends on the specific problem, the nature of the data, and the relationship between the input features and the target variable. S.Revathi AP/CSE 11
  • 12. Simple Linear Regression • Linear regression is one of the simplest and most fundamental supervised learning algorithms used for predictive modeling. • It is used to model the relationship between a dependent variable (target) and one or more independent variables (features) when the relationship is approximately linear. S.Revathi AP/CSE 12
  • 13. • Consider predicting the salary of an employee based on his/her age. • We can easily identify that there seems to be a correlation between employee’s age and salary (more the age more is the salary). • The hypothesis of linear regression is: • Y represents salary, X is employee’s age and a and b are the coefficients of the equation. So in order to predict Y (salary) given X (age), we need to know the values of a and b (the model’s coefficients). • “a" is the slope of the line, representing how much the target variable changes for a one-unit change in the input feature. • "b" is the y-intercept, representing the value of the target variable when the input feature is zero. • The algorithm tries to find the values of “a" and "b" that minimize the error between the predicted values and the actual target values in the training dataset. S.Revathi AP/CSE 13
  • 14. b. Classification • Classification algorithms are used when the output variable is categorical, which means there are two classes such as Yes-No, Male-Female, True-false, etc. • Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue", "fruit or animal", etc. • Since the Classification algorithm is a Supervised learning technique, hence it takes labeled input data, which means it contains input with the corresponding output. S.Revathi AP/CSE 14
  • 15. Example • In the below diagram, there are two classes, class A and Class B. These classes have features that are similar to each other and dissimilar to other classes. S.Revathi AP/CSE 15
  • 16. Example • Iris flower classification is a very popular machine learning project. • The iris dataset contains three classes of flowers, Versicolor, Setosa, Virginica, and each class contains 4 features, ‘Sepal length’, ‘Sepal width’, ‘Petal length’, ‘Petal width’. • The aim of the iris flower classification is to predict flowers based on their specific features. S.Revathi AP/CSE 16
  • 17. From this visualization, we can tell that iris-setosa is well separated from the other two flowers. And iris virginica is the longest flower and iris setosa is the shortest. S.Revathi AP/CSE 17
  • 18. Other Applications of Classification: • Document classification: A multinomial classification model can be trained to classify documents in different categories. • Spam filtering: An algorithm is trained to recognize spam email by learning the characteristics of what constitutes spam vs non-spam email. • Image classification: One of the most popular classification problems is image classification: determining what type of object (or scene) is in a digital image. S.Revathi AP/CSE 18
  • 19. 2.Unsupervised Machine Learning • Unsupervised Learning is a machine learning technique in which the users do not need to supervise the model. • Instead, it allows the model to work on its own to discover patterns and information that was previously undetected. • It mainly deals with the unlabelled data. • Although, unsupervised learning can be more unpredictable compared with other natural learning methods. • Unsupervised learning algorithms include clustering, anomaly detection, neural networks, etc. S.Revathi AP/CSE 19
  • 20. a. Discovering clusters “Clustering” is the process of grouping similar entities together. The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together. • The left side of the image shows uncategorized data. • On the right side, data has been grouped into clusters that consist of similar attributes. S.Revathi AP/CSE 20
  • 21. Why Clustering is Important • Attributes of unique entities can be profiled easier. This can subsequently enable users to sort data and analyze specific groups. • Clustering enables businesses to approach customer segments differently based on their attributes and similarities. This helps in maximizing profits. • It can help in dimensionality reduction if the dataset is comprised of too many variables. Irrelevant clusters can be identified easier and removed from the dataset. S.Revathi AP/CSE 21
  • 22. Types of Clustering • K-Means • Hierarchical clustering • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) • Gaussian Mixtures Model (GMM). S.Revathi AP/CSE 22
  • 23. K-Means Clustering Algorithm Steps: 1. Choose the value of K (the number of desired clusters). 2. Select K number of cluster centroids randomly. 3. Use the Euclidean distance (between centroids and data points) to assign every data point to the closest cluster. 4. Recalculate the centers of all clusters (as an average of the data points have been assigned to each of them). 5. Steps 3-4 should be repeated until there is no further change. S.Revathi AP/CSE 23
  • 25. b. Discovering Latent Factors • Discovering latent factors in machine learning refers to the process of identifying hidden or unobserved variables that capture the underlying structure or patterns in the data. • These latent factors are not directly observable but can influence the observed data. • Latent factor analysis is widely used in various machine learning and statistical models to explain complex relationships in data and reduce dimensionality. • A classic example of discovering latent factors is Principal Component Analysis (PCA) • Consider a dataset with two features: height and weight of individuals. Each data point represents a person's height and weight. • The body size or overall body proportions that influence both height and weight. These factors are not directly observable but can be inferred from the variations in the data. S.Revathi AP/CSE 25
  • 26. c.Discovering Graph Structure • Discovering Graph Structure in machine learning refers to the process of identifying the underlying relationships or connections between data points represented as a graph. • In a graph, data points are represented as nodes, and the relationships between them are represented as edges. • The graph structure can capture complex relationships that may not be easily discernible in raw data. • In the context of graphs, unsupervised learning methods aim to discover the graph structure without using any labeled information about the nodes or edges. S.Revathi AP/CSE 26
  • 27. • Social network analysis • Recommendation systems • Bioinformatics • Fraud detection here data is naturally organized as a graph and meaningful patterns can emerge from the relationships between entities. S.Revathi AP/CSE 27
  • 28. d. Matrix Completion • Matrix completion in machine learning refers to the task of filling in missing or unknown entries in a partially observed matrix. • This is a common problem in various applications where data is naturally represented as a matrix, but some entries are missing or unobserved. • Matrix completion algorithms aim to estimate these missing values based on the available data, exploiting patterns and relationships within the matrix. • The goal is to predict or impute the missing entries to reconstruct the full matrix. • Applications of matrix completion can be found in different domains, including recommendation systems, collaborative filtering, image and video inpainting, sensor networks. S.Revathi AP/CSE 28
  • 29. Example: Movie Recommendations Movie A Movie B Movie C Movie D User 1 5 3 4 User 2 4 2 1 User 3 2 5 3 User 4 2 4 Each row represents a user, each column represents a movie, there is missing value S.Revathi AP/CSE 29
  • 30. Supervised Vs Unsupervised Learning S.Revathi AP/CSE 30
  • 31. Curse of Dimensionality • Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. • The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. • A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. • Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. • The difficulties related to training machine learning models due to high dimensional data is referred to as ‘Curse of Dimensionality’. S.Revathi AP/CSE 31
  • 32. As the number of dimensions (features) in a dataset increases, several problems emerge: • Increased Data Sparsity: This means that the available data becomes increasingly sparse, making it more challenging to find meaningful patterns and relationships. • Computational Complexity: As the number of dimensions grows, the computational resources required for various algorithms also increase significantly. • Distance and Similarity Measure: In high-dimensional spaces, the concept of distance becomes less meaningful. The distance between data points tends to become more uniform, leading to reduced discrimination between points. S.Revathi AP/CSE 32
  • 33. Ctd… • Overfitting: With high-dimensional data, there is an increased risk of overfitting in machine learning models. • Curse of Data Visualization: Visualizing data in high-dimensional spaces becomes extremely challenging for human comprehension. As humans can only visualize up to three dimensions effectively. S.Revathi AP/CSE 33
  • 34. To mitigate the Curse of Dimensionality, some techniques are used: • Dimensionality Reduction: Methods like Principal Component Analysis (PCA) and t- Distributed Stochastic Neighbor Embedding (t-SNE) reduce the number of dimensions while preserving as much data variability as possible. • Feature Selection: Selecting the most relevant features can help reduce the dimensionality while retaining essential information. • Regularization: Applying regularization techniques in machine learning models can prevent overfitting and improve generalization. • Sampling: When data is sparse, proper sampling strategies can help make computations more tractable. S.Revathi AP/CSE 34
  • 35. Under Fitting & Over Fitting • In machine learning, a model’s performance and accuracy is known as prediction errors. • Let us consider that we are designing a machine learning model. • A model is said to be a good machine learning model if it find any new input data from the problem domain in a proper way. • This helps us to make predictions about future data, that the data model has never seen. • Now, suppose we want to check how well our machine learning model learns and find the new data. • For that, we have overfitting and underfitting, which are majorly responsible for the poor performances of the machine learning algorithms. S.Revathi AP/CSE 35
  • 36. Under Fitting • Underfitting occurs when the model is too simple or lacks the capacity to capture the underlying patterns in the data. As a result, the model not only performs poorly on the training data but also tends to perform poorly on new, unseen data (testing data). • Underfitting destroys the accuracy of our machine-learning model. • Its occurrence simply means that our model or the algorithm does not fit the data well enough. • It usually happens when we have less data to build an accurate model and also when we try to build a linear model with fewer non-linear data. • In such cases, the model will probably make a lot of wrong predictions. • Underfitting can be avoided by using more data and also reducing the features by feature selection. S.Revathi AP/CSE 36
  • 37. Reasons for Underfitting • The size of the training dataset used is not enough. • The model is too simple. • Training data is not cleaned and also contains noise in it. S.Revathi AP/CSE 37
  • 38. Over Fitting • When a model performs very well for training data but has poor performance with test data (new data), it is known as overfitting. • In this case, the machine learning model learns the details and noise in the training data such that it negatively affects the performance of the model on test data. • Overfitting can happen due to low bias and high variance. S.Revathi AP/CSE 38
  • 39. Reasons for Overfitting: • High variance and low bias. • The model is too complex. • The size of the training data is huge. S.Revathi AP/CSE 39
  • 40. Summary • overfitting occurs when a model is too complex and memorizes the training data, while underfitting happens when a model is too simple to capture the underlying patterns in the data. • The ideal model finds the right balance between complexity and simplicity, so it can generalize well to new data and make accurate predictions. S.Revathi AP/CSE 40
  • 41. Platform for Machine Learning • In the older days, people used to perform Machine Learning tasks by manually coding all the algorithms and mathematical and statistical formulas. • This made the processing time-consuming, tedious, and inefficient. • But in the modern days, it is become very much easy and more efficient compared to the olden days with various python libraries, frameworks, and modules. • Today, Python is one of the most popular programming languages for this task and it has replaced many languages in the industry, one of the reasons is its vast collection of libraries. S.Revathi AP/CSE 41
  • 43. Python libraries that are used in Machine Learning are: • Numpy • Scipy • Scikit-learn • Theano • TensorFlow • Keras • PyTorch • Pandas • Matplotlib S.Revathi AP/CSE 43
  • 44. a.Numpy • NumPy is a very popular python library for large multi-dimensional array and matrix processing, with the help of a large collection of high-level mathematical functions. • It is very useful for fundamental scientific computations in Machine Learning. • It is particularly useful for linear algebra, Fourier transform, and random number capabilities. • High-end libraries like TensorFlow uses NumPy internally for manipulation of Tensors. S.Revathi AP/CSE 44
  • 46. b. SciPy • SciPy is a very popular library among Machine Learning enthusiasts as it contains different modules for optimization, linear algebra, integration and statistics. • The SciPy is one of the core packages that make up the SciPy stack. SciPy is also very useful for image manipulation. S.Revathi AP/CSE 46
  • 47. Example: Image Manipulation from scipy.misc import imread, imsave, imresize # Read a JPEG image into a numpy array img = imread('D:/Programs / cat.jpg') # path of the image print(img.dtype, img.shape) # Tinting the image img_tint = img * [1, 0.45, 0.3] # Saving the tinted image imsave('D:/Programs / cat_tinted.jpg', img_tint) # Resizing the tinted image to be 300 x 300 pixels img_tint_resize = imresize(img_tint, (300, 300)) # Saving the resized tinted image imsave('D:/Programs / cat_tinted_resized.jpg', img_tint_resize) S.Revathi AP/CSE 47
  • 48. S.Revathi AP/CSE 48 Resized tinted image Tinted image Original image
  • 49. c. Scikit-learn • Scikit-learn is one of the most popular ML libraries for classical ML algorithms. • It is built on top of two basic Python libraries, NumPy and SciPy. • Scikit-learn supports most of the supervised and unsupervised learning algorithms. • Scikit-learn can also be used for data-mining and data-analysis, which makes it a great tool who is starting out with ML. S.Revathi AP/CSE 49
  • 50. d.Theano • Theano is a popular python library that is used to define, evaluate and optimize mathematical expressions involving multi-dimensional arrays in an efficient manner. • It is achieved by optimizing the utilization of CPU and GPU. • It is extensively used for unit-testing and self-verification to detect and diagnose different types of errors. • Theano is a very powerful library that has been used in large-scale computationally intensive scientific projects for a long time but is simple and approachable enough to be used by individuals for their own projects. S.Revathi AP/CSE 50
  • 51. e.Tensor Flow • TensorFlow is a very popular open-source library for high performance numerical computation developed by the Google Brain team in Google. • As the name suggests, Tensorflow is a framework that involves defining and running computations involving tensors. • It can train and run deep neural networks that can be used to develop several AI applications. • TensorFlow is widely used in the field of deep learning research and application. S.Revathi AP/CSE 51
  • 52. f.Keras • It provides many inbuilt methods for groping, combining and filtering data. • Keras is a very popular Machine Learning library for Python. • It is a high-level neural networks API capable of running on top of TensorFlow, Theano. • It can run seamlessly on both CPU and GPU. • Keras makes it really for ML beginners to build and design a Neural Network. S.Revathi AP/CSE 52
  • 53. g.PyTorch • PyTorch is a popular open-source Machine Learning library for Python based on Torch, which is an open-source Machine Learning library. • It has an extensive choice of tools and libraries that support Computer Vision, Natural Language Processing(NLP), and many more ML programs. • It allows developers to perform computations on Tensors with GPU acceleration and also helps in creating computational graphs. S.Revathi AP/CSE 53
  • 54. h. Pandas • Pandas is a popular Python library for data analysis. • As we know that the dataset must be prepared before training. • In this case, Pandas comes handy as it was developed specifically for data extraction and preparation. • It provides high-level data structures and wide variety tools for data analysis. • It provides many inbuilt methods for grouping, combining and filtering data. S.Revathi AP/CSE 54
  • 55. g. Matplotlib • Matplotlib is a very popular Python library for data visualization. • It particularly comes in handy when a programmer wants to visualize the patterns in the data. • It is a 2D plotting library used for creating 2D graphs and plots. • A module named pyplot makes it easy for programmers for plotting as it provides features to control line styles, font properties, formatting axes, etc. • It provides various kinds of graphs and plots for data visualization, viz., histogram, error charts, bar chats, etc, S.Revathi AP/CSE 55