Machine Learning_2025_First Module_1.pdf

MACHINE LEARNING
(22ISE62)
Module 1 & 2 ½
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
17-04-2025 1
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering
Dr. Shivashankar-ISE-GAT

Course Outcomes
After Completion of the course, student will be able to:
22ISE62.1: Describe the machine learning techniques, their types and data analysis framework.
22ISE62.2: Apply mathematical concepts for feature engineering and perform dimensionality
reduction to enhance model performance.
22ISE62.3: Develop similarity-based learning models and regression models for solving
classification and prediction tasks.
22ISE62.4: Build probabilistic learning models and design neural network models using perceptron
and multilayer architectures.
22ISE62.5: Utilize clustering algorithms to identify patterns in data and implement reinforcement
learning techniques.
Text Book:
1. S Sridhar, M Vijayalakshmi, “Machine Learning”, OXFORD University Press 2021, First Edition.
2. Murty, M. N., and V. S. Ananthanarayana. Machine Learning: Theory and Practice, Universities Press, 2024.
3. T. M. Mitchell, “Machine Learning”, McGraw Hill, 1997.
4. Burkov, Andriy. The hundred-page machine learning book. Vol. 1. Quebec City, QC, Canada: Andriy Burkov,
2019.
17-04-2025 2

Module -1: Machine Learning
What is Artificial Intelligence (AI)?
 AI is technology that enables computers and machines to simulate human learning, comprehension,
problem solving, decision making, creativity and autonomy.
 AI can learn from experience, adapt to new inputs, and perform tasks like recognizing images,
understanding speech, and playing games etc.
Types of AI
 Natural language processing (NLP): Teaches computers to understand and communicate in human
language. This allows computers to answer questions and have conversations.
 Machine learning: The process of training a piece of software, called a model, to make useful
predictions or generate content from data.
 Deep learning: Ttype of machine learning that uses artificial neural networks to learn from data.
Artificial neural networks are inspired by the human brain, and they can be used to solve a wide
variety of problems, including image recognition, natural language processing, and speech
recognition.
 Neural networks: A neural network is a type of artificial intelligence for information processing that
imitates the human brain. A neural network consists of connected units or nodes called artificial
neurons, which loosely model the neurons in the brain.
17-04-2025 3

Machine Learning
• Machine learning (ML) is a type of Artificial Intelligence (AI) that allows
computers to learn from data and improve over time.
• It uses algorithms to analyze data, identify patterns, and make predictions.
How does ML work?
• ML systems are trained by feeding them large amounts of data.
• The systems learn from the data and improve their performance over time.
• The systems can make informed decisions based on the data.
Benefits of ML
ML can help companies make informed decisions, streamline operations,
and reduce workloads.
It can also help researchers analyze data, identify patterns, and predict
outcomes.
17-04-2025 4

Need for Machine Learning
• Machine learning is important because it allows computers to learn and improve
without being explicitly programmed.
• This makes it useful for many tasks, including complex decision-making, analyzing large
amounts of data, and adapting to new situations.
Machine learning has become so popular because of three reasons:
1. High volume of available data to manage: Big companies such as Facebook, Twitter, and
YouTube generate huge amount of data that grows at a phenomenal rate. It is estimated
that the data approximately gets doubled every year.
2. The cost of storage has reduced: The hardware cost has also dropped. Therefore, it is
easier now to capture, process, store, distribute, and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of complex algorithms
now. Especially with the advent of deep learning, many algorithms are available for
machine learning.
17-04-2025 5

Conti..
• It has become a dominant technology trend now.
• let us establish these terms - data, information, knowledge, intelligence, and wisdom.
Wisdom: The ability to use our knowledge and experience to make good decisions and judgments. It is
the knowledge of God and a life lived for God.
17-04-2025 6
Figure 1: ML Knowledge Pyramid
Information
(processed data)
Data
(mostly available as raw facts
and symbols)
Wisdom
Intelligence
(Applied Knowledge)
Knowledge
(condensed information)

Conti…
• An actionable form of knowledge is called intelligence.
• Computer systems have been successful till this stage.
• The ultimate objective of knowledge pyramid is wisdom that represents
the maturity of mind that is, so far, exhibited only by humans.
• Here comes the need for machine learning.
• The objective of machine learning is to process these archival data for
organizations to take better decisions to design new products, improve the
business processes, and to develop effective decision support systems.
17-04-2025 7

Machine Learning Explained
• “Machine learning is the field of study that gives the computers ability to learn without being
explicitly programmed.”
• In conventional programming, after understanding the problem, a detailed design of the
program such as a flowchart or an algorithm needs to be created and converted into programs
using a suitable programming language.
• This approach could be difficult for many real-world problems such as puzzles, games, and
complex image recognition applications.
• Initially, AI aims to understand these problems and develop general purpose rules manually.
• Then, these rules are formulated into logic and implemented in a program to create intelligent
systems.
• This idea of developing intelligent systems by using logic and reasoning by converting an
expert’s knowledge into a set of rules and programs is called an expert system.
• An expert system like MYCIN was designed for medical diagnosis after converting the expert
knowledge of many doctors into a system.
• However, this approach did not progress much as programs lacked real intelligence. The word
MYCIN is derived from the fact that most of the antibiotics’ names end with ‘mycin’.
17-04-2025 8

Conti..
• The focus of AI is to develop intelligent systems by using data-driven approach, where data is used as an
input to develop intelligent models.
• The models can then be used to predict new inputs.
• Thus, the aim of machine learning is to learn a model or set of rules from the given dataset automatically
so that it can predict the unknown data correctly.
• As humans take decisions based on an experience, computers make models based on extracted patterns
in the input data and then use these data-filled models for prediction and to take decisions.
• For computers, the learnt model is equivalent to human experience.
• (a)
17-04-2025 9
Figure 1.2: (a) A Learning System for Humans
(b) A Learning System for Machine Learning.
Humans
Experience Decision
Learning
Program
Data
base
Data
Model
(b)

Conti..
• Often, the quality of data determines the quality of experience and, therefore, the
quality of the learning system.
• In statistical learning, the relationship between the input x and output y is modeled as a
function in the form y = f(x). Here, f is the learning function that maps the input x to
output y.
• Learning of function f is the crucial aspect of forming a model in statistical learning.
• In machine learning, this is simply called mapping of input to output.
• The learning program summarizes the raw data in a model.
A model is an explicit description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
17-04-2025 10

Conti..
In systems, experience is gathered by these steps:
1. Collection of data
2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is
used to generate concepts. This is equivalent to humans’ idea of objects, for example, we
have some idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can be
viewed as ordering of all possible concepts. So, generalization involves ranking of
concepts, inferencing from them and formation of heuristics, an actionable aspect of
intelligence. Heuristics are educated guesses for all tasks.
4. Heuristics (Problem solving) normally works! But, occasionally, it may fail too. It is not
the fault of heuristics as it is just a ‘rule of thumb′. The course correction is done by taking
evaluation measures.
17-04-2025 11

Machine Learning in Relation to other Fields
• Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
primarily.
• It is the resultant of combined ideas of diverse fields.
1. Machine Learning and Artificial Intelligence
• The aim of AI is to develop intelligent agents. An agent can be a robot, humans, or any
autonomous systems.
• The focus was on logic and logical inferences.
• It had seen many ups and downs.
• These down periods were called AI winters.
• The recovery in AI happened due to development of data driven systems.
• The aim is to find relations and regularities present in the data.
• It is a broad field that includes learning from examples and other areas like reinforcement
learning.
17-04-2025 12

Conti…
Figure 1.3: Relationship of AI with Machine Learning
• Deep learning is a sub-branch of machine learning.
• In deep learning, the models are constructed using neural network technology.
• Neural networks are based on the human neuron models.
• Many neurons form a models.
• Many neurons form a network connected with the activation functions that
trigger further neurons to perform tasks.
17-04-2025 13
Deep
learning
Machine
Learning
AI

Conti…
2. Machine Learning, Data Science, Data Mining, and Data Analytics
Data Science:
• Data science is the study of data to derive meaningful insights, using a variety of methods from different fields.
• It's a multidisciplinary field that combines statistics, computer engineering, mathematics, and artificial
intelligence
• Machine learning starts with data.
• Therefore, data science and machine learning are interlinked.
• Machine learning is a branch of data science.
• Data science deals with gathering of data for analysis.
It is a broad field that includes:
Big Data:
Data science concerns about collection of data. Big data is a field of data science that deals with data’s following
characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter, YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.
3. Velocity: It refers to the speed at which the data is generated and processed.
Big data is used by many machine learning algorithms for applications such as language translation and image
recognition.
17-04-2025 14

Conti…
Data Mining:
The process of extracting knowledge or insights from large amounts of data using various statistical and
computational techniques.
• Data Analytics:
• Another branch of data science is data analytics. It aims to extract useful knowledge from crude data.
There are different types of analytics. Predictive data analytics is used for making predictions.
Pattern Recognition: It uses machine learning algorithms to extract the features for pattern analysis and
pattern classification. Set of numbers arranged in a sequence such that they are related to each other in a
specific rule
17-04-2025 15
Figure 1.4: Relationship of Machine Learning
with Other Major Fields
Data
mining
Data
Analytics
Machine
Learning
Pattern
Recognition
Big
Data
Data Science

Conti…
3. Machine Learning and Statistics
• A branch of mathematics that deals with collecting, organizing, and analyzing numerical data to
solve real-world problems
• Statistics is a branch of mathematics that has a solid theoretical foundation regarding statistical
learning.
• But the difference between statistics and ML is that statistical methods look for regularity in
data called patterns.
• Initially, statistics sets a hypothesis and performs experiments to verify and validate the
hypothesis in order to find relationships among data.
• Statistics requires knowledge of the statistical procedures and the guidance of a good
statistician.
• It is mathematics intensive and models are often complicated equations and involve many
assumptions.
• Machine learning, comparatively, has less assumptions and requires less statistical knowledge.
• But, it often requires interaction with various tools to automate the process of learning.
17-04-2025 16

TYPES OF MACHINE LEARNING
17-04-2025 17
Supervised Learning
Classification
Semi-Supervised
Learning
Unsupervised
Learning
Reinforcement
Machine Learning
Machine Learning
Regression Dimension
Reduction
Association
Mining
Cluster
Analysis
Figure 1.5: Types of Machine Learning

Conti…
Labelled Data
• Raw data that has been assigned tags or labels to provide context and meaning.
• It's a foundation of machine learning models and is used in supervised learning.
• Accuracy: High-quality labeled data can help machine learning models make more
accurate predictions.
• Context: Labels provide context and categorization for machine learning models.
17-04-2025 18
Labelled data Unlabelled data
Use Supervised learning unsupervised learning
Examples Photos with tags like "cat"
or "car"
Photos without tags
How it's
created
Annotations are added by
humans or experts
Collected by observing
and recording
Ease of use More difficult to acquire
and store
Easier to acquire and
store

Conti…
17-04-2025 19
Supervised Learning
• Supervised algorithms use labelled dataset.
• As the name suggests, there is a supervisor or teacher component in supervised learning.
• A supervisor provides labelled data so that the model is constructed and generates test data.
• The algorithm is trained on labeled data that specifies both the input and output.
• The algorithm learns the relationship between the input and output data.
• The algorithm creates a model that can predict correct outputs on new data.
Examples of supervised learning
• Image recognition: A supervised learning model can be trained to recognize handwritten
numbers by analyzing clusters of pixels and shapes.
• Weather forecasting: A supervised model can predict flight times based on weather conditions,
airport traffic, and more.
• Stock price forecasting: A supervised model can predict stock prices based on various factors.
Supervised learning has two methods:
1. Classification
2. Regression Classification

Conti…
Classification
• A supervised learning method.
• The input attributes of the classification algorithms are
called independent variables.
• The target attribute is called label or dependent variable.
• The relationship between the input and target variable is
represented in the form of a structure which is
called a classification model.
• In classification, learning takes place in two stages.
During the first stage, called training stage,
the learning algorithm takes a labelled dataset and starts learning.
After the training set, samples are processed and the model is generated.
In the second stage, the constructed model is tested with test or unknown sample and assigned a label.
17-04-2025 20
Figure 1.7: Data classification

Conti…
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks like CNN
17-04-2025 21

Conti…
Regression models:
• Regression in machine learning consists of mathematical methods that allow data scientists to predict a
continuous outcome (y) based on the value of one or more predictor variables (x).
• The regression model takes input x and generates a model in the form of a fitted line of the form y = f(x).
• The main difference is that regression models predict continuous variables such as product price, while
classification concentrates on assigning labels such as class.
Linear Regression:
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Two types of variables present in regression:
• Dependent Variable (Target): The variable we are trying
• to predict e.x. house price.
• Independent Variables (Features): The input variables that
• influence the prediction, e.g locality, number of rooms.
17-04-2025 22
Figure 1.8: Regression model

Unsupervised Learning
• Unsupervised learning algorithms are tasked with finding patterns and relationships within the data
without any prior knowledge of the data’s meaning.
• A process that uses data to find patterns and structures without human intervention.
• It deals with unlabelled data.
• It can help identify natural groupings in data.
• Find hidden patterns and data without any human intervention, i.e., we don’t give output to our model.
• The training model has only input parameter values and discovers the groups or patterns on its own.
17-04-2025 23
Figure 1.9: Unsupervised ML
processing

Conti…
Unsupervised Learning Algorithms
There are mainly 3 types of Algorithms which are used for Unsupervised dataset.
1. Clustering
2. Association Rule Learning
3. Dimensionality Reduction
Clustering Analysis
• The process of grouping unlabeled data into clusters based on their similarities.
• The goal of clustering is to identify patterns and relationships in the data without any prior knowledge of
the data’s meaning.
• It aims to group objects into disjoint clusters or groups.
• Cluster analysis clusters objects based on its attributes.
• All the data objects of the partitions are similar in some aspect and vary from the data objects in the
other partitions significantly.
• Some of the examples of clustering processes are — segmentation of a region of interest in an image,
detection of abnormal growth in a medical image, and determining clusters of signatures in a gene
database.
17-04-2025 24

Conti…
Some of the key clustering algorithms are:
• k-means algorithm
• Hierarchical algorithms
17-04-2025 25
Figure 1.10: Example of clustering in ML

Conti…
Association rule
• In unsupervised learning, association rule mining is a technique that discovers interesting
relationships or patterns (called rules) between data items in large datasets, like "if X, then Y,"
without prior knowledge or labeled data.
• Association rule mining is a type of unsupervised learning that uses a rule-based approach to
discover interesting relationships between features in a dataset.
• Association Rule Learning is an example of Unsupervised Learning, it is used in Market Basket
Analysis, Intrusion Detection, Web Usage Mining etc.
Applications:
• Retail: Analyzing customer behavior, optimizing product placement, and suggesting related
products.
• Healthcare: Identifying correlations between symptoms and diseases, and predicting patient
outcomes.
• Finance: Detecting fraudulent transactions and identifying patterns in financial data.
• Social Media: Analyzing user interactions and trends.
17-04-2025 26

Dimensionality reduction
• Dimensionality reduction is the process of reducing the number of features (or dimensions) in a
dataset while retaining as much information as possible.
• While working with machine learning models, we often encounter datasets with a large number
of features.
• These datasets can lead to problems such as increased computation time and overfitting.
• To address these issues, we use dimensionality reduction techniques.
17-04-2025 27
Dr. Shivashankar-ISE-GAT Figure 1.11: Representation of Feature deduction
S.No. Supervised Learning Unsupervised Learning
1 Labeled datasets, where each
input is paired with a
corresponding output label.
with unlabeled data aiming to
uncover hidden patterns or
structures within the dataset
2 Predicts outcomes or classifies
data based on known labels.
Discovers hidden patterns,
structures, or groupings in
data.
3 Less complex, as the model
learns from labeled data with
clear guidance.
More complex, as the model
must find patterns without any
guidance.

Semi-supervised Learning
• A machine learning approach that utilizes both labeled and unlabeled data to train models, bridging the
gap between supervised and unsupervised learning, and often achieving better results than using only
labeled data.
• The dataset has a huge collection of unlabelled data and some labelled data.
• Method that uses a small amount of labeled data and a large amount of unlabeled data to train a model.
• The goal of semi-supervised learning is to learn a function that can accurately predict the output variable
based on the input variables, similar to supervised learning.
17-04-2025 28
Figure 1.13: Semi-
supervised learning
Examples of Applications:
Anomaly Detection: Identifying unusual
patterns or observations.
Speech Analysis: Labeling audio files.
Internet Content Classification: Categorizing
webpages.
Protein Sequence Classification: Analyzing
DNA strands.

Reinforcement Learning
• A machine learning paradigm where an agent learns to make decisions and take actions in an environment
to maximize a reward signal, similar to how humans learn through trial and error.
• A machine learning technique that teaches an agent to make decisions by rewarding correct actions and
punishing incorrect actions.
• The agent can be human, animal, robot, or any independent program. The rewards enable the agent to
gain experience.
• It's used in many fields, including robotics, gaming, and self-driving vehicles.
• The agent aims to maximize the reward.
• The reward can be positive or negative (Punishment). When the rewards are more, the behaviour gets
reinforced and learning becomes possible.
17-04-2025 29
Figure 1.13: Reinforcement Learning agent and rewards

Challenges of Machine Learning
• Computers are better than humans in performing tasks like computation.
• For example, while calculating the square root of large numbers, an average human may blink but
computers can display the result in seconds. Computers can play games like chess, GO, and even beat
professional players of that game.
Some of the challenges are listed below:
1. Problems – Machine learning can deal with the ‘well-posed’ problems (a scenario where the solution
is either non-existent, not unique, or highly sensitive to small changes in the input ) where
specifications are complete and available. Computers cannot solve ‘ill-posed’ problems.
• Can a model for this test data be multiplication?
• That is, y = x1 × x2. Well!
• It is true! But, this is equally true that y may be
• y = x1 ÷ x2, or y = 𝑥1𝑥2.
• So, there are three functions that fit the data.
• This means that the problem is ill-posed.
• To solve this problem, one needs more example to check the model.
17-04-2025 30
Input (𝒙𝟏, 𝒙𝟐) Output (y)
1,1 1
2,1 2
3,1 3
4,1 4
5,1 5

Conti…
2. Huge data – Availability of a quality data is a challenge. A quality data means it
should be large and should not have data problems such as missing data or
incorrect data.
3. High computation power – With the availability of Big Data, the computational
resource requirement has also increased. Systems with Graphics Processing Unit
(GPU) or even Tensor Processing Unit (TPU) are required to execute machine
learning algorithms.
4. Complexity of the algorithms – The selection of algorithms, describing the
algorithms, application of algorithms to solve machine learning task, and
comparison of algorithms have become necessary for machine learning or data
scientists now. Design, select, and evaluate optimal algorithms.
5. Bias/Variance – Variance is the error of the model. This leads to a problem called
bias/ variance tradeoff. A model that fits the training data correctly but fails for
test data, in general lacks generalization, is called overfitting. The reverse problem
is called underfitting where the model fails for training data but has good
generalization. Overfitting and underfitting are great challenges for machine
learning.
17-04-2025 31

Machine Learning Process
• Involves data collection and preparation, model selection and training, evaluation, and deployment, with
continuous monitoring and maintenance for improved performance.
• The emerging process model for the data mining solutions for business organizations is CRISP-DM.
• CRISP-DM stands for Cross Industry Standard Process – Data Mining.
• Since machine learning is like data mining, except for the aim, this process can be used for machine
learning.
17-04-2025 32
Model
deployment
Understanding the
business
Model evaluation
Modelling
Data
preprocessing
Understanding the
data
Figure 1.11: A Machine Learning/Data Mining
Process

Conti…
1. Understanding the business – understanding the objectives and requirements of the
business organization. A single data mining algorithm is enough for giving the solution.
Formulation of the problem statement for the data mining process.
2. Understanding the data – Data collection, study of the characteristics of the data,
formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – Producing the final dataset by cleaning the raw data and
preparation of data for the data mining process. The missing values may cause problems
during both training and testing phases. Missing data forces classifiers to produce
inaccurate results. This is a perennial problem for the classification models.
4. Modelling – Plays a role in the application of data mining algorithm to obtain a model.
5. Evaluate –The performance of the classifier is determined by evaluating the accuracy of
the classifier. The process of classification is a fuzzy issue. For example, classification of
emails requires extensive domain knowledge and requires domain experts. Hence,
performance of the classifier is very crucial.
6. Deployment – Making a trained model available for use in a production environment,
integrating it into existing systems, and enabling it to make predictions on new data.
17-04-2025 33

Machine Learning Applications
Some of the common application of
Machine learning in our day to day life are
17-04-2025 34
Figure 1.14: ML common applications in our day to day
life

Conti…
Sl. No. Problem Domain Applications
1 Business Predicting the bankruptcy of a business firm
2 Banking Prediction of bank loan defaulters and detecting credit card frauds
3 Image processing Image search engines, object identification, image classification, and generating
synthetic images
4 Audio/Video Chatbots like Alexa, Microsoft Cortana. Developing chatbots for customer support,
speech to text, and text to voice
5 Telecommunication Trend analysis and identification of bogus calls, fraudulent calls and its callers, churn
analysis 6. Marketing Retail sales analysis, market
6 Marketing Retail sales analysis, market basket analysis, product performance analysis, market
segmentation analysis, and study of travel patterns of customers for marketing tours
7 Games Game programs for Chess, GO, and Atari video games
8 Natural Language
translation
Google Translate, Text summarization, and sentiment analysis.,
9 Web Analysis and
services
Identification of access patterns, detection of e-mail spams, viruses, personalized web
services, search engines like Google,
10 Medicine Medicine Prediction of diseases, given disease symptoms as cancer or diabetes.
Prediction of effectiveness of the treatment using patient history and Chatbots to
interact with patients like IBM Watson uses.
17-04-2025 35

Understanding Data
• A collection of facts, numbers, words, or observations that can be used to learn about something or make
decisions.
• The quality and quantity of data significantly impact the performance of a machine learning
model. Poor data can lead to inaccurate predictions.
• In computer systems, bits encode facts present in numbers, text, images, audio, and video.
• Today, business organizations are accumulating vast and growing amounts of data of the order of
gigabytes, tera bytes, exabytes.
• Data is available in different data sources like flat files, databases, or data warehouses.
• It can either be an operational data or a non-operational data.
• Operational data is the one that is encountered in normal business procedures and processes.
• Before using data for training, it often needs to be cleaned (outliers), transformed(Scaling), and prepared
to make it suitable for machine learning algorithms.
• Processed data is called information that includes patterns, associations, or relationships among
data.
17-04-2025 36

Elements of Big Data
"data" refers to the observations or measurements used to train and test models, enabling them
to make predictions or decisions based on patterns learned from the input data.
Poor data can lead to inaccurate predictions, while insufficient data can result in a model that is
unable to classification well to new, unseen data.
Big data, on the other hand, is a larger data whose volume is much larger than ‘small data’ and is
characterized as follows:
1. Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data.
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as velocity.
The availability of IoT devices and Internet power ensures that the data is arriving at a faster
rate.
3. Variety – The variety of Big Data includes: • Form –• Function –• Source of data.
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts, truthfulness,
believability, and confidence in data.
5. Validity – The accuracy of the data for taking decisions or for any other goals that are needed
by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted from the data and its influence on the decisions that are taken based on it.
17-04-2025 37

Types of Data
There are three kinds of data.
1. Structured data
2. unstructured data, and
3. semi-structured data.
Structured Data:
• Data is stored in an organized manner such as a database where it is available in the form of a table.
• The data can also be retrieved in an organized manner using tools like SQL.
The structured data frequently encountered in machine learning are listed below:
Record Data:
• We have a collection of objects in a dataset and each object has a set of measurements.
• The measurements can be arranged in the form of a matrix.
• Rows in the matrix represent an object and can be called as entities, cases, or records.
• The columns of the dataset are called attributes, features, or fields.
17-04-2025 38

Conti…
Data Matrix:
• A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a
square or rectangular pattern, also known as a matrix.
• The information to be encoded can be text or numeric data.
Graph Data:
Represents entities as nodes and relationships between them as edges, enabling analysis of interconnected
information, such as social networks or molecular structures.
The modes are web pages and the hyperlink is an edge that connects the nodes.
Ordered Data:
For many Machine Learning algorithms on supervised learning problems, the order of training data samples
can affect the quality of the derived model and the accuracy of predictions.
The examples of ordered data are:
1. Temporal data – It is the data whose attributes are associated with time.
2. Sequence data – It is like sequential data but does not have time stamps. This data involves the
sequence of words or letters. For example, DNA data is a sequence of four characters – A T G C.
3. Spatial data – It has attributes such as positions or areas. For example, maps are spatial data where
the points are related by location.
17-04-2025 39

Conti…
Unstructured data:
• Refers to information that does not have a predefined format or structure, making it challenging to organize,
process, and analyze.
• Unstructured data includes video, image, and audio.
• It also includes textual documents, programs, and blog data. It is estimated that 80% of the data are
unstructured data.
Semi-Structured Data:
Not organized in a rigid, tabular format like structured data (e.g., relational databases), but it also isn't completely
unstructured like text or images.
Example:
• JSON (JavaScript Object Notation): A common format for transmitting data, often used for APIs and web
applications.
• XML (Extensible Markup Language): Used for storing and transporting data, often used for configuration files
and data exchange.
• HTML (Hypertext Markup Language): Used for web pages, providing structure and content.
• CSV (Comma-Separated Values): A simple format for storing tabular data, often used for exporting and
importing data.
• Email: The content of an email is semi-structured, with the header and body following a certain structure, but
the content itself is unstructured.
17-04-2025 40

Data Storage and Representation
• The methods and technologies used to store and manage the data used for training and deploying models.
• Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis.
• The goal of data storage management is to make data available for analysis.
• Flat Files:
• These are the simplest and most commonly available data source.
• It is also the cheapest way of organizing the data.
• These flat files are the files where data is stored in plain ASCII or EBCDIC format. Some of the popular
spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the values are separated by commas.
These are used by spreadsheet and database applications. The first row may have attributes and the
rest of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both CSV and
TSV files are generic in nature and can be shared.
• Database System:
• It normally consists of database files and a Database Management System (DBMS).
• Database files contain original data and metadata.
17-04-2025 41

DIFFERENT TYPES OF DATABASES
1. A transactional database is a collection of transactional records. Each record is a transaction.
2. Time-series database stores time related information like log files where data is associated with a time stamp. This
data represents the sequences of data, which represent values or events obtained over a period (for example,
hourly, weekly or yearly) or repeated time span.
World Wide Web (WWW):
• It provides a diverse, worldwide online information source.
• The objective of data mining algorithms is to mine interesting patterns of information present in WWW.
XML (eXtensible Markup Language)
• It is both human and machine interpretable data format that can be used to represent data that needs to be shared
across the platforms.
• Data Stream It is dynamic data, which flows in and out of the observing environment.
• Typical characteristics of data stream are huge volume of data, dynamic, fixed order movement, and real-time
constraints.
RSS (Really Simple Syndication):
• It is a format for sharing instant feeds across services.
JSON (JavaScript Object Notation):
• It is another useful data interchange format that is often used for many machine learning algorithms.
17-04-2025 42

BIG DATA ANALYTICS AND TYPES OF ANALYTICS
• Enabling the extraction of valuable insights and predictions from massive datasets that would be
impossible to analyze manually.
• Data analysis is an activity that takes the data and generates useful information and insights for assisting
the organizations.
Data analytics, instead, concentrates more on future and helps in prediction. There are four types of data
analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive Analytics:
• It is about describing the main features of the data.
• After data collection is done, descriptive analytics deals with the collected data and quantifies it.
• It is often stated that analytics is essentially statistics.
Diagnostic Analytics:
• It deals with the question – ‘Why?’. This is also known as causal analysis, as it aims to find out the cause
and effect of the events.
• For example, if a product is not selling, diagnostic analytics aims to find out the reason.
• There may be multiple reasons and associated effects are analyzed as part of it.
17-04-2025 43

Conti…
Predictive Analytics:
• It deals with the future.
• It deals with the question – ‘What will happen in future given this data?’.
• This involves the application of algorithms to identify the patterns to
predict the future.
• The entire course of machine learning is mostly about predictive analytics
and forms the core of this book.
Prescriptive Analytics:
• A way to use historical data and machine learning to predict future events.
• It's also known as predictive modeling or predictive AI.
• Predictive analytics involves certain manipulations on data from existing
data sets with the goal of identifying some new trends and patterns.
17-04-2025 44

BIG DATA ANALYSIS FRAMEWORK
• Big data framework is a layered architecture.
• Such an architecture has many advantages such as genericness. A 4-layer architecture has the following
layers:
1. Date connection layer
2. Data management layer
3. Data analytics later
4. Presentation layer
Data Connection Layer:
• It has data ingestion mechanisms and data connectors.
• Data ingestion means taking raw data and importing it into appropriate data structures.
• It performs the tasks of ETL process. By ETL, it means extract, transform and load operations.
Data Management Layer
• It performs preprocessing of data.
• The purpose of this layer is to allow parallel execution of queries, and read, write and data management
tasks.
• There may be many schemes that can be implemented by this layer such as data-in-place, where the data
is not moved at all.
17-04-2025 45

Conti…
Data Analytic Layer
• It has many functionalities such as statistical tests, machine learning algorithms to understand,
and construction of machine learning models.
• This layer implements many model validation mechanisms tool.
• The Hybrid Cloud is the combination of two or more cloud types.
• The characteristics of cloud computing are:
1. Shared Infrastructure – Sharing of physical services, storage, and networking capabilities
2. Dynamic Provisioning – Resources assigned dynamically, based on demands
3. Dynamic Scaling – Expansion and contraction of service capability
4. Network Access – Needs to be accessed across the internet
5. Utility-based Metering – Uses metering to provide reporting and billing information
6. Multitenancy – Serves multiple customers
7. Reliability – Customer reliable service Grid Computing Grid Computing is a parallel and
distributed
17-04-2025 46

Conti…
Grid Computing:
• Grid Computing is a parallel and distributed computing framework consisting of a network of
computers offering a super computing service as a single virtual supercomputer.
• This high-performance computing is required to perform specialized tasks that require a high
computing power and a single computer cannot provide enough computing resources.
• The grid computing model forms a grid by connecting tens of thousands of nodes as a cluster
that runs on an operating system.
H-Computing (High Performance Computing or HPC)
• It enables to perform complex tasks at high speed.
• It aggregates computing power in such a way that provides much higher performance to solve
complex problems in science, engineering, research or business.
• It leverages parallel processing techniques for solving complex computational problems.
• HPC system achieves this sustained performance through concurrent use of computing
resources.
• An HPC system combines the computing power of thousands of compute nodes that work in
parallel to complete tasks faster.
17-04-2025 47

Conti…
Presentation Layer
• It has mechanisms such as dashboards, and applications that display the
results of analytical engines and machine learning algorithms.
• Thus, the Big Data processing cycle involves data management that consists
of the following steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm
This is an iterative process and is carried out on a permanent basis to ensure
that data is suitable for data mining.
17-04-2025 48

Conti…
Data Collection:
The first task of gathering datasets are the collection of data.
It is often estimated that most of the time is spent for collection of good quality data.
A good quality data yields a better result. It is often difficult to characterize a ‘Good data’. ‘Good data’ is one that has
the following properties:
1. Timeliness – The data should be relevant and not stale or obsolete data.
2. Relevancy – The data should be relevant and ready for the machine learning or data mining algorithms. All the
necessary information should be available and there should be no bias in the data.
3. Knowledge about the data – The data should be understandable and interpretable, and should be self-sufficient
for the required application as desired by the domain knowledge engineer.
Broadly, the data source can be classified as open/public data, social media data and multimodal data.
1. Open or public data source – It is a data source that does not have any stringent copyright rules or restrictions.
Government census data are good examples of open data:
• Digital libraries that have huge amount of text data as well as document images
• Scientific domains with a huge collection of experimental data like genomic data and biological data
• Healthcare systems that use extensive databases like patient databases, health insurance data, doctors’
information, and bioinformatics information
2. Social media – It is the data that is generated by various social media platforms like Twitter, Facebook, YouTube, and
Instagram. An enormous amount of data is generated by these platforms.
3. Multimodal data – It includes data that involves many modes such as text, video, audio and mixed types.
17-04-2025 49

Data Preprocessing
• Data preprocessing improves the quality of the data mining techniques.
• The raw data must be preprocessed to give accurate results.
• The process of detection and removal of errors in data is called data cleaning.
• Data wrangling means making the data processable for machine learning algorithms.
• Some of the data errors include human errors such as typographical errors or incorrect
measurement and structural errors like improper data formats.
• Data errors can also arise from omission and duplication of attributes.
• Noise is a random component and involves distortion of a value or introduction of spurious
objects.
In real world, the available data is ’dirty’.
By this word ’dirty’, it means:
• Incomplete data • Inaccurate data • Outlier data • Data with missing values • Data with
inconsistent values • Duplicate data
17-04-2025 50

Missing Data Analysis
The primary data cleaning process is missing data analysis.
Data cleaning routines attempt to fill up the missing values, smoothen the noise while
identifying the outliers and correct the inconsistencies of the data.
This enables data mining to avoid overfitting of the models.
The procedures that are given below can solve the problem of missing data:
1. Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This
method is not effective when the percentage of the missing values increases.
2. Fill in the values manually – Here, the domain expert can analyse the data tables and
carry out the analysis and fill in the values manually.
3. A global constant can be used to fill in the missing attributes. The missing values may
be ’Unknown’ or be ’Infinity’.
4. The attribute value may be filled by the attribute value.
5. Use the attribute mean for all samples belonging to the same class. Here, the average
value replaces the missing values of all tuples that fall in this group.
6. Use the most possible value to fill in the missing value. The most probable value can
be obtained from other methods like classification and decision tree prediction.
17-04-2025 51

Removal of Noisy or Outlier
• Data Noise is a random error or variance in a measured value.
• It can be removed by using binning, which is a method where the given data values are sorted
and distributed into equal frequency bins.
• The bins are also called as buckets.
• The binning method then uses the neighbor values to smooth the noisy data.
• Some of the techniques commonly used are ‘smoothing by means’ where the mean of the bin
removes the values of the bins, ‘smoothing by bin medians’ where the bin median replaces the
bin values, and ‘smoothing by bin boundaries’ where the bin value is replaced by the closest bin
boundary.
• The maximum and minimum values are called bin boundaries.
• Binning methods may be used as a discretization technique.
There are three techniques to data smoothing:
1. Smoothing data by equal frequency bins.
2. Smoothing the data by bin means.
3. Smoothing the data by bin boundaries.
17-04-2025 52

Conti…
Problem 1: Unsorted data for price in dollars, consider data set: 8,16,9, 15,21,21, 24,30,26,27,30,34, Apply various
binning techniques and show the result.
Solution: After sorting: 8,9,15,16,21,21,24,26,27,30,30,34
1. By equal-frequency bin method, the data should be distributed across bins. Let us assume the bins of size 3,
then the above data is distributed across the bins a
Bin 1: 8, 9, 15, 16
Bin 2: 21,21,24,26
Bin 3: 27, 30, 30, 34
2. Smoothing the data by bin means
Bin 1: 8,9,15,16
Mean of Bin 1: (8+9+15+16)/4 = 12
Bin 1: 12,12,12,12
Mean of Bin 2: (21+21+24+26)/4 = 23
Bin 2: 23,23,23,23
Mean of Bin 3: (27+30+30+34)/4 = 30
Bin 3: 30,30,30,30
17-04-2025 53

Conti…
Data smoothing by Bin Boundaries.
Before Bin boundary: Bin 1: 8,9,15,16
Here, 9 is near to 8, so 9 will be treated ad 8, 15 is near to 16 and further
away from 8. so, 15 will be treated as 16.
After Bin boundary:
Bin 1: 8,8,16,16
Bin 2: 21,21,26,26
Bin 3: 27,34,34,34
17-04-2025 54
Minimum
Maximum

Conti…
Problem-2: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply various binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed across bins. Let us assume the bins of size 3, then the above
data is distributed across the bins a
Bin 1: 12, 14, 19
Bin 2: 22, 24, 26
Bin 3: 28, 31, 34
By smoothing bins method, the bins are replaced by the bin means. This method results in:
Bin 1: 15, 15, 15
Bin 2: 24, 24, 24
Bin 3: 30.3, 30.3, 30.3
Using smoothing by bin boundaries method, the bins' values would be like:
Bin 1: 12, 12, 19
Bin 2: 22, 22, 26
Bin 3: 28, 34, 34
As per the method, the minimum and maximum values of the bin are determined, and it serves as bin boundary and does not change.
Rest of the values are transformed to the nearest value. It can be observed in Bin 1, the middle value 14 is compared with the
boundary values 12 and 19 and changed to the closest value, that is 12. This process is repeated for all bins.
17-04-2025 55

Conti…
Problem-3: Consider the following set: S = {4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34,38,42,45}. Apply various
binning techniques and show the result.
17-04-2025 56

Conti…
Problem-4: Consider the following set: S = {5,10,11,13,15,35,50,55,72,92,204,215}. Apply
various binning techniques and show the result.
17-04-2025 57

Data Integration and Data Transformations
• The process of combining data from various sources to create a unified view, enabling organizations
to analyze data more effectively and make better decisions.
• So, this may lead to redundant data.
• The main goal of data integration is to detect and remove redundancies that arise from integration.
• Data transformation refers to the process of converting raw data into a format that is more suitable
for model building and analysis, ensuring accuracy, relevance, and usability for machine learning
algorithms.
• Data transformation routines perform operations like normalization to improve the performance of
the data mining algorithms.
• Normalization is one such technique - the attribute values are scaled to fit in a range (say 0-1) to
improve the performance of the data mining algorithm.
• Often, in neural networks, these techniques are used.
• Some of the normalization procedures used are:
1. Min-Max
2. z-Score
17-04-2025 58

Conti…
• Min-Max Procedure:
• It is a normalization technique where each variable V is normalized by its difference
with the minimum value divided by the range to a new range, say 0–1.
• The Minimax algorithm is a decision-making algorithm used in game theory and artificial
intelligence to find the optimal move for a player, assuming the opponent also plays
optimally.
• The formula to implement this normalization is
Min-max =
𝑉−𝑚𝑖𝑛
𝑚𝑎𝑥−𝑚𝑖𝑛
X (new_max – new_min)+ new_min
Here, max-min is the range.
Min and max are the minimum and maximum of the given data, new max and new min
are the minimum and maximum of the target range, say 0 and 1.
17-04-2025 59

Conti…
Problem-1: Consider the data set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the marks to a
new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and new max are 0 and 1,
respectively.
For marks 88:
Min-max =
88−88
94−88
x (1-0) + 0 = 0
Similarly, other marks can be computed as follows:
For marks 90:
Min-max =
90−88
94−88
x (1-0) + 0 = 0.33
For marks 92:
Min-max =
92−88
94−88
x (1-0) + 0 = 0.66
For marks 94:
Min-max =
94−88
94−88
x (1-0) + 0 = 1
So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33, 0.66, 1}.
Thus, the Min-Max normalization range is between 0 and 1.
17-04-2025 60

Conti…
• Problem-2: Consider the set: V = {1000,2000,3000,9000}. Apply Min-Max procedure and map the marks
to a new range 0–1.
Solution:
• max(A)=9000,as the maximum data among (1000,2000,3000,9000
• min(A)=1000,as the minimum data among 1000,2000,3000,9000
For marks 1000:
Min-max =
1000−1000
9000−1000
x (1-0) + 0 = 0
Similarly, other marks can be computed as follows:
For marks 2000:
Min-max =
2000−1000
9000−1000
x (1-0) + 0 = 0.125
For marks 3000:
Min-max =
3000−1000
9000−1000
x (1-0) + 0 = 0.25
For marks 9000:
Min-max =
9000−1000
9000−1000
x (1-0) + 0 = 1
• Hence, the normalized values of 1000,2000,3000,9000 are 0, 0.125, 0.25, 1.
• Thus, the Min-Max normalization range is between 0 and 1.
17-04-2025 61

Conti…
Problem-3: Consider the set: V = {200, 300, 400, 600, 1000). Apply Min-Max
procedure and min-max normalization by setting min = 0 and max = 1
17-04-2025 62

z-Score Normalization
• z-scores are used to detect outlier detection.
• If the data value z-score function is either less than -3 or greater than +3, then it is possibly an outlier.
• The major disadvantage of z-score function is that it is extremely sensitive to outliers as it is dependent on
mean.
• Also known as standardization, transforms data to have a mean of 0 and a standard deviation of 1, making it
easier to compare datasets with different scales or ranges.
V* = (V – μ)/σ
Here,
V: original value
μ : mean of the data set.
σ : standard deviation of the dataset.
How is it used?
• Comparing data points: Z-scores are useful for comparing data points from different distributions, as they
provide a standardized scale.
• Identifying outliers (errors in dataset): A data point with a very high or low z-score (e.g., above +2 or below
-2) can be considered an outlier.
• Understanding probability: Z-scores can be used to determine the probability of a data point occurring
within a normal distribution.
17-04-2025 63

Conti…
Problem 1: Consider the marks list V = {10, 20, 30}, convert the marks to z-
score.
Solution: The mean and Sample Standard deviation (s) values of the list V are
20 and 10, respectively. So the z-scores of these marks are calculated as:
V* = (V – μ)/σ
Z-score of 10 =
10−20
10
= -1
Z-score of 20 =
20−20
10
= 0
Z-score of 30 =
30−20
10
= 1
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
17-04-2025 64

Conti…
Problem 2: Consider the mark list V =
{3,5,5,8,9,12,12,13,15,16,17,19,22,24,25,134}, convert the marks to z-score.
Solution: -0.61, -0.54, -0.54, -0.44, -0.41, -0.31, -0.31, -0.28, -0.21, -0.17, -
0.14, -0.07, -0.03, -0.09, -0.13, -0.61, 3.79.
17-04-2025 65

Data Reduction
• The process of minimizing the size of datasets while retaining essential information,
making data more manageable for analysis, storage, and processing.
• It involves eliminating redundant or irrelevant information without losing critical
insights.
• There are different ways in which data reduction can be carried out such as data
aggregation, feature selection, and dimensionality reduction.
Data reduction can be achieved through various techniques, including:
• Data Compression: Reducing the number of bits needed to represent data, making files
smaller.
• Data Deduplication: Eliminating duplicate copies of data, storing only one instance and
referencing it.
• Dimensionality Reduction: Reducing the number of variables or attributes under
consideration.
• Data Discretization: Replacing numerical attributes with nominal ones.
• Attribute Subset Selection: Choosing a subset of relevant attributes for analysis.
17-04-2025 66

DESCRIPTIVE STATISTICS
• It Summarizes or describes the characteristics of a data set.
• It consists of three basic categories of measures: measures of central tendency,
measures of variability (or spread), and frequency distribution.
• Measures of central tendency describe the center of the data set (mean, median, mode).
• Measures of variability describe the dispersion of the data set (variance, standard
deviation).
• Measures of frequency distribution describe the occurrence of data within the data set
(count).
• Descriptive analytics and data visualization techniques help to understand the nature of
the data (Qualitative Data: Colors, textures, opinions, and categories, Quantitative
Data: Height, weight, age, and temperature).
• This step is often known as Exploratory Data Analysis (EDA):
• The focus of EDA is to understand the given data and to prepare it for machine learning
algorithms.
• EDA includes descriptive statistics and data visualization.
17-04-2025 67

Dataset and Data Type
• A dataset is a structured collection of data points that a machine learning algorithm can analyze.
• It provides the model with examples to learn from, typically including features (input variables) and in
some cases, labels (output variables) for supervised learning tasks.
• Data types are broadly classified as qualitative (or categorical) and quantitative (or numerical), with
further sub-divisions like nominal, ordinal, discrete, and continuous.
Table -1: Dataset
Data types
17-04-2025 68
Patient Id Name Age Blood test Fever Disease
1 John 30 Negative Low No
2 Miller 38 Positive High Yes
3 Andre 35 Negative Low No
Data types
Categorical data Numerical data
Interval data Ordinal data
Ordinal data
Nominal data

UNIVARIATE DATA ANALYSIS AND VISUALIZATION
UNIVARIATE DATA ANALYSIS
• Refers to data that consists of observations on only one characteristic or attribute.
• As the name indicates, the dataset has only one variable.
• A variable can be called as a category.
• Univariate does not deal with cause or relationships.
• The aim of univariate analysis is to describe data and find patterns.
• Univariate data description involves finding the frequency distributions, central
tendency measures, dispersion or variation, and shape of the data.
• Example:
• If you're analyzing the ages of people in a sample, univariate analysis would focus on
describing the distribution of those ages (e.g., average age, range of ages, most common
age).
17-04-2025 69

Conti…
Data Visualization
• To understand data, graph visualization is must.
• Data visualization helps to understand data.
• It helps to present information and data to customers.
• Some of the graphs that are used in univariate data analysis are bar charts, histograms, frequency
polygons and pie charts.
• The advantages of the graphs are presentation of data, summarization of data, description of data,
exploration of data, and to make comparisons of data.
Bar Chart
• A Bar chart (or Bar graph) is used to display the frequency distribution for variables.
• Bar charts are used to illustrate discrete data.
• The charts can also help to explain the counts of nominal data. It also helps in comparing the frequency of
different groups.
• The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID
17-04-2025 70

Conti…
Pie Chart
• Type of graph representing data in a circular form and these are equally helpful in illustrating the
univariate data.
• The percentage frequency distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90}.
Histogram
• It plays an important role in data mining for showing frequency distributions.
• The histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75, 76-100
17-04-2025 71
Figure 1.15: Bar Chart
Figure 1.17: Histgram Chart
Figure 1.16: Pie Chart

Central Tendency
• It is the statistical measure that represents the single value of the entire distribution or a dataset.
• Identifying the central position within that set of data.
• As such, measures of central tendency are sometimes called measures of central location.
• Therefore, a condensation or summary of the data is necessary.
• This makes the data analysis easy and simple.
• Thus, central tendency can explain the characteristics of data and that further helps in comparison.
• Mass data have tendency to concentrate at certain values, normally in the central location.
• The mean, median and mode are all valid measures of central tendency, but under different conditions,
some measures of central tendency become more appropriate to use than others.
17-04-2025 72

Conti…
1. Mean:
• Arithmetic average (or mean) is a measure of central tendency that
represents the ‘center’ of the dataset.
• This is the commonest measure used in our daily conversation such as
average income or average traffic.
• It can be found by adding all the data and dividing the sum by the number
of observations.
• Mathematically Mean is defined by
ҧ
𝑥=
𝑥1+𝑥2+𝑥3+⋯.+𝑥𝑁
𝑁
=
1
𝑁
σ𝑖=1
𝑁
𝑥𝑖
17-04-2025 73

Conti…
2. Median:
• The middle value in the distribution is called median.
• If the total number of items in the distribution is odd, then the middle value is called median.
• If the numbers are even, then the average value of two items in the centre is the median.
• It can be observed that the median is the value where 𝑥𝑖is divided into two equal halves, with half of the
values being lower than the median and half higher than the median.
• A median class is that class where (𝑁/2)𝑡ℎ
item is present.
• In the continuous case, the median is given by the formula:
Median = 𝐿1+
𝑁
2
−𝑐𝑓
𝑓
X I
Median class is that class where (𝑁/2)𝑡ℎ
item is present.
Here, i is the class interval of the median class
𝐿1 is the lower limit of median class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
Ex: 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and 17
Median: (27+29)/2=28
17-04-2025 74

Conti…
3. Mode
• Mode is the value that occurs more frequently in the dataset.
• In other words, the value that has the highest frequency is called mode.
• Mode is only for discrete data and is not applicable for continuous data as there are no
repeated values in continuous data.
• The procedure for finding the mode is to calculate the frequencies for all the values in
the data, and mode is the value (or values) with the highest frequency.
• Normally, the dataset is classified as unimodal, bimodal and tri-modal with modes 1, 2
and 3, respectively.
• Ex: Consider the given dataset (5, 4, 2, 3, 2, 1, 5, 4, 5)
• Mode: 5,5,5,4,4,3,2,1
17-04-2025 75

Conti…
Problem 1: What is the median of the following data set?
(32, 6, 21, 10, 8, 11, 12, 36, 17, 16, 15, 18, 40, 24, 21, 23, 24, 24, 29, 16, 32, 31, 10, 30, 35, 32, 18, 39, 12, 20)
Solution:
The ascending order of the given data set is:
(6, 8, 10, 10, 11, 12, 12, 15, 16, 16, 17, 18, 18, 20, 21, 21, 23, 24, 24, 24, 29, 30, 31, 32, 32, 32, 35, 36, 39, 40)
Number of values in the data set = n = 30
n/2 = 30/2 = 15
15th data value = 21
(n/2) +1 = 16
16th data value = 21
Median = [(n/2)th observation + {(n/2)+1}th observation]/2
= (15th data value + 16th data value)/2
= (21 + 21)/2
= 21
17-04-2025 76

Conti…
Problem 2: Identify the mode for the following data set:
(21, 19, 62, 21, 66, 28, 66, 48, 79, 59, 28, 62, 63, 63, 48, 66, 59, 66, 94, 79, 19, 94)
Solution:
Let us write the given data set in ascending order as follows:
(19, 19, 21, 21, 28, 28, 48, 48, 59, 59, 62, 62, 63, 63, 66, 66, 66, 66, 79, 79, 94, 94)
Here, we can observe that the number 66 occurred the maximum number of times.
Thus, the mode of the given data set is 66.
17-04-2025 77

Conti…
Problem 3: Find the mean, median, mode and range for the given data:
(90, 94, 53, 68, 79, 94, 53, 65, 87, 90, 70, 69, 65, 89, 85, 53, 47, 61, 27, 80)
Solution:
Given: (90, 94, 53, 68, 79, 94, 53, 65, 87, 90, 70, 69, 65, 89, 85, 53, 47, 61, 27, 80)
Number of observations = 20
Mean = (90 + 94 + 53 + 68 + 79 + 94 + 53 + 65 + 87 + 90 + 70 + 69 + 65 + 89 + 85 +
53 + 47 + 61 + 27 + 80)/20
= 1419/20
= 70.95
Therefore, mean is 70.95.
Median:
The ascending order of given observations is:
(27, 47, 53, 53, 53, 61, 65, 65, 68, 69, 70, 79, 80, 85, 87, 89, 90, 90, 94,94)
Here, n = 20
17-04-2025 78

Conti…
Median = 1/2 [(n/2) + (n/2 + 1)]th observation
= 1/2 [10 + 11]th observation
= 1/2 (69 + 70)
= 139/2
= 69.5
Thus, the median is 69.5.
Mode:
The most frequently occurred value in the given data is 53.
Therefore, mode = 53
Range:
Range = Highest value – Lowest value
= 94 – 27
= 67
17-04-2025 79

Dispersion
• The spreadout of a set of data around the central tendency (mean, median or mode) is called
dispersion.
• Dispersion is represented by various ways such as range, variance, standard deviation, and
standard error.
• These are second order measures.
• The most common measures of the dispersion data are listed below:
Range
• Range is the difference between the maximum and minimum of values of the given list of data.
Standard Deviation
• The mean does not convey much more than a middle point.
• For example, the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20.
• The difference between these two sets is the spread of data.
• Standard deviation is the average distance from the mean of the dataset to each point.
• The formula for sample standard deviation is given by: S=
σ𝑖=1
𝑁 {𝑥𝑖− ҧ
𝑥)2
𝑁−1
17-04-2025 80

Conti…
Quartiles and Inter Quartile Range
• It is sometimes convenient to subdivide the dataset using coordinates.
• Percentiles are about data that are less than the coordinates by some percentage
of the total value.
• 𝑘𝑡ℎ
percentile is the property that the k% of the data lies at or below 𝑥𝑖.
• For example, median is 50th percentile and can be denoted as 𝑄0.50.
• The 25th percentile is called first quartile (𝑄1) and the 75th percentile is called
third quartile (𝑄3).
• Another measure that is useful to measure dispersion is Inter Quartile Range
(IQR). The IQR is the difference between 𝑄3and 𝑄1.
Interquartile percentile = 𝑄3 – 𝑄1
• Outliers are normally the values falling apart at least by the amount 1.5 × IQR
above the third quartile or below the first quartile.
Interquartile is defined by 𝑄0.75 – 𝑄0.25
17-04-2025 81

Conti…
Problem 1: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34}, find the IQR.
Solution: The median is in the fifth position.
• In this case, 24 is the median. The first quartile is median of the scores below the mean
i.e., {12, 14, 19, 22}.
• Hence, it’s the median of the list below 24.
• In this case, the median is the average of the second and third values (14+19)/2, that is,
𝑄0.25 = 16.5.
• Similarly, the third quartile is the median of the values above the median, that is {26, 28,
31, 34}.
• So, 𝑄0.75 is the average of the seventh and eighth score. In this case, it is (28 + 31)/2
𝑄0.75 = 59/2 = 29.5.
• Hence, the IQR = 𝑄3 – 𝑄1= 𝑄0.75 − 𝑄0.25= 29.5 - 16.5 = 13
• The half of IQR is called semi-quartile range. The Semi Inter Quartile Range (SIQR) is
given as:
• SIQR =
1
2
𝑋 𝐼𝑄𝑅 =
1
2
𝑋 13 = 6.5
17-04-2025 82

Conti…
Problem 2: For patients’ age list {9, 11, 15, 21, 23, 26, 29, 30, 32, 36, 38}, find
the IQR.
17-04-2025 83

Module 2 - Understanding Data – 2
Bivariate Data
• Bivariate analysis is one of the statistical analysis where two variables are observed.
• One variable here is dependent (X) while the other is independent (Y).
• Bivariate data can be used to determine whether or not two variables are related.
• The aim of bivariate analysis is to find relationships among data.
• The relationships can then be used in comparisons, finding causes, and in further explorations.
• To do that, graphical display of the data is necessary.
• One such graph method is called scatter plot.
• Scatter plots are the graphs that present the relationship between two variables in a data-set.
• It is a 2D graph showing the relationship between two variables.
• It is useful in exploratory data before calculating a correlation coefficient or fitting regression
curve.
17-04-2025 84

Conti..
Temperature (in
centigrade)
Sales of Sweaters
(in thousands)
5 300
12 250
15 200
20 110
23 45
27 10
35 5
17-04-2025 85
300
250
200
110
45
10 5
-100
-50
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40
Sales
of
Sweaters
Temparature
Sales of Sweaters (in thousands)
Figure 2.11: Scatter Plot Line graphs are similar to scatter
plots.
300
250
200
110
45
10 5
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7
sales
of
Sweaters
Temparature
Sales of Sweaters (in thousands)
Figure 2.12: Line Chart
Table 2.1: Temperature in a Shop and Sales Data

Bivariate Statistics
• Bivariate analysis is stated to be an analysis of any concurrent relation between two variables or
attributes.
• Example: examples: student's study time vs. their exam scores, ice cream sales vs. temperature,
height vs. weight, income vs. years of education, and patient's BMI vs. blood pressure.
• Covariance and Correlation are methods of bivariate statistics.
• Covariance is a measure of joint probability of random variables, say X and Y.
• It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between two
dimensions.
• The formula for finding co-variance for specific x, and y are:
𝐶𝑂𝑉(𝑋, 𝑌) =
1
𝑁
෍
𝑖=1
𝑁
𝑥𝑖 − 𝐸(𝑋) 𝑦𝑖 − 𝐸(𝑌)
Here, 𝑥𝑖and 𝑦𝑖are data values from X and Y. E(X) and E(Y) are the mean values of 𝑥𝑖 and 𝑦𝑖.
N is the number of given data.
Also, the COV(X, Y) is same as COV(Y, X).
17-04-2025 86

Problem 1: Find the covariance of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: Mean(X) = E(X) =
15
5
= 3
Mean(Y) = E(Y) =
55
5
= 11
𝐶𝑂𝑉(𝑋, 𝑌) =
1
𝑁
෍
𝑖=1
𝑁
𝑥𝑖 − 𝐸(𝑋) 𝑦𝑖 − 𝐸(𝑌)
=
1 − 3 1 − 11 + 2 − 3 4 − 11 + 3 − 3 9 − 11 + 4 − 3 16 − 11 + (5 − 3)(25 − 11)
5
= 12
The covariance between X and Y is 12.
17-04-2025 87

Problem 2: Find the covariance between X and Y for the following data:
Solution:
17-04-2025 88
X 3 4 5 8 7 9 6 2 1
Y 4 3 4 7 8 7 6 3 2

Correlation
Correlation refers to a process for establishing the relationships between two
variables.
The correlation coefficient is a statistical measure of the strength of a linear relationship
between two variables. Its values can range from -1 to 1.
The sign is more important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the other
dimension decreases.
3. If the value is zero, then it indicates that both the dimensions are independent of each
other.
If the given attributes are X = (𝑥1, 𝑥2, 𝑥3, …, 𝑥𝑛) and Y = (𝑦1, 𝑦2, 𝑦3, …, 𝑦𝑛), then the Pearson
correlation coefficient, that is denoted as r, is given as:
r=
𝐶𝑂𝑉(𝑋,𝑌)
𝜎𝑥,𝜎𝑦
where, 𝜎𝑥, 𝜎𝑦 are the standard deviations of X and Y.
17-04-2025 89

Conti..
Problem 1: Find the correlation coefficient of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: Step 1: The mean values of X and Y
Mean(X) = ത
𝑋=
15
5
= 3
Mean(Y) = = ത
𝑌 =
55
5
= 11
Step 2: Calculate the squared differences from the mean
For X for Y
(𝑋1 − ത
𝑋)2= 1 − 3 2=4 (𝑌1 − ത
𝑌)2= 1 − 11 2=100
(𝑋2 − ത
𝑋)2
= 2 − 3 2
=1 (𝑌2 − ത
𝑌)2
= 4 − 11 2
=49
(𝑋3 − ത
𝑋)2= 3 − 3 2=0 (𝑌3 − ത
𝑌)2= 9 − 11 2=4
(𝑋4 − ത
𝑋)2
= 4 − 3 2
=1 (𝑌4 − ത
𝑌)2
= 16 − 11 2
=25
(𝑋5 − ത
𝑋)2
= 5 − 3 2
=4 (𝑌5 − ത
𝑌)2
= 25 − 11 2
=196
Sum of squared differences for X: 10 Sum of squared differences for X: 374
17-04-2025 90

CONTI..
Step 3: Calculate the variance
• The variance for each set is the average of these squared differences.
For X:
• Variance of X =
10
5
= 2
For Y:
• Variance of Y=
374
5
= 74.8
Step 4: Calculate the standard deviation
• The standard deviation is the square root of the variance.
• For X:
• 𝜎𝑋 = 2≈1.414
• For Y:
• 𝜎𝑌 = 74.8≈8.6486
• Therefore, the correlation coefficient is given as ratio of covariance
• 𝐶𝑂𝑉 𝑋, 𝑌 =
1
𝑁
σ𝑖=1
𝑁
𝑥𝑖 − 𝐸 𝑋 𝑦𝑖 − 𝐸 𝑌 =
1−3 1−11 + 2−3 4−11 + 3−3 9−11 + 4−3 16−11 + 5−3 25−11
5
= 12
• Therefore, correlation coefficient, r=
12
1.414+8.6486
= 0.984
17-04-2025 91

Conti..
Problem 1: Find the correlation coefficient of data X = {5,9,10,3,5,7} and Y = {6,11,6,4,6,9}.
Solution:
17-04-2025 92

Multivariate Statistics
• Multivariate statistics refers to methods that examine the simultaneous effect of multiple variables.
• In machine learning, almost all datasets are multivariable.
• Multivariate data is the analysis of more than two observable variables, and often, thousands of multiple
measurements need to be conducted for one or more subjects.
• The multivariate data is like bivariate data but may have more than two dependent variables.
• Some of the multivariate analysis are regression analysis, principal component analysis, and path analysis.
id Attribute-1 Attribute-2 Attribute-3
1 1 4 1
2 2 5 2
3 3 6 1
• The mean of multivariate data is a mean vector and the mean of the above three attributes is given as (2,
5, 1.33).
• The variance of multivariate data becomes the covariance matrix.
• The mean vector is called centroid and variance is called dispersion matrix.
• Multivariate data has three or more variables.
17-04-2025 93

Heatmap
• In machine learning, a heatmap is a data visualization technique that uses color-coding to represent the
magnitude of individual values within a dataset, often displayed as a grid or matrix.
• It helps to identify patterns, correlations, and anomalies within complex datasets by highlighting areas of
significance.
• It takes a matrix as input and colours it.
• The darker colours indicate very large values and lighter colours indicate smaller values.
• The advantage of this method is that humans perceive colours well.
• So, by colour shaping, larger values can be perceived well.
• For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic regions
through heatmap.
17-04-2025 94
Figure 2.3 : Grid with Heatmap Pattern

Pairplot
• A scatterplot matrix, is a data visualization tool that displays pairwise relationships between all variables in
a dataset, helping to understand distributions and correlations at a glance.
• Pairplot or scatter matrix is a data visual technique for multivariate data.
• A scatter matrix consists of several pair-wise scatter plots of variables of the multivariate data.
• All the results are presented in a matrix format.
• By visual examination of the chart, one can easily find relationships among the variables such as
correlation between the variables.
• A random matrix of three columns is chosen and the relationships of the columns is plotted as a pairplot.
17-04-2025 95
Figure 1: PAIRPLOT VISUALIZATION

Essential Mathematics for Multivariate Data
• Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory.
• Linear algebra deals with linear equations, vectors, matrices, vector spaces and transformations.
• These are the driving forces of machine learning and machine learning cannot exist without these data
types.
Linear Systems and Gaussian Elimination for Multivariate Data
• A linear system of equations is a group of equations with unknown variables. Let Ax = y,
then the solution x is given as: x = y/A = 𝐴−1
y
This is true if y is not zero and A is not zero.
The logic can be extended for N-set of equations with ‘n’ unknown variables.
• It means if and y=(𝑦1, 𝑦2,……, 𝑦𝑛)
• Then unknown variable x= y/A = 𝐴−1
y
17-04-2025 96

Conti..
For solving large number of system of equations, Gaussian elimination can be used.
The procedure for applying Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element 𝑎11as pivot and eliminate all 𝑎11 in second row using the matrix operation,
𝑅2−
𝑎21
𝑎11
, here 𝑅2 is the second row and
𝑎21
𝑎11
is called as multiplier. The same logic can be
used to remove 𝑎12in all other equations.
4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:
𝑥𝑛=
𝑦𝑛𝑛
𝑥𝑛𝑛
5. Then, the remaining unknown variables can be found by back-substitution as:
𝑥𝑛−1 =
𝑦𝑛−1 − 𝑎𝑛−1 𝑥 𝑎𝑛
𝑎(𝑛−1)(𝑥−1)
This part is called backward substitution.
17-04-2025 97

Problem 4: Solve the following set of equations using Gaussian Elimination method.
2𝑥1 + 4𝑥2 = 6
4𝑥1 + 3𝑥2 = 7
Solution:
2 4 | 6
4 3 | 7
Apply the transformation by dividing the row 1 by 2 (R1/2).
-
1 2 | 3
4 3 | 7
R2=R2-4R1
-
1 2 | 3
0 − 5 | − 5
R2=R2/-5
-
1 2 | 3
0 1 | 1
R1=R1-2R2 -
1 0 | 1
0 1 | 1
x1 = 1, x2 = 1
17-04-2025 98

Problem 5: Solve the following set of equations using Gaussian Elimination
method. 2x+y=-1
3x-5y= -21
solution:
1 0 | − 2
0 1 | 3
17-04-2025 99

Machine Learning and Importance of Probability and Statistics
• Machine learning is linked with statistics and probability.
• Like linear algebra, statistics is the heart of machine learning.
• The importance of statistics needs to be stressed as without statistics; analysis of data is difficult.
• Probability is especially important for machine learning.
• In machine learning, probability is a fundamental concept that deals with the likelihood of events or
outcomes. It's used to model uncertainty and make predictions, especially in algorithms that deal with
probabilistic models like Naive Bayes.
Probability Distributions
• The mathematical function that gives the probabilities of occurrence of possible outcomes for
an experiment.
• In other words, distribution is a function that describes the relationship between the observations in a
sample space.
Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
17-04-2025 100

FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES
• The process of selecting, transforming, and creating new features (or variables) from raw data to improve the
performance of machine learning models.
• It involves carefully preparing the input data so that machine learning algorithms can learn effectively and make
accurate predictions.
• Features are attributes.
• Feature engineering is about determining the subset of features that form an important part of the input that
improves the performance of the model, be it classification or any other model in machine learning.
• Feature engineering deals with two problems – Feature Transformation and Feature Selection.
• Feature transformation is extraction of features and creating new features that may be helpful in increasing
performance.
• For example, the height and weight may give a new attribute called Body Mass Index (BMI).
• Feature subset selection is another important aspect of feature engineering that focuses on selection of features to
reduce the time but not at the cost of reliability.
The features can be removed based on two aspects:
1. Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like nose.
2. Feature redundancy – Some features are redundant.
For example, when a database table has a field called Date of birth, then age field is not relevant as age can be
computed easily from date of birth. This helps in removing the column age that leads to reduction of dimension one.
17-04-2025 101

Machine Learning_2025_First Module_1.pdf

More Related Content

What's hot (20)

Similar to Machine Learning_2025_First Module_1.pdf (20)

More from Dr. Shivashankar (20)

Recently uploaded (20)

Machine Learning_2025_First Module_1.pdf