ELH -4.2: MACHINE LEARNING :supervised, unsupervised or reinforcement learning

ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 1
ELH -4.2: MACHINE LEARNING
INTRODUCTION
 Machine Learning (ML) is a field of artificial intelligence (AI) focused on developing
algorithms that enable computers to learn from and make decisions based on data.
 Its history dates back to the 1950s with Alan Turing's concept of machines simulating
human intelligence.
 The term "artificial intelligence" was coined in 1956, but early AI faced limitations due to
insufficient data and computational power. The 1980s saw the emergence of machine
learning methods, and the 1990s brought significant advances with the rise of the internet
and statistical techniques.
 The modern era, particularly from the 2010s onwards, has been dominated by deep
learning, leveraging neural networks and vast datasets to achieve breakthroughs in areas
like image and speech recognition, transforming various industries and applications.
Definitions and Examples
1. Learning
Definition: Learning, in the context of machine learning, refers to the process of gaining
knowledge or skills through experience, study, or being taught. In ML, it specifically means
improving performance on a task over time by gaining experience from data.

P a g e | 2
Example: Suppose you are learning to play chess. Initially, you might not know the best strategies,
but as you play more games and study strategies, you get better. Similarly, in ML, an algorithm
might start with little knowledge but improves its performance as it processes more data.
2. Machine
Definition: A machine, in this context, is any device that uses electrical or mechanical power to
perform tasks. In the realm of machine learning, "machine" usually refers to a computer or a system
that can process and analyze data.
Example: Your smartphone is a machine. It can perform various tasks like recognizing your voice
or face, suggesting the next word while typing, or recommending songs you might like based on
your listening history
3. Natural intelligence
Natural intelligence refers to the inherent ability of humans and certain animals to understand,
learn, and adapt to their environment using cognitive processes such as perception, reasoning, and
problem-solving. It encompasses a wide range of capabilities, including language comprehension,
social interactions, creativity, and emotional intelligence.
Example: A person walking through a crowded street demonstrates natural intelligence by
effortlessly navigating through the environment, avoiding obstacles, recognizing familiar faces,
interpreting traffic signals, and making decisions based on situational awareness. This ability to
process complex sensory information, analyze context, and respond appropriately showcases the
remarkable cognitive abilities inherent in natural intelligence.

P a g e | 3
4. Artificial Intelligence (AI)
Definition: AI is the broader field that encompasses any technique enabling computers to mimic
human intelligence. This includes problem-solving, understanding language, recognizing patterns,
and learning from experience.
Example: Siri, Apple's virtual assistant, is an example of AI. It can understand and respond to
your questions, set reminders, and perform tasks based on your voice commands. This involves
natural language processing and machine learning.
Types of AI
Artificial Intelligence can be divided in various types, there are mainly two types of main
categorization which are based on capabilities and based on functionally of AI.

P a g e | 4
I) Based on Capabilities
AI Type Description Examples
1. Narrow AI AI designed for specific tasks. - Apple Siri
- Playing chess
- Self-driving cars
- Speech recognition
- Image recognition
2. General AI AI with the ability to understand, learn,
and apply intelligence across a wide range
of tasks.
Currently theoretical; under
research and development.
3. Super AI AI that surpasses human intelligence in all
aspects, including creativity, problem-
solving, and emotions.
Currently theoretical; under
research and development.
II) Based on Functionalities
AI Type Description Examples
1. Reactive
Machines
Basic AI systems that react to current
scenarios without storing memories or
past experiences.
- IBM's Deep Blue
- Google's AlphaGo
2. Limited
Memory
AI systems capable of storing past
experiences for a short period and
using them to inform decisions.
- Self-driving cars storing recent
speed of nearby cars, distance,
speed limits, etc.
3. Theory of
Mind
AI intended to understand human
emotions, beliefs, and interact socially
like humans.
Currently in development; no
existing examples.

P a g e | 5
4. Machine Learning (ML)
Definition: Machine Learning (ML) is a subset of artificial intelligence (AI) that enables
computers to learn from and make predictions or decisions based on data. Instead of being
explicitly programmed to perform a task, a machine learning model is trained using data and
algorithms to find patterns and make decisions.
Example: A spam filter in your email uses ML to distinguish between spam and legitimate emails.
It learns from past emails marked as spam or not and uses that data to predict and filter future
emails.
Difference between Machine Learning and Traditional Programming
 Machine learning (ML) and traditional programming are two distinct approaches to solving
problems with computer systems.
 While traditional programming relies on explicit rules and human-crafted logic, machine
learning leverages algorithms that learn from data to make predictions or decisions

P a g e | 6
Characteristic Machine Learning Traditional Programming
Approach Data-driven Rule-based
Development
Process
Model is trained using data and
algorithms
Explicit rules and logic are coded by
developers
Adaptability Highly adaptable; can improve
with more data
Limited to predefined rules; changes
require code modifications
Human
Intervention
Minimal after training;
continuous learning
Ongoing maintenance and updates
by developers
Handling
Complexity
Handles complex patterns and
large datasets effectively
Effective for well-defined problems
and tasks
Required Input Large datasets for training and
testing
Detailed specifications and rules
Error Handling Can handle noisy or incomplete
data
Requires precise data and handling
of edge cases
Performance Performance improves with more
data and better algorithms
Performance depends on code
optimization
Learning from
Data
Learns and improves from new
data
Does not learn; behavior remains
static unless reprogrammed
Flexibility Can generalize well to new,
unseen data
Limited flexibility; changes require
code rewrite
Predictive
Capability
Can make predictions based on
patterns in data
Cannot predict; follows explicit
instructions
Time to
Deployment
Longer initial setup for training
models
Quicker to deploy for well-defined
tasks
Scalability Scales well with more data and
computational power
Scales with code complexity and
hardware resources
Limitations Can be limited by the quality and
quantity of training data
Can be limited by the programmer's
understanding and analysis of the
problem
Advantages Can handle complex tasks and
adapt to new data and scenarios
Can be used for tasks that require
specific functionality and are well-
defined
Examples of
Technologies
Neural networks, decision trees,
support vector machines
Compilers, interpreters, databases
Applications Used for complex tasks like
Image recognition, and predictive
analytics
Used for tasks like database
management and website
development

P a g e | 7
Need of Machine Learning
 The need for machine learning (ML) arises from the exponential expansion of data in the
digital age. Traditional analytical approaches are no longer adequate due to the vast
amounts of data being generated daily.
 Machine learning algorithms can find patterns, trends, and connections that humans would
not even be aware of, making them crucial for decision-making processes, optimizing
resource allocation, and spurring innovation in various sectors
1. Handling Big Data: ML can process and analyze vast amounts of data, extracting
meaningful patterns and insights.
2. Complex Pattern Recognition: ML algorithms excel at identifying intricate patterns
in data that are difficult to detect using traditional methods.
3. Automation of Tasks:ML enables automation of repetitive tasks, reducing human
intervention and increasing efficiency.
4. Improved Decision Making:ML models can provide data-driven insights and
predictions, aiding in better decision-making processes.
5. Adaptability:ML models can adapt to new data and changing conditions, making them
flexible and robust.
6. Personalization:ML allows for personalized experiences, such as tailored
recommendations in e-commerce and streaming services.
7. Scalability:ML systems can scale with the amount of data and computational power,
improving performance and accuracy over time.
8. Real-Time Processing:ML can process data in real time, enabling applications like fraud
detection, autonomous vehicles, and instant recommendations.
9. Complex Problem Solving: ML can tackle problems that are too complex for traditional
algorithms, such as image and speech recognition.
10. Predictive Maintenance: ML can predict equipment failures and maintenance needs,
reducing downtime and saving costs.
11. Enhanced Customer Experience: ML-driven chatbots and virtual assistants provide
better customer support and interaction.

P a g e | 8
Life Cycle of Machine Learning
The Machine Learning life cycle encompasses the iterative process of developing, deploying, and
maintaining machine learning models. It involves steps such as data gathering, preprocessing,
model selection and training, evaluation, and deployment, ensuring the model's effectiveness and
adaptability to real-world scenarios. This cyclical approach enables continuous improvement and
refinement of models to meet evolving needs and challenges.
Here's 7 step in the Machine Learning life cycle using a fruit classification example:
1. Define Problem Statement
 Understand the problem to be solved and define the objectives of the machine learning
project.
 In this example, the goal is to develop a model that can classify fruits into different
categories based on their features, such as color, shape, and size.

P a g e | 9
2. Data Gathering
 Collect relevant data that will be used to train and test the machine learning model.
 This could involve gathering information about various types of fruits, including images
and corresponding labels indicating the fruit type.
3. Data Preparation
 Clean and preprocess the collected data to ensure it is in a suitable format for training the
machine learning model.
 This may involve tasks such as removing irrelevant features, handling missing values, and
normalizing the data.
4. Data Analysis
 Explore and analyze the prepared data to gain insights into its characteristics and identify
patterns that may be useful for training the model.
 For example, analyzing the distribution of different fruit types in the dataset and
visualizing the relationships between features.
5. Model Selection and Training
 Select an appropriate machine learning algorithm and train it using the prepared data.
 In this example, you might choose a classification algorithm such as a decision tree or a
neural network to train the model to classify fruits based on their features.
6. Model Testing
 Evaluate the performance of the trained model using a separate dataset that was not used
during training. This helps assess how well the model generalizes to new, unseen data.
 For fruit classification, you would test the model on a set of fruit images it hasn't seen
before and measure its accuracy in predicting the correct fruit type.

P a g e | 10
7. Deployment
 Deploy the trained model into a production environment where it can be used to make
predictions on new, incoming data.
 For example, you could develop a mobile app that allows users to take a picture of a fruit
and have the model classify it in real-time based on its features.
Types of Machine Learning Algorithms:
Machine learning algorithms are classified into three main types: supervised, unsupervised, and
reinforcement learning.
1. Supervised Learning:
Definition: Supervised learning involves training a model on a labeled dataset, where each input
is associated with a corresponding output label. The goal is for the model to learn the mapping
between inputs and outputs, enabling it to make predictions on new, unseen data.

P a g e | 11
Working: The algorithm learns from the labeled examples by adjusting its parameters to
minimize prediction errors. It generalizes from the training data to make predictions on new
instances, aiming to accurately predict the correct output labels.
Example: Given the features of a shape (e.g., number of sides, angles), the supervised learning
algorithm would analyze these features and learn patterns distinguishing between different types
of shapes. Once trained, the model can classify new shapes based on their features into categories
like square, rectangle, triangle, or polygon.
Applications: Classification tasks such as spam detection, sentiment analysis, image recognition,
and regression tasks like predicting house prices or stock prices.
Advantages: Ability to make precise predictions on new data, well-understood and widely
applicable across various domains.
Disadvantages: Requires labeled training data, which can be time-consuming and expensive to
obtain. Performance highly depends on the quality and quantity of labeled examples.
2. Unsupervised Learning:
Definition: Unsupervised learning involves training a model on an unlabeled dataset, where the
algorithm learns to find patterns or structures within the data without predefined output labels.

P a g e | 12
Working: The algorithm identifies underlying patterns or structures in the data without the need
for labeled output. Common techniques include clustering similar data points together or reducing
the dimensionality of the data
Example: An unsupervised learning algorithm could analyze the geometric properties of the
shapes (e.g., side lengths, angles) and identify clusters of shapes that exhibit similar characteristics.
This could result in clusters representing shapes with similar attributes, such as squares, rectangles,
triangles, and polygons.
Applications: Clustering (e.g., customer segmentation), dimensionality reduction (e.g., principal
component analysis), and anomaly detection.
Advantages: Can uncover hidden patterns or structures in data without labeled examples. Doesn't
require manual labeling of large datasets.
Disadvantages: May be more challenging to interpret results compared to supervised learning.
Relies on assumptions about the structure of the data.

P a g e | 13
3. Reinforcement Learning:
Definition: Reinforcement learning involves training an agent to interact with an environment and
learn to make decisions based on feedback in the form of rewards or penalties.
Working: The agent takes actions in an environment and receives feedback in the form of rewards
or penalties. It learns to maximize cumulative rewards over time through trial and error, aiming to
discover the best sequence of actions to achieve its goals.

P a g e | 14
Example: In this maze scenario, the agent is tasked with navigating from the starting point to the
destination while avoiding obstacles and maximizing rewards. The maze consists of different
blocks, including walls (S6), a fire pit (S8), and a diamond block (S4). The agent receives a +1
reward for reaching the diamond block (S4) and a -1 reward for falling into the fire pit (S8).
Applications: Game playing, robotics, recommendation systems, natural language processing, and
finance (e.g., algorithmic trading).
Advantages: Capable of learning complex behaviors through interaction with the environment.
Can handle situations with delayed feedback and uncertainty.
Disadvantages: Can be computationally expensive and require large amounts of data for training.
Training may be unstable or require careful tuning of hyperparameters. Learning from delayed
rewards can be slow and inefficient in some scenarios.

P a g e | 15
Comparison of Supervised, Unsupervised, and Reinforcement Learning
Aspect Supervised Learning Unsupervised Learning Reinforcement Learning
Definition Supervised learning
involves training a model
on a labeled dataset,
where the algorithm
learns to map input data to
corresponding output
labels or categories.
Unsupervised learning
involves training a model on
an unlabeled dataset, where
the algorithm learns to find
patterns or structures within
the data without predefined
output labels.
Reinforcement learning
involves training an agent to
interact with an environment
and learn to make decisions
based on feedback in the form
of rewards or penalties.
Data Type Requires labeled data for
both input and output.
Works with unlabeled data;
no output labels are provided
during training.
Involves an environment
where actions are taken and
feedback is received in the
form of rewards or penalties.
Feedback
Mechanism
Feedback provided in the
form of labeled examples,
allowing the algorithm to
adjust its parameters to
minimize prediction
errors.
No explicit feedback is
provided; the algorithm
learns to identify patterns
based on the inherent
structure of the data.
Feedback received in the form
of rewards or penalties based
on the actions taken by the
agent in the environment.
Objective Predict the output label
for new, unseen data
based on learned patterns
from labeled examples.
Discover hidden patterns or
structures within the data to
gain insights or make sense
of complex datasets.
Learn a policy or strategy that
maximizes cumulative
rewards over time, aiming to
achieve specific goals or
tasks.
Example Image classification,
sentiment analysis,
regression tasks like
predicting house prices.
Clustering similar data
points together,
dimensionality reduction,
anomaly detection.
Training an agent to play
games (e.g., chess, Go),
robotics (e.g., navigating a
maze), recommendation
systems.

P a g e | 16
Applications - Classification (e.g.,
spam detection, image
recognition).
- Regression (e.g., house
price prediction).
-Clustering (e.g., customer
segmentation, document
clustering).
-Dimensionality reduction
(e.g., principal component
analysis).
- Anomaly detection.
- Game playing (e.g., chess,
Go).
- Robotics (e.g., autonomous
vehicles).
- Recommendation systems.
- Natural language
processing.
- Finance (e.g., algorithmic
trading).
Advantages - Ability to make precise
predictions on new,
unseen data.
- Well-understood and
widely applicable in
various domains.
- Can uncover hidden
patterns or structures in data
without labeled examples.
- Doesn't require manual
labeling of large datasets.
- Capable of learning complex
behaviors through interaction
with the environment.
- Can handle situations with
delayed feedback and
uncertainty.
Disadvantages - Requires labeled
training data, which may
be time-consuming and
expensive to obtain.
- Performance highly
dependent on the quality
and quantity of labeled
examples.
- May be more challenging
to interpret results compared
to supervised learning.
- Relies on assumptions
about the structure of the
data.
- Can be computationally
expensive and require large
amounts of data for training.
- Training may be unstable or
require careful tuning of
hyperparameters.
- Learning from delayed
rewards can be slow and
inefficient in some scenarios.

P a g e | 17
5. Deep Learning (DL)
Definition: DL is a subset of ML that uses neural networks with many layers (hence "deep") to
model and understand complex patterns in data. Deep learning is particularly powerful for tasks
like image and speech recognition.
Example: An application like Google Photos can automatically organize your photos by
recognizing faces, objects, and scenes. This is done using deep learning algorithms that have been
trained on vast amounts of image data to identify and categorize images accurately.

P a g e | 18
Comparison of AI, ML, and DL
Aspect Artificial Intelligence
(AI)
Machine Learning (ML) Deep Learning (DL)
Definition AI refers to the broader
concept of creating
machines that can perform
tasks requiring human-
like intelligence.
ML involves the
development of algorithms
that enable computers to
learn from data and
improve over time.
DL is a subset of ML that
uses artificial neural
networks with multiple
layers (deep architectures) to
learn representations of data.
Approach Mimics human
intelligence and behavior
to perform tasks.
Learns patterns from data
and makes predictions or
decisions.
Learns representations of
data through hierarchical
layers of abstraction.
Examples Virtual assistants (e.g.,
Siri, Alexa), autonomous
vehicles, game playing
AI.
Spam filters,
recommendation systems,
image recognition.
Image and speech
recognition, natural
language processing,
autonomous driving.
Data Size Can handle both small and
large datasets.
Can handle both small and
large datasets.
Particularly effective with
large volumes of data.
Complexity Can be complex and may
involve various
approaches, including ML
and DL.
Can range from simple
linear models to complex
deep neural networks.
Utilizes complex neural
network architectures with
multiple layers.
Interpretability May lack interpretability
due to the complexity of
AI systems.
Depends on the complexity
of the ML model; simpler
models may be more
interpretable.
Often considered less
interpretable due to the
hierarchical nature of deep
neural networks.
Training Time Can vary widely
depending on the
complexity of the AI
system.
Training time depends on
the complexity of the ML
model and the size of the
dataset.
Can be time-consuming,
especially with large
datasets and complex
architectures.
Hardware Can run on various
hardware platforms,
including CPUs and
GPUs.
Can run on CPUs and
GPUs, with specialized
hardware (e.g., TPUs)
available for ML tasks.
Often requires GPUs or
specialized hardware
accelerators for training and
inference.
Applications Wide range of
applications across
industries, including
healthcare, finance, and
gaming.
Numerous applications in
fields such as healthcare,
finance, e-commerce, and
more.
Dominates fields such as
computer vision, natural
language processing, and
speech recognition.

P a g e | 19
WELL-POSED LEARNING PROBLEM
 A well-posed learning problem is a problem whose solution exists, is unique, and depends
on data and is not sensitive to small changes in data.
 It is formally defined as: "A computer program is said to learn from Experience E when
given a task T, and some performance measure P. If it performs on T with a performance
measure P, then it upgrades with experience E."
 The three essential components of a well-posed learning problem:
1. Task (T): The specific problem or task that the model is intended to solve.
2. Performance Measure (P): The metric used to evaluate the model's performance.
3. Experience (E): The data used to train and improve the model.
Criteria for a Well-Posed Learning Problem
1. Well-Defined Objective: The problem should have a clear and specific goal.
2. Relevant and Sufficient Data: The data should be relevant to the problem and sufficient
in quantity and quality to train the model effectively.
3. Measurable Performance: There must be a way to measure the performance of the model,
such as accuracy, precision, recall, F1 score, mean squared error, etc.
4. Feasibility and Practicality: The problem should be practically solvable given the current
technology, data availability, and resource constraints.
Examples of Well-Posed Learning Problems:
1. Learning to Play Checkers:
 Task: Play the checkers game.
 Performance Measure: Percentage of games won against the opponent.
 Experience: Playing practice games against itself.
2. Handwriting Recognition:
 Task: Recognizing and classifying handwritten words from images.
 Performance Measure: Percentage of correctly identified words.
 Experience: A set of handwritten words with their classifications in a database.
3. Robot Driving:
 Task: Driving on public four-lane highways using sight scanners.

P a g e | 20
 Performance Measure: Average distance progressed before an error.
 Experience: The order of images and steering instructions noted down while
observing a human driver.
4. Spam Filtering:
 Task: Identifying whether or not an email is spam.
 Performance Measure: Percentage of emails correctly categorized as spam or
nonspam.
 Experience: Observing how you categorize emails as spam or nonspam.
5. Face Recognition:
 Task: Predicting distinct sorts of faces.
 Performance Measure: Ability to anticipate the largest number of different sorts of
faces.
 Experience: Training the system with as many datasets of varied facial photos as
possible.
DESIGNING A LEARNING SYSTEM
The basic design issues and approaches to machine learning are illustrated by designing a program
to learn to play checkers, with the goal of entering it in the world checkers tournament. It mainly
involves following 5 steps.
1. Choosing the Training Experience
a) Type of Feedback: Direct vs. Indirect
b) Degree of Control over Training Sequence
c) Representation of Example Distribution
2. Choosing the Target Function
a) Linear Function
b) Neural Networks
d) Decision Trees
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
a) Estimating training values
b) Adjusting the weights
5. The Final Design

P a g e | 21
1. Choosing the Training Experience
 The training experience defines the data or experiences the machine learning algorithm
will use to learn.
 The training data must reflect the overall characteristics of the dataset to ensure the
algorithm performs well in real-world scenarios. To select the optimal training experience,
consider these three key attributes:
a) Type of Feedback: Direct vs. Indirect
The learner has significant control over the sequence of training examples, allowing it to explore
different strategies and adjust based on feedback.

P a g e | 22
 Direct Feedback: The training experience provides immediate feedback on each choice
made by the algorithm. For example, in a game, the algorithm gets feedback on every move
it makes.
Let f(t) represent the feedback function, where t is the time step or iteration.
)
f t
( ) r(
= t
where r(t) is the reward or feedback received at time t.
 
 
 
 
 
+1 if the move a leads to a win
Reward s,a = -1 if the move a leads to a loss
0 Oth s
(
er i e
)
w
Here, s represents the state (board configuration), and a represents the action (move).
 Indirect Feedback: The training experience provides feedback after a sequence of actions,
indicating the final outcome rather than the quality of each individual move. This is
common in scenarios where the algorithm needs to learn from the consequences of a series
of decisions, such as in strategic games or long-term planning tasks.
)
f t
( ) R(
= T
where R(T) is the cumulative reward or feedback received at the end of a sequence of
actions at time T.
b) Degree of Control over Training Sequence:
 Teacher-Driven: The teacher (or supervisor) selects the training examples, providing
informative states and correct actions. This approach is structured but limits the learner's
ability to explore and understand the problem space independently.

P a g e | 23
Let S(t) represent the state at time t, and A(t) represent the action taken at time t.
)
A t = Teacher S
) ( (t
( )
where Teacher is a function provided by the teacher selecting the action based on the
state.
 Learner-Driven: The learner selects the training examples by identifying challenging or
confusing states and requesting guidance from the teacher. This method promotes active
learning and helps the learner focus on areas where it needs the most improvement.
( ) ( ( )) ( ( ))
A t = Learner S t + λ×Teacher S t
where Learner(S(t)) is the action proposed by the learner, and λ is a mixing parameter
indicating the reliance on teacher feedback.
 Self-Learning: The learner has complete control over the training process, generating its
own examples and learning from them without external guidance. This method, often used
in reinforcement learning, allows the learner to explore a wide range of scenarios but
requires robust mechanisms to avoid overfitting and ensure generalization.
)
A t = Learner S
) ( (t
( )
where the learner fully controls the action selection without external guidance.
c)Representation of Example Distribution:
 The training data should cover a diverse range of examples that reflect the distribution of
scenarios the algorithm will encounter in real-world use.
 The training examples are biased or not representative of the overall data set, potentially
leading to overfitting or poor generalization.

P a g e | 24
 Training only on games where the algorithm wins, which does not account for challenging
scenarios.
 Mitigating bias by ensuring the training set includes examples from both wins and losses,
various board states, and diverse opponent strategies.
{ , , ) }
, ,
(
TrainingSet s a r s AllStates a AllActions r AllRewards
   
∣
 A diverse training set helps the algorithm generalize better, improving its performance
across different situations. Ensuring the training experience encompasses varied examples
is crucial for achieving robust and reliable performance.
When designing a checkers-playing program, it's essential to carefully select the training
experience to ensure that the algorithm learns effectively and generalizes well to new games.
Considering the type of feedback, the degree of control over training examples, and the distribution
of examples will significantly impact the success of the algorithm.
2. Choosing the Target Function
The target function represents the goal of the learning process, mapping from the current state of
the game to the desired outcome.
NextMove Function: The target function f could predict the value of making a particular move in
a given state, it can be represented as:
( )
( )
,
) (
LegalMoves s
f s a maxV s a
 
where s is the current board state, a is a possible move, and V(s,a) is the value of taking action a
in state s.
3. Choosing a Representation for the Target Function
The representation defines how the target function will be modeled, which can impact the learning
process's complexity and efficiency.
a) Linear Function: A simple linear combination of features.
)
, ,
( ( )
i i
i
V s a w f s a
 
where fi(s,a) are features, and wi are the weights.

P a g e | 25
b) Neural Networks: More complex and capable of capturing non-linear relationships.
( )
, , ;
) (
V s a NN s a 

where θ are the parameters of the neural network.
c) Decision Trees: Hierarchical model splitting decisions based on feature values.
4. Choosing a Function Approximation Algorithm
The function approximation algorithm determines how the target function will be learned from the
training data.
a) Estimating Training Values
To train the model, we need to estimate the value of different moves. This can be done using
techniques like:
 Monte Carlo Simulation: Running simulations to estimate the value of each move based
on the outcomes of simulated games.
1
( ) ( )
1
,
N
i
i
V s a Outcome G
N 
 
where _Gi is the i-th simulated game starting from state s after move a
 Temporal Difference Learning: Updating estimates based on the difference between
successive state values.
( ) ( ) ( ( ) ( ))
V s V s r V s V s
 
    

P a g e | 26
where α is the learning rate, γ is the discount factor, r is the reward, and s′ is the next
state.
b) Adjusting the Weights
Using a learning algorithm to adjust the weights of the representation based on the estimated
values.
 Gradient Descent: For a neural network, weights are adjusted to minimize the error in the
estimated values.
( )
L

   
  
where η is the learning rate, and L(θ) is the loss function.
 Least Squares Method: For linear functions, weights can be adjusted using the least
squares method to fit the function to the training data.
The sum of squared errors (SSE) is given by:
2
1
( )
n
i i
i
SSE y y


 

2
1
( ( ))
n
i i
i
SSE y w x b

  

where:
 w=[w1,w2,…,wn]⊤ is the weight vector,
 b is the bias term,
 y^ is the predicted value.
 yi the corresponding target value.

P a g e | 27
By following these steps, a machine learning-based checkers program can be developed,
optimized, and made ready for competitive play.
5. The Final Design
 The final design of a checkers learning system can be described by four distinct program
modules that represent the central components in many learning systems. These modules
work together to facilitate the learning and improvement of the system over time through
a series of iterations involving performance, critique, generalization, and experimentation.
 The final design integrates all components, resulting in a checkers-playing program
capable of competing in the world checkers tournament.

P a g e | 28
Overall Workflow:
1. The Performance System plays a new game of checkers and records the game history.
2. The Critic analyzes this game history to generate training examples.
3. The Generalizer uses these training examples to update its hypothesis about the best
moves in checkers.
4. The Experiment Generator uses the updated hypothesis to select a new initial board state
for the next game, and the cycle repeats.
Through this iterative process, the system continuously improves its ability to play checkers by
learning from each game played, evaluating its performance, generalizing from its experiences,
and exploring new game scenarios.

P a g e | 29
PERSPECTIVE AND ISSUES IN MACHINE LEARNING
Machine learning encompasses various perspectives, from supervised learning's reliance on
labeled data to reinforcement learning's dynamic environment interactions, yet faces challenges
such as data bias and interpretability concerns.
Perspective in Machine Learning:
1. Data-Centric Perspective:
 Machine learning focuses on leveraging data to extract meaningful patterns,
insights, and knowledge.
 It emphasizes the importance of data quality, quantity, and relevance in training
accurate models.
2. Model-Centric Perspective:
 Machine learning involves designing and developing models that can learn from
data and make predictions or decisions.
 Models can range from simple linear models to complex deep neural networks,
and their selection depends on the problem and data characteristics.
3. Algorithmic Perspective:
 Machine learning encompasses various algorithms and techniques that enable
models to learn from data.
 These include supervised learning, unsupervised learning, reinforcement learning,
and deep learning, among others.
Issues in Machine Learning
1. Data Quality and Quantity:
o Issue: Insufficient or poor-quality data can lead to inaccurate models and biased
results.
o Solution: Collecting more high-quality data, preprocessing data to handle missing
values and outliers, and ensuring data is representative of the problem domain.

P a g e | 30
2. Overfitting and Underfitting:
o Issue: Overfitting occurs when a model learns the training data too well but fails to
generalize to new, unseen data. Underfitting happens when the model is too simple
to capture the underlying structure of the data.
o Solution: Regularization techniques, cross-validation, and adjusting model
complexity can help mitigate overfitting and underfitting.
3. Interpretability and Explainability:
o Issue: Complex machine learning models often lack interpretability, making it
challenging to understand and trust their decisions, especially in critical
applications like healthcare or finance.

P a g e | 31
o Solution: Using simpler, more interpretable models when possible, or employing
techniques such as feature importance analysis and model explanation methods.
4. Bias and Fairness:
o Issue: Models can inadvertently learn and perpetuate biases present in the training
data, leading to unfair or discriminatory outcomes.
o Solution: Careful selection and preprocessing of training data, fairness-aware
algorithms, and post-processing techniques to mitigate bias.
5. Computational Resources:
o Issue: Training and deploying complex machine learning models can require
significant computational resources, including processing power and memory.
o Solution: Optimizing algorithms and model architectures, utilizing distributed
computing frameworks, and leveraging cloud computing resources.
6. Privacy and Security:
o Issue: Machine learning models trained on sensitive data may inadvertently leak
private information or be vulnerable to adversarial attacks.
o Solution: Implementing privacy-preserving techniques such as differential privacy,
federated learning, and robust model training against adversarial attacks.
7. Ethical Considerations:
o Issue: Machine learning applications raise ethical concerns regarding issues like
data privacy, consent, transparency, and potential societal impacts.
o Solution: Adhering to ethical guidelines and regulations, fostering interdisciplinary
collaboration, and engaging in transparent communication with stakeholders.
Addressing these issues requires a combination of technical expertise, ethical considerations, and
interdisciplinary collaboration to ensure responsible and effective deployment of machine learning
systems.

P a g e | 32
CONCEPT LEARNING
Concept learning is a key task in machine learning, aimed at discovering general patterns or
concepts from labeled examples.
Definition: Concept learning - Inferring a Boolean-valued function from training examples of
its input and output
It involves the following steps and objectives:
1. Inference of Hypotheses: The process starts by inferring a hypothesis that accurately
describes the target concept based on observed instances. For example, understanding what
a "bird" is by analyzing various examples of birds and identifying their common
characteristics.
2. Generalization: The goal is to derive a general rule or concept from specific examples.
This allows the model to generalize beyond the training data, making accurate predictions
on new, unseen instances.
3. Pattern Recognition and Classification: Concept learning is crucial for tasks such as
classification and pattern recognition. By identifying the underlying rules or patterns that
define a concept, systems can make predictions or decisions based on the learned
knowledge.
In the study of concept learning, there are two types
i) Concept Learning Task
ii) Concept Learning as Search

P a g e | 33
Types of Concept Learning
i) Concept Learning Task
Definition: A concept learning task involves identifying a general rule (or concept) from specific
examples, allowing for the classification of new, unseen examples.The concept learning task
typically involves the following components:
1.Instance Space(X):
 The instance space refers to the set of all possible instances or examples that can be
observed or encountered in the domain of interest.
}
1, 2,
{ , n
X x x x
 
 Each instance x in X is described by a vector of attribute values.
 For example, x=(Sunny, Warm,Normal,Strong,Warm,Same).
2.Hypothesis Space(H):
 The hypothesis space represents the set of possible hypotheses or concept descriptions that
can be considered during the concept learning process.
 
h: X 0,1

 Each hypothesis is a potential concept description that can classify instances into positive
or negative examples of the target concept.
 For example, h(x)=1 if the hypothesis predicts that Aldo enjoys the sport on day x, and
h(x)=0 otherwise.
 For example, some hypotheses in the hypothesis space could be:
If Sky = Sunny and AirTemp = Warm, then EnjoySport = Yes
If Humidity = High and Water = Warm, then EnjoySport = No
3.Training Examples(D):
 The training examples are the provided instances along with their corresponding class
labels (EnjoySport).
{( ( )) ( ( )) ( ( ))}
1, 1 , 2, 2 , , ,
n n
D x c x x c x x c x
 

P a g e | 34
 Each training example consists of attribute values and the target concept's class label (Yes
or No).
4.Target Concept(C):
 The target concept represents the concept or category[(Yes) or not (No).] that we want to
learn from the training examples.
 
c: X 0,1

 For example, c(x)=1(positive examples) if Aldo enjoys the sport on day x, and c(x)=0
(negative examples) otherwise.
• Each hypothesis h in H represents a Boolean valued function defined over X
 
h: X 0,1

 The goal of the learner is to find a hypothesis: )
, ( ( )
x X h x c x
  
• The aim of concept learning is to infer a concept description or hypothesis that accurately
predicts the EnjoySport label for new, unseen instances based on the provided training
examples.
Example:
 Let's consider learning the target concept "Days on which Aldo enjoys his favorite water
sport."
 We have a table of data with various attributes (like Sky, AirTemp, Humidity, Wind,
Water, Forecast) and whether Aldo enjoyed the sport (EnjoySport) on those days.

P a g e | 35
Training Data: (EnjoySport Dataset)
Task: The goal is to learn a rule (hypothesis) that can predict the value of EnjoySport for any new
day based on the values of its other attributes (Sky, AirTemp, Humidity, Wind, Water, and
Forecast).
Hypothesis Representation:
 General Hypothesis: A rule that applies to many instances (e.g., Aldo enjoys the sport on
any day).
 Specific Hypothesis: A rule that applies to very specific instances (e.g., Aldo enjoys the
sport only on Sunny and Warm days).
Each hypothesis can be represented as a vector of constraints on the attributes:
 "?" means any value is acceptable.
 A specific value (e.g., "Warm") means only that value is acceptable.
 "Φ" means no value is acceptable.
Examples:
 Hypothesis for enjoying the sport on cold days with high humidity: h=(?,Cold,High,?,?,?)
 The most general hypothesis (every day is positive): h=(?,?,?,?,?,?)
 The most specific hypothesis (no day is positive): h=(Φ,Φ,Φ,Φ,Φ,Φ)

P a g e | 36
The Inductive Learning Hypothesis
 Inductive learning, also known as inductive reasoning or inductive inference, is a type of
learning that involves generalizing from specific instances to form general rules or
concepts.
 It is a fundamental process used by humans and machines to acquire knowledge and make
predictions based on observed examples.
 Any hypothesis found to approximate the target function well over a sufficiently large set
of training examples will also approximate the target function well over other unobserved
examples.
If h(x)≈c(x) all x in the training set D we say h approximates c well.
Example:
In simpler terms, if a rule (hypothesis) works well for the examples we've seen, it should also work
well for new examples we haven't seen yet, provided we've seen enough examples to make this
judgment.

P a g e | 37
Example:
Suppose we have the following training data for the target concept "Days on which Aldo enjoys
his favorite water sport":
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Hypothesis h: Aldo enjoys the sport on sunny and warm days, represented as
h=(Sunny,Warm,?,?,?,?)
 If our hypothesis h correctly predicts the EnjoySport value for all days in the training set
D, and D is large and diverse enough, the inductive learning hypothesis suggests hhh will
likely also predict well for new, unseen days.
 Hence, the inductive learning hypothesis gives us confidence that a well-performing
hypothesis on a large training set will generalize well to other examples, ensuring the
robustness of our learning model.
ii) Concept Learning as Search
 Concept learning can be viewed as a search through a space of possible hypotheses to find
the one that best matches the training examples.
 The search process involves exploring the hypothesis space to find a hypothesis that
minimizes the errors or inconsistencies between the predicted labels and the true labels.
 Find a hypothesis that best fits training examples
 Efficient search in hypothesis space (finite/infinite)

P a g e | 38
Example:
Consider the instances X and hypotheses H in the EnjoySport learning task
Sunny Warm Normal Weak Warm Same Yes
Cloudy Warm High Strong Warm Same Yes
Search Space for Hypotheses
 Instance Space: The total number of possible combinations of attribute values
Each attribute can take on multiple values:
Total instances=3×2×2×2×2×2=96 distinct instances
 Syntactically Distinct Hypotheses (including? and Φ): Counts all possible combinations
of attribute values, including the, "don't care" (?), and "always false" (Φ) symbols.
Number of choices=Number of attribute values+2 (for ? and Φ)
For each attribute with n possible values, there are n+2 choices (including "?" and "Φ")
Total syntactically distinct hypotheses H=5×4×4×4×4×4=5120
syntactically distinct hypotheses

P a g e | 39
 Semantically Distinct Hypotheses (excluding redundant ones with Φ): Counts
meaningful hypotheses, excluding those that classify all instances as negative.
Hypotheses with one or more "Φ" symbols are considered empty because they classify
every instance as negative.
We count only one "Φ" for each attribute:
Semantically distinct hypotheses H=1+(4×3×3×3×3×3) =1+972=973
semantically distinct hypotheses
In the context of Concept Learning as Search, the task is to navigate through this large hypothesis
space to find a hypothesis that best matches the training examples and generalizes well to new
instances.
General-to-Specific Ordering of Hypotheses
 General-to-specific ordering of hypotheses is a method of organizing hypotheses in a way that
progresses from broader, more general statements to narrower, more specific ones.
 This ordering helps in systematically exploring and narrowing down potential explanations or
predictions.
 A more general hypothesis covers a broader range of instances (e.g., "Aldo enjoys the
sport on any day").
 A more specific hypothesis covers a narrower range of instances (e.g., "Aldo enjoys
the sport only on Sunny, Warm, and Windy days").
 We can order hypotheses from general to specific based on their constraints.

P a g e | 40
Example:
Consider two hypotheses:
 h1=(Sunny,?,?,Strong,?,?) (Sunny days with Strong wind) [Most Specific]
 h2=(Sunny,?,?,?,?,?) (Any Sunny day) [Most General]
Since h2 imposes fewer constraints, it classifies more instances as positive and is more general
than h1.
General-to-Specific Ordering:
 General Hypothesis: h2 (more general, covers more instances).
 Specific Hypothesis: h1 (more specific, covers fewer instances).
Definition: Hypothesis hj is more-general-than-or-equal-to hypothesis hk if every instance
satisfying hk also satisfies hj:
( )[ ( ) 1 ( ) ]
1
j
k
x X h x h x
    

P a g e | 41
Find-S (Find-Specific) Algorithm: Finding a Maximally Specific Hypothesis
 The Find-S (Find-Specific) algorithm is a simple supervised machine learning algorithm
used for finding the most specific hypothesis that fits all the positive examples in a given
dataset.
 It starts with the most specific hypothesis and generalizes it by incorporating positive
examples, while ignoring negative examples during the learning process.
 The algorithm represents the hypothesis using a vector of attribute constraints. The most
specific hypothesis is represented as {φ, φ, φ, ..., φ}, where φ means no value is acceptable
for that attribute.
 The most general hypothesis is represented as {?, ?, ?, ..., ?}, where ? means any value is
acceptable for that attribute
FIND-S Algorithm
1. Initialize the hypothesis h to the most specific hypothesis possible.
2. For each positive training example x:
 For each attribute constraint ai in h:
 If ai is satisfied by x, do nothing
 Else, replace ai in h with the next more general constraint that is satisfied
by x
3. Output the final hypothesis h
Illustrative Example 1:
To illustrate this algorithm, assume the learner is given the sequence of training examples
from the EnjoySport task

P a g e | 42
Step-by-Step Execution of Find-S Algorithm:
The final hypothesis h after processing all instances is <Sunny,Warm,?,Strong,?,?>

P a g e | 43
Illustrative Example 2:
Example Color Hardness Smell Surface
1 GREEN HARD NO WRINKLED
2 GREEN HARD NO SMOOTH
3 GREEN SOFT YES WRINKLED
4 ORANGE HARD NO WRINKLED
5 GREEN SOFT YES SMOOTH
1. Initialize the hypothesis h to the most specific {φ, φ, φ, φ}.
2. Consider example 1: {GREEN, HARD, NO, WRINKLED}
 Since this is a positive example, we generalize the hypothesis to match it: h =
{GREEN, HARD, NO, WRINKLED}
3. Example 2 is negative, so we ignore it and h remains the same.

P a g e | 44
4. Example 3 is negative, so we ignore it and h remains the same.
5. Example 4: {ORANGE, HARD, NO, WRINKLED}
 We compare each attribute to h and replace any mismatches with ? to generalize:
h = {?, HARD, NO, WRINKLED}
6. Example 5: {GREEN, SOFT, YES, SMOOTH}
 Comparing to h, we replace mismatches with ?:
h = {?, ?, ?, ?}
The final hypothesis h after processing all instances is h = {?, ?, ?, ?}
Advantages of the FIND-S algorithm
 Simplicity: Easy to understand and implement, making it ideal for introducing machine
learning concepts.
 Efficiency: Computationally efficient for small to moderate-sized datasets, updating the
hypothesis with individual examples.
 Maximally Specific Hypothesis: Ensures the hypothesis is as specific as possible,
covering all positive examples without conflicting with negative examples.
Limitations of the FIND-S algorithm
 Assumes noiseless data: Find-S assumes that all positive instances are correctly labeled
and there are no errors.
 Ignores negative instances: It only considers positive examples for generalization.
 Cannot handle inconsistent data: If there is noise or inconsistency in the data, Find-S
might not perform well.
Unanswered by FIND-S
 Has the learner converged to the correct target concept?
 Why prefer the most specific hypothesis?
 Are the training examples consistent?
 What if there are several maximally specific consistent hypotheses?

P a g e | 45
Exercise
Apply the Find-S algorithm to determine the most specific hypothesis that fits all positive
instances.
Instance Weather Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak Yes
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
Instance Outlook Temperature Humidity Wind PlayGolf
1 Sunny Hot High Weak Yes
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
6 Rainy Cool Normal Strong No
Instance Day Weather Temperature Humidity Wind Surfing
1 Weekday Sunny Warm High Strong Yes
2 Weekend Rainy Cold High Weak No
3 Weekend Sunny Warm Normal Strong Yes
4 Weekday Sunny Warm High Weak No
5 Weekday Rainy Warm Normal Strong No
6 Weekend Sunny Hot Normal Strong Yes
Instance Color Size Shape Texture PlayGame
1 Red Small Round Smooth Yes
2 Red Small Square Rough No
3 Blue Large Round Smooth Yes
4 Red Small Round Rough Yes
5 Blue Small Round Smooth Yes

P a g e | 46
Instance Sky AirTemp Humidity Wind Water Forecast GoHiking
1 Sunny Hot High Weak Warm Same Yes
2 Sunny Hot High Strong Cool Change No
3 Overcast Cool Normal Weak Warm Same Yes
4 Rainy Mild High Weak Warm Same Yes
5 Sunny Cool Normal Weak Warm Same Yes
6 Overcast Hot Normal Strong Cool Same Yes
Instance Color Size Shape Texture Fruit
1 Red Large Round Smooth Yes
2 Yellow Medium Oval Rough No
3 Red Small Round Smooth Yes
4 Green Large Oval Rough No
5 Red Large Round Rough Yes

P a g e | 47
Consistent Hypothesis and Version Space
a) Consistent Hypothesis
Definition: A hypothesis h is consistent with a set of training examples D if and only if h(x)=c(x)
for each example (x, c(x)) in D.
( ) ( ( )
, , ) ) ( ) (
Consistent h D x c x D h x c x
   

Satisfies Hypothesis:
 An example x is said to satisfy hypothesis h when h(x)=1, regardless of whether x is a
positive or negative example of the target concept.
 An example x is consistent with hypothesis h iff h(x)=c(x).
Example Citations Size In Library Price Editions Buy
1 Some Small No Affordable One No
2 May Big No Expense May Yes
h1=(?, ?, No, ?, Many)
h2=(?, ?, No, ?, ?)
Hypothesis Example 1 Consistency Example 2 Consistency Consistent Check
h1 Consistent Consistent Consistent
Yes (All Examples match)
h2 Inconsistent Consistent Inconsistent
No (Mismatch in Example 1)

P a g e | 48
b) Version Space
The version space VS(H,D) with respect to a hypothesis space H and a set of training examples D
is the subset of hypotheses from H that are consistent with the training examples in D.
,
( ) { ( )}
,
H D
VS h H Consistent h D
  ∣
Here:
 H is the hypothesis space, which is the set of all possible hypotheses that can be
formulated based on the given problem.
 D is the set of training examples, which consist of input-output pairs used to train the
model.
 A hypothesis h is said to be consistent with the training examples D if it correctly
classifies all the examples in D.

P a g e | 49
LIST-THEN-ELIMINATION(LTE) Algorithm
 The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all
hypotheses in H and then eliminates any hypothesis found inconsistent with any training
example.
 List-Then-Eliminate works in principle, so long as version space is finite.
 However, since it requires exhaustive enumeration of all hypotheses in practice it is not
feasible
Illustrative Example:

P a g e | 50
Advantages of LTE
 Simplicity: The algorithm is conceptually simple and easy to understand. It directly
maintains and updates the version space by removing inconsistent hypotheses.
 Correctness: If the version space is finite and a consistent hypothesis exists, the algorithm
is guaranteed to find it.
Disadvantages of LTE
 Infeasibility for Large Hypothesis Spaces: The primary drawback is that the hypothesis
space H can be extremely large, making it impractical to enumerate and store all hypotheses
explicitly.
o For example, if the hypothesis space contains 2^n hypotheses, where n is the
number of possible binary features, the algorithm becomes computationally
infeasible.
 Exhaustive Enumeration: The requirement to exhaustively enumerate all hypotheses
makes the algorithm inefficient for large or infinite hypothesis spaces.

P a g e | 51
A More Compact Representation for Version Spaces
The concept of representing version spaces through their most general and least general members
(the general boundary G and the specific boundary S) is an elegant and efficient way to manage
hypotheses in machine learning.
i)General Boundary (G):
 The set of maximally general hypotheses in the hypothesis space H that are consistent with
the training data D.
{ ( ) ( )[( ) ( , )]}
,
G g H Consistent g D g H g g Consistent g D
    
     
∣
 These hypotheses are as general as possible while still being consistent with the data. No
more general hypothesis exists in H that is also consistent with D.
ii) Specific Boundary (S):
 The set of minimally general (i.e., maximally specific) hypotheses in the hypothesis space
H that are consistent with the training data D.
{ ( ) ( )[( ) ( , )]}
,
S s H Consistent s D s H s s Consistent s D
    
     
∣
 These hypotheses are as specific as possible while still being consistent with the data. No
more specific hypothesis exists in H that is also consistent with D.
Version Space Representation Theorem
The Version Space representation theorem states that the version space can be compactly
represented using the general boundary G and the specific boundary S
, { ( )( )( )}
H D
VS h H s S g G g h s
       
∣

P a g e | 52
Implications and Advantages
 Compact Representation: Instead of storing all hypotheses in the version space, we only
need to store the boundaries GGG and SSS. This significantly reduces memory
requirements.
 Efficient Updates: Updating the version space with new training examples involves
adjusting GGG and SSS, which is typically more efficient than handling the entire set of
consistent hypotheses.
 Boundary Manipulation:
o Adding a Positive Example: When a new positive example is encountered, we
need to generalize the specific boundary S (make it less specific) and ensure the
general boundary G remains consistent.
o Adding a Negative Example: When a new negative example is encountered, we
need to specialize the general boundary G (make it less general) and ensure the
specific boundary S remains consistent.
CANDIDATE-ELIMINATION(CE) Algorithm
 The CANDIDATE-ELIMINATION algorithm operates similarly to the LIST-THEN-
ELIMINATE algorithm but uses a more compact representation of the version space.
 It represents the version space by its most General (G) and Specific (S) boundaries.
 These boundaries form general and specific boundary sets, which delimit the version space
within the partially ordered hypothesis space.
 The key idea is to output a description of all hypotheses consistent with the training
examples.
 The algorithm incrementally builds the version space given a hypothesis space H and a set
of examples.
 Examples are added one by one, each potentially shrinking the version space by removing
inconsistent hypotheses.
 The algorithm updates the general and specific boundaries with each new example.
 It is an extended form of the Find-S algorithm and the LIST-THEN-ELIMINATE
algorithm.

P a g e | 53
 The algorithm considers both positive and negative examples:
 Positive examples generalize the specific hypothesis.
 Negative examples make the general hypothesis more specific.
Candidate-Elimination Algorithm
1. Initialization: Initialize both specific and general hypotheses.
The general hypothesis is set to the most general hypothesis (?).
G={?, ?, ?, ?, ?.........}
The specific hypothesis is set to the most specific hypothesis (ϕ)
S={ ϕ, ϕ, ϕ, ϕ, ϕ…..}
2. Processing Training Examples: For each training example, the algorithm checks if it is
positive or negative.
2.1 If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it(S))
2.2 If example is Negative example
Make generalize hypothesis more specific(G).
3. Updating Version Space(G&S): The version space is updated after each training
example by removing any hypotheses that are inconsistent with the example.

P a g e | 54
An Illustrative Example
Initialization
 Specific Hypothesis (S): ]
,
[ , , , ,
S Ø Ø Ø Ø Ø Ø
            
 General Hypothesis (G): ? , ? , ? , ? , ? , ?
[[ ]]
G             
Iteration through Examples
Example 1: [ , , , , , , ]
              +
 Update S: ]
,
[ , , , ,
S Sunny Warm Normal Strong Warm Same
            
 G remains unchanged: ? , ? , ? , ? , ? , ?
[[ ]]
G             
Example 2: , ,
[ , , , ]
,
              +
 Update S:
 Compare each attribute: ?
[2] [2]
S High S
  
    
 ,
[ ]
, ? , , ,
S Sunny Warm Strong Warm Same
            
 G remains unchanged: ? , ? , ? , ? , ? , ?
[[ ]]
G             
Example 3: [ , , , , , , ]
              
 Update G:
Current ,
[ ]
, ? , , ,
S Sunny Warm Strong Warm Same
            
 For each hypothesis in G: ? , ? , ? , ? , ? , ?
[[ ]]
G             
Create new hypotheses:
For attribute 0 (Sky):
[ ]
0
S Sunny
  
[ ]
0
attributes Rainy
  
New hypothesis: [ , ? , ? , ? , ? , ? ]
Sunny
           
For attribute 1 (AirTemp):
[ ]
1
S Warm
  
[ ]
1
attributes Cold
  

P a g e | 55
New hypothesis: [ ]
? , , ? , ? , ? , ?
Warm
           
For Attributes 2 to 5 do not specialize since : ?
[2]
S   
[[ ] [ ]]
, ? , ? , ? , ? , ? , ? , , ? , ? , ? , ?
G Sunny Warm
                        
Example 4: , ,
[ , , , ]
,
Sunny Warm High Strong Cool Change Yes 
             
 Update S:
Compare each attribute:
[ [
4] 4 ?
]
S Cool S
    
  [5] ?
S
 
 

[ [
5] ?
]
5
S Change S
    
 
]
, , ? , , ? , ?
[
S Sunny Warm Strong
            
 Update G:
Filter out inconsistent hypotheses with the positive example
[ , ? , ? , ? , ? , ? ]
Sunny
            is consistent
[ ]
? , , ? , ? , ? , ?
Warm
            is consistent
[[ ] [ ]]
, ? , ? , ? , ? , ? , ? , , ? , ? , ? , ?
G Sunny Warm
                        
Final Hypotheses
 Specific Hypothesis (S): ]
, , ? , , ? , ?
[
S Sunny Warm Strong
            
 General Hypotheses (G): [[ ] [ ]]
, ? , ? , ? , ? , ? , ? , , ? , ? , ? , ?
G Sunny Warm
                        
Specific Hypothesis (S) represents the most specific generalization that covers all positive
examples.
General Hypotheses (G) represent the broadest set of hypotheses that exclude the negative
examples while being consistent with the positive examples.

P a g e | 56

P a g e | 57
Comparison of Find-S, List-Then-Eliminate, Candidate-Elimination
Algorithm Find-S List-Then-Eliminate Candidate-Elimination
Hypothesis Space Specific Hypotheses Version Spaces Version Spaces
Search Strategy General-to-specific Specific-to-general Specific-to-general
Handling
Overfitting
Prone to overfitting Prone to overfitting Handles overfitting
Iterative Process Yes No Yes
Negative Examples
Handling
Ignore Eliminate Hypotheses Refine Boundaries
Completeness Not guaranteed Guaranteed Guaranteed
Complexity O(1)
i.e. every time a constant
amount of time is
required to execute code
O(n) O(n^2)
Advantages Efficient for small
hypothesis spaces.
Produces a single,
consistent hypothesis.
Handles both positive
and negative instances.
Allows complex
hypothesis spaces.
Handles both positive
and negative instances.
Can handle continuous-
valued attributes.
Disadvantages Prone to
overgeneralization if
negative instances are
not considered.
Limited to simple
hypothesis spaces.
Can be computationally
expensive for large
hypothesis spaces.
May generate redundant
hypotheses.
Requires storing and
manipulating large sets
of hypotheses.
Can be computationally
expensive.

P a g e | 58
Inductive Bias
 Bias: In the context of machine learning, bias refers to the error introduced by
approximating a real-world problem with a simplified model.
 Inductive bias referring to the assumptions a learning algorithm makes to generalize from
observed data to unseen instances.
Fundamental Questions for Inductive Inference
1. What if the target concept is not in the hypothesis space?
The algorithm cannot learn the target concept accurately.
2. Can we avoid this difficulty by using a hypothesis space that includes every possible
hypothesis?
In theory, yes, but it has practical limitations such as computational complexity and
overfitting.
3. How does the size of this hypothesis space influence the ability to generalize to
unobserved instances?
A larger hypothesis space can lead to overfitting, reducing the ability to generalize well to
new data.
4. How does the size of the hypothesis space influence the number of training examples
required?
A larger hypothesis space generally requires more training examples to accurately learn
the target concept without overfitting.
Biased Hypothesis Space: An Example
 Consider the "EnjoySport" example, where we want to predict whether a sport is enjoyable
based on certain weather conditions.
 If the hypothesis space is restricted to only conjunctions (e.g., "Sky = Sunny AND
Temperature = Warm"), it cannot represent disjunctions (e.g., "Sky = Sunny OR Sky =
Cloudy").

P a g e | 59
Example Training Data
 ⟨Sunny,Warm,Normal,Strong,Cool,Change⟩ Y
 ⟨Cloudy,Warm,Normal,Strong,Cool,Change⟩ Y
 ⟨Rainy,Warm,Normal,Strong,Cool,Change⟩ N
Using the Candidate Elimination algorithm with this restricted hypothesis space:
1. After the first two positive examples, the specific hypothesis (S) becomes overly general:
⟨?,Warm,Normal,Strong,Cool,Change⟩
2. This overly general hypothesis incorrectly covers the third negative example.
Thus, the hypothesis space needs to be more expressive to include disjunctions.
An Unbiased Learner
To avoid bias, we can define a hypothesis space (H') that includes every possible subset of
instances, known as the power set. For the "EnjoySport" example with six attributes, there are 296
possible target concepts.
Example Unbiased Hypothesis
The target concept "Sky = Sunny OR Sky = Cloudy" can be represented as:
⟨Sunny,?,?,?,?,?⟩ OR ⟨Cloudy,?,?,?,?,?⟩
Definition of Inductive Bias
Inductive bias refers to the assumptions an algorithm makes to generalize from the training data.
It can be defined as the minimal set of assertions (B) that guide the algorithm in making
predictions.

P a g e | 60
Example:
For the Candidate Elimination algorithm, the inductive bias is the assertion that "H contains the
target concept." This bias helps the algorithm generalize beyond the observed data by modeling
inductive systems as equivalent deductive systems.
By characterizing inductive systems through their biases, we can compare different algorithms
based on their generalization policies.

P a g e | 61
DECISION TREE LEARNING
 Decision tree learning is a popular machine learning method used for both classification
and regression tasks.
 Decision tree learning is a method for approximating discrete-valued target functions,
in which the learned function is represented by a decision tree.
 It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.

P a g e | 62
Decision Tree Representation:
 It builds a model in the form of a tree structure where each internal node represents a
decision based on a feature or attribute, each branch represents the outcome of the decision,
and each leaf node represents the prediction label or value.
 Below diagram explains the general structure of a decision tree:
1. Nodes:
 Root Node: The topmost node in the tree, representing the initial decision point. It contains
the entire dataset.
 Internal Nodes: Nodes that split the dataset based on a feature or attribute value. They
lead to child nodes based on the outcome of the split.
 Leaf Nodes/ Decision Node: Terminal nodes that predict the outcome. They do not split
further and represent the final prediction label or value.
2. Edges:
 Edges/branches: Connect nodes and represent the outcome of a decision or a set of
decisions based on a feature's value.

P a g e | 63
3. Splitting Criteria:
 Decision trees use various criteria to split nodes, such as Gini impurity (for classification)
or variance reduction (for regression). The goal is to maximize information gain at each
split.
4. Tree Depth:
 The depth of a tree is the number of edges from the root node to the farthest leaf node. It
determines the complexity of the model and influences its ability to generalize.
Example:
• Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM).
• The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels.
• The next decision node further gets split into one decision node (Cab facility) and one leaf
node.
• Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

P a g e | 64
Advantages of Decision Trees:
 Interpretability: Easy to interpret and visualize, making them useful for explaining
decisions.
 Non-linear Relationships: Can capture non-linear relationships between features and
target variables.
 Handles Missing Values: Can handle missing values in the dataset.
 No Need for Feature Scaling: Not sensitive to feature scaling unlike some other models
like SVMs or neural networks.
Disadvantages of Decision Trees:
 Overfitting: Prone to overfitting, especially with complex trees that capture noise in the
training data.
 Instability: Small variations in the data can result in a completely different tree structure.
 Bias towards Dominant Classes: In classification tasks, can create biased trees if one
class dominates the dataset.
 Greedy Nature: The greedy approach to find the best split at each node may not result in
the globally optimal tree.
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING
A classic famous example where decision tree is used is known as Play Tennis.

P a g e | 65
Play-Tennis decision tree example
Decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs
 Each instance or example in the dataset is described by a fixed set of attributes and their
corresponding values.
 This structured format allows decision trees to efficiently partition the data based on
attribute conditions.
 Example:
Attributes: {Outlook, Temperature, Humidity, Wind}
Values: Sunny, Overcast, Rainy; Hot, Mild, Cool; High, Normal; Weak, Strong
2. The target function has discrete output values
 Decision trees naturally handle classification tasks where the target function outputs
discrete values (e.g., categories or classes).
 This includes binary classifications (yes/no) as well as multi-class classifications.
 Example:
Output: Whether to play tennis or not (Yes or No)

P a g e | 66
3. Disjunctive descriptions may be required
 Decision trees can easily handle disjunctive (or) conditions in the data, where multiple
attributes or their combinations may lead to the same outcome.
 For example, a rule could be: "Play tennis if the outlook is Sunny OR Overcast."
4. The training data may contain errors
 Decision tree algorithms are robust to errors in the training data, including misclassified
examples or incorrect attribute values. They can effectively handle noise and outliers
without significantly impacting performance.
 Decision tree algorithms like ID3, C4.5, or CART are robust to errors in training data,
including misclassifications or errors in attribute values.
5.The training data may contain missing attribute values:
 Decision tree methods can handle missing values by skipping over them during the
decision-making process.
 For instance, if the "Humidity" value is missing for a particular instance, the decision tree
can still classify based on the available attributes.
BASIC DECISION TREE LEARNING ALGORITHM
 The ID3 (Iterative Dichotomiser 3) algorithm is a fundamental decision tree learning
algorithm that constructs decision trees from a dataset.
 ID3 was one of the earliest algorithms developed for constructing decision trees.
 It was developed by Ross Quinlan and is based on the concept of information gain.
 C4.5 is an extension of ID3 developed by Ross Quinlan. It addresses some limitations of
ID3, such as handling both categorical and numerical attributes, handling missing values,
and pruning trees to avoid overfitting.
 CART is a versatile decision tree algorithm developed by Breiman et al. It can be used for
both classification and regression tasks.
 Random Forest is an ensemble learning method based on decision trees. It builds multiple
decision trees and combines their predictions to improve accuracy and reduce overfitting.

P a g e | 67
ID3 (Iterative Dichotomiser 3) algorithm
It is a top-down, greedy search algorithm that generates a decision tree from a given dataset by
iteratively selecting the best attribute to split the data based on information gain.
Step 1: Calculate the entropy of the entire dataset.
Step 2: For each feature, calculate the information gain.
Step 3: Select the feature with the highest information gain as the best feature to split the data.
Step 4: Create a branch node in the decision tree using the selected feature.
Step 5: For each unique value of the selected feature, repeat steps 1 to 4 (recursion).
Step 6: Continue building the tree until all data is classified correctly or there are no features left
to split on.
Step 7: The decision tree is complete.
Attribute Selection Measures (ASM)
 Attribute Selection Measures (ASM) are criteria used in decision tree algorithms to select
the best attribute for splitting the data at each node of the tree.
 These measures quantify how well an attribute separates the training examples into their
target classes.
 Decision tree algorithms like ID3, C4.5, and CART use these measures to recursively build
trees by selecting attributes that optimize the chosen measure at each step.
 Here are some commonly used Attribute Selection Measures in decision tree algorithms
1. Entropy
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance

P a g e | 68
1.Entropy(S):
 Entropy is a measure of impurity or uncertainty in a dataset.
 In the context of decision trees, entropy is used to quantify the amount of information
disorder or unpredictability in the data before and after splitting based on an attribute.
2
1
( ) ( )
c
i i
i
Entropy S p log p

 
Where:
S is the dataset at a given node.
c is the number of classes in the dataset.
Pi is the proportion of examples in class iii in dataset S.
• The entropy is 0 if all members of S belong to the same class (highly pure dataset)
• The entropy is 1 when the collection contains an equal number of positive and negative
examples (mixed dataset/impure)
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1

P a g e | 69
2. Information Gain:
 Information gain is the reduction in entropy or uncertainty achieved by splitting the data
on a particular attribute.
 The decision tree algorithm selects the attribute that maximizes information gain at each
node.
 Constructing a decision tree is all about finding an attribute that returns the highest
information gain and the smallest entropy.
( )
( ) ( ) (
, )
v Values A
Sv
InformationGain S A Entropy S Entropy Sv
S

  
∣ ∣
∣ ∣
Where:
A is the attribute being considered for splitting.
Values(A) is the set of all possible values of attribute A.
Sv is the subset of S where attribute A has value v.
∣S∣ is the total number of examples in S.
∣Sv∣ is the number of examples in subset Sv.

P a g e | 70
3. Gini Index:
 Gini index measures the impurity or the likelihood of an incorrect classification of a
randomly chosen element in the dataset if it were randomly labeled according to the
distribution of labels in the subset.
 It is used to evaluate splits in the dataset. A low Gini index suggests that a particular
attribute provides good separation of the classes.
 Higher value of Gini index implies higher inequality, higher heterogeneity.
2
1
1
( ) ( )
c
i
Gini S pi

  
Where:
pi is the proportion of examples in class i in dataset S.
 Example Calculation: Suppose we have a dataset S with 10 examples, where 6 examples
belong to class A and 4 examples belong to class B.
Proportion of class A:
6
0.6
10
A
P 

Proportion of class B:
4
0.4
10
B
P 

2 2
( ) ( )
1 0.6 0.4
Gini S   
8
( ) 0.4
Gini S 
4. Gain Ratio:
 Gain ratio is an extension of information gain that takes into account the intrinsic
information of a split by normalizing the information gain using the split information.
 Gain ratio adjusts for the bias towards attributes with a large number of distinct values.
,
( )
( )
( )
,
,
InformationGain S A
GainRatio S A
Split Information S A

Where:
( , log
( ) 2
)
Sv Sv
v Values A S S
 
 
 
 

∣ ∣ ∣ ∣
∣ ∣ ∣ ∣

P a g e | 71
5. Reduction in Variance:
 Reduction in variance is used in decision trees for regression tasks, where it measures the
amount by which variance in the target variable is reduced when a dataset is split based on
an attribute.
 It seeks to minimize the variance of the target variable within each node of the tree.
( ) ( ) ( )
(
Re ,
)
Sv
ductioninVariance S A Variance S Variance S
v Values A v
S
  

∣ ∣
∣ ∣
Where:
Variance(S) is the variance of the target variable in dataset S.
Variance(Sv) is the variance of the target variable in subset Sv.
Example 1:

P a g e | 72

P a g e | 73

P a g e | 74

P a g e | 75
Example 2:

P a g e | 76

P a g e | 77

P a g e | 78

P a g e | 79

P a g e | 80

P a g e | 81

P a g e | 82
Example 3:

P a g e | 83

P a g e | 84
Example 4:

P a g e | 85

P a g e | 86

P a g e | 87

P a g e | 88
Exercise
Given the following dataset, construct a decision tree using the ID3 algorithm
1)

P a g e | 89
2)
Hours Studied Attendance Homework Completed Participation Passed
5 High Yes Active Yes
3 Medium Yes Inactive No
4 High No Active Yes
2 Low No Inactive No
1 Low No Inactive No
5 Medium Yes Inactive Yes
4 Medium Yes Inactive Yes
2 Low Yes Active No
3)
Age Income Level Credit Score Previous Purchase Purchase Decision
25 High Excellent Yes Yes
40 Medium Good No No
35 Low Poor No No
28 High Good Yes Yes
50 Medium Good No Yes
45 Low Poor No No
30 High Excellent No Yes
55 Medium Good Yes Yes
60 Low Poor Yes No
20 Medium Excellent No Yes
4)
Weather Distance Traffic Car Availability Commute By Car
Sunny Short Low Yes Yes
Rainy Long High Yes No
Sunny Long Medium No No
Overcast Short Low Yes Yes
Rainy Short Medium No No
Sunny Long Low Yes Yes
Overcast Long High Yes No
Rainy Short Low No No
Sunny Short Medium Yes Yes
Overcast Long Medium No No

P a g e | 90
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
 ID3 algorithm can be understood as a search through the space of hypotheses to find one
that best fits the training examples.
 The hypothesis space in ID3 is the set of all possible decision trees.
 ID3 starts with the simplest hypothesis (an empty tree) and progressively explores more
complex hypotheses, guided by information gain.
Key Characteristics of ID3's Search Strategy:
1. Complete Hypothesis Space:
 The hypothesis space in ID3 includes all possible decision trees that can be formed from
the given attributes.
 This ensures that the hypothesis space is complete because any finite discrete-valued
function can be represented by some decision tree.
 ID3 avoids the risk that the target function is not within the hypothesis space, a problem
common in incomplete hypothesis spaces.
2. Single Hypothesis Approach:
 ID3 maintains only one current hypothesis at any point during the search.
 Unlike methods like the version space candidate elimination, which maintain a set of
all consistent hypotheses.
o Limitation: By focusing on a single hypothesis, ID3 loses the ability to:

P a g e | 91
 Determine the number of alternative decision trees consistent with the training
data.
 Pose new instance queries to resolve among competing hypotheses.
3. No Backtracking:
 Once ID3 selects an attribute to test at a node, it does not reconsider this choice.
 The search path is fixed and may lead to a locally optimal solution.
 Limitation: This locally optimal solution may not be as desirable as other potential trees
that could have been found by exploring different branches of the search.
4. Statistical Decision-Making:
 ID3 uses all training examples at each step to make statistically based decisions about
refining the current hypothesis.
 Advantage: This approach makes ID3 less sensitive to errors in individual training
examples.
 Handling Noisy Data: ID3 can be extended to handle noisy data by adjusting the
termination criterion to accept hypotheses that imperfectly fit the training data.
INDUCTIVE BIAS IN DECISION TREE LEARNING
Inductive Bias refers to the set of assumptions or predispositions that a learning algorithm uses to
generalize from the training data to new, unseen instances. It shapes how the algorithm selects and
prioritizes certain hypotheses (models) over others.
ID3 Algorithm and Its Bias
ID3 (Iterative Dichotomiser 3) is a classic decision tree algorithm that constructs decision trees
from a set of training data. Its approximate inductive bias includes:
1. Preference for Shorter Trees:
 ID3 prefers simpler (shorter) decision trees over longer ones.
 This preference stems from Occam's razor, which suggests that simpler hypotheses
are more likely to generalize well to new, unseen data.
 Shorter trees are less complex and less prone to overfitting.

P a g e | 92
2. High Information Gain:
 ID3 uses the information gain heuristic to decide which attribute to split on at each
node of the tree.
 It places attributes with higher information gain closer to the root of the tree.
 This strategy aims to reduce uncertainty (entropy) most effectively, leading to
more informative splits.
Types of Inductive Bias
1. Preference Bias:
 ID3 exhibits a preference bias because its bias arises primarily from its search
strategy within a complete hypothesis space.
 It favors hypotheses (decision trees) that are simpler (shorter) and provide higher
information gain.
2. Restriction Bias:
 In contrast, algorithms like the Candidate-Elimination Algorithm might exhibit a
restriction bias because they operate within a more limited hypothesis space.
 This limitation can potentially exclude the true target function if it falls outside the
predefined constraints of the hypothesis space.
Occam's Razor and Preference for Short Hypotheses
Occam's Razor states that among competing hypotheses, the one with the fewest assumptions
should be selected. In the context of machine learning:
 Favoring Simplicity: Occam’s razor supports the preference for shorter hypotheses (or
models). This preference is justified because simpler hypotheses are less likely to fit the
training data coincidentally (overfitting) and are more likely to capture the underlying
patterns that generalize well.
 Arguments in Favor: Shorter hypotheses are fewer in number and less likely to overfit.
They often provide clearer insights into the data and are computationally efficient to learn
and apply.

P a g e | 93
 Arguments Opposed: The pool of short hypotheses that fit any arbitrary data might be
limited, which can make finding a suitable hypothesis challenging. Moreover, what
constitutes simplicity can be subjective, potentially leading different learners to derive
different hypotheses from the same data.
This preference aims to strike a balance between model simplicity and predictive power, enhancing
generalization to new data while avoiding overfitting. Understanding these biases helps in
selecting appropriate algorithms and interpreting their results effectively in real-world
applications.
ISSUES IN DECISION TREE LEARNING
Issues in learning decision trees include
1. Avoiding Overfitting the Data
2. Incorporating Continuous-Valued Attributes
3. Alternative Measures for Selecting Attributes
4. Handling Training Examples with Missing Attribute Values
5. Handling Attributes with Differing Costs
1. Avoiding Overfitting the Data
Overfitting occurs when a decision tree model is too complex and captures noise in the training
data rather than the underlying patterns.
To prevent overfitting, several strategies can be employed:
a) Pre-pruning (avoidance): Pre-pruning, also known as early stopping, involves stopping the
growth of the decision tree early, before it perfectly classifies the training data.
b) Post-pruning (recovery): Post-pruning involves growing the full tree and then pruning it back
to avoid overfitting.

P a g e | 94
c) Converting a Decision Tree into Rules: Decision trees can be easily converted into a set of if-
then rules, which can be more interpretable than the tree structure. This can help mitigate
overfitting by providing a more concise and generalized representation of the decision process.
2. Incorporating Continuous-Valued Attributes
Handling continuous-valued attributes involves splitting the attribute value range into intervals:
There are two methods for Handling Continuous Attributes
a) Define new discrete valued attributes that partition the continuous attribute value into a
discrete set of intervals.
E.g., {High ≡ Temp > 35º C, Med ≡ 10º C < Temp ≤ 35º C, Low ≡ Temp ≤ 10º C}

P a g e | 95
b) Using thresholds for splitting nodes: To define a threshold-based Boolean attribute for a
continuous attribute like Temperature, you need to determine a threshold value, t, and then create
a Boolean condition that evaluates whether the attribute's value is above or below this threshold
3. Alternative Measures for Selecting Attributes
Different criteria can be used to select the best attribute for splitting the data:
a) Gain Ratio: Adjusts information gain by taking into account the intrinsic information of a
split.
,
( )
( )
( )
,
,
InformationGain S A
GainRatio S A

Where:
( , log
( ) 2
)
Sv Sv
v Values A S S
 
 
 
 

∣ ∣ ∣ ∣
∣ ∣ ∣ ∣
b) Gini Index: Measures impurity based on the probability of a randomly chosen element being
incorrectly labeled.
2
1
1
( ) ( )
c
i
Gini S pi

  
Where:
pi is the proportion of examples in class i in dataset S.

P a g e | 96
4. Handling Training Examples with Missing Attribute Values
Dealing with missing values can be approached in several ways:
a) Ignore Examples: Discard examples that have missing values for any attribute. This can lead
to a significant reduction in the size of the training dataset, potentially removing valuable
information and reducing the overall robustness of the model.
Original Dataset:
Age Salary Purchased
25 50000 Yes
30 60000 No
35 ? Yes
? 45000 No
After Ignoring Examples with Missing Values:
25 50000 Yes
30 60000 No
b) Assign Most Common Value: For categorical attributes, assign the most common (mode)
value of that attribute among the remaining examples. This approach can introduce bias into the
model, as it may skew the representation of certain values and does not account for the variability
and true distribution of the data.
Original Dataset:
25 50000 Yes
33 60000 Yes
30 60000 No
35 ? Yes
30 45000 No

P a g e | 97
After Assigning Most Common Values:
25 50000 Yes
30 60000 No
35 60000 Yes
30 45000 No
c) Assign Mean Value: For continuous attributes, replace the missing values with the mean value
of that attribute from the other examples. Similar to assigning the most common value, this can
introduce bias and does not capture the potential range of variability within the data.
Original Dataset:
25 50000 Yes
30 60000 No
35 ? Yes
? 45000 No
 Mean Age: (25 + 30 + 35) / 3 = 30
 Mean Salary: (50000 + 60000 + 45000) / 3 = 51666.67
After Assigning Mean Values:
25 50000 Yes
30 60000 No
35 51666.67 Yes
30 45000 No
5. Handling Attributes with Differing Costs
 When building decision trees, it is important to consider the case where the attributes
(features) have differing costs associated with them.
 This is a common scenario in real-world applications, such as medical diagnosis, where
certain tests or examinations may be more expensive or invasive than others.

P a g e | 98
a) Cost-Sensitive Splitting Criterion
 Instead of using the standard information gain or Gini impurity as the splitting criterion, a
cost-sensitive version can be used.
 The idea is to divide the information gain by the cost of the attribute
InformationGain
Cost SensitiveGain
AttributeCost

 This encourages the algorithm to select lower-cost attributes, as they will have a higher
cost-sensitive gain.
b) Weighted Information Gain
 Another approach is to use a weighted information gain, where the weight is inversely
proportional to the attribute cost:
1
Weighted InformationGain InformationGain
AttributeCost
 
 This has a similar effect to the cost-sensitive splitting criterion, biasing the algorithm
towards lower-cost attributes.
Example:
Assume we have three attributes for a medical test:
 Attribute A (Cost: $10)
 Attribute B (Cost: $50)
 Attribute C (Cost: $100)
a) Using Cost-Sensitive Splitting Criterion
Cost-sensitive gain is calculated by dividing the information gain by the attribute cost.
Attribute Cost Information Gain Cost-Sensitive Gain
A $10 0.3 0.3/10= 0.03
B $50 0.5 0.5/50=0.01
C $100 0.8 0.8/100 =0.008
The algorithm would choose Attribute A despite its lower information gain because it has a
higher cost-sensitive gain.

P a g e | 99
b) Using Weighted Information Gain
Weighted information gain is calculated by multiplying the information gain by the inverse of
the attribute cost.
Attribute Cost Information Gain Weighted Information Gain
A $10 0.3 1
0.3 0.03
10
 
B $50 0.5 1
0.5 0.01
50
 
C $100 0.8 1
0.8 0.008
100
 
Again, the algorithm would prefer Attribute A due to its higher weighted information gain.
c) Minimum Cost Path
 Instead of optimizing the tree for overall accuracy, the goal can be to find the minimum
cost path from the root to a leaf node.
 This can be achieved by modifying the pruning algorithm to consider the total cost of the
path, not just the accuracy.
For example, consider two paths:
 The task is to identify and follow the path that minimizes the total cost between
two nodes.
 The path marked with a green arrow has costs of 10, 10, 10, and 10, resulting in a
total cost of 40.
 The path marked with a red arrow has costs of 10, 20, and 30, resulting in a total
cost of 60.
**********

ELH -4.2: MACHINE LEARNING :supervised, unsupervised or reinforcement learning

More Related Content

Similar to ELH -4.2: MACHINE LEARNING :supervised, unsupervised or reinforcement learning (20)

More from Kuvempu University (9)

Recently uploaded (20)

ELH -4.2: MACHINE LEARNING :supervised, unsupervised or reinforcement learning