SlideShare a Scribd company logo
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 1
ELH -4.2: MACHINE LEARNING
INTRODUCTION
 Machine Learning (ML) is a field of artificial intelligence (AI) focused on developing
algorithms that enable computers to learn from and make decisions based on data.
 Its history dates back to the 1950s with Alan Turing's concept of machines simulating
human intelligence.
 The term "artificial intelligence" was coined in 1956, but early AI faced limitations due to
insufficient data and computational power. The 1980s saw the emergence of machine
learning methods, and the 1990s brought significant advances with the rise of the internet
and statistical techniques.
 The modern era, particularly from the 2010s onwards, has been dominated by deep
learning, leveraging neural networks and vast datasets to achieve breakthroughs in areas
like image and speech recognition, transforming various industries and applications.
Definitions and Examples
1. Learning
Definition: Learning, in the context of machine learning, refers to the process of gaining
knowledge or skills through experience, study, or being taught. In ML, it specifically means
improving performance on a task over time by gaining experience from data.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 2
Example: Suppose you are learning to play chess. Initially, you might not know the best strategies,
but as you play more games and study strategies, you get better. Similarly, in ML, an algorithm
might start with little knowledge but improves its performance as it processes more data.
2. Machine
Definition: A machine, in this context, is any device that uses electrical or mechanical power to
perform tasks. In the realm of machine learning, "machine" usually refers to a computer or a system
that can process and analyze data.
Example: Your smartphone is a machine. It can perform various tasks like recognizing your voice
or face, suggesting the next word while typing, or recommending songs you might like based on
your listening history
3. Natural intelligence
Natural intelligence refers to the inherent ability of humans and certain animals to understand,
learn, and adapt to their environment using cognitive processes such as perception, reasoning, and
problem-solving. It encompasses a wide range of capabilities, including language comprehension,
social interactions, creativity, and emotional intelligence.
Example: A person walking through a crowded street demonstrates natural intelligence by
effortlessly navigating through the environment, avoiding obstacles, recognizing familiar faces,
interpreting traffic signals, and making decisions based on situational awareness. This ability to
process complex sensory information, analyze context, and respond appropriately showcases the
remarkable cognitive abilities inherent in natural intelligence.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 3
4. Artificial Intelligence (AI)
Definition: AI is the broader field that encompasses any technique enabling computers to mimic
human intelligence. This includes problem-solving, understanding language, recognizing patterns,
and learning from experience.
Example: Siri, Apple's virtual assistant, is an example of AI. It can understand and respond to
your questions, set reminders, and perform tasks based on your voice commands. This involves
natural language processing and machine learning.
Types of AI
Artificial Intelligence can be divided in various types, there are mainly two types of main
categorization which are based on capabilities and based on functionally of AI.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 4
I) Based on Capabilities
AI Type Description Examples
1. Narrow AI AI designed for specific tasks. - Apple Siri
- Playing chess
- Self-driving cars
- Speech recognition
- Image recognition
2. General AI AI with the ability to understand, learn,
and apply intelligence across a wide range
of tasks.
Currently theoretical; under
research and development.
3. Super AI AI that surpasses human intelligence in all
aspects, including creativity, problem-
solving, and emotions.
Currently theoretical; under
research and development.
II) Based on Functionalities
AI Type Description Examples
1. Reactive
Machines
Basic AI systems that react to current
scenarios without storing memories or
past experiences.
- IBM's Deep Blue
- Google's AlphaGo
2. Limited
Memory
AI systems capable of storing past
experiences for a short period and
using them to inform decisions.
- Self-driving cars storing recent
speed of nearby cars, distance,
speed limits, etc.
3. Theory of
Mind
AI intended to understand human
emotions, beliefs, and interact socially
like humans.
Currently in development; no
existing examples.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 5
4. Machine Learning (ML)
Definition: Machine Learning (ML) is a subset of artificial intelligence (AI) that enables
computers to learn from and make predictions or decisions based on data. Instead of being
explicitly programmed to perform a task, a machine learning model is trained using data and
algorithms to find patterns and make decisions.
Example: A spam filter in your email uses ML to distinguish between spam and legitimate emails.
It learns from past emails marked as spam or not and uses that data to predict and filter future
emails.
Difference between Machine Learning and Traditional Programming
 Machine learning (ML) and traditional programming are two distinct approaches to solving
problems with computer systems.
 While traditional programming relies on explicit rules and human-crafted logic, machine
learning leverages algorithms that learn from data to make predictions or decisions
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 6
Characteristic Machine Learning Traditional Programming
Approach Data-driven Rule-based
Development
Process
Model is trained using data and
algorithms
Explicit rules and logic are coded by
developers
Adaptability Highly adaptable; can improve
with more data
Limited to predefined rules; changes
require code modifications
Human
Intervention
Minimal after training;
continuous learning
Ongoing maintenance and updates
by developers
Handling
Complexity
Handles complex patterns and
large datasets effectively
Effective for well-defined problems
and tasks
Required Input Large datasets for training and
testing
Detailed specifications and rules
Error Handling Can handle noisy or incomplete
data
Requires precise data and handling
of edge cases
Performance Performance improves with more
data and better algorithms
Performance depends on code
optimization
Learning from
Data
Learns and improves from new
data
Does not learn; behavior remains
static unless reprogrammed
Flexibility Can generalize well to new,
unseen data
Limited flexibility; changes require
code rewrite
Predictive
Capability
Can make predictions based on
patterns in data
Cannot predict; follows explicit
instructions
Time to
Deployment
Longer initial setup for training
models
Quicker to deploy for well-defined
tasks
Scalability Scales well with more data and
computational power
Scales with code complexity and
hardware resources
Limitations Can be limited by the quality and
quantity of training data
Can be limited by the programmer's
understanding and analysis of the
problem
Advantages Can handle complex tasks and
adapt to new data and scenarios
Can be used for tasks that require
specific functionality and are well-
defined
Examples of
Technologies
Neural networks, decision trees,
support vector machines
Compilers, interpreters, databases
Applications Used for complex tasks like
Image recognition, and predictive
analytics
Used for tasks like database
management and website
development
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 7
Need of Machine Learning
 The need for machine learning (ML) arises from the exponential expansion of data in the
digital age. Traditional analytical approaches are no longer adequate due to the vast
amounts of data being generated daily.
 Machine learning algorithms can find patterns, trends, and connections that humans would
not even be aware of, making them crucial for decision-making processes, optimizing
resource allocation, and spurring innovation in various sectors
1. Handling Big Data: ML can process and analyze vast amounts of data, extracting
meaningful patterns and insights.
2. Complex Pattern Recognition: ML algorithms excel at identifying intricate patterns
in data that are difficult to detect using traditional methods.
3. Automation of Tasks:ML enables automation of repetitive tasks, reducing human
intervention and increasing efficiency.
4. Improved Decision Making:ML models can provide data-driven insights and
predictions, aiding in better decision-making processes.
5. Adaptability:ML models can adapt to new data and changing conditions, making them
flexible and robust.
6. Personalization:ML allows for personalized experiences, such as tailored
recommendations in e-commerce and streaming services.
7. Scalability:ML systems can scale with the amount of data and computational power,
improving performance and accuracy over time.
8. Real-Time Processing:ML can process data in real time, enabling applications like fraud
detection, autonomous vehicles, and instant recommendations.
9. Complex Problem Solving: ML can tackle problems that are too complex for traditional
algorithms, such as image and speech recognition.
10. Predictive Maintenance: ML can predict equipment failures and maintenance needs,
reducing downtime and saving costs.
11. Enhanced Customer Experience: ML-driven chatbots and virtual assistants provide
better customer support and interaction.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 8
Life Cycle of Machine Learning
The Machine Learning life cycle encompasses the iterative process of developing, deploying, and
maintaining machine learning models. It involves steps such as data gathering, preprocessing,
model selection and training, evaluation, and deployment, ensuring the model's effectiveness and
adaptability to real-world scenarios. This cyclical approach enables continuous improvement and
refinement of models to meet evolving needs and challenges.
Here's 7 step in the Machine Learning life cycle using a fruit classification example:
1. Define Problem Statement
 Understand the problem to be solved and define the objectives of the machine learning
project.
 In this example, the goal is to develop a model that can classify fruits into different
categories based on their features, such as color, shape, and size.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 9
2. Data Gathering
 Collect relevant data that will be used to train and test the machine learning model.
 This could involve gathering information about various types of fruits, including images
and corresponding labels indicating the fruit type.
3. Data Preparation
 Clean and preprocess the collected data to ensure it is in a suitable format for training the
machine learning model.
 This may involve tasks such as removing irrelevant features, handling missing values, and
normalizing the data.
4. Data Analysis
 Explore and analyze the prepared data to gain insights into its characteristics and identify
patterns that may be useful for training the model.
 For example, analyzing the distribution of different fruit types in the dataset and
visualizing the relationships between features.
5. Model Selection and Training
 Select an appropriate machine learning algorithm and train it using the prepared data.
 In this example, you might choose a classification algorithm such as a decision tree or a
neural network to train the model to classify fruits based on their features.
6. Model Testing
 Evaluate the performance of the trained model using a separate dataset that was not used
during training. This helps assess how well the model generalizes to new, unseen data.
 For fruit classification, you would test the model on a set of fruit images it hasn't seen
before and measure its accuracy in predicting the correct fruit type.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 10
7. Deployment
 Deploy the trained model into a production environment where it can be used to make
predictions on new, incoming data.
 For example, you could develop a mobile app that allows users to take a picture of a fruit
and have the model classify it in real-time based on its features.
Types of Machine Learning Algorithms:
Machine learning algorithms are classified into three main types: supervised, unsupervised, and
reinforcement learning.
1. Supervised Learning:
Definition: Supervised learning involves training a model on a labeled dataset, where each input
is associated with a corresponding output label. The goal is for the model to learn the mapping
between inputs and outputs, enabling it to make predictions on new, unseen data.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 11
Working: The algorithm learns from the labeled examples by adjusting its parameters to
minimize prediction errors. It generalizes from the training data to make predictions on new
instances, aiming to accurately predict the correct output labels.
Example: Given the features of a shape (e.g., number of sides, angles), the supervised learning
algorithm would analyze these features and learn patterns distinguishing between different types
of shapes. Once trained, the model can classify new shapes based on their features into categories
like square, rectangle, triangle, or polygon.
Applications: Classification tasks such as spam detection, sentiment analysis, image recognition,
and regression tasks like predicting house prices or stock prices.
Advantages: Ability to make precise predictions on new data, well-understood and widely
applicable across various domains.
Disadvantages: Requires labeled training data, which can be time-consuming and expensive to
obtain. Performance highly depends on the quality and quantity of labeled examples.
2. Unsupervised Learning:
Definition: Unsupervised learning involves training a model on an unlabeled dataset, where the
algorithm learns to find patterns or structures within the data without predefined output labels.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 12
Working: The algorithm identifies underlying patterns or structures in the data without the need
for labeled output. Common techniques include clustering similar data points together or reducing
the dimensionality of the data
Example: An unsupervised learning algorithm could analyze the geometric properties of the
shapes (e.g., side lengths, angles) and identify clusters of shapes that exhibit similar characteristics.
This could result in clusters representing shapes with similar attributes, such as squares, rectangles,
triangles, and polygons.
Applications: Clustering (e.g., customer segmentation), dimensionality reduction (e.g., principal
component analysis), and anomaly detection.
Advantages: Can uncover hidden patterns or structures in data without labeled examples. Doesn't
require manual labeling of large datasets.
Disadvantages: May be more challenging to interpret results compared to supervised learning.
Relies on assumptions about the structure of the data.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 13
3. Reinforcement Learning:
Definition: Reinforcement learning involves training an agent to interact with an environment and
learn to make decisions based on feedback in the form of rewards or penalties.
Working: The agent takes actions in an environment and receives feedback in the form of rewards
or penalties. It learns to maximize cumulative rewards over time through trial and error, aiming to
discover the best sequence of actions to achieve its goals.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 14
Example: In this maze scenario, the agent is tasked with navigating from the starting point to the
destination while avoiding obstacles and maximizing rewards. The maze consists of different
blocks, including walls (S6), a fire pit (S8), and a diamond block (S4). The agent receives a +1
reward for reaching the diamond block (S4) and a -1 reward for falling into the fire pit (S8).
Applications: Game playing, robotics, recommendation systems, natural language processing, and
finance (e.g., algorithmic trading).
Advantages: Capable of learning complex behaviors through interaction with the environment.
Can handle situations with delayed feedback and uncertainty.
Disadvantages: Can be computationally expensive and require large amounts of data for training.
Training may be unstable or require careful tuning of hyperparameters. Learning from delayed
rewards can be slow and inefficient in some scenarios.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 15
Comparison of Supervised, Unsupervised, and Reinforcement Learning
Aspect Supervised Learning Unsupervised Learning Reinforcement Learning
Definition Supervised learning
involves training a model
on a labeled dataset,
where the algorithm
learns to map input data to
corresponding output
labels or categories.
Unsupervised learning
involves training a model on
an unlabeled dataset, where
the algorithm learns to find
patterns or structures within
the data without predefined
output labels.
Reinforcement learning
involves training an agent to
interact with an environment
and learn to make decisions
based on feedback in the form
of rewards or penalties.
Data Type Requires labeled data for
both input and output.
Works with unlabeled data;
no output labels are provided
during training.
Involves an environment
where actions are taken and
feedback is received in the
form of rewards or penalties.
Feedback
Mechanism
Feedback provided in the
form of labeled examples,
allowing the algorithm to
adjust its parameters to
minimize prediction
errors.
No explicit feedback is
provided; the algorithm
learns to identify patterns
based on the inherent
structure of the data.
Feedback received in the form
of rewards or penalties based
on the actions taken by the
agent in the environment.
Objective Predict the output label
for new, unseen data
based on learned patterns
from labeled examples.
Discover hidden patterns or
structures within the data to
gain insights or make sense
of complex datasets.
Learn a policy or strategy that
maximizes cumulative
rewards over time, aiming to
achieve specific goals or
tasks.
Example Image classification,
sentiment analysis,
regression tasks like
predicting house prices.
Clustering similar data
points together,
dimensionality reduction,
anomaly detection.
Training an agent to play
games (e.g., chess, Go),
robotics (e.g., navigating a
maze), recommendation
systems.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 16
Applications - Classification (e.g.,
spam detection, image
recognition).
- Regression (e.g., house
price prediction).
-Clustering (e.g., customer
segmentation, document
clustering).
-Dimensionality reduction
(e.g., principal component
analysis).
- Anomaly detection.
- Game playing (e.g., chess,
Go).
- Robotics (e.g., autonomous
vehicles).
- Recommendation systems.
- Natural language
processing.
- Finance (e.g., algorithmic
trading).
Advantages - Ability to make precise
predictions on new,
unseen data.
- Well-understood and
widely applicable in
various domains.
- Can uncover hidden
patterns or structures in data
without labeled examples.
- Doesn't require manual
labeling of large datasets.
- Capable of learning complex
behaviors through interaction
with the environment.
- Can handle situations with
delayed feedback and
uncertainty.
Disadvantages - Requires labeled
training data, which may
be time-consuming and
expensive to obtain.
- Performance highly
dependent on the quality
and quantity of labeled
examples.
- May be more challenging
to interpret results compared
to supervised learning.
- Relies on assumptions
about the structure of the
data.
- Can be computationally
expensive and require large
amounts of data for training.
- Training may be unstable or
require careful tuning of
hyperparameters.
- Learning from delayed
rewards can be slow and
inefficient in some scenarios.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 17
5. Deep Learning (DL)
Definition: DL is a subset of ML that uses neural networks with many layers (hence "deep") to
model and understand complex patterns in data. Deep learning is particularly powerful for tasks
like image and speech recognition.
Example: An application like Google Photos can automatically organize your photos by
recognizing faces, objects, and scenes. This is done using deep learning algorithms that have been
trained on vast amounts of image data to identify and categorize images accurately.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 18
Comparison of AI, ML, and DL
Aspect Artificial Intelligence
(AI)
Machine Learning (ML) Deep Learning (DL)
Definition AI refers to the broader
concept of creating
machines that can perform
tasks requiring human-
like intelligence.
ML involves the
development of algorithms
that enable computers to
learn from data and
improve over time.
DL is a subset of ML that
uses artificial neural
networks with multiple
layers (deep architectures) to
learn representations of data.
Approach Mimics human
intelligence and behavior
to perform tasks.
Learns patterns from data
and makes predictions or
decisions.
Learns representations of
data through hierarchical
layers of abstraction.
Examples Virtual assistants (e.g.,
Siri, Alexa), autonomous
vehicles, game playing
AI.
Spam filters,
recommendation systems,
image recognition.
Image and speech
recognition, natural
language processing,
autonomous driving.
Data Size Can handle both small and
large datasets.
Can handle both small and
large datasets.
Particularly effective with
large volumes of data.
Complexity Can be complex and may
involve various
approaches, including ML
and DL.
Can range from simple
linear models to complex
deep neural networks.
Utilizes complex neural
network architectures with
multiple layers.
Interpretability May lack interpretability
due to the complexity of
AI systems.
Depends on the complexity
of the ML model; simpler
models may be more
interpretable.
Often considered less
interpretable due to the
hierarchical nature of deep
neural networks.
Training Time Can vary widely
depending on the
complexity of the AI
system.
Training time depends on
the complexity of the ML
model and the size of the
dataset.
Can be time-consuming,
especially with large
datasets and complex
architectures.
Hardware Can run on various
hardware platforms,
including CPUs and
GPUs.
Can run on CPUs and
GPUs, with specialized
hardware (e.g., TPUs)
available for ML tasks.
Often requires GPUs or
specialized hardware
accelerators for training and
inference.
Applications Wide range of
applications across
industries, including
healthcare, finance, and
gaming.
Numerous applications in
fields such as healthcare,
finance, e-commerce, and
more.
Dominates fields such as
computer vision, natural
language processing, and
speech recognition.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 19
WELL-POSED LEARNING PROBLEM
 A well-posed learning problem is a problem whose solution exists, is unique, and depends
on data and is not sensitive to small changes in data.
 It is formally defined as: "A computer program is said to learn from Experience E when
given a task T, and some performance measure P. If it performs on T with a performance
measure P, then it upgrades with experience E."
 The three essential components of a well-posed learning problem:
1. Task (T): The specific problem or task that the model is intended to solve.
2. Performance Measure (P): The metric used to evaluate the model's performance.
3. Experience (E): The data used to train and improve the model.
Criteria for a Well-Posed Learning Problem
1. Well-Defined Objective: The problem should have a clear and specific goal.
2. Relevant and Sufficient Data: The data should be relevant to the problem and sufficient
in quantity and quality to train the model effectively.
3. Measurable Performance: There must be a way to measure the performance of the model,
such as accuracy, precision, recall, F1 score, mean squared error, etc.
4. Feasibility and Practicality: The problem should be practically solvable given the current
technology, data availability, and resource constraints.
Examples of Well-Posed Learning Problems:
1. Learning to Play Checkers:
 Task: Play the checkers game.
 Performance Measure: Percentage of games won against the opponent.
 Experience: Playing practice games against itself.
2. Handwriting Recognition:
 Task: Recognizing and classifying handwritten words from images.
 Performance Measure: Percentage of correctly identified words.
 Experience: A set of handwritten words with their classifications in a database.
3. Robot Driving:
 Task: Driving on public four-lane highways using sight scanners.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 20
 Performance Measure: Average distance progressed before an error.
 Experience: The order of images and steering instructions noted down while
observing a human driver.
4. Spam Filtering:
 Task: Identifying whether or not an email is spam.
 Performance Measure: Percentage of emails correctly categorized as spam or
nonspam.
 Experience: Observing how you categorize emails as spam or nonspam.
5. Face Recognition:
 Task: Predicting distinct sorts of faces.
 Performance Measure: Ability to anticipate the largest number of different sorts of
faces.
 Experience: Training the system with as many datasets of varied facial photos as
possible.
DESIGNING A LEARNING SYSTEM
The basic design issues and approaches to machine learning are illustrated by designing a program
to learn to play checkers, with the goal of entering it in the world checkers tournament. It mainly
involves following 5 steps.
1. Choosing the Training Experience
a) Type of Feedback: Direct vs. Indirect
b) Degree of Control over Training Sequence
c) Representation of Example Distribution
2. Choosing the Target Function
a) Linear Function
b) Neural Networks
d) Decision Trees
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
a) Estimating training values
b) Adjusting the weights
5. The Final Design
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 21
1. Choosing the Training Experience
 The training experience defines the data or experiences the machine learning algorithm
will use to learn.
 The training data must reflect the overall characteristics of the dataset to ensure the
algorithm performs well in real-world scenarios. To select the optimal training experience,
consider these three key attributes:
a) Type of Feedback: Direct vs. Indirect
The learner has significant control over the sequence of training examples, allowing it to explore
different strategies and adjust based on feedback.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 22
 Direct Feedback: The training experience provides immediate feedback on each choice
made by the algorithm. For example, in a game, the algorithm gets feedback on every move
it makes.
Let f(t) represent the feedback function, where t is the time step or iteration.
)
f t
( ) r(
= t
where r(t) is the reward or feedback received at time t.
 
 
 
 
 
+1 if the move a leads to a win
Reward s,a = -1 if the move a leads to a loss
0 Oth s
(
er i e
)
w
Here, s represents the state (board configuration), and a represents the action (move).
 Indirect Feedback: The training experience provides feedback after a sequence of actions,
indicating the final outcome rather than the quality of each individual move. This is
common in scenarios where the algorithm needs to learn from the consequences of a series
of decisions, such as in strategic games or long-term planning tasks.
)
f t
( ) R(
= T
where R(T) is the cumulative reward or feedback received at the end of a sequence of
actions at time T.
b) Degree of Control over Training Sequence:
 Teacher-Driven: The teacher (or supervisor) selects the training examples, providing
informative states and correct actions. This approach is structured but limits the learner's
ability to explore and understand the problem space independently.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 23
Let S(t) represent the state at time t, and A(t) represent the action taken at time t.
)
A t = Teacher S
) ( (t
( )
where Teacher is a function provided by the teacher selecting the action based on the
state.
 Learner-Driven: The learner selects the training examples by identifying challenging or
confusing states and requesting guidance from the teacher. This method promotes active
learning and helps the learner focus on areas where it needs the most improvement.
( ) ( ( )) ( ( ))
A t = Learner S t + λ×Teacher S t
where Learner(S(t)) is the action proposed by the learner, and λ is a mixing parameter
indicating the reliance on teacher feedback.
 Self-Learning: The learner has complete control over the training process, generating its
own examples and learning from them without external guidance. This method, often used
in reinforcement learning, allows the learner to explore a wide range of scenarios but
requires robust mechanisms to avoid overfitting and ensure generalization.
)
A t = Learner S
) ( (t
( )
where the learner fully controls the action selection without external guidance.
c)Representation of Example Distribution:
 The training data should cover a diverse range of examples that reflect the distribution of
scenarios the algorithm will encounter in real-world use.
 The training examples are biased or not representative of the overall data set, potentially
leading to overfitting or poor generalization.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 24
 Training only on games where the algorithm wins, which does not account for challenging
scenarios.
 Mitigating bias by ensuring the training set includes examples from both wins and losses,
various board states, and diverse opponent strategies.
{ , , ) }
, ,
(
TrainingSet s a r s AllStates a AllActions r AllRewards
   
∣
 A diverse training set helps the algorithm generalize better, improving its performance
across different situations. Ensuring the training experience encompasses varied examples
is crucial for achieving robust and reliable performance.
When designing a checkers-playing program, it's essential to carefully select the training
experience to ensure that the algorithm learns effectively and generalizes well to new games.
Considering the type of feedback, the degree of control over training examples, and the distribution
of examples will significantly impact the success of the algorithm.
2. Choosing the Target Function
The target function represents the goal of the learning process, mapping from the current state of
the game to the desired outcome.
NextMove Function: The target function f could predict the value of making a particular move in
a given state, it can be represented as:
( )
( )
,
) (
LegalMoves s
f s a maxV s a
 
where s is the current board state, a is a possible move, and V(s,a) is the value of taking action a
in state s.
3. Choosing a Representation for the Target Function
The representation defines how the target function will be modeled, which can impact the learning
process's complexity and efficiency.
a) Linear Function: A simple linear combination of features.
)
, ,
( ( )
i i
i
V s a w f s a
 
where fi(s,a) are features, and wi are the weights.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 25
b) Neural Networks: More complex and capable of capturing non-linear relationships.
( )
, , ;
) (
V s a NN s a 

where θ are the parameters of the neural network.
c) Decision Trees: Hierarchical model splitting decisions based on feature values.
4. Choosing a Function Approximation Algorithm
The function approximation algorithm determines how the target function will be learned from the
training data.
a) Estimating Training Values
To train the model, we need to estimate the value of different moves. This can be done using
techniques like:
 Monte Carlo Simulation: Running simulations to estimate the value of each move based
on the outcomes of simulated games.
1
( ) ( )
1
,
N
i
i
V s a Outcome G
N 
 
where _Gi is the i-th simulated game starting from state s after move a
 Temporal Difference Learning: Updating estimates based on the difference between
successive state values.
( ) ( ) ( ( ) ( ))
V s V s r V s V s
 
    
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 26
where α is the learning rate, γ is the discount factor, r is the reward, and s′ is the next
state.
b) Adjusting the Weights
Using a learning algorithm to adjust the weights of the representation based on the estimated
values.
 Gradient Descent: For a neural network, weights are adjusted to minimize the error in the
estimated values.
( )
L

   
  
where η is the learning rate, and L(θ) is the loss function.
 Least Squares Method: For linear functions, weights can be adjusted using the least
squares method to fit the function to the training data.
The sum of squared errors (SSE) is given by:
2
1
( )
n
i i
i
SSE y y


 

2
1
( ( ))
n
i i
i
SSE y w x b

  

where:
 w=[w1,w2,…,wn]⊤ is the weight vector,
 b is the bias term,
 y^ is the predicted value.
 yi the corresponding target value.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 27
By following these steps, a machine learning-based checkers program can be developed,
optimized, and made ready for competitive play.
5. The Final Design
 The final design of a checkers learning system can be described by four distinct program
modules that represent the central components in many learning systems. These modules
work together to facilitate the learning and improvement of the system over time through
a series of iterations involving performance, critique, generalization, and experimentation.
 The final design integrates all components, resulting in a checkers-playing program
capable of competing in the world checkers tournament.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 28
Overall Workflow:
1. The Performance System plays a new game of checkers and records the game history.
2. The Critic analyzes this game history to generate training examples.
3. The Generalizer uses these training examples to update its hypothesis about the best
moves in checkers.
4. The Experiment Generator uses the updated hypothesis to select a new initial board state
for the next game, and the cycle repeats.
Through this iterative process, the system continuously improves its ability to play checkers by
learning from each game played, evaluating its performance, generalizing from its experiences,
and exploring new game scenarios.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 29
PERSPECTIVE AND ISSUES IN MACHINE LEARNING
Machine learning encompasses various perspectives, from supervised learning's reliance on
labeled data to reinforcement learning's dynamic environment interactions, yet faces challenges
such as data bias and interpretability concerns.
Perspective in Machine Learning:
1. Data-Centric Perspective:
 Machine learning focuses on leveraging data to extract meaningful patterns,
insights, and knowledge.
 It emphasizes the importance of data quality, quantity, and relevance in training
accurate models.
2. Model-Centric Perspective:
 Machine learning involves designing and developing models that can learn from
data and make predictions or decisions.
 Models can range from simple linear models to complex deep neural networks,
and their selection depends on the problem and data characteristics.
3. Algorithmic Perspective:
 Machine learning encompasses various algorithms and techniques that enable
models to learn from data.
 These include supervised learning, unsupervised learning, reinforcement learning,
and deep learning, among others.
Issues in Machine Learning
1. Data Quality and Quantity:
o Issue: Insufficient or poor-quality data can lead to inaccurate models and biased
results.
o Solution: Collecting more high-quality data, preprocessing data to handle missing
values and outliers, and ensuring data is representative of the problem domain.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 30
2. Overfitting and Underfitting:
o Issue: Overfitting occurs when a model learns the training data too well but fails to
generalize to new, unseen data. Underfitting happens when the model is too simple
to capture the underlying structure of the data.
o Solution: Regularization techniques, cross-validation, and adjusting model
complexity can help mitigate overfitting and underfitting.
3. Interpretability and Explainability:
o Issue: Complex machine learning models often lack interpretability, making it
challenging to understand and trust their decisions, especially in critical
applications like healthcare or finance.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 31
o Solution: Using simpler, more interpretable models when possible, or employing
techniques such as feature importance analysis and model explanation methods.
4. Bias and Fairness:
o Issue: Models can inadvertently learn and perpetuate biases present in the training
data, leading to unfair or discriminatory outcomes.
o Solution: Careful selection and preprocessing of training data, fairness-aware
algorithms, and post-processing techniques to mitigate bias.
5. Computational Resources:
o Issue: Training and deploying complex machine learning models can require
significant computational resources, including processing power and memory.
o Solution: Optimizing algorithms and model architectures, utilizing distributed
computing frameworks, and leveraging cloud computing resources.
6. Privacy and Security:
o Issue: Machine learning models trained on sensitive data may inadvertently leak
private information or be vulnerable to adversarial attacks.
o Solution: Implementing privacy-preserving techniques such as differential privacy,
federated learning, and robust model training against adversarial attacks.
7. Ethical Considerations:
o Issue: Machine learning applications raise ethical concerns regarding issues like
data privacy, consent, transparency, and potential societal impacts.
o Solution: Adhering to ethical guidelines and regulations, fostering interdisciplinary
collaboration, and engaging in transparent communication with stakeholders.
Addressing these issues requires a combination of technical expertise, ethical considerations, and
interdisciplinary collaboration to ensure responsible and effective deployment of machine learning
systems.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 32
CONCEPT LEARNING
Concept learning is a key task in machine learning, aimed at discovering general patterns or
concepts from labeled examples.
Definition: Concept learning - Inferring a Boolean-valued function from training examples of
its input and output
It involves the following steps and objectives:
1. Inference of Hypotheses: The process starts by inferring a hypothesis that accurately
describes the target concept based on observed instances. For example, understanding what
a "bird" is by analyzing various examples of birds and identifying their common
characteristics.
2. Generalization: The goal is to derive a general rule or concept from specific examples.
This allows the model to generalize beyond the training data, making accurate predictions
on new, unseen instances.
3. Pattern Recognition and Classification: Concept learning is crucial for tasks such as
classification and pattern recognition. By identifying the underlying rules or patterns that
define a concept, systems can make predictions or decisions based on the learned
knowledge.
In the study of concept learning, there are two types
i) Concept Learning Task
ii) Concept Learning as Search
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 33
Types of Concept Learning
i) Concept Learning Task
Definition: A concept learning task involves identifying a general rule (or concept) from specific
examples, allowing for the classification of new, unseen examples.The concept learning task
typically involves the following components:
1.Instance Space(X):
 The instance space refers to the set of all possible instances or examples that can be
observed or encountered in the domain of interest.
}
1, 2,
{ , n
X x x x
 
 Each instance x in X is described by a vector of attribute values.
 For example, x=(Sunny, Warm,Normal,Strong,Warm,Same).
2.Hypothesis Space(H):
 The hypothesis space represents the set of possible hypotheses or concept descriptions that
can be considered during the concept learning process.
 
h: X 0,1

 Each hypothesis is a potential concept description that can classify instances into positive
or negative examples of the target concept.
 For example, h(x)=1 if the hypothesis predicts that Aldo enjoys the sport on day x, and
h(x)=0 otherwise.
 For example, some hypotheses in the hypothesis space could be:
If Sky = Sunny and AirTemp = Warm, then EnjoySport = Yes
If Humidity = High and Water = Warm, then EnjoySport = No
3.Training Examples(D):
 The training examples are the provided instances along with their corresponding class
labels (EnjoySport).
{( ( )) ( ( )) ( ( ))}
1, 1 , 2, 2 , , ,
n n
D x c x x c x x c x
 
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 34
 Each training example consists of attribute values and the target concept's class label (Yes
or No).
4.Target Concept(C):
 The target concept represents the concept or category[(Yes) or not (No).] that we want to
learn from the training examples.
 
c: X 0,1

 For example, c(x)=1(positive examples) if Aldo enjoys the sport on day x, and c(x)=0
(negative examples) otherwise.
• Each hypothesis h in H represents a Boolean valued function defined over X
 
h: X 0,1

 The goal of the learner is to find a hypothesis: )
, ( ( )
x X h x c x
  
• The aim of concept learning is to infer a concept description or hypothesis that accurately
predicts the EnjoySport label for new, unseen instances based on the provided training
examples.
Example:
 Let's consider learning the target concept "Days on which Aldo enjoys his favorite water
sport."
 We have a table of data with various attributes (like Sky, AirTemp, Humidity, Wind,
Water, Forecast) and whether Aldo enjoyed the sport (EnjoySport) on those days.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 35
Training Data: (EnjoySport Dataset)
Task: The goal is to learn a rule (hypothesis) that can predict the value of EnjoySport for any new
day based on the values of its other attributes (Sky, AirTemp, Humidity, Wind, Water, and
Forecast).
Hypothesis Representation:
 General Hypothesis: A rule that applies to many instances (e.g., Aldo enjoys the sport on
any day).
 Specific Hypothesis: A rule that applies to very specific instances (e.g., Aldo enjoys the
sport only on Sunny and Warm days).
Each hypothesis can be represented as a vector of constraints on the attributes:
 "?" means any value is acceptable.
 A specific value (e.g., "Warm") means only that value is acceptable.
 "Φ" means no value is acceptable.
Examples:
 Hypothesis for enjoying the sport on cold days with high humidity: h=(?,Cold,High,?,?,?)
 The most general hypothesis (every day is positive): h=(?,?,?,?,?,?)
 The most specific hypothesis (no day is positive): h=(Φ,Φ,Φ,Φ,Φ,Φ)
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 36
The Inductive Learning Hypothesis
 Inductive learning, also known as inductive reasoning or inductive inference, is a type of
learning that involves generalizing from specific instances to form general rules or
concepts.
 It is a fundamental process used by humans and machines to acquire knowledge and make
predictions based on observed examples.
 Any hypothesis found to approximate the target function well over a sufficiently large set
of training examples will also approximate the target function well over other unobserved
examples.
If h(x)≈c(x) all x in the training set D we say h approximates c well.
Example:
In simpler terms, if a rule (hypothesis) works well for the examples we've seen, it should also work
well for new examples we haven't seen yet, provided we've seen enough examples to make this
judgment.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 37
Example:
Suppose we have the following training data for the target concept "Days on which Aldo enjoys
his favorite water sport":
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Hypothesis h: Aldo enjoys the sport on sunny and warm days, represented as
h=(Sunny,Warm,?,?,?,?)
 If our hypothesis h correctly predicts the EnjoySport value for all days in the training set
D, and D is large and diverse enough, the inductive learning hypothesis suggests hhh will
likely also predict well for new, unseen days.
 Hence, the inductive learning hypothesis gives us confidence that a well-performing
hypothesis on a large training set will generalize well to other examples, ensuring the
robustness of our learning model.
ii) Concept Learning as Search
 Concept learning can be viewed as a search through a space of possible hypotheses to find
the one that best matches the training examples.
 The search process involves exploring the hypothesis space to find a hypothesis that
minimizes the errors or inconsistencies between the predicted labels and the true labels.
 Find a hypothesis that best fits training examples
 Efficient search in hypothesis space (finite/infinite)
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 38
Example:
Consider the instances X and hypotheses H in the EnjoySport learning task
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Weak Warm Same Yes
Cloudy Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Search Space for Hypotheses
 Instance Space: The total number of possible combinations of attribute values
Each attribute can take on multiple values:
Total instances=3×2×2×2×2×2=96 distinct instances
 Syntactically Distinct Hypotheses (including? and Φ): Counts all possible combinations
of attribute values, including the, "don't care" (?), and "always false" (Φ) symbols.
Number of choices=Number of attribute values+2 (for ? and Φ)
For each attribute with n possible values, there are n+2 choices (including "?" and "Φ")
Total syntactically distinct hypotheses H=5×4×4×4×4×4=5120
syntactically distinct hypotheses
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 39
 Semantically Distinct Hypotheses (excluding redundant ones with Φ): Counts
meaningful hypotheses, excluding those that classify all instances as negative.
Hypotheses with one or more "Φ" symbols are considered empty because they classify
every instance as negative.
We count only one "Φ" for each attribute:
Semantically distinct hypotheses H=1+(4×3×3×3×3×3) =1+972=973
semantically distinct hypotheses
In the context of Concept Learning as Search, the task is to navigate through this large hypothesis
space to find a hypothesis that best matches the training examples and generalizes well to new
instances.
General-to-Specific Ordering of Hypotheses
 General-to-specific ordering of hypotheses is a method of organizing hypotheses in a way that
progresses from broader, more general statements to narrower, more specific ones.
 This ordering helps in systematically exploring and narrowing down potential explanations or
predictions.
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
 A more general hypothesis covers a broader range of instances (e.g., "Aldo enjoys the
sport on any day").
 A more specific hypothesis covers a narrower range of instances (e.g., "Aldo enjoys
the sport only on Sunny, Warm, and Windy days").
 We can order hypotheses from general to specific based on their constraints.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 40
Example:
Consider two hypotheses:
 h1=(Sunny,?,?,Strong,?,?) (Sunny days with Strong wind) [Most Specific]
 h2=(Sunny,?,?,?,?,?) (Any Sunny day) [Most General]
Since h2 imposes fewer constraints, it classifies more instances as positive and is more general
than h1.
General-to-Specific Ordering:
 General Hypothesis: h2 (more general, covers more instances).
 Specific Hypothesis: h1 (more specific, covers fewer instances).
Definition: Hypothesis hj is more-general-than-or-equal-to hypothesis hk if every instance
satisfying hk also satisfies hj:
( )[ ( ) 1 ( ) ]
1
j
k
x X h x h x
    
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 41
Find-S (Find-Specific) Algorithm: Finding a Maximally Specific Hypothesis
 The Find-S (Find-Specific) algorithm is a simple supervised machine learning algorithm
used for finding the most specific hypothesis that fits all the positive examples in a given
dataset.
 It starts with the most specific hypothesis and generalizes it by incorporating positive
examples, while ignoring negative examples during the learning process.
 The algorithm represents the hypothesis using a vector of attribute constraints. The most
specific hypothesis is represented as {φ, φ, φ, ..., φ}, where φ means no value is acceptable
for that attribute.
 The most general hypothesis is represented as {?, ?, ?, ..., ?}, where ? means any value is
acceptable for that attribute
FIND-S Algorithm
1. Initialize the hypothesis h to the most specific hypothesis possible.
2. For each positive training example x:
 For each attribute constraint ai in h:
 If ai is satisfied by x, do nothing
 Else, replace ai in h with the next more general constraint that is satisfied
by x
3. Output the final hypothesis h
Illustrative Example 1:
To illustrate this algorithm, assume the learner is given the sequence of training examples
from the EnjoySport task
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 42
Step-by-Step Execution of Find-S Algorithm:
The final hypothesis h after processing all instances is <Sunny,Warm,?,Strong,?,?>
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 43
Illustrative Example 2:
Example Color Hardness Smell Surface
1 GREEN HARD NO WRINKLED
2 GREEN HARD NO SMOOTH
3 GREEN SOFT YES WRINKLED
4 ORANGE HARD NO WRINKLED
5 GREEN SOFT YES SMOOTH
1. Initialize the hypothesis h to the most specific {φ, φ, φ, φ}.
2. Consider example 1: {GREEN, HARD, NO, WRINKLED}
 Since this is a positive example, we generalize the hypothesis to match it: h =
{GREEN, HARD, NO, WRINKLED}
3. Example 2 is negative, so we ignore it and h remains the same.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 44
4. Example 3 is negative, so we ignore it and h remains the same.
5. Example 4: {ORANGE, HARD, NO, WRINKLED}
 We compare each attribute to h and replace any mismatches with ? to generalize:
h = {?, HARD, NO, WRINKLED}
6. Example 5: {GREEN, SOFT, YES, SMOOTH}
 Comparing to h, we replace mismatches with ?:
h = {?, ?, ?, ?}
The final hypothesis h after processing all instances is h = {?, ?, ?, ?}
Advantages of the FIND-S algorithm
 Simplicity: Easy to understand and implement, making it ideal for introducing machine
learning concepts.
 Efficiency: Computationally efficient for small to moderate-sized datasets, updating the
hypothesis with individual examples.
 Maximally Specific Hypothesis: Ensures the hypothesis is as specific as possible,
covering all positive examples without conflicting with negative examples.
Limitations of the FIND-S algorithm
 Assumes noiseless data: Find-S assumes that all positive instances are correctly labeled
and there are no errors.
 Ignores negative instances: It only considers positive examples for generalization.
 Cannot handle inconsistent data: If there is noise or inconsistency in the data, Find-S
might not perform well.
Unanswered by FIND-S
 Has the learner converged to the correct target concept?
 Why prefer the most specific hypothesis?
 Are the training examples consistent?
 What if there are several maximally specific consistent hypotheses?
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 45
Exercise
Apply the Find-S algorithm to determine the most specific hypothesis that fits all positive
instances.
Instance Weather Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak Yes
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
Instance Outlook Temperature Humidity Wind PlayGolf
1 Sunny Hot High Weak Yes
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
6 Rainy Cool Normal Strong No
Instance Day Weather Temperature Humidity Wind Surfing
1 Weekday Sunny Warm High Strong Yes
2 Weekend Rainy Cold High Weak No
3 Weekend Sunny Warm Normal Strong Yes
4 Weekday Sunny Warm High Weak No
5 Weekday Rainy Warm Normal Strong No
6 Weekend Sunny Hot Normal Strong Yes
Instance Color Size Shape Texture PlayGame
1 Red Small Round Smooth Yes
2 Red Small Square Rough No
3 Blue Large Round Smooth Yes
4 Red Small Round Rough Yes
5 Blue Small Round Smooth Yes
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 46
Instance Sky AirTemp Humidity Wind Water Forecast GoHiking
1 Sunny Hot High Weak Warm Same Yes
2 Sunny Hot High Strong Cool Change No
3 Overcast Cool Normal Weak Warm Same Yes
4 Rainy Mild High Weak Warm Same Yes
5 Sunny Cool Normal Weak Warm Same Yes
6 Overcast Hot Normal Strong Cool Same Yes
Instance Color Size Shape Texture Fruit
1 Red Large Round Smooth Yes
2 Yellow Medium Oval Rough No
3 Red Small Round Smooth Yes
4 Green Large Oval Rough No
5 Red Large Round Rough Yes
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 47
Consistent Hypothesis and Version Space
a) Consistent Hypothesis
Definition: A hypothesis h is consistent with a set of training examples D if and only if h(x)=c(x)
for each example (x, c(x)) in D.
( ) ( ( )
, , ) ) ( ) (
Consistent h D x c x D h x c x
   

Satisfies Hypothesis:
 An example x is said to satisfy hypothesis h when h(x)=1, regardless of whether x is a
positive or negative example of the target concept.
 An example x is consistent with hypothesis h iff h(x)=c(x).
Example Citations Size In Library Price Editions Buy
1 Some Small No Affordable One No
2 May Big No Expense May Yes
h1=(?, ?, No, ?, Many)
h2=(?, ?, No, ?, ?)
Hypothesis Example 1 Consistency Example 2 Consistency Consistent Check
h1 Consistent Consistent Consistent
Yes (All Examples match)
h2 Inconsistent Consistent Inconsistent
No (Mismatch in Example 1)
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 48
b) Version Space
The version space VS(H,D) with respect to a hypothesis space H and a set of training examples D
is the subset of hypotheses from H that are consistent with the training examples in D.
,
( ) { ( )}
,
H D
VS h H Consistent h D
  ∣
Here:
 H is the hypothesis space, which is the set of all possible hypotheses that can be
formulated based on the given problem.
 D is the set of training examples, which consist of input-output pairs used to train the
model.
 A hypothesis h is said to be consistent with the training examples D if it correctly
classifies all the examples in D.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 49
LIST-THEN-ELIMINATION(LTE) Algorithm
 The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all
hypotheses in H and then eliminates any hypothesis found inconsistent with any training
example.
 List-Then-Eliminate works in principle, so long as version space is finite.
 However, since it requires exhaustive enumeration of all hypotheses in practice it is not
feasible
Illustrative Example:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 50
Advantages of LTE
 Simplicity: The algorithm is conceptually simple and easy to understand. It directly
maintains and updates the version space by removing inconsistent hypotheses.
 Correctness: If the version space is finite and a consistent hypothesis exists, the algorithm
is guaranteed to find it.
Disadvantages of LTE
 Infeasibility for Large Hypothesis Spaces: The primary drawback is that the hypothesis
space H can be extremely large, making it impractical to enumerate and store all hypotheses
explicitly.
o For example, if the hypothesis space contains 2^n hypotheses, where n is the
number of possible binary features, the algorithm becomes computationally
infeasible.
 Exhaustive Enumeration: The requirement to exhaustively enumerate all hypotheses
makes the algorithm inefficient for large or infinite hypothesis spaces.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 51
A More Compact Representation for Version Spaces
The concept of representing version spaces through their most general and least general members
(the general boundary G and the specific boundary S) is an elegant and efficient way to manage
hypotheses in machine learning.
i)General Boundary (G):
 The set of maximally general hypotheses in the hypothesis space H that are consistent with
the training data D.
{ ( ) ( )[( ) ( , )]}
,
G g H Consistent g D g H g g Consistent g D
    
     
∣
 These hypotheses are as general as possible while still being consistent with the data. No
more general hypothesis exists in H that is also consistent with D.
ii) Specific Boundary (S):
 The set of minimally general (i.e., maximally specific) hypotheses in the hypothesis space
H that are consistent with the training data D.
{ ( ) ( )[( ) ( , )]}
,
S s H Consistent s D s H s s Consistent s D
    
     
∣
 These hypotheses are as specific as possible while still being consistent with the data. No
more specific hypothesis exists in H that is also consistent with D.
Version Space Representation Theorem
The Version Space representation theorem states that the version space can be compactly
represented using the general boundary G and the specific boundary S
, { ( )( )( )}
H D
VS h H s S g G g h s
       
∣
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 52
Implications and Advantages
 Compact Representation: Instead of storing all hypotheses in the version space, we only
need to store the boundaries GGG and SSS. This significantly reduces memory
requirements.
 Efficient Updates: Updating the version space with new training examples involves
adjusting GGG and SSS, which is typically more efficient than handling the entire set of
consistent hypotheses.
 Boundary Manipulation:
o Adding a Positive Example: When a new positive example is encountered, we
need to generalize the specific boundary S (make it less specific) and ensure the
general boundary G remains consistent.
o Adding a Negative Example: When a new negative example is encountered, we
need to specialize the general boundary G (make it less general) and ensure the
specific boundary S remains consistent.
CANDIDATE-ELIMINATION(CE) Algorithm
 The CANDIDATE-ELIMINATION algorithm operates similarly to the LIST-THEN-
ELIMINATE algorithm but uses a more compact representation of the version space.
 It represents the version space by its most General (G) and Specific (S) boundaries.
 These boundaries form general and specific boundary sets, which delimit the version space
within the partially ordered hypothesis space.
 The key idea is to output a description of all hypotheses consistent with the training
examples.
 The algorithm incrementally builds the version space given a hypothesis space H and a set
of examples.
 Examples are added one by one, each potentially shrinking the version space by removing
inconsistent hypotheses.
 The algorithm updates the general and specific boundaries with each new example.
 It is an extended form of the Find-S algorithm and the LIST-THEN-ELIMINATE
algorithm.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 53
 The algorithm considers both positive and negative examples:
 Positive examples generalize the specific hypothesis.
 Negative examples make the general hypothesis more specific.
Candidate-Elimination Algorithm
1. Initialization: Initialize both specific and general hypotheses.
The general hypothesis is set to the most general hypothesis (?).
G={?, ?, ?, ?, ?.........}
The specific hypothesis is set to the most specific hypothesis (ϕ)
S={ ϕ, ϕ, ϕ, ϕ, ϕ…..}
2. Processing Training Examples: For each training example, the algorithm checks if it is
positive or negative.
2.1 If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it(S))
2.2 If example is Negative example
Make generalize hypothesis more specific(G).
3. Updating Version Space(G&S): The version space is updated after each training
example by removing any hypotheses that are inconsistent with the example.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 54
An Illustrative Example
Initialization
 Specific Hypothesis (S): ]
,
[ , , , ,
S Ø Ø Ø Ø Ø Ø
            
 General Hypothesis (G): ? , ? , ? , ? , ? , ?
[[ ]]
G             
Iteration through Examples
Example 1: [ , , , , , , ]
Sunny Warm Normal Strong Warm Same Yes
              +
 Update S: ]
,
[ , , , ,
S Sunny Warm Normal Strong Warm Same
            
 G remains unchanged: ? , ? , ? , ? , ? , ?
[[ ]]
G             
Example 2: , ,
[ , , , ]
,
Sunny Warm High Strong Warm Same Yes
              +
 Update S:
 Compare each attribute: ?
[2] [2]
S High S
  
    
 ,
[ ]
, ? , , ,
S Sunny Warm Strong Warm Same
            
 G remains unchanged: ? , ? , ? , ? , ? , ?
[[ ]]
G             
Example 3: [ , , , , , , ]
Rainy Cold High Strong Warm Change No
              
 Update G:
Current ,
[ ]
, ? , , ,
S Sunny Warm Strong Warm Same
            
 For each hypothesis in G: ? , ? , ? , ? , ? , ?
[[ ]]
G             
Create new hypotheses:
For attribute 0 (Sky):
[ ]
0
S Sunny
  
[ ]
0
attributes Rainy
  
New hypothesis: [ , ? , ? , ? , ? , ? ]
Sunny
           
For attribute 1 (AirTemp):
[ ]
1
S Warm
  
[ ]
1
attributes Cold
  
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 55
New hypothesis: [ ]
? , , ? , ? , ? , ?
Warm
           
For Attributes 2 to 5 do not specialize since : ?
[2]
S   
[[ ] [ ]]
, ? , ? , ? , ? , ? , ? , , ? , ? , ? , ?
G Sunny Warm
                        
Example 4: , ,
[ , , , ]
,
Sunny Warm High Strong Cool Change Yes 
             
 Update S:
Compare each attribute:
[ [
4] 4 ?
]
S Cool S
    
  [5] ?
S
 
 

[ [
5] ?
]
5
S Change S
    
 
]
, , ? , , ? , ?
[
S Sunny Warm Strong
            
 Update G:
Filter out inconsistent hypotheses with the positive example
[ , ? , ? , ? , ? , ? ]
Sunny
            is consistent
[ ]
? , , ? , ? , ? , ?
Warm
            is consistent
[[ ] [ ]]
, ? , ? , ? , ? , ? , ? , , ? , ? , ? , ?
G Sunny Warm
                        
Final Hypotheses
 Specific Hypothesis (S): ]
, , ? , , ? , ?
[
S Sunny Warm Strong
            
 General Hypotheses (G): [[ ] [ ]]
, ? , ? , ? , ? , ? , ? , , ? , ? , ? , ?
G Sunny Warm
                        
Specific Hypothesis (S) represents the most specific generalization that covers all positive
examples.
General Hypotheses (G) represent the broadest set of hypotheses that exclude the negative
examples while being consistent with the positive examples.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 56
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 57
Comparison of Find-S, List-Then-Eliminate, Candidate-Elimination
Algorithm Find-S List-Then-Eliminate Candidate-Elimination
Hypothesis Space Specific Hypotheses Version Spaces Version Spaces
Search Strategy General-to-specific Specific-to-general Specific-to-general
Handling
Overfitting
Prone to overfitting Prone to overfitting Handles overfitting
Iterative Process Yes No Yes
Negative Examples
Handling
Ignore Eliminate Hypotheses Refine Boundaries
Completeness Not guaranteed Guaranteed Guaranteed
Complexity O(1)
i.e. every time a constant
amount of time is
required to execute code
O(n) O(n^2)
Advantages Efficient for small
hypothesis spaces.
Produces a single,
consistent hypothesis.
Handles both positive
and negative instances.
Allows complex
hypothesis spaces.
Handles both positive
and negative instances.
Can handle continuous-
valued attributes.
Disadvantages Prone to
overgeneralization if
negative instances are
not considered.
Limited to simple
hypothesis spaces.
Can be computationally
expensive for large
hypothesis spaces.
May generate redundant
hypotheses.
Requires storing and
manipulating large sets
of hypotheses.
Can be computationally
expensive.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 58
Inductive Bias
 Bias: In the context of machine learning, bias refers to the error introduced by
approximating a real-world problem with a simplified model.
 Inductive bias referring to the assumptions a learning algorithm makes to generalize from
observed data to unseen instances.
Fundamental Questions for Inductive Inference
1. What if the target concept is not in the hypothesis space?
The algorithm cannot learn the target concept accurately.
2. Can we avoid this difficulty by using a hypothesis space that includes every possible
hypothesis?
In theory, yes, but it has practical limitations such as computational complexity and
overfitting.
3. How does the size of this hypothesis space influence the ability to generalize to
unobserved instances?
A larger hypothesis space can lead to overfitting, reducing the ability to generalize well to
new data.
4. How does the size of the hypothesis space influence the number of training examples
required?
A larger hypothesis space generally requires more training examples to accurately learn
the target concept without overfitting.
Biased Hypothesis Space: An Example
 Consider the "EnjoySport" example, where we want to predict whether a sport is enjoyable
based on certain weather conditions.
 If the hypothesis space is restricted to only conjunctions (e.g., "Sky = Sunny AND
Temperature = Warm"), it cannot represent disjunctions (e.g., "Sky = Sunny OR Sky =
Cloudy").
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 59
Example Training Data
 ⟨Sunny,Warm,Normal,Strong,Cool,Change⟩ Y
 ⟨Cloudy,Warm,Normal,Strong,Cool,Change⟩ Y
 ⟨Rainy,Warm,Normal,Strong,Cool,Change⟩ N
Using the Candidate Elimination algorithm with this restricted hypothesis space:
1. After the first two positive examples, the specific hypothesis (S) becomes overly general:
⟨?,Warm,Normal,Strong,Cool,Change⟩
2. This overly general hypothesis incorrectly covers the third negative example.
Thus, the hypothesis space needs to be more expressive to include disjunctions.
An Unbiased Learner
To avoid bias, we can define a hypothesis space (H') that includes every possible subset of
instances, known as the power set. For the "EnjoySport" example with six attributes, there are 296
possible target concepts.
Example Unbiased Hypothesis
The target concept "Sky = Sunny OR Sky = Cloudy" can be represented as:
⟨Sunny,?,?,?,?,?⟩ OR ⟨Cloudy,?,?,?,?,?⟩
Definition of Inductive Bias
Inductive bias refers to the assumptions an algorithm makes to generalize from the training data.
It can be defined as the minimal set of assertions (B) that guide the algorithm in making
predictions.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 60
Example:
For the Candidate Elimination algorithm, the inductive bias is the assertion that "H contains the
target concept." This bias helps the algorithm generalize beyond the observed data by modeling
inductive systems as equivalent deductive systems.
By characterizing inductive systems through their biases, we can compare different algorithms
based on their generalization policies.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 61
DECISION TREE LEARNING
 Decision tree learning is a popular machine learning method used for both classification
and regression tasks.
 Decision tree learning is a method for approximating discrete-valued target functions,
in which the learned function is represented by a decision tree.
 It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 62
Decision Tree Representation:
 It builds a model in the form of a tree structure where each internal node represents a
decision based on a feature or attribute, each branch represents the outcome of the decision,
and each leaf node represents the prediction label or value.
 Below diagram explains the general structure of a decision tree:
1. Nodes:
 Root Node: The topmost node in the tree, representing the initial decision point. It contains
the entire dataset.
 Internal Nodes: Nodes that split the dataset based on a feature or attribute value. They
lead to child nodes based on the outcome of the split.
 Leaf Nodes/ Decision Node: Terminal nodes that predict the outcome. They do not split
further and represent the final prediction label or value.
2. Edges:
 Edges/branches: Connect nodes and represent the outcome of a decision or a set of
decisions based on a feature's value.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 63
3. Splitting Criteria:
 Decision trees use various criteria to split nodes, such as Gini impurity (for classification)
or variance reduction (for regression). The goal is to maximize information gain at each
split.
4. Tree Depth:
 The depth of a tree is the number of edges from the root node to the farthest leaf node. It
determines the complexity of the model and influences its ability to generalize.
Example:
• Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM).
• The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels.
• The next decision node further gets split into one decision node (Cab facility) and one leaf
node.
• Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 64
Advantages of Decision Trees:
 Interpretability: Easy to interpret and visualize, making them useful for explaining
decisions.
 Non-linear Relationships: Can capture non-linear relationships between features and
target variables.
 Handles Missing Values: Can handle missing values in the dataset.
 No Need for Feature Scaling: Not sensitive to feature scaling unlike some other models
like SVMs or neural networks.
Disadvantages of Decision Trees:
 Overfitting: Prone to overfitting, especially with complex trees that capture noise in the
training data.
 Instability: Small variations in the data can result in a completely different tree structure.
 Bias towards Dominant Classes: In classification tasks, can create biased trees if one
class dominates the dataset.
 Greedy Nature: The greedy approach to find the best split at each node may not result in
the globally optimal tree.
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING
A classic famous example where decision tree is used is known as Play Tennis.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 65
Play-Tennis decision tree example
Decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs
 Each instance or example in the dataset is described by a fixed set of attributes and their
corresponding values.
 This structured format allows decision trees to efficiently partition the data based on
attribute conditions.
 Example:
Attributes: {Outlook, Temperature, Humidity, Wind}
Values: Sunny, Overcast, Rainy; Hot, Mild, Cool; High, Normal; Weak, Strong
2. The target function has discrete output values
 Decision trees naturally handle classification tasks where the target function outputs
discrete values (e.g., categories or classes).
 This includes binary classifications (yes/no) as well as multi-class classifications.
 Example:
Output: Whether to play tennis or not (Yes or No)
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 66
3. Disjunctive descriptions may be required
 Decision trees can easily handle disjunctive (or) conditions in the data, where multiple
attributes or their combinations may lead to the same outcome.
 For example, a rule could be: "Play tennis if the outlook is Sunny OR Overcast."
4. The training data may contain errors
 Decision tree algorithms are robust to errors in the training data, including misclassified
examples or incorrect attribute values. They can effectively handle noise and outliers
without significantly impacting performance.
 Decision tree algorithms like ID3, C4.5, or CART are robust to errors in training data,
including misclassifications or errors in attribute values.
5.The training data may contain missing attribute values:
 Decision tree methods can handle missing values by skipping over them during the
decision-making process.
 For instance, if the "Humidity" value is missing for a particular instance, the decision tree
can still classify based on the available attributes.
BASIC DECISION TREE LEARNING ALGORITHM
 The ID3 (Iterative Dichotomiser 3) algorithm is a fundamental decision tree learning
algorithm that constructs decision trees from a dataset.
 ID3 was one of the earliest algorithms developed for constructing decision trees.
 It was developed by Ross Quinlan and is based on the concept of information gain.
 C4.5 is an extension of ID3 developed by Ross Quinlan. It addresses some limitations of
ID3, such as handling both categorical and numerical attributes, handling missing values,
and pruning trees to avoid overfitting.
 CART is a versatile decision tree algorithm developed by Breiman et al. It can be used for
both classification and regression tasks.
 Random Forest is an ensemble learning method based on decision trees. It builds multiple
decision trees and combines their predictions to improve accuracy and reduce overfitting.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 67
ID3 (Iterative Dichotomiser 3) algorithm
It is a top-down, greedy search algorithm that generates a decision tree from a given dataset by
iteratively selecting the best attribute to split the data based on information gain.
Step 1: Calculate the entropy of the entire dataset.
Step 2: For each feature, calculate the information gain.
Step 3: Select the feature with the highest information gain as the best feature to split the data.
Step 4: Create a branch node in the decision tree using the selected feature.
Step 5: For each unique value of the selected feature, repeat steps 1 to 4 (recursion).
Step 6: Continue building the tree until all data is classified correctly or there are no features left
to split on.
Step 7: The decision tree is complete.
Attribute Selection Measures (ASM)
 Attribute Selection Measures (ASM) are criteria used in decision tree algorithms to select
the best attribute for splitting the data at each node of the tree.
 These measures quantify how well an attribute separates the training examples into their
target classes.
 Decision tree algorithms like ID3, C4.5, and CART use these measures to recursively build
trees by selecting attributes that optimize the chosen measure at each step.
 Here are some commonly used Attribute Selection Measures in decision tree algorithms
1. Entropy
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 68
1.Entropy(S):
 Entropy is a measure of impurity or uncertainty in a dataset.
 In the context of decision trees, entropy is used to quantify the amount of information
disorder or unpredictability in the data before and after splitting based on an attribute.
2
1
( ) ( )
c
i i
i
Entropy S p log p

 
Where:
S is the dataset at a given node.
c is the number of classes in the dataset.
Pi is the proportion of examples in class iii in dataset S.
• The entropy is 0 if all members of S belong to the same class (highly pure dataset)
• The entropy is 1 when the collection contains an equal number of positive and negative
examples (mixed dataset/impure)
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 69
2. Information Gain:
 Information gain is the reduction in entropy or uncertainty achieved by splitting the data
on a particular attribute.
 The decision tree algorithm selects the attribute that maximizes information gain at each
node.
 Constructing a decision tree is all about finding an attribute that returns the highest
information gain and the smallest entropy.
( )
( ) ( ) (
, )
v Values A
Sv
InformationGain S A Entropy S Entropy Sv
S

  
∣ ∣
∣ ∣
Where:
S is the dataset at a given node.
A is the attribute being considered for splitting.
Values(A) is the set of all possible values of attribute A.
Sv is the subset of S where attribute A has value v.
∣S∣ is the total number of examples in S.
∣Sv∣ is the number of examples in subset Sv.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 70
3. Gini Index:
 Gini index measures the impurity or the likelihood of an incorrect classification of a
randomly chosen element in the dataset if it were randomly labeled according to the
distribution of labels in the subset.
 It is used to evaluate splits in the dataset. A low Gini index suggests that a particular
attribute provides good separation of the classes.
 Higher value of Gini index implies higher inequality, higher heterogeneity.
2
1
1
( ) ( )
c
i
Gini S pi

  
Where:
S is the dataset at a given node.
c is the number of classes in the dataset.
pi is the proportion of examples in class i in dataset S.
 Example Calculation: Suppose we have a dataset S with 10 examples, where 6 examples
belong to class A and 4 examples belong to class B.
Proportion of class A:
6
0.6
10
A
P 

Proportion of class B:
4
0.4
10
B
P 

2 2
( ) ( )
1 0.6 0.4
Gini S   
8
( ) 0.4
Gini S 
4. Gain Ratio:
 Gain ratio is an extension of information gain that takes into account the intrinsic
information of a split by normalizing the information gain using the split information.
 Gain ratio adjusts for the bias towards attributes with a large number of distinct values.
,
( )
( )
( )
,
,
InformationGain S A
GainRatio S A
Split Information S A

Where:
( , log
( ) 2
)
Sv Sv
Split Information S A
v Values A S S
 
 
 
 

∣ ∣ ∣ ∣
∣ ∣ ∣ ∣
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 71
5. Reduction in Variance:
 Reduction in variance is used in decision trees for regression tasks, where it measures the
amount by which variance in the target variable is reduced when a dataset is split based on
an attribute.
 It seeks to minimize the variance of the target variable within each node of the tree.
( ) ( ) ( )
(
Re ,
)
Sv
ductioninVariance S A Variance S Variance S
v Values A v
S
  

∣ ∣
∣ ∣
Where:
Variance(S) is the variance of the target variable in dataset S.
Variance(Sv) is the variance of the target variable in subset Sv.
Example 1:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 72
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 73
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 74
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 75
Example 2:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 76
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 77
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 78
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 79
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 80
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 81
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 82
Example 3:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 83
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 84
Example 4:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 85
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 86
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 87
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 88
Exercise
Given the following dataset, construct a decision tree using the ID3 algorithm
1)
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 89
2)
Hours Studied Attendance Homework Completed Participation Passed
5 High Yes Active Yes
3 Medium Yes Inactive No
4 High No Active Yes
2 Low No Inactive No
6 High Yes Active Yes
1 Low No Inactive No
5 Medium Yes Inactive Yes
3 High Yes Active Yes
4 Medium Yes Inactive Yes
2 Low Yes Active No
3)
Age Income Level Credit Score Previous Purchase Purchase Decision
25 High Excellent Yes Yes
40 Medium Good No No
35 Low Poor No No
28 High Good Yes Yes
50 Medium Good No Yes
45 Low Poor No No
30 High Excellent No Yes
55 Medium Good Yes Yes
60 Low Poor Yes No
20 Medium Excellent No Yes
4)
Weather Distance Traffic Car Availability Commute By Car
Sunny Short Low Yes Yes
Rainy Long High Yes No
Sunny Long Medium No No
Overcast Short Low Yes Yes
Rainy Short Medium No No
Sunny Long Low Yes Yes
Overcast Long High Yes No
Rainy Short Low No No
Sunny Short Medium Yes Yes
Overcast Long Medium No No
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 90
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
 ID3 algorithm can be understood as a search through the space of hypotheses to find one
that best fits the training examples.
 The hypothesis space in ID3 is the set of all possible decision trees.
 ID3 starts with the simplest hypothesis (an empty tree) and progressively explores more
complex hypotheses, guided by information gain.
Key Characteristics of ID3's Search Strategy:
1. Complete Hypothesis Space:
 The hypothesis space in ID3 includes all possible decision trees that can be formed from
the given attributes.
 This ensures that the hypothesis space is complete because any finite discrete-valued
function can be represented by some decision tree.
 ID3 avoids the risk that the target function is not within the hypothesis space, a problem
common in incomplete hypothesis spaces.
2. Single Hypothesis Approach:
 ID3 maintains only one current hypothesis at any point during the search.
 Unlike methods like the version space candidate elimination, which maintain a set of
all consistent hypotheses.
o Limitation: By focusing on a single hypothesis, ID3 loses the ability to:
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 91
 Determine the number of alternative decision trees consistent with the training
data.
 Pose new instance queries to resolve among competing hypotheses.
3. No Backtracking:
 Once ID3 selects an attribute to test at a node, it does not reconsider this choice.
 The search path is fixed and may lead to a locally optimal solution.
 Limitation: This locally optimal solution may not be as desirable as other potential trees
that could have been found by exploring different branches of the search.
4. Statistical Decision-Making:
 ID3 uses all training examples at each step to make statistically based decisions about
refining the current hypothesis.
 Advantage: This approach makes ID3 less sensitive to errors in individual training
examples.
 Handling Noisy Data: ID3 can be extended to handle noisy data by adjusting the
termination criterion to accept hypotheses that imperfectly fit the training data.
INDUCTIVE BIAS IN DECISION TREE LEARNING
Inductive Bias refers to the set of assumptions or predispositions that a learning algorithm uses to
generalize from the training data to new, unseen instances. It shapes how the algorithm selects and
prioritizes certain hypotheses (models) over others.
ID3 Algorithm and Its Bias
ID3 (Iterative Dichotomiser 3) is a classic decision tree algorithm that constructs decision trees
from a set of training data. Its approximate inductive bias includes:
1. Preference for Shorter Trees:
 ID3 prefers simpler (shorter) decision trees over longer ones.
 This preference stems from Occam's razor, which suggests that simpler hypotheses
are more likely to generalize well to new, unseen data.
 Shorter trees are less complex and less prone to overfitting.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 92
2. High Information Gain:
 ID3 uses the information gain heuristic to decide which attribute to split on at each
node of the tree.
 It places attributes with higher information gain closer to the root of the tree.
 This strategy aims to reduce uncertainty (entropy) most effectively, leading to
more informative splits.
Types of Inductive Bias
1. Preference Bias:
 ID3 exhibits a preference bias because its bias arises primarily from its search
strategy within a complete hypothesis space.
 It favors hypotheses (decision trees) that are simpler (shorter) and provide higher
information gain.
2. Restriction Bias:
 In contrast, algorithms like the Candidate-Elimination Algorithm might exhibit a
restriction bias because they operate within a more limited hypothesis space.
 This limitation can potentially exclude the true target function if it falls outside the
predefined constraints of the hypothesis space.
Occam's Razor and Preference for Short Hypotheses
Occam's Razor states that among competing hypotheses, the one with the fewest assumptions
should be selected. In the context of machine learning:
 Favoring Simplicity: Occam’s razor supports the preference for shorter hypotheses (or
models). This preference is justified because simpler hypotheses are less likely to fit the
training data coincidentally (overfitting) and are more likely to capture the underlying
patterns that generalize well.
 Arguments in Favor: Shorter hypotheses are fewer in number and less likely to overfit.
They often provide clearer insights into the data and are computationally efficient to learn
and apply.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 93
 Arguments Opposed: The pool of short hypotheses that fit any arbitrary data might be
limited, which can make finding a suitable hypothesis challenging. Moreover, what
constitutes simplicity can be subjective, potentially leading different learners to derive
different hypotheses from the same data.
This preference aims to strike a balance between model simplicity and predictive power, enhancing
generalization to new data while avoiding overfitting. Understanding these biases helps in
selecting appropriate algorithms and interpreting their results effectively in real-world
applications.
ISSUES IN DECISION TREE LEARNING
Issues in learning decision trees include
1. Avoiding Overfitting the Data
2. Incorporating Continuous-Valued Attributes
3. Alternative Measures for Selecting Attributes
4. Handling Training Examples with Missing Attribute Values
5. Handling Attributes with Differing Costs
1. Avoiding Overfitting the Data
Overfitting occurs when a decision tree model is too complex and captures noise in the training
data rather than the underlying patterns.
To prevent overfitting, several strategies can be employed:
a) Pre-pruning (avoidance): Pre-pruning, also known as early stopping, involves stopping the
growth of the decision tree early, before it perfectly classifies the training data.
b) Post-pruning (recovery): Post-pruning involves growing the full tree and then pruning it back
to avoid overfitting.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 94
c) Converting a Decision Tree into Rules: Decision trees can be easily converted into a set of if-
then rules, which can be more interpretable than the tree structure. This can help mitigate
overfitting by providing a more concise and generalized representation of the decision process.
2. Incorporating Continuous-Valued Attributes
Handling continuous-valued attributes involves splitting the attribute value range into intervals:
There are two methods for Handling Continuous Attributes
a) Define new discrete valued attributes that partition the continuous attribute value into a
discrete set of intervals.
E.g., {High ≡ Temp > 35º C, Med ≡ 10º C < Temp ≤ 35º C, Low ≡ Temp ≤ 10º C}
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 95
b) Using thresholds for splitting nodes: To define a threshold-based Boolean attribute for a
continuous attribute like Temperature, you need to determine a threshold value, t, and then create
a Boolean condition that evaluates whether the attribute's value is above or below this threshold
3. Alternative Measures for Selecting Attributes
Different criteria can be used to select the best attribute for splitting the data:
a) Gain Ratio: Adjusts information gain by taking into account the intrinsic information of a
split.
,
( )
( )
( )
,
,
InformationGain S A
GainRatio S A
Split Information S A

Where:
( , log
( ) 2
)
Sv Sv
Split Information S A
v Values A S S
 
 
 
 

∣ ∣ ∣ ∣
∣ ∣ ∣ ∣
b) Gini Index: Measures impurity based on the probability of a randomly chosen element being
incorrectly labeled.
2
1
1
( ) ( )
c
i
Gini S pi

  
Where:
S is the dataset at a given node.
c is the number of classes in the dataset.
pi is the proportion of examples in class i in dataset S.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 96
4. Handling Training Examples with Missing Attribute Values
Dealing with missing values can be approached in several ways:
a) Ignore Examples: Discard examples that have missing values for any attribute. This can lead
to a significant reduction in the size of the training dataset, potentially removing valuable
information and reducing the overall robustness of the model.
Original Dataset:
Age Salary Purchased
25 50000 Yes
30 60000 No
35 ? Yes
? 45000 No
After Ignoring Examples with Missing Values:
Age Salary Purchased
25 50000 Yes
30 60000 No
b) Assign Most Common Value: For categorical attributes, assign the most common (mode)
value of that attribute among the remaining examples. This approach can introduce bias into the
model, as it may skew the representation of certain values and does not account for the variability
and true distribution of the data.
Original Dataset:
Age Salary Purchased
25 50000 Yes
33 60000 Yes
30 60000 No
35 ? Yes
30 45000 No
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 97
After Assigning Most Common Values:
Age Salary Purchased
25 50000 Yes
30 60000 No
35 60000 Yes
30 45000 No
c) Assign Mean Value: For continuous attributes, replace the missing values with the mean value
of that attribute from the other examples. Similar to assigning the most common value, this can
introduce bias and does not capture the potential range of variability within the data.
Original Dataset:
Age Salary Purchased
25 50000 Yes
30 60000 No
35 ? Yes
? 45000 No
 Mean Age: (25 + 30 + 35) / 3 = 30
 Mean Salary: (50000 + 60000 + 45000) / 3 = 51666.67
After Assigning Mean Values:
Age Salary Purchased
25 50000 Yes
30 60000 No
35 51666.67 Yes
30 45000 No
5. Handling Attributes with Differing Costs
 When building decision trees, it is important to consider the case where the attributes
(features) have differing costs associated with them.
 This is a common scenario in real-world applications, such as medical diagnosis, where
certain tests or examinations may be more expensive or invasive than others.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 98
a) Cost-Sensitive Splitting Criterion
 Instead of using the standard information gain or Gini impurity as the splitting criterion, a
cost-sensitive version can be used.
 The idea is to divide the information gain by the cost of the attribute
InformationGain
Cost SensitiveGain
AttributeCost

 This encourages the algorithm to select lower-cost attributes, as they will have a higher
cost-sensitive gain.
b) Weighted Information Gain
 Another approach is to use a weighted information gain, where the weight is inversely
proportional to the attribute cost:
1
Weighted InformationGain InformationGain
AttributeCost
 
 This has a similar effect to the cost-sensitive splitting criterion, biasing the algorithm
towards lower-cost attributes.
Example:
Assume we have three attributes for a medical test:
 Attribute A (Cost: $10)
 Attribute B (Cost: $50)
 Attribute C (Cost: $100)
a) Using Cost-Sensitive Splitting Criterion
Cost-sensitive gain is calculated by dividing the information gain by the attribute cost.
Attribute Cost Information Gain Cost-Sensitive Gain
A $10 0.3 0.3/10= 0.03
B $50 0.5 0.5/50=0.01
C $100 0.8 0.8/100 =0.008
The algorithm would choose Attribute A despite its lower information gain because it has a
higher cost-sensitive gain.
ELH -4.2: MACHINE LEARNING UNIT – I
Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24
P a g e | 99
b) Using Weighted Information Gain
Weighted information gain is calculated by multiplying the information gain by the inverse of
the attribute cost.
Attribute Cost Information Gain Weighted Information Gain
A $10 0.3 1
0.3 0.03
10
 
B $50 0.5 1
0.5 0.01
50
 
C $100 0.8 1
0.8 0.008
100
 
Again, the algorithm would prefer Attribute A due to its higher weighted information gain.
c) Minimum Cost Path
 Instead of optimizing the tree for overall accuracy, the goal can be to find the minimum
cost path from the root to a leaf node.
 This can be achieved by modifying the pruning algorithm to consider the total cost of the
path, not just the accuracy.
For example, consider two paths:
 The task is to identify and follow the path that minimizes the total cost between
two nodes.
 The path marked with a green arrow has costs of 10, 10, 10, and 10, resulting in a
total cost of 40.
 The path marked with a red arrow has costs of 10, 20, and 30, resulting in a total
cost of 60.
**********

More Related Content

PPTX
Introduction.pptx about the mechine Learning
PPTX
Introduction of machine learning.pptx
PPTX
Introduction to Machine Learning and Deep Learning
PDF
Machine Learning Introduction
PDF
Lecture1 introduction to machine learning
PPTX
Data analytics with python introductory
PPTX
Machine Learning and its Applications
PPTX
B tech vi sem cse ml lecture 1 RTU Kota
Introduction.pptx about the mechine Learning
Introduction of machine learning.pptx
Introduction to Machine Learning and Deep Learning
Machine Learning Introduction
Lecture1 introduction to machine learning
Data analytics with python introductory
Machine Learning and its Applications
B tech vi sem cse ml lecture 1 RTU Kota

Similar to ELH -4.2: MACHINE LEARNING :supervised, unsupervised or reinforcement learning (20)

PDF
Efficient Learning Machines Theories Concepts And Applications For Engineers ...
PDF
ML All Chapter PDF.pdf
PPTX
Machine-Learning-vs-Deep-Learning-Whats-the-Difference
PPTX
Deeeep Leeearning Leeeecture gor undergraduate.pptx
PPTX
The concept of Artiificial intelligence.pptx
PDF
Machine Learning: Need of Machine Learning, Its Challenges and its Applications
PDF
Week 2 lecture
PDF
Machine Learning The Powerhouse of AI Explained.pdf
PPTX
Introduction to Machine Learning.pptx
PDF
Machine Learning Fundamentals.pdf - jntu
PDF
Machine Learning vs. Deep Learning: What’s the Difference?
PDF
Machine Learning vs. Deep Learning: What’s the Difference?
PPTX
Understanding Machine Learning --- Chapter 2.pptx
PPTX
Ai & ML workshop-1.pptx ppt presentation
PPTX
Artificial Intelligence (AI) basics.pptx
PPTX
Module 4.pptx............................
PDF
Lect 7 intro to M.L..pdf
PDF
How to use Artificial Intelligence with Python? Edureka
PPTX
ppt on introduction to Machine learning tools
PPTX
Machine learning
Efficient Learning Machines Theories Concepts And Applications For Engineers ...
ML All Chapter PDF.pdf
Machine-Learning-vs-Deep-Learning-Whats-the-Difference
Deeeep Leeearning Leeeecture gor undergraduate.pptx
The concept of Artiificial intelligence.pptx
Machine Learning: Need of Machine Learning, Its Challenges and its Applications
Week 2 lecture
Machine Learning The Powerhouse of AI Explained.pdf
Introduction to Machine Learning.pptx
Machine Learning Fundamentals.pdf - jntu
Machine Learning vs. Deep Learning: What’s the Difference?
Machine Learning vs. Deep Learning: What’s the Difference?
Understanding Machine Learning --- Chapter 2.pptx
Ai & ML workshop-1.pptx ppt presentation
Artificial Intelligence (AI) basics.pptx
Module 4.pptx............................
Lect 7 intro to M.L..pdf
How to use Artificial Intelligence with Python? Edureka
ppt on introduction to Machine learning tools
Machine learning
Ad

More from Kuvempu University (9)

PDF
Unit – 4 Transducers and sensors:Definition and types of transducers
PDF
Unit – 3:Data Conversion and display
PDF
Unit – 2: Wave form generators and Filters
PDF
Artificial Neural Networks and Bayesian Learning
PDF
ELS: 2.4.1 POWER ELECTRONICS
PDF
ELH – 3.1: ADVANCED DIGITAL COMMUNICATION UNIT – II Coding techniques
PDF
ELH – 3.1: ADVANCED DIGITAL COMMUNICATION UNIT – I Digital modulation techniques
PDF
ELH-1.3 PIC & ARM MICROCONTROLLER UNIT II ARM Processor
PDF
ELH-1.3 PIC & ARM MICROCONTROLLER UNIT I Microcontroller’s
Unit – 4 Transducers and sensors:Definition and types of transducers
Unit – 3:Data Conversion and display
Unit – 2: Wave form generators and Filters
Artificial Neural Networks and Bayesian Learning
ELS: 2.4.1 POWER ELECTRONICS
ELH – 3.1: ADVANCED DIGITAL COMMUNICATION UNIT – II Coding techniques
ELH – 3.1: ADVANCED DIGITAL COMMUNICATION UNIT – I Digital modulation techniques
ELH-1.3 PIC & ARM MICROCONTROLLER UNIT II ARM Processor
ELH-1.3 PIC & ARM MICROCONTROLLER UNIT I Microcontroller’s
Ad

Recently uploaded (20)

PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPT
introduction to datamining and warehousing
PPTX
Current and future trends in Computer Vision.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT - 3 Total quality Management .pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT
Total quality management ppt for engineering students
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
introduction to high performance computing
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Artificial Intelligence
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
Fundamentals of Mechanical Engineering.pptx
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Nature of X-rays, X- Ray Equipment, Fluoroscopy
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
introduction to datamining and warehousing
Current and future trends in Computer Vision.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT - 3 Total quality Management .pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Abrasive, erosive and cavitation wear.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Total quality management ppt for engineering students
III.4.1.2_The_Space_Environment.p pdffdf
introduction to high performance computing
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Artificial Intelligence

ELH -4.2: MACHINE LEARNING :supervised, unsupervised or reinforcement learning

  • 1. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 1 ELH -4.2: MACHINE LEARNING INTRODUCTION  Machine Learning (ML) is a field of artificial intelligence (AI) focused on developing algorithms that enable computers to learn from and make decisions based on data.  Its history dates back to the 1950s with Alan Turing's concept of machines simulating human intelligence.  The term "artificial intelligence" was coined in 1956, but early AI faced limitations due to insufficient data and computational power. The 1980s saw the emergence of machine learning methods, and the 1990s brought significant advances with the rise of the internet and statistical techniques.  The modern era, particularly from the 2010s onwards, has been dominated by deep learning, leveraging neural networks and vast datasets to achieve breakthroughs in areas like image and speech recognition, transforming various industries and applications. Definitions and Examples 1. Learning Definition: Learning, in the context of machine learning, refers to the process of gaining knowledge or skills through experience, study, or being taught. In ML, it specifically means improving performance on a task over time by gaining experience from data.
  • 2. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 2 Example: Suppose you are learning to play chess. Initially, you might not know the best strategies, but as you play more games and study strategies, you get better. Similarly, in ML, an algorithm might start with little knowledge but improves its performance as it processes more data. 2. Machine Definition: A machine, in this context, is any device that uses electrical or mechanical power to perform tasks. In the realm of machine learning, "machine" usually refers to a computer or a system that can process and analyze data. Example: Your smartphone is a machine. It can perform various tasks like recognizing your voice or face, suggesting the next word while typing, or recommending songs you might like based on your listening history 3. Natural intelligence Natural intelligence refers to the inherent ability of humans and certain animals to understand, learn, and adapt to their environment using cognitive processes such as perception, reasoning, and problem-solving. It encompasses a wide range of capabilities, including language comprehension, social interactions, creativity, and emotional intelligence. Example: A person walking through a crowded street demonstrates natural intelligence by effortlessly navigating through the environment, avoiding obstacles, recognizing familiar faces, interpreting traffic signals, and making decisions based on situational awareness. This ability to process complex sensory information, analyze context, and respond appropriately showcases the remarkable cognitive abilities inherent in natural intelligence.
  • 3. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 3 4. Artificial Intelligence (AI) Definition: AI is the broader field that encompasses any technique enabling computers to mimic human intelligence. This includes problem-solving, understanding language, recognizing patterns, and learning from experience. Example: Siri, Apple's virtual assistant, is an example of AI. It can understand and respond to your questions, set reminders, and perform tasks based on your voice commands. This involves natural language processing and machine learning. Types of AI Artificial Intelligence can be divided in various types, there are mainly two types of main categorization which are based on capabilities and based on functionally of AI.
  • 4. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 4 I) Based on Capabilities AI Type Description Examples 1. Narrow AI AI designed for specific tasks. - Apple Siri - Playing chess - Self-driving cars - Speech recognition - Image recognition 2. General AI AI with the ability to understand, learn, and apply intelligence across a wide range of tasks. Currently theoretical; under research and development. 3. Super AI AI that surpasses human intelligence in all aspects, including creativity, problem- solving, and emotions. Currently theoretical; under research and development. II) Based on Functionalities AI Type Description Examples 1. Reactive Machines Basic AI systems that react to current scenarios without storing memories or past experiences. - IBM's Deep Blue - Google's AlphaGo 2. Limited Memory AI systems capable of storing past experiences for a short period and using them to inform decisions. - Self-driving cars storing recent speed of nearby cars, distance, speed limits, etc. 3. Theory of Mind AI intended to understand human emotions, beliefs, and interact socially like humans. Currently in development; no existing examples.
  • 5. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 5 4. Machine Learning (ML) Definition: Machine Learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn from and make predictions or decisions based on data. Instead of being explicitly programmed to perform a task, a machine learning model is trained using data and algorithms to find patterns and make decisions. Example: A spam filter in your email uses ML to distinguish between spam and legitimate emails. It learns from past emails marked as spam or not and uses that data to predict and filter future emails. Difference between Machine Learning and Traditional Programming  Machine learning (ML) and traditional programming are two distinct approaches to solving problems with computer systems.  While traditional programming relies on explicit rules and human-crafted logic, machine learning leverages algorithms that learn from data to make predictions or decisions
  • 6. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 6 Characteristic Machine Learning Traditional Programming Approach Data-driven Rule-based Development Process Model is trained using data and algorithms Explicit rules and logic are coded by developers Adaptability Highly adaptable; can improve with more data Limited to predefined rules; changes require code modifications Human Intervention Minimal after training; continuous learning Ongoing maintenance and updates by developers Handling Complexity Handles complex patterns and large datasets effectively Effective for well-defined problems and tasks Required Input Large datasets for training and testing Detailed specifications and rules Error Handling Can handle noisy or incomplete data Requires precise data and handling of edge cases Performance Performance improves with more data and better algorithms Performance depends on code optimization Learning from Data Learns and improves from new data Does not learn; behavior remains static unless reprogrammed Flexibility Can generalize well to new, unseen data Limited flexibility; changes require code rewrite Predictive Capability Can make predictions based on patterns in data Cannot predict; follows explicit instructions Time to Deployment Longer initial setup for training models Quicker to deploy for well-defined tasks Scalability Scales well with more data and computational power Scales with code complexity and hardware resources Limitations Can be limited by the quality and quantity of training data Can be limited by the programmer's understanding and analysis of the problem Advantages Can handle complex tasks and adapt to new data and scenarios Can be used for tasks that require specific functionality and are well- defined Examples of Technologies Neural networks, decision trees, support vector machines Compilers, interpreters, databases Applications Used for complex tasks like Image recognition, and predictive analytics Used for tasks like database management and website development
  • 7. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 7 Need of Machine Learning  The need for machine learning (ML) arises from the exponential expansion of data in the digital age. Traditional analytical approaches are no longer adequate due to the vast amounts of data being generated daily.  Machine learning algorithms can find patterns, trends, and connections that humans would not even be aware of, making them crucial for decision-making processes, optimizing resource allocation, and spurring innovation in various sectors 1. Handling Big Data: ML can process and analyze vast amounts of data, extracting meaningful patterns and insights. 2. Complex Pattern Recognition: ML algorithms excel at identifying intricate patterns in data that are difficult to detect using traditional methods. 3. Automation of Tasks:ML enables automation of repetitive tasks, reducing human intervention and increasing efficiency. 4. Improved Decision Making:ML models can provide data-driven insights and predictions, aiding in better decision-making processes. 5. Adaptability:ML models can adapt to new data and changing conditions, making them flexible and robust. 6. Personalization:ML allows for personalized experiences, such as tailored recommendations in e-commerce and streaming services. 7. Scalability:ML systems can scale with the amount of data and computational power, improving performance and accuracy over time. 8. Real-Time Processing:ML can process data in real time, enabling applications like fraud detection, autonomous vehicles, and instant recommendations. 9. Complex Problem Solving: ML can tackle problems that are too complex for traditional algorithms, such as image and speech recognition. 10. Predictive Maintenance: ML can predict equipment failures and maintenance needs, reducing downtime and saving costs. 11. Enhanced Customer Experience: ML-driven chatbots and virtual assistants provide better customer support and interaction.
  • 8. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 8 Life Cycle of Machine Learning The Machine Learning life cycle encompasses the iterative process of developing, deploying, and maintaining machine learning models. It involves steps such as data gathering, preprocessing, model selection and training, evaluation, and deployment, ensuring the model's effectiveness and adaptability to real-world scenarios. This cyclical approach enables continuous improvement and refinement of models to meet evolving needs and challenges. Here's 7 step in the Machine Learning life cycle using a fruit classification example: 1. Define Problem Statement  Understand the problem to be solved and define the objectives of the machine learning project.  In this example, the goal is to develop a model that can classify fruits into different categories based on their features, such as color, shape, and size.
  • 9. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 9 2. Data Gathering  Collect relevant data that will be used to train and test the machine learning model.  This could involve gathering information about various types of fruits, including images and corresponding labels indicating the fruit type. 3. Data Preparation  Clean and preprocess the collected data to ensure it is in a suitable format for training the machine learning model.  This may involve tasks such as removing irrelevant features, handling missing values, and normalizing the data. 4. Data Analysis  Explore and analyze the prepared data to gain insights into its characteristics and identify patterns that may be useful for training the model.  For example, analyzing the distribution of different fruit types in the dataset and visualizing the relationships between features. 5. Model Selection and Training  Select an appropriate machine learning algorithm and train it using the prepared data.  In this example, you might choose a classification algorithm such as a decision tree or a neural network to train the model to classify fruits based on their features. 6. Model Testing  Evaluate the performance of the trained model using a separate dataset that was not used during training. This helps assess how well the model generalizes to new, unseen data.  For fruit classification, you would test the model on a set of fruit images it hasn't seen before and measure its accuracy in predicting the correct fruit type.
  • 10. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 10 7. Deployment  Deploy the trained model into a production environment where it can be used to make predictions on new, incoming data.  For example, you could develop a mobile app that allows users to take a picture of a fruit and have the model classify it in real-time based on its features. Types of Machine Learning Algorithms: Machine learning algorithms are classified into three main types: supervised, unsupervised, and reinforcement learning. 1. Supervised Learning: Definition: Supervised learning involves training a model on a labeled dataset, where each input is associated with a corresponding output label. The goal is for the model to learn the mapping between inputs and outputs, enabling it to make predictions on new, unseen data.
  • 11. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 11 Working: The algorithm learns from the labeled examples by adjusting its parameters to minimize prediction errors. It generalizes from the training data to make predictions on new instances, aiming to accurately predict the correct output labels. Example: Given the features of a shape (e.g., number of sides, angles), the supervised learning algorithm would analyze these features and learn patterns distinguishing between different types of shapes. Once trained, the model can classify new shapes based on their features into categories like square, rectangle, triangle, or polygon. Applications: Classification tasks such as spam detection, sentiment analysis, image recognition, and regression tasks like predicting house prices or stock prices. Advantages: Ability to make precise predictions on new data, well-understood and widely applicable across various domains. Disadvantages: Requires labeled training data, which can be time-consuming and expensive to obtain. Performance highly depends on the quality and quantity of labeled examples. 2. Unsupervised Learning: Definition: Unsupervised learning involves training a model on an unlabeled dataset, where the algorithm learns to find patterns or structures within the data without predefined output labels.
  • 12. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 12 Working: The algorithm identifies underlying patterns or structures in the data without the need for labeled output. Common techniques include clustering similar data points together or reducing the dimensionality of the data Example: An unsupervised learning algorithm could analyze the geometric properties of the shapes (e.g., side lengths, angles) and identify clusters of shapes that exhibit similar characteristics. This could result in clusters representing shapes with similar attributes, such as squares, rectangles, triangles, and polygons. Applications: Clustering (e.g., customer segmentation), dimensionality reduction (e.g., principal component analysis), and anomaly detection. Advantages: Can uncover hidden patterns or structures in data without labeled examples. Doesn't require manual labeling of large datasets. Disadvantages: May be more challenging to interpret results compared to supervised learning. Relies on assumptions about the structure of the data.
  • 13. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 13 3. Reinforcement Learning: Definition: Reinforcement learning involves training an agent to interact with an environment and learn to make decisions based on feedback in the form of rewards or penalties. Working: The agent takes actions in an environment and receives feedback in the form of rewards or penalties. It learns to maximize cumulative rewards over time through trial and error, aiming to discover the best sequence of actions to achieve its goals.
  • 14. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 14 Example: In this maze scenario, the agent is tasked with navigating from the starting point to the destination while avoiding obstacles and maximizing rewards. The maze consists of different blocks, including walls (S6), a fire pit (S8), and a diamond block (S4). The agent receives a +1 reward for reaching the diamond block (S4) and a -1 reward for falling into the fire pit (S8). Applications: Game playing, robotics, recommendation systems, natural language processing, and finance (e.g., algorithmic trading). Advantages: Capable of learning complex behaviors through interaction with the environment. Can handle situations with delayed feedback and uncertainty. Disadvantages: Can be computationally expensive and require large amounts of data for training. Training may be unstable or require careful tuning of hyperparameters. Learning from delayed rewards can be slow and inefficient in some scenarios.
  • 15. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 15 Comparison of Supervised, Unsupervised, and Reinforcement Learning Aspect Supervised Learning Unsupervised Learning Reinforcement Learning Definition Supervised learning involves training a model on a labeled dataset, where the algorithm learns to map input data to corresponding output labels or categories. Unsupervised learning involves training a model on an unlabeled dataset, where the algorithm learns to find patterns or structures within the data without predefined output labels. Reinforcement learning involves training an agent to interact with an environment and learn to make decisions based on feedback in the form of rewards or penalties. Data Type Requires labeled data for both input and output. Works with unlabeled data; no output labels are provided during training. Involves an environment where actions are taken and feedback is received in the form of rewards or penalties. Feedback Mechanism Feedback provided in the form of labeled examples, allowing the algorithm to adjust its parameters to minimize prediction errors. No explicit feedback is provided; the algorithm learns to identify patterns based on the inherent structure of the data. Feedback received in the form of rewards or penalties based on the actions taken by the agent in the environment. Objective Predict the output label for new, unseen data based on learned patterns from labeled examples. Discover hidden patterns or structures within the data to gain insights or make sense of complex datasets. Learn a policy or strategy that maximizes cumulative rewards over time, aiming to achieve specific goals or tasks. Example Image classification, sentiment analysis, regression tasks like predicting house prices. Clustering similar data points together, dimensionality reduction, anomaly detection. Training an agent to play games (e.g., chess, Go), robotics (e.g., navigating a maze), recommendation systems.
  • 16. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 16 Applications - Classification (e.g., spam detection, image recognition). - Regression (e.g., house price prediction). -Clustering (e.g., customer segmentation, document clustering). -Dimensionality reduction (e.g., principal component analysis). - Anomaly detection. - Game playing (e.g., chess, Go). - Robotics (e.g., autonomous vehicles). - Recommendation systems. - Natural language processing. - Finance (e.g., algorithmic trading). Advantages - Ability to make precise predictions on new, unseen data. - Well-understood and widely applicable in various domains. - Can uncover hidden patterns or structures in data without labeled examples. - Doesn't require manual labeling of large datasets. - Capable of learning complex behaviors through interaction with the environment. - Can handle situations with delayed feedback and uncertainty. Disadvantages - Requires labeled training data, which may be time-consuming and expensive to obtain. - Performance highly dependent on the quality and quantity of labeled examples. - May be more challenging to interpret results compared to supervised learning. - Relies on assumptions about the structure of the data. - Can be computationally expensive and require large amounts of data for training. - Training may be unstable or require careful tuning of hyperparameters. - Learning from delayed rewards can be slow and inefficient in some scenarios.
  • 17. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 17 5. Deep Learning (DL) Definition: DL is a subset of ML that uses neural networks with many layers (hence "deep") to model and understand complex patterns in data. Deep learning is particularly powerful for tasks like image and speech recognition. Example: An application like Google Photos can automatically organize your photos by recognizing faces, objects, and scenes. This is done using deep learning algorithms that have been trained on vast amounts of image data to identify and categorize images accurately.
  • 18. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 18 Comparison of AI, ML, and DL Aspect Artificial Intelligence (AI) Machine Learning (ML) Deep Learning (DL) Definition AI refers to the broader concept of creating machines that can perform tasks requiring human- like intelligence. ML involves the development of algorithms that enable computers to learn from data and improve over time. DL is a subset of ML that uses artificial neural networks with multiple layers (deep architectures) to learn representations of data. Approach Mimics human intelligence and behavior to perform tasks. Learns patterns from data and makes predictions or decisions. Learns representations of data through hierarchical layers of abstraction. Examples Virtual assistants (e.g., Siri, Alexa), autonomous vehicles, game playing AI. Spam filters, recommendation systems, image recognition. Image and speech recognition, natural language processing, autonomous driving. Data Size Can handle both small and large datasets. Can handle both small and large datasets. Particularly effective with large volumes of data. Complexity Can be complex and may involve various approaches, including ML and DL. Can range from simple linear models to complex deep neural networks. Utilizes complex neural network architectures with multiple layers. Interpretability May lack interpretability due to the complexity of AI systems. Depends on the complexity of the ML model; simpler models may be more interpretable. Often considered less interpretable due to the hierarchical nature of deep neural networks. Training Time Can vary widely depending on the complexity of the AI system. Training time depends on the complexity of the ML model and the size of the dataset. Can be time-consuming, especially with large datasets and complex architectures. Hardware Can run on various hardware platforms, including CPUs and GPUs. Can run on CPUs and GPUs, with specialized hardware (e.g., TPUs) available for ML tasks. Often requires GPUs or specialized hardware accelerators for training and inference. Applications Wide range of applications across industries, including healthcare, finance, and gaming. Numerous applications in fields such as healthcare, finance, e-commerce, and more. Dominates fields such as computer vision, natural language processing, and speech recognition.
  • 19. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 19 WELL-POSED LEARNING PROBLEM  A well-posed learning problem is a problem whose solution exists, is unique, and depends on data and is not sensitive to small changes in data.  It is formally defined as: "A computer program is said to learn from Experience E when given a task T, and some performance measure P. If it performs on T with a performance measure P, then it upgrades with experience E."  The three essential components of a well-posed learning problem: 1. Task (T): The specific problem or task that the model is intended to solve. 2. Performance Measure (P): The metric used to evaluate the model's performance. 3. Experience (E): The data used to train and improve the model. Criteria for a Well-Posed Learning Problem 1. Well-Defined Objective: The problem should have a clear and specific goal. 2. Relevant and Sufficient Data: The data should be relevant to the problem and sufficient in quantity and quality to train the model effectively. 3. Measurable Performance: There must be a way to measure the performance of the model, such as accuracy, precision, recall, F1 score, mean squared error, etc. 4. Feasibility and Practicality: The problem should be practically solvable given the current technology, data availability, and resource constraints. Examples of Well-Posed Learning Problems: 1. Learning to Play Checkers:  Task: Play the checkers game.  Performance Measure: Percentage of games won against the opponent.  Experience: Playing practice games against itself. 2. Handwriting Recognition:  Task: Recognizing and classifying handwritten words from images.  Performance Measure: Percentage of correctly identified words.  Experience: A set of handwritten words with their classifications in a database. 3. Robot Driving:  Task: Driving on public four-lane highways using sight scanners.
  • 20. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 20  Performance Measure: Average distance progressed before an error.  Experience: The order of images and steering instructions noted down while observing a human driver. 4. Spam Filtering:  Task: Identifying whether or not an email is spam.  Performance Measure: Percentage of emails correctly categorized as spam or nonspam.  Experience: Observing how you categorize emails as spam or nonspam. 5. Face Recognition:  Task: Predicting distinct sorts of faces.  Performance Measure: Ability to anticipate the largest number of different sorts of faces.  Experience: Training the system with as many datasets of varied facial photos as possible. DESIGNING A LEARNING SYSTEM The basic design issues and approaches to machine learning are illustrated by designing a program to learn to play checkers, with the goal of entering it in the world checkers tournament. It mainly involves following 5 steps. 1. Choosing the Training Experience a) Type of Feedback: Direct vs. Indirect b) Degree of Control over Training Sequence c) Representation of Example Distribution 2. Choosing the Target Function a) Linear Function b) Neural Networks d) Decision Trees 3. Choosing a Representation for the Target Function 4. Choosing a Function Approximation Algorithm a) Estimating training values b) Adjusting the weights 5. The Final Design
  • 21. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 21 1. Choosing the Training Experience  The training experience defines the data or experiences the machine learning algorithm will use to learn.  The training data must reflect the overall characteristics of the dataset to ensure the algorithm performs well in real-world scenarios. To select the optimal training experience, consider these three key attributes: a) Type of Feedback: Direct vs. Indirect The learner has significant control over the sequence of training examples, allowing it to explore different strategies and adjust based on feedback.
  • 22. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 22  Direct Feedback: The training experience provides immediate feedback on each choice made by the algorithm. For example, in a game, the algorithm gets feedback on every move it makes. Let f(t) represent the feedback function, where t is the time step or iteration. ) f t ( ) r( = t where r(t) is the reward or feedback received at time t.           +1 if the move a leads to a win Reward s,a = -1 if the move a leads to a loss 0 Oth s ( er i e ) w Here, s represents the state (board configuration), and a represents the action (move).  Indirect Feedback: The training experience provides feedback after a sequence of actions, indicating the final outcome rather than the quality of each individual move. This is common in scenarios where the algorithm needs to learn from the consequences of a series of decisions, such as in strategic games or long-term planning tasks. ) f t ( ) R( = T where R(T) is the cumulative reward or feedback received at the end of a sequence of actions at time T. b) Degree of Control over Training Sequence:  Teacher-Driven: The teacher (or supervisor) selects the training examples, providing informative states and correct actions. This approach is structured but limits the learner's ability to explore and understand the problem space independently.
  • 23. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 23 Let S(t) represent the state at time t, and A(t) represent the action taken at time t. ) A t = Teacher S ) ( (t ( ) where Teacher is a function provided by the teacher selecting the action based on the state.  Learner-Driven: The learner selects the training examples by identifying challenging or confusing states and requesting guidance from the teacher. This method promotes active learning and helps the learner focus on areas where it needs the most improvement. ( ) ( ( )) ( ( )) A t = Learner S t + λ×Teacher S t where Learner(S(t)) is the action proposed by the learner, and λ is a mixing parameter indicating the reliance on teacher feedback.  Self-Learning: The learner has complete control over the training process, generating its own examples and learning from them without external guidance. This method, often used in reinforcement learning, allows the learner to explore a wide range of scenarios but requires robust mechanisms to avoid overfitting and ensure generalization. ) A t = Learner S ) ( (t ( ) where the learner fully controls the action selection without external guidance. c)Representation of Example Distribution:  The training data should cover a diverse range of examples that reflect the distribution of scenarios the algorithm will encounter in real-world use.  The training examples are biased or not representative of the overall data set, potentially leading to overfitting or poor generalization.
  • 24. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 24  Training only on games where the algorithm wins, which does not account for challenging scenarios.  Mitigating bias by ensuring the training set includes examples from both wins and losses, various board states, and diverse opponent strategies. { , , ) } , , ( TrainingSet s a r s AllStates a AllActions r AllRewards     ∣  A diverse training set helps the algorithm generalize better, improving its performance across different situations. Ensuring the training experience encompasses varied examples is crucial for achieving robust and reliable performance. When designing a checkers-playing program, it's essential to carefully select the training experience to ensure that the algorithm learns effectively and generalizes well to new games. Considering the type of feedback, the degree of control over training examples, and the distribution of examples will significantly impact the success of the algorithm. 2. Choosing the Target Function The target function represents the goal of the learning process, mapping from the current state of the game to the desired outcome. NextMove Function: The target function f could predict the value of making a particular move in a given state, it can be represented as: ( ) ( ) , ) ( LegalMoves s f s a maxV s a   where s is the current board state, a is a possible move, and V(s,a) is the value of taking action a in state s. 3. Choosing a Representation for the Target Function The representation defines how the target function will be modeled, which can impact the learning process's complexity and efficiency. a) Linear Function: A simple linear combination of features. ) , , ( ( ) i i i V s a w f s a   where fi(s,a) are features, and wi are the weights.
  • 25. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 25 b) Neural Networks: More complex and capable of capturing non-linear relationships. ( ) , , ; ) ( V s a NN s a   where θ are the parameters of the neural network. c) Decision Trees: Hierarchical model splitting decisions based on feature values. 4. Choosing a Function Approximation Algorithm The function approximation algorithm determines how the target function will be learned from the training data. a) Estimating Training Values To train the model, we need to estimate the value of different moves. This can be done using techniques like:  Monte Carlo Simulation: Running simulations to estimate the value of each move based on the outcomes of simulated games. 1 ( ) ( ) 1 , N i i V s a Outcome G N    where _Gi is the i-th simulated game starting from state s after move a  Temporal Difference Learning: Updating estimates based on the difference between successive state values. ( ) ( ) ( ( ) ( )) V s V s r V s V s       
  • 26. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 26 where α is the learning rate, γ is the discount factor, r is the reward, and s′ is the next state. b) Adjusting the Weights Using a learning algorithm to adjust the weights of the representation based on the estimated values.  Gradient Descent: For a neural network, weights are adjusted to minimize the error in the estimated values. ( ) L         where η is the learning rate, and L(θ) is the loss function.  Least Squares Method: For linear functions, weights can be adjusted using the least squares method to fit the function to the training data. The sum of squared errors (SSE) is given by: 2 1 ( ) n i i i SSE y y      2 1 ( ( )) n i i i SSE y w x b      where:  w=[w1,w2,…,wn]⊤ is the weight vector,  b is the bias term,  y^ is the predicted value.  yi the corresponding target value.
  • 27. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 27 By following these steps, a machine learning-based checkers program can be developed, optimized, and made ready for competitive play. 5. The Final Design  The final design of a checkers learning system can be described by four distinct program modules that represent the central components in many learning systems. These modules work together to facilitate the learning and improvement of the system over time through a series of iterations involving performance, critique, generalization, and experimentation.  The final design integrates all components, resulting in a checkers-playing program capable of competing in the world checkers tournament.
  • 28. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 28 Overall Workflow: 1. The Performance System plays a new game of checkers and records the game history. 2. The Critic analyzes this game history to generate training examples. 3. The Generalizer uses these training examples to update its hypothesis about the best moves in checkers. 4. The Experiment Generator uses the updated hypothesis to select a new initial board state for the next game, and the cycle repeats. Through this iterative process, the system continuously improves its ability to play checkers by learning from each game played, evaluating its performance, generalizing from its experiences, and exploring new game scenarios.
  • 29. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 29 PERSPECTIVE AND ISSUES IN MACHINE LEARNING Machine learning encompasses various perspectives, from supervised learning's reliance on labeled data to reinforcement learning's dynamic environment interactions, yet faces challenges such as data bias and interpretability concerns. Perspective in Machine Learning: 1. Data-Centric Perspective:  Machine learning focuses on leveraging data to extract meaningful patterns, insights, and knowledge.  It emphasizes the importance of data quality, quantity, and relevance in training accurate models. 2. Model-Centric Perspective:  Machine learning involves designing and developing models that can learn from data and make predictions or decisions.  Models can range from simple linear models to complex deep neural networks, and their selection depends on the problem and data characteristics. 3. Algorithmic Perspective:  Machine learning encompasses various algorithms and techniques that enable models to learn from data.  These include supervised learning, unsupervised learning, reinforcement learning, and deep learning, among others. Issues in Machine Learning 1. Data Quality and Quantity: o Issue: Insufficient or poor-quality data can lead to inaccurate models and biased results. o Solution: Collecting more high-quality data, preprocessing data to handle missing values and outliers, and ensuring data is representative of the problem domain.
  • 30. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 30 2. Overfitting and Underfitting: o Issue: Overfitting occurs when a model learns the training data too well but fails to generalize to new, unseen data. Underfitting happens when the model is too simple to capture the underlying structure of the data. o Solution: Regularization techniques, cross-validation, and adjusting model complexity can help mitigate overfitting and underfitting. 3. Interpretability and Explainability: o Issue: Complex machine learning models often lack interpretability, making it challenging to understand and trust their decisions, especially in critical applications like healthcare or finance.
  • 31. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 31 o Solution: Using simpler, more interpretable models when possible, or employing techniques such as feature importance analysis and model explanation methods. 4. Bias and Fairness: o Issue: Models can inadvertently learn and perpetuate biases present in the training data, leading to unfair or discriminatory outcomes. o Solution: Careful selection and preprocessing of training data, fairness-aware algorithms, and post-processing techniques to mitigate bias. 5. Computational Resources: o Issue: Training and deploying complex machine learning models can require significant computational resources, including processing power and memory. o Solution: Optimizing algorithms and model architectures, utilizing distributed computing frameworks, and leveraging cloud computing resources. 6. Privacy and Security: o Issue: Machine learning models trained on sensitive data may inadvertently leak private information or be vulnerable to adversarial attacks. o Solution: Implementing privacy-preserving techniques such as differential privacy, federated learning, and robust model training against adversarial attacks. 7. Ethical Considerations: o Issue: Machine learning applications raise ethical concerns regarding issues like data privacy, consent, transparency, and potential societal impacts. o Solution: Adhering to ethical guidelines and regulations, fostering interdisciplinary collaboration, and engaging in transparent communication with stakeholders. Addressing these issues requires a combination of technical expertise, ethical considerations, and interdisciplinary collaboration to ensure responsible and effective deployment of machine learning systems.
  • 32. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 32 CONCEPT LEARNING Concept learning is a key task in machine learning, aimed at discovering general patterns or concepts from labeled examples. Definition: Concept learning - Inferring a Boolean-valued function from training examples of its input and output It involves the following steps and objectives: 1. Inference of Hypotheses: The process starts by inferring a hypothesis that accurately describes the target concept based on observed instances. For example, understanding what a "bird" is by analyzing various examples of birds and identifying their common characteristics. 2. Generalization: The goal is to derive a general rule or concept from specific examples. This allows the model to generalize beyond the training data, making accurate predictions on new, unseen instances. 3. Pattern Recognition and Classification: Concept learning is crucial for tasks such as classification and pattern recognition. By identifying the underlying rules or patterns that define a concept, systems can make predictions or decisions based on the learned knowledge. In the study of concept learning, there are two types i) Concept Learning Task ii) Concept Learning as Search
  • 33. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 33 Types of Concept Learning i) Concept Learning Task Definition: A concept learning task involves identifying a general rule (or concept) from specific examples, allowing for the classification of new, unseen examples.The concept learning task typically involves the following components: 1.Instance Space(X):  The instance space refers to the set of all possible instances or examples that can be observed or encountered in the domain of interest. } 1, 2, { , n X x x x    Each instance x in X is described by a vector of attribute values.  For example, x=(Sunny, Warm,Normal,Strong,Warm,Same). 2.Hypothesis Space(H):  The hypothesis space represents the set of possible hypotheses or concept descriptions that can be considered during the concept learning process.   h: X 0,1   Each hypothesis is a potential concept description that can classify instances into positive or negative examples of the target concept.  For example, h(x)=1 if the hypothesis predicts that Aldo enjoys the sport on day x, and h(x)=0 otherwise.  For example, some hypotheses in the hypothesis space could be: If Sky = Sunny and AirTemp = Warm, then EnjoySport = Yes If Humidity = High and Water = Warm, then EnjoySport = No 3.Training Examples(D):  The training examples are the provided instances along with their corresponding class labels (EnjoySport). {( ( )) ( ( )) ( ( ))} 1, 1 , 2, 2 , , , n n D x c x x c x x c x  
  • 34. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 34  Each training example consists of attribute values and the target concept's class label (Yes or No). 4.Target Concept(C):  The target concept represents the concept or category[(Yes) or not (No).] that we want to learn from the training examples.   c: X 0,1   For example, c(x)=1(positive examples) if Aldo enjoys the sport on day x, and c(x)=0 (negative examples) otherwise. • Each hypothesis h in H represents a Boolean valued function defined over X   h: X 0,1   The goal of the learner is to find a hypothesis: ) , ( ( ) x X h x c x    • The aim of concept learning is to infer a concept description or hypothesis that accurately predicts the EnjoySport label for new, unseen instances based on the provided training examples. Example:  Let's consider learning the target concept "Days on which Aldo enjoys his favorite water sport."  We have a table of data with various attributes (like Sky, AirTemp, Humidity, Wind, Water, Forecast) and whether Aldo enjoyed the sport (EnjoySport) on those days.
  • 35. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 35 Training Data: (EnjoySport Dataset) Task: The goal is to learn a rule (hypothesis) that can predict the value of EnjoySport for any new day based on the values of its other attributes (Sky, AirTemp, Humidity, Wind, Water, and Forecast). Hypothesis Representation:  General Hypothesis: A rule that applies to many instances (e.g., Aldo enjoys the sport on any day).  Specific Hypothesis: A rule that applies to very specific instances (e.g., Aldo enjoys the sport only on Sunny and Warm days). Each hypothesis can be represented as a vector of constraints on the attributes:  "?" means any value is acceptable.  A specific value (e.g., "Warm") means only that value is acceptable.  "Φ" means no value is acceptable. Examples:  Hypothesis for enjoying the sport on cold days with high humidity: h=(?,Cold,High,?,?,?)  The most general hypothesis (every day is positive): h=(?,?,?,?,?,?)  The most specific hypothesis (no day is positive): h=(Φ,Φ,Φ,Φ,Φ,Φ)
  • 36. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 36 The Inductive Learning Hypothesis  Inductive learning, also known as inductive reasoning or inductive inference, is a type of learning that involves generalizing from specific instances to form general rules or concepts.  It is a fundamental process used by humans and machines to acquire knowledge and make predictions based on observed examples.  Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. If h(x)≈c(x) all x in the training set D we say h approximates c well. Example: In simpler terms, if a rule (hypothesis) works well for the examples we've seen, it should also work well for new examples we haven't seen yet, provided we've seen enough examples to make this judgment.
  • 37. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 37 Example: Suppose we have the following training data for the target concept "Days on which Aldo enjoys his favorite water sport": Sky AirTemp Humidity Wind Water Forecast EnjoySport Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes Hypothesis h: Aldo enjoys the sport on sunny and warm days, represented as h=(Sunny,Warm,?,?,?,?)  If our hypothesis h correctly predicts the EnjoySport value for all days in the training set D, and D is large and diverse enough, the inductive learning hypothesis suggests hhh will likely also predict well for new, unseen days.  Hence, the inductive learning hypothesis gives us confidence that a well-performing hypothesis on a large training set will generalize well to other examples, ensuring the robustness of our learning model. ii) Concept Learning as Search  Concept learning can be viewed as a search through a space of possible hypotheses to find the one that best matches the training examples.  The search process involves exploring the hypothesis space to find a hypothesis that minimizes the errors or inconsistencies between the predicted labels and the true labels.  Find a hypothesis that best fits training examples  Efficient search in hypothesis space (finite/infinite)
  • 38. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 38 Example: Consider the instances X and hypotheses H in the EnjoySport learning task Sky AirTemp Humidity Wind Water Forecast EnjoySport Sunny Warm Normal Weak Warm Same Yes Cloudy Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes Search Space for Hypotheses  Instance Space: The total number of possible combinations of attribute values Each attribute can take on multiple values: Total instances=3×2×2×2×2×2=96 distinct instances  Syntactically Distinct Hypotheses (including? and Φ): Counts all possible combinations of attribute values, including the, "don't care" (?), and "always false" (Φ) symbols. Number of choices=Number of attribute values+2 (for ? and Φ) For each attribute with n possible values, there are n+2 choices (including "?" and "Φ") Total syntactically distinct hypotheses H=5×4×4×4×4×4=5120 syntactically distinct hypotheses
  • 39. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 39  Semantically Distinct Hypotheses (excluding redundant ones with Φ): Counts meaningful hypotheses, excluding those that classify all instances as negative. Hypotheses with one or more "Φ" symbols are considered empty because they classify every instance as negative. We count only one "Φ" for each attribute: Semantically distinct hypotheses H=1+(4×3×3×3×3×3) =1+972=973 semantically distinct hypotheses In the context of Concept Learning as Search, the task is to navigate through this large hypothesis space to find a hypothesis that best matches the training examples and generalizes well to new instances. General-to-Specific Ordering of Hypotheses  General-to-specific ordering of hypotheses is a method of organizing hypotheses in a way that progresses from broader, more general statements to narrower, more specific ones.  This ordering helps in systematically exploring and narrowing down potential explanations or predictions. Sky AirTemp Humidity Wind Water Forecast EnjoySport Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes  A more general hypothesis covers a broader range of instances (e.g., "Aldo enjoys the sport on any day").  A more specific hypothesis covers a narrower range of instances (e.g., "Aldo enjoys the sport only on Sunny, Warm, and Windy days").  We can order hypotheses from general to specific based on their constraints.
  • 40. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 40 Example: Consider two hypotheses:  h1=(Sunny,?,?,Strong,?,?) (Sunny days with Strong wind) [Most Specific]  h2=(Sunny,?,?,?,?,?) (Any Sunny day) [Most General] Since h2 imposes fewer constraints, it classifies more instances as positive and is more general than h1. General-to-Specific Ordering:  General Hypothesis: h2 (more general, covers more instances).  Specific Hypothesis: h1 (more specific, covers fewer instances). Definition: Hypothesis hj is more-general-than-or-equal-to hypothesis hk if every instance satisfying hk also satisfies hj: ( )[ ( ) 1 ( ) ] 1 j k x X h x h x     
  • 41. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 41 Find-S (Find-Specific) Algorithm: Finding a Maximally Specific Hypothesis  The Find-S (Find-Specific) algorithm is a simple supervised machine learning algorithm used for finding the most specific hypothesis that fits all the positive examples in a given dataset.  It starts with the most specific hypothesis and generalizes it by incorporating positive examples, while ignoring negative examples during the learning process.  The algorithm represents the hypothesis using a vector of attribute constraints. The most specific hypothesis is represented as {φ, φ, φ, ..., φ}, where φ means no value is acceptable for that attribute.  The most general hypothesis is represented as {?, ?, ?, ..., ?}, where ? means any value is acceptable for that attribute FIND-S Algorithm 1. Initialize the hypothesis h to the most specific hypothesis possible. 2. For each positive training example x:  For each attribute constraint ai in h:  If ai is satisfied by x, do nothing  Else, replace ai in h with the next more general constraint that is satisfied by x 3. Output the final hypothesis h Illustrative Example 1: To illustrate this algorithm, assume the learner is given the sequence of training examples from the EnjoySport task Sky AirTemp Humidity Wind Water Forecast EnjoySport Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes
  • 42. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 42 Step-by-Step Execution of Find-S Algorithm: The final hypothesis h after processing all instances is <Sunny,Warm,?,Strong,?,?>
  • 43. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 43 Illustrative Example 2: Example Color Hardness Smell Surface 1 GREEN HARD NO WRINKLED 2 GREEN HARD NO SMOOTH 3 GREEN SOFT YES WRINKLED 4 ORANGE HARD NO WRINKLED 5 GREEN SOFT YES SMOOTH 1. Initialize the hypothesis h to the most specific {φ, φ, φ, φ}. 2. Consider example 1: {GREEN, HARD, NO, WRINKLED}  Since this is a positive example, we generalize the hypothesis to match it: h = {GREEN, HARD, NO, WRINKLED} 3. Example 2 is negative, so we ignore it and h remains the same.
  • 44. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 44 4. Example 3 is negative, so we ignore it and h remains the same. 5. Example 4: {ORANGE, HARD, NO, WRINKLED}  We compare each attribute to h and replace any mismatches with ? to generalize: h = {?, HARD, NO, WRINKLED} 6. Example 5: {GREEN, SOFT, YES, SMOOTH}  Comparing to h, we replace mismatches with ?: h = {?, ?, ?, ?} The final hypothesis h after processing all instances is h = {?, ?, ?, ?} Advantages of the FIND-S algorithm  Simplicity: Easy to understand and implement, making it ideal for introducing machine learning concepts.  Efficiency: Computationally efficient for small to moderate-sized datasets, updating the hypothesis with individual examples.  Maximally Specific Hypothesis: Ensures the hypothesis is as specific as possible, covering all positive examples without conflicting with negative examples. Limitations of the FIND-S algorithm  Assumes noiseless data: Find-S assumes that all positive instances are correctly labeled and there are no errors.  Ignores negative instances: It only considers positive examples for generalization.  Cannot handle inconsistent data: If there is noise or inconsistency in the data, Find-S might not perform well. Unanswered by FIND-S  Has the learner converged to the correct target concept?  Why prefer the most specific hypothesis?  Are the training examples consistent?  What if there are several maximally specific consistent hypotheses?
  • 45. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 45 Exercise Apply the Find-S algorithm to determine the most specific hypothesis that fits all positive instances. Instance Weather Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak Yes 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rainy Mild High Weak Yes 5 Rainy Cool Normal Weak Yes Instance Outlook Temperature Humidity Wind PlayGolf 1 Sunny Hot High Weak Yes 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rainy Mild High Weak Yes 5 Rainy Cool Normal Weak Yes 6 Rainy Cool Normal Strong No Instance Day Weather Temperature Humidity Wind Surfing 1 Weekday Sunny Warm High Strong Yes 2 Weekend Rainy Cold High Weak No 3 Weekend Sunny Warm Normal Strong Yes 4 Weekday Sunny Warm High Weak No 5 Weekday Rainy Warm Normal Strong No 6 Weekend Sunny Hot Normal Strong Yes Instance Color Size Shape Texture PlayGame 1 Red Small Round Smooth Yes 2 Red Small Square Rough No 3 Blue Large Round Smooth Yes 4 Red Small Round Rough Yes 5 Blue Small Round Smooth Yes
  • 46. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 46 Instance Sky AirTemp Humidity Wind Water Forecast GoHiking 1 Sunny Hot High Weak Warm Same Yes 2 Sunny Hot High Strong Cool Change No 3 Overcast Cool Normal Weak Warm Same Yes 4 Rainy Mild High Weak Warm Same Yes 5 Sunny Cool Normal Weak Warm Same Yes 6 Overcast Hot Normal Strong Cool Same Yes Instance Color Size Shape Texture Fruit 1 Red Large Round Smooth Yes 2 Yellow Medium Oval Rough No 3 Red Small Round Smooth Yes 4 Green Large Oval Rough No 5 Red Large Round Rough Yes
  • 47. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 47 Consistent Hypothesis and Version Space a) Consistent Hypothesis Definition: A hypothesis h is consistent with a set of training examples D if and only if h(x)=c(x) for each example (x, c(x)) in D. ( ) ( ( ) , , ) ) ( ) ( Consistent h D x c x D h x c x      Satisfies Hypothesis:  An example x is said to satisfy hypothesis h when h(x)=1, regardless of whether x is a positive or negative example of the target concept.  An example x is consistent with hypothesis h iff h(x)=c(x). Example Citations Size In Library Price Editions Buy 1 Some Small No Affordable One No 2 May Big No Expense May Yes h1=(?, ?, No, ?, Many) h2=(?, ?, No, ?, ?) Hypothesis Example 1 Consistency Example 2 Consistency Consistent Check h1 Consistent Consistent Consistent Yes (All Examples match) h2 Inconsistent Consistent Inconsistent No (Mismatch in Example 1)
  • 48. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 48 b) Version Space The version space VS(H,D) with respect to a hypothesis space H and a set of training examples D is the subset of hypotheses from H that are consistent with the training examples in D. , ( ) { ( )} , H D VS h H Consistent h D   ∣ Here:  H is the hypothesis space, which is the set of all possible hypotheses that can be formulated based on the given problem.  D is the set of training examples, which consist of input-output pairs used to train the model.  A hypothesis h is said to be consistent with the training examples D if it correctly classifies all the examples in D.
  • 49. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 49 LIST-THEN-ELIMINATION(LTE) Algorithm  The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all hypotheses in H and then eliminates any hypothesis found inconsistent with any training example.  List-Then-Eliminate works in principle, so long as version space is finite.  However, since it requires exhaustive enumeration of all hypotheses in practice it is not feasible Illustrative Example:
  • 50. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 50 Advantages of LTE  Simplicity: The algorithm is conceptually simple and easy to understand. It directly maintains and updates the version space by removing inconsistent hypotheses.  Correctness: If the version space is finite and a consistent hypothesis exists, the algorithm is guaranteed to find it. Disadvantages of LTE  Infeasibility for Large Hypothesis Spaces: The primary drawback is that the hypothesis space H can be extremely large, making it impractical to enumerate and store all hypotheses explicitly. o For example, if the hypothesis space contains 2^n hypotheses, where n is the number of possible binary features, the algorithm becomes computationally infeasible.  Exhaustive Enumeration: The requirement to exhaustively enumerate all hypotheses makes the algorithm inefficient for large or infinite hypothesis spaces.
  • 51. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 51 A More Compact Representation for Version Spaces The concept of representing version spaces through their most general and least general members (the general boundary G and the specific boundary S) is an elegant and efficient way to manage hypotheses in machine learning. i)General Boundary (G):  The set of maximally general hypotheses in the hypothesis space H that are consistent with the training data D. { ( ) ( )[( ) ( , )]} , G g H Consistent g D g H g g Consistent g D            ∣  These hypotheses are as general as possible while still being consistent with the data. No more general hypothesis exists in H that is also consistent with D. ii) Specific Boundary (S):  The set of minimally general (i.e., maximally specific) hypotheses in the hypothesis space H that are consistent with the training data D. { ( ) ( )[( ) ( , )]} , S s H Consistent s D s H s s Consistent s D            ∣  These hypotheses are as specific as possible while still being consistent with the data. No more specific hypothesis exists in H that is also consistent with D. Version Space Representation Theorem The Version Space representation theorem states that the version space can be compactly represented using the general boundary G and the specific boundary S , { ( )( )( )} H D VS h H s S g G g h s         ∣
  • 52. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 52 Implications and Advantages  Compact Representation: Instead of storing all hypotheses in the version space, we only need to store the boundaries GGG and SSS. This significantly reduces memory requirements.  Efficient Updates: Updating the version space with new training examples involves adjusting GGG and SSS, which is typically more efficient than handling the entire set of consistent hypotheses.  Boundary Manipulation: o Adding a Positive Example: When a new positive example is encountered, we need to generalize the specific boundary S (make it less specific) and ensure the general boundary G remains consistent. o Adding a Negative Example: When a new negative example is encountered, we need to specialize the general boundary G (make it less general) and ensure the specific boundary S remains consistent. CANDIDATE-ELIMINATION(CE) Algorithm  The CANDIDATE-ELIMINATION algorithm operates similarly to the LIST-THEN- ELIMINATE algorithm but uses a more compact representation of the version space.  It represents the version space by its most General (G) and Specific (S) boundaries.  These boundaries form general and specific boundary sets, which delimit the version space within the partially ordered hypothesis space.  The key idea is to output a description of all hypotheses consistent with the training examples.  The algorithm incrementally builds the version space given a hypothesis space H and a set of examples.  Examples are added one by one, each potentially shrinking the version space by removing inconsistent hypotheses.  The algorithm updates the general and specific boundaries with each new example.  It is an extended form of the Find-S algorithm and the LIST-THEN-ELIMINATE algorithm.
  • 53. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 53  The algorithm considers both positive and negative examples:  Positive examples generalize the specific hypothesis.  Negative examples make the general hypothesis more specific. Candidate-Elimination Algorithm 1. Initialization: Initialize both specific and general hypotheses. The general hypothesis is set to the most general hypothesis (?). G={?, ?, ?, ?, ?.........} The specific hypothesis is set to the most specific hypothesis (ϕ) S={ ϕ, ϕ, ϕ, ϕ, ϕ…..} 2. Processing Training Examples: For each training example, the algorithm checks if it is positive or negative. 2.1 If example is positive example if attribute_value == hypothesis_value: Do nothing else: replace attribute value with '?' (Basically generalizing it(S)) 2.2 If example is Negative example Make generalize hypothesis more specific(G). 3. Updating Version Space(G&S): The version space is updated after each training example by removing any hypotheses that are inconsistent with the example.
  • 54. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 54 An Illustrative Example Initialization  Specific Hypothesis (S): ] , [ , , , , S Ø Ø Ø Ø Ø Ø               General Hypothesis (G): ? , ? , ? , ? , ? , ? [[ ]] G              Iteration through Examples Example 1: [ , , , , , , ] Sunny Warm Normal Strong Warm Same Yes               +  Update S: ] , [ , , , , S Sunny Warm Normal Strong Warm Same               G remains unchanged: ? , ? , ? , ? , ? , ? [[ ]] G              Example 2: , , [ , , , ] , Sunny Warm High Strong Warm Same Yes               +  Update S:  Compare each attribute: ? [2] [2] S High S          , [ ] , ? , , , S Sunny Warm Strong Warm Same               G remains unchanged: ? , ? , ? , ? , ? , ? [[ ]] G              Example 3: [ , , , , , , ] Rainy Cold High Strong Warm Change No                 Update G: Current , [ ] , ? , , , S Sunny Warm Strong Warm Same               For each hypothesis in G: ? , ? , ? , ? , ? , ? [[ ]] G              Create new hypotheses: For attribute 0 (Sky): [ ] 0 S Sunny    [ ] 0 attributes Rainy    New hypothesis: [ , ? , ? , ? , ? , ? ] Sunny             For attribute 1 (AirTemp): [ ] 1 S Warm    [ ] 1 attributes Cold   
  • 55. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 55 New hypothesis: [ ] ? , , ? , ? , ? , ? Warm             For Attributes 2 to 5 do not specialize since : ? [2] S    [[ ] [ ]] , ? , ? , ? , ? , ? , ? , , ? , ? , ? , ? G Sunny Warm                          Example 4: , , [ , , , ] , Sunny Warm High Strong Cool Change Yes                 Update S: Compare each attribute: [ [ 4] 4 ? ] S Cool S        [5] ? S      [ [ 5] ? ] 5 S Change S        ] , , ? , , ? , ? [ S Sunny Warm Strong               Update G: Filter out inconsistent hypotheses with the positive example [ , ? , ? , ? , ? , ? ] Sunny             is consistent [ ] ? , , ? , ? , ? , ? Warm             is consistent [[ ] [ ]] , ? , ? , ? , ? , ? , ? , , ? , ? , ? , ? G Sunny Warm                          Final Hypotheses  Specific Hypothesis (S): ] , , ? , , ? , ? [ S Sunny Warm Strong               General Hypotheses (G): [[ ] [ ]] , ? , ? , ? , ? , ? , ? , , ? , ? , ? , ? G Sunny Warm                          Specific Hypothesis (S) represents the most specific generalization that covers all positive examples. General Hypotheses (G) represent the broadest set of hypotheses that exclude the negative examples while being consistent with the positive examples.
  • 56. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 56
  • 57. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 57 Comparison of Find-S, List-Then-Eliminate, Candidate-Elimination Algorithm Find-S List-Then-Eliminate Candidate-Elimination Hypothesis Space Specific Hypotheses Version Spaces Version Spaces Search Strategy General-to-specific Specific-to-general Specific-to-general Handling Overfitting Prone to overfitting Prone to overfitting Handles overfitting Iterative Process Yes No Yes Negative Examples Handling Ignore Eliminate Hypotheses Refine Boundaries Completeness Not guaranteed Guaranteed Guaranteed Complexity O(1) i.e. every time a constant amount of time is required to execute code O(n) O(n^2) Advantages Efficient for small hypothesis spaces. Produces a single, consistent hypothesis. Handles both positive and negative instances. Allows complex hypothesis spaces. Handles both positive and negative instances. Can handle continuous- valued attributes. Disadvantages Prone to overgeneralization if negative instances are not considered. Limited to simple hypothesis spaces. Can be computationally expensive for large hypothesis spaces. May generate redundant hypotheses. Requires storing and manipulating large sets of hypotheses. Can be computationally expensive.
  • 58. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 58 Inductive Bias  Bias: In the context of machine learning, bias refers to the error introduced by approximating a real-world problem with a simplified model.  Inductive bias referring to the assumptions a learning algorithm makes to generalize from observed data to unseen instances. Fundamental Questions for Inductive Inference 1. What if the target concept is not in the hypothesis space? The algorithm cannot learn the target concept accurately. 2. Can we avoid this difficulty by using a hypothesis space that includes every possible hypothesis? In theory, yes, but it has practical limitations such as computational complexity and overfitting. 3. How does the size of this hypothesis space influence the ability to generalize to unobserved instances? A larger hypothesis space can lead to overfitting, reducing the ability to generalize well to new data. 4. How does the size of the hypothesis space influence the number of training examples required? A larger hypothesis space generally requires more training examples to accurately learn the target concept without overfitting. Biased Hypothesis Space: An Example  Consider the "EnjoySport" example, where we want to predict whether a sport is enjoyable based on certain weather conditions.  If the hypothesis space is restricted to only conjunctions (e.g., "Sky = Sunny AND Temperature = Warm"), it cannot represent disjunctions (e.g., "Sky = Sunny OR Sky = Cloudy").
  • 59. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 59 Example Training Data  ⟨Sunny,Warm,Normal,Strong,Cool,Change⟩ Y  ⟨Cloudy,Warm,Normal,Strong,Cool,Change⟩ Y  ⟨Rainy,Warm,Normal,Strong,Cool,Change⟩ N Using the Candidate Elimination algorithm with this restricted hypothesis space: 1. After the first two positive examples, the specific hypothesis (S) becomes overly general: ⟨?,Warm,Normal,Strong,Cool,Change⟩ 2. This overly general hypothesis incorrectly covers the third negative example. Thus, the hypothesis space needs to be more expressive to include disjunctions. An Unbiased Learner To avoid bias, we can define a hypothesis space (H') that includes every possible subset of instances, known as the power set. For the "EnjoySport" example with six attributes, there are 296 possible target concepts. Example Unbiased Hypothesis The target concept "Sky = Sunny OR Sky = Cloudy" can be represented as: ⟨Sunny,?,?,?,?,?⟩ OR ⟨Cloudy,?,?,?,?,?⟩ Definition of Inductive Bias Inductive bias refers to the assumptions an algorithm makes to generalize from the training data. It can be defined as the minimal set of assertions (B) that guide the algorithm in making predictions.
  • 60. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 60 Example: For the Candidate Elimination algorithm, the inductive bias is the assertion that "H contains the target concept." This bias helps the algorithm generalize beyond the observed data by modeling inductive systems as equivalent deductive systems. By characterizing inductive systems through their biases, we can compare different algorithms based on their generalization policies.
  • 61. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 61 DECISION TREE LEARNING  Decision tree learning is a popular machine learning method used for both classification and regression tasks.  Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree.  It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions.  It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and constructs a tree-like structure.  A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
  • 62. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 62 Decision Tree Representation:  It builds a model in the form of a tree structure where each internal node represents a decision based on a feature or attribute, each branch represents the outcome of the decision, and each leaf node represents the prediction label or value.  Below diagram explains the general structure of a decision tree: 1. Nodes:  Root Node: The topmost node in the tree, representing the initial decision point. It contains the entire dataset.  Internal Nodes: Nodes that split the dataset based on a feature or attribute value. They lead to child nodes based on the outcome of the split.  Leaf Nodes/ Decision Node: Terminal nodes that predict the outcome. They do not split further and represent the final prediction label or value. 2. Edges:  Edges/branches: Connect nodes and represent the outcome of a decision or a set of decisions based on a feature's value.
  • 63. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 63 3. Splitting Criteria:  Decision trees use various criteria to split nodes, such as Gini impurity (for classification) or variance reduction (for regression). The goal is to maximize information gain at each split. 4. Tree Depth:  The depth of a tree is the number of edges from the root node to the farthest leaf node. It determines the complexity of the model and influences its ability to generalize. Example: • Suppose there is a candidate who has a job offer and wants to decide whether he should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). • The root node splits further into the next decision node (distance from the office) and one leaf node based on the corresponding labels. • The next decision node further gets split into one decision node (Cab facility) and one leaf node. • Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:
  • 64. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 64 Advantages of Decision Trees:  Interpretability: Easy to interpret and visualize, making them useful for explaining decisions.  Non-linear Relationships: Can capture non-linear relationships between features and target variables.  Handles Missing Values: Can handle missing values in the dataset.  No Need for Feature Scaling: Not sensitive to feature scaling unlike some other models like SVMs or neural networks. Disadvantages of Decision Trees:  Overfitting: Prone to overfitting, especially with complex trees that capture noise in the training data.  Instability: Small variations in the data can result in a completely different tree structure.  Bias towards Dominant Classes: In classification tasks, can create biased trees if one class dominates the dataset.  Greedy Nature: The greedy approach to find the best split at each node may not result in the globally optimal tree. APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING A classic famous example where decision tree is used is known as Play Tennis.
  • 65. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 65 Play-Tennis decision tree example Decision tree learning is generally best suited to problems with the following characteristics: 1. Instances are represented by attribute-value pairs  Each instance or example in the dataset is described by a fixed set of attributes and their corresponding values.  This structured format allows decision trees to efficiently partition the data based on attribute conditions.  Example: Attributes: {Outlook, Temperature, Humidity, Wind} Values: Sunny, Overcast, Rainy; Hot, Mild, Cool; High, Normal; Weak, Strong 2. The target function has discrete output values  Decision trees naturally handle classification tasks where the target function outputs discrete values (e.g., categories or classes).  This includes binary classifications (yes/no) as well as multi-class classifications.  Example: Output: Whether to play tennis or not (Yes or No)
  • 66. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 66 3. Disjunctive descriptions may be required  Decision trees can easily handle disjunctive (or) conditions in the data, where multiple attributes or their combinations may lead to the same outcome.  For example, a rule could be: "Play tennis if the outlook is Sunny OR Overcast." 4. The training data may contain errors  Decision tree algorithms are robust to errors in the training data, including misclassified examples or incorrect attribute values. They can effectively handle noise and outliers without significantly impacting performance.  Decision tree algorithms like ID3, C4.5, or CART are robust to errors in training data, including misclassifications or errors in attribute values. 5.The training data may contain missing attribute values:  Decision tree methods can handle missing values by skipping over them during the decision-making process.  For instance, if the "Humidity" value is missing for a particular instance, the decision tree can still classify based on the available attributes. BASIC DECISION TREE LEARNING ALGORITHM  The ID3 (Iterative Dichotomiser 3) algorithm is a fundamental decision tree learning algorithm that constructs decision trees from a dataset.  ID3 was one of the earliest algorithms developed for constructing decision trees.  It was developed by Ross Quinlan and is based on the concept of information gain.  C4.5 is an extension of ID3 developed by Ross Quinlan. It addresses some limitations of ID3, such as handling both categorical and numerical attributes, handling missing values, and pruning trees to avoid overfitting.  CART is a versatile decision tree algorithm developed by Breiman et al. It can be used for both classification and regression tasks.  Random Forest is an ensemble learning method based on decision trees. It builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
  • 67. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 67 ID3 (Iterative Dichotomiser 3) algorithm It is a top-down, greedy search algorithm that generates a decision tree from a given dataset by iteratively selecting the best attribute to split the data based on information gain. Step 1: Calculate the entropy of the entire dataset. Step 2: For each feature, calculate the information gain. Step 3: Select the feature with the highest information gain as the best feature to split the data. Step 4: Create a branch node in the decision tree using the selected feature. Step 5: For each unique value of the selected feature, repeat steps 1 to 4 (recursion). Step 6: Continue building the tree until all data is classified correctly or there are no features left to split on. Step 7: The decision tree is complete. Attribute Selection Measures (ASM)  Attribute Selection Measures (ASM) are criteria used in decision tree algorithms to select the best attribute for splitting the data at each node of the tree.  These measures quantify how well an attribute separates the training examples into their target classes.  Decision tree algorithms like ID3, C4.5, and CART use these measures to recursively build trees by selecting attributes that optimize the chosen measure at each step.  Here are some commonly used Attribute Selection Measures in decision tree algorithms 1. Entropy 2. Information gain, 3. Gini index, 4. Gain Ratio, 5. Reduction in Variance
  • 68. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 68 1.Entropy(S):  Entropy is a measure of impurity or uncertainty in a dataset.  In the context of decision trees, entropy is used to quantify the amount of information disorder or unpredictability in the data before and after splitting based on an attribute. 2 1 ( ) ( ) c i i i Entropy S p log p    Where: S is the dataset at a given node. c is the number of classes in the dataset. Pi is the proportion of examples in class iii in dataset S. • The entropy is 0 if all members of S belong to the same class (highly pure dataset) • The entropy is 1 when the collection contains an equal number of positive and negative examples (mixed dataset/impure) • If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1
  • 69. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 69 2. Information Gain:  Information gain is the reduction in entropy or uncertainty achieved by splitting the data on a particular attribute.  The decision tree algorithm selects the attribute that maximizes information gain at each node.  Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy. ( ) ( ) ( ) ( , ) v Values A Sv InformationGain S A Entropy S Entropy Sv S     ∣ ∣ ∣ ∣ Where: S is the dataset at a given node. A is the attribute being considered for splitting. Values(A) is the set of all possible values of attribute A. Sv is the subset of S where attribute A has value v. ∣S∣ is the total number of examples in S. ∣Sv∣ is the number of examples in subset Sv.
  • 70. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 70 3. Gini Index:  Gini index measures the impurity or the likelihood of an incorrect classification of a randomly chosen element in the dataset if it were randomly labeled according to the distribution of labels in the subset.  It is used to evaluate splits in the dataset. A low Gini index suggests that a particular attribute provides good separation of the classes.  Higher value of Gini index implies higher inequality, higher heterogeneity. 2 1 1 ( ) ( ) c i Gini S pi     Where: S is the dataset at a given node. c is the number of classes in the dataset. pi is the proportion of examples in class i in dataset S.  Example Calculation: Suppose we have a dataset S with 10 examples, where 6 examples belong to class A and 4 examples belong to class B. Proportion of class A: 6 0.6 10 A P   Proportion of class B: 4 0.4 10 B P   2 2 ( ) ( ) 1 0.6 0.4 Gini S    8 ( ) 0.4 Gini S  4. Gain Ratio:  Gain ratio is an extension of information gain that takes into account the intrinsic information of a split by normalizing the information gain using the split information.  Gain ratio adjusts for the bias towards attributes with a large number of distinct values. , ( ) ( ) ( ) , , InformationGain S A GainRatio S A Split Information S A  Where: ( , log ( ) 2 ) Sv Sv Split Information S A v Values A S S          ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣
  • 71. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 71 5. Reduction in Variance:  Reduction in variance is used in decision trees for regression tasks, where it measures the amount by which variance in the target variable is reduced when a dataset is split based on an attribute.  It seeks to minimize the variance of the target variable within each node of the tree. ( ) ( ) ( ) ( Re , ) Sv ductioninVariance S A Variance S Variance S v Values A v S     ∣ ∣ ∣ ∣ Where: Variance(S) is the variance of the target variable in dataset S. Variance(Sv) is the variance of the target variable in subset Sv. Example 1:
  • 72. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 72
  • 73. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 73
  • 74. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 74
  • 75. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 75 Example 2:
  • 76. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 76
  • 77. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 77
  • 78. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 78
  • 79. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 79
  • 80. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 80
  • 81. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 81
  • 82. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 82 Example 3:
  • 83. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 83
  • 84. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 84 Example 4:
  • 85. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 85
  • 86. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 86
  • 87. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 87
  • 88. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 88 Exercise Given the following dataset, construct a decision tree using the ID3 algorithm 1)
  • 89. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 89 2) Hours Studied Attendance Homework Completed Participation Passed 5 High Yes Active Yes 3 Medium Yes Inactive No 4 High No Active Yes 2 Low No Inactive No 6 High Yes Active Yes 1 Low No Inactive No 5 Medium Yes Inactive Yes 3 High Yes Active Yes 4 Medium Yes Inactive Yes 2 Low Yes Active No 3) Age Income Level Credit Score Previous Purchase Purchase Decision 25 High Excellent Yes Yes 40 Medium Good No No 35 Low Poor No No 28 High Good Yes Yes 50 Medium Good No Yes 45 Low Poor No No 30 High Excellent No Yes 55 Medium Good Yes Yes 60 Low Poor Yes No 20 Medium Excellent No Yes 4) Weather Distance Traffic Car Availability Commute By Car Sunny Short Low Yes Yes Rainy Long High Yes No Sunny Long Medium No No Overcast Short Low Yes Yes Rainy Short Medium No No Sunny Long Low Yes Yes Overcast Long High Yes No Rainy Short Low No No Sunny Short Medium Yes Yes Overcast Long Medium No No
  • 90. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 90 HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING  ID3 algorithm can be understood as a search through the space of hypotheses to find one that best fits the training examples.  The hypothesis space in ID3 is the set of all possible decision trees.  ID3 starts with the simplest hypothesis (an empty tree) and progressively explores more complex hypotheses, guided by information gain. Key Characteristics of ID3's Search Strategy: 1. Complete Hypothesis Space:  The hypothesis space in ID3 includes all possible decision trees that can be formed from the given attributes.  This ensures that the hypothesis space is complete because any finite discrete-valued function can be represented by some decision tree.  ID3 avoids the risk that the target function is not within the hypothesis space, a problem common in incomplete hypothesis spaces. 2. Single Hypothesis Approach:  ID3 maintains only one current hypothesis at any point during the search.  Unlike methods like the version space candidate elimination, which maintain a set of all consistent hypotheses. o Limitation: By focusing on a single hypothesis, ID3 loses the ability to:
  • 91. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 91  Determine the number of alternative decision trees consistent with the training data.  Pose new instance queries to resolve among competing hypotheses. 3. No Backtracking:  Once ID3 selects an attribute to test at a node, it does not reconsider this choice.  The search path is fixed and may lead to a locally optimal solution.  Limitation: This locally optimal solution may not be as desirable as other potential trees that could have been found by exploring different branches of the search. 4. Statistical Decision-Making:  ID3 uses all training examples at each step to make statistically based decisions about refining the current hypothesis.  Advantage: This approach makes ID3 less sensitive to errors in individual training examples.  Handling Noisy Data: ID3 can be extended to handle noisy data by adjusting the termination criterion to accept hypotheses that imperfectly fit the training data. INDUCTIVE BIAS IN DECISION TREE LEARNING Inductive Bias refers to the set of assumptions or predispositions that a learning algorithm uses to generalize from the training data to new, unseen instances. It shapes how the algorithm selects and prioritizes certain hypotheses (models) over others. ID3 Algorithm and Its Bias ID3 (Iterative Dichotomiser 3) is a classic decision tree algorithm that constructs decision trees from a set of training data. Its approximate inductive bias includes: 1. Preference for Shorter Trees:  ID3 prefers simpler (shorter) decision trees over longer ones.  This preference stems from Occam's razor, which suggests that simpler hypotheses are more likely to generalize well to new, unseen data.  Shorter trees are less complex and less prone to overfitting.
  • 92. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 92 2. High Information Gain:  ID3 uses the information gain heuristic to decide which attribute to split on at each node of the tree.  It places attributes with higher information gain closer to the root of the tree.  This strategy aims to reduce uncertainty (entropy) most effectively, leading to more informative splits. Types of Inductive Bias 1. Preference Bias:  ID3 exhibits a preference bias because its bias arises primarily from its search strategy within a complete hypothesis space.  It favors hypotheses (decision trees) that are simpler (shorter) and provide higher information gain. 2. Restriction Bias:  In contrast, algorithms like the Candidate-Elimination Algorithm might exhibit a restriction bias because they operate within a more limited hypothesis space.  This limitation can potentially exclude the true target function if it falls outside the predefined constraints of the hypothesis space. Occam's Razor and Preference for Short Hypotheses Occam's Razor states that among competing hypotheses, the one with the fewest assumptions should be selected. In the context of machine learning:  Favoring Simplicity: Occam’s razor supports the preference for shorter hypotheses (or models). This preference is justified because simpler hypotheses are less likely to fit the training data coincidentally (overfitting) and are more likely to capture the underlying patterns that generalize well.  Arguments in Favor: Shorter hypotheses are fewer in number and less likely to overfit. They often provide clearer insights into the data and are computationally efficient to learn and apply.
  • 93. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 93  Arguments Opposed: The pool of short hypotheses that fit any arbitrary data might be limited, which can make finding a suitable hypothesis challenging. Moreover, what constitutes simplicity can be subjective, potentially leading different learners to derive different hypotheses from the same data. This preference aims to strike a balance between model simplicity and predictive power, enhancing generalization to new data while avoiding overfitting. Understanding these biases helps in selecting appropriate algorithms and interpreting their results effectively in real-world applications. ISSUES IN DECISION TREE LEARNING Issues in learning decision trees include 1. Avoiding Overfitting the Data 2. Incorporating Continuous-Valued Attributes 3. Alternative Measures for Selecting Attributes 4. Handling Training Examples with Missing Attribute Values 5. Handling Attributes with Differing Costs 1. Avoiding Overfitting the Data Overfitting occurs when a decision tree model is too complex and captures noise in the training data rather than the underlying patterns. To prevent overfitting, several strategies can be employed: a) Pre-pruning (avoidance): Pre-pruning, also known as early stopping, involves stopping the growth of the decision tree early, before it perfectly classifies the training data. b) Post-pruning (recovery): Post-pruning involves growing the full tree and then pruning it back to avoid overfitting.
  • 94. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 94 c) Converting a Decision Tree into Rules: Decision trees can be easily converted into a set of if- then rules, which can be more interpretable than the tree structure. This can help mitigate overfitting by providing a more concise and generalized representation of the decision process. 2. Incorporating Continuous-Valued Attributes Handling continuous-valued attributes involves splitting the attribute value range into intervals: There are two methods for Handling Continuous Attributes a) Define new discrete valued attributes that partition the continuous attribute value into a discrete set of intervals. E.g., {High ≡ Temp > 35º C, Med ≡ 10º C < Temp ≤ 35º C, Low ≡ Temp ≤ 10º C}
  • 95. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 95 b) Using thresholds for splitting nodes: To define a threshold-based Boolean attribute for a continuous attribute like Temperature, you need to determine a threshold value, t, and then create a Boolean condition that evaluates whether the attribute's value is above or below this threshold 3. Alternative Measures for Selecting Attributes Different criteria can be used to select the best attribute for splitting the data: a) Gain Ratio: Adjusts information gain by taking into account the intrinsic information of a split. , ( ) ( ) ( ) , , InformationGain S A GainRatio S A Split Information S A  Where: ( , log ( ) 2 ) Sv Sv Split Information S A v Values A S S          ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ b) Gini Index: Measures impurity based on the probability of a randomly chosen element being incorrectly labeled. 2 1 1 ( ) ( ) c i Gini S pi     Where: S is the dataset at a given node. c is the number of classes in the dataset. pi is the proportion of examples in class i in dataset S.
  • 96. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 96 4. Handling Training Examples with Missing Attribute Values Dealing with missing values can be approached in several ways: a) Ignore Examples: Discard examples that have missing values for any attribute. This can lead to a significant reduction in the size of the training dataset, potentially removing valuable information and reducing the overall robustness of the model. Original Dataset: Age Salary Purchased 25 50000 Yes 30 60000 No 35 ? Yes ? 45000 No After Ignoring Examples with Missing Values: Age Salary Purchased 25 50000 Yes 30 60000 No b) Assign Most Common Value: For categorical attributes, assign the most common (mode) value of that attribute among the remaining examples. This approach can introduce bias into the model, as it may skew the representation of certain values and does not account for the variability and true distribution of the data. Original Dataset: Age Salary Purchased 25 50000 Yes 33 60000 Yes 30 60000 No 35 ? Yes 30 45000 No
  • 97. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 97 After Assigning Most Common Values: Age Salary Purchased 25 50000 Yes 30 60000 No 35 60000 Yes 30 45000 No c) Assign Mean Value: For continuous attributes, replace the missing values with the mean value of that attribute from the other examples. Similar to assigning the most common value, this can introduce bias and does not capture the potential range of variability within the data. Original Dataset: Age Salary Purchased 25 50000 Yes 30 60000 No 35 ? Yes ? 45000 No  Mean Age: (25 + 30 + 35) / 3 = 30  Mean Salary: (50000 + 60000 + 45000) / 3 = 51666.67 After Assigning Mean Values: Age Salary Purchased 25 50000 Yes 30 60000 No 35 51666.67 Yes 30 45000 No 5. Handling Attributes with Differing Costs  When building decision trees, it is important to consider the case where the attributes (features) have differing costs associated with them.  This is a common scenario in real-world applications, such as medical diagnosis, where certain tests or examinations may be more expensive or invasive than others.
  • 98. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 98 a) Cost-Sensitive Splitting Criterion  Instead of using the standard information gain or Gini impurity as the splitting criterion, a cost-sensitive version can be used.  The idea is to divide the information gain by the cost of the attribute InformationGain Cost SensitiveGain AttributeCost   This encourages the algorithm to select lower-cost attributes, as they will have a higher cost-sensitive gain. b) Weighted Information Gain  Another approach is to use a weighted information gain, where the weight is inversely proportional to the attribute cost: 1 Weighted InformationGain InformationGain AttributeCost    This has a similar effect to the cost-sensitive splitting criterion, biasing the algorithm towards lower-cost attributes. Example: Assume we have three attributes for a medical test:  Attribute A (Cost: $10)  Attribute B (Cost: $50)  Attribute C (Cost: $100) a) Using Cost-Sensitive Splitting Criterion Cost-sensitive gain is calculated by dividing the information gain by the attribute cost. Attribute Cost Information Gain Cost-Sensitive Gain A $10 0.3 0.3/10= 0.03 B $50 0.5 0.5/50=0.01 C $100 0.8 0.8/100 =0.008 The algorithm would choose Attribute A despite its lower information gain because it has a higher cost-sensitive gain.
  • 99. ELH -4.2: MACHINE LEARNING UNIT – I Notes by Mr. Chandrakantha T S, Dept. of PG Studies & Research in Electronics Kuvempu University, Jnanasahyadri Shankaraghatta,2023-24 P a g e | 99 b) Using Weighted Information Gain Weighted information gain is calculated by multiplying the information gain by the inverse of the attribute cost. Attribute Cost Information Gain Weighted Information Gain A $10 0.3 1 0.3 0.03 10   B $50 0.5 1 0.5 0.01 50   C $100 0.8 1 0.8 0.008 100   Again, the algorithm would prefer Attribute A due to its higher weighted information gain. c) Minimum Cost Path  Instead of optimizing the tree for overall accuracy, the goal can be to find the minimum cost path from the root to a leaf node.  This can be achieved by modifying the pruning algorithm to consider the total cost of the path, not just the accuracy. For example, consider two paths:  The task is to identify and follow the path that minimizes the total cost between two nodes.  The path marked with a green arrow has costs of 10, 10, 10, and 10, resulting in a total cost of 40.  The path marked with a red arrow has costs of 10, 20, and 30, resulting in a total cost of 60. **********