3. • K-nearest neighbors (KNN) is considered a “lazy learner,” as there is no learning
required in the model.
• For a new data point, predictions are made by searching through the entire training
set for the K most similar instances (the neighbors) and summarizing the output
variable for those K instances.
• To determine which of the K instances in the training dataset are most similar to a
new input, a distance measure is used.
4. • The most popular distance measure is Euclidean distance, which is calculated
as the square root of the sum of the squared differences between a point a and
a point b across all input attributes i, and which is represented as 𝑑 𝑎, 𝑏
= σ𝑖=1
𝑛
(𝑎𝑖−𝑏𝑖)2 .
• Euclidean distance is a good distance measure to use if the input variables are
similar in type.
5. Advantages of KNN:
1. No Training Required – No learning phase, making it easy to use.
2. Handles New Data Well – New data can be added without affecting accuracy.
3. Easy to Understand – Intuitive and simple to implement.
4. Supports Multiclass Classification – Naturally handles multiple classes.
5. Can Learn Complex Decision Boundaries – Adapts well to different patterns.
6. Effective with Large Datasets – Performs well when enough data is available.
7. Robust to Noise – Can handle noisy data without filtering outliers.
6. Disadvantages of KNN:
1. Choosing a Distance Metric is Challenging – Hard to justify the best one.
2. Performs Poorly on High-Dimensional Data – Struggles when features are too
many.
3. Slow and Expensive for Predictions – Needs to compute distances for all
neighbors.
4. Sensitive to Noise – Can be affected by noisy data.
5. Requires Manual Handling of Missing Values and Outliers – Needs
preprocessing.
• Feature Scaling is Necessary – Standardization or normalization is
required to avoid incorrect predictions.
7. • The model can be represented by a binary
tree (or decision tree), where each node is
an input variable x with a split point and
each leaf contains an output variable y for
prediction.
• Figure shows an example of a simple
classification tree to predict whether a
person is a male or a female based on two
inputs of height (in centimeters) and weight
(in kilograms).
9. Advantages of CART (Classification and Regression Trees):
1. Easy to Interpret – Simple to understand and visualize.
2. Can Learn Complex Relationships – Captures non-linear patterns
effectively.
3. Minimal Data Preparation – Does not require scaling or extensive
preprocessing.
4. Built-in Feature Importance – Identifies important features naturally.
5. Performs Well on Large Datasets – Scales effectively with more data.
6. Supports Both Regression and Classification – Versatile for different tasks.
10. Disadvantages of CART:
1. Prone to Overfitting – Needs pruning to prevent excessive complexity.
2. Non-Robust to Small Changes – Slight variations in data can drastically
change the tree.
3. Sensitive to Noisy Data – Can easily pick up irrelevant patterns.
4. Greedy Algorithm – Makes local optimal decisions at each step, which may
not lead to a globally optimal tree.