SlideShare a Scribd company logo
2
Most read
4
Most read
11
Most read
Dependency modelling
AI
Outliers
• In machine learning, outliers are data points that significantly differ from the
rest of the dataset. They can be unusually high or low values compared to
the majority of data points and may result from errors, variability in
measurements, or rare occurrences.
Types of Outliers
• Global Outliers (Point Anomalies)
• A data point that deviates significantly from the entire dataset.
• Example: In a dataset of human weights (mostly between 40-100 kg), a value
of 500 kg would be a global outlier.
• Contextual Outliers (Conditional Anomalies):
• A value that is normal in one context but an outlier in another.
•Example: A temperature of 30°C is normal in summer but an outlier
in winter.
• Collective Outliers:
•A group of data points that, when considered together, behave
differently from the rest.
•Example: A sudden spike in website traffic at midnight for an e-
commerce site may indicate a cyber attack.
Causes of Outliers
Measurement errors (faulty sensors, human input mistakes)
•Data entry errors (typos, incorrect units)
•Experimental errors
•Natural variations (legitimate extreme values)
•Fraudulent activities (e.g., fraud detection in banking
Effects of Outliers
• Skew statistical results (e.g., mean, variance)
• Affect model performance (e.g., linear regression, KNN)
• Mislead training in machine learning models
How to Handle Outliers
• Detection Methods:
• Box Plot (IQR Method): Identifies outliers based on interquartile range
(IQR).
• Z-Score: Values with Z-score > 3 or < -3 are considered outliers.
• DBSCAN Clustering: Detects density-based outliers.
• Isolation Forests & LOF (Local Outlier Factor): Machine learning
methods to detect anomalies.
Handling Techniques
• Remove outliers if they are due to errors.
• Transform data (e.g., log transformation) to reduce impact.
• Cap the values (winsorization) to limit extreme values.
• Use robust models (e.g., tree-based models, median-based methods) that
are less sensitive to outliers.
Evaluation metrics in machine learning
• Accuracy is one of the most commonly used evaluation metrics in machine
learning, especially for classification problems. It measures how often the
model correctly predicts the target class.
• Formula for Accuracy
• Accuracy= *100
• Or in terms of a confusion matrix:.
• Accuracy=
• Where:
• TP (True Positive): Correctly predicted positive cases
• TN (True Negative): Correctly predicted negative cases
• FP (False Positive): Incorrectly predicted as positive when it was negative
• FN (False Negative): Incorrectly predicted as negative when it was positive
• ).
• When is Accuracy Useful?
• Accuracy is a good metric when:
✅ The dataset is balanced (equal number of classes).
✅ False positives and false negatives have similar costs (e.g., spam detection
• When Accuracy is Misleading?
• Accuracy can be misleading in imbalanced datasets where one class dominates.
• Example:
• Imagine a diabetes prediction system where:
• 95% of people are non-diabetic (negative class)
• 5% of people are diabetic (positive class)
• If a model predicts "non-diabetic" for everyone, the accuracy would be 95%, but it
completely fails to detect diabetes
Better Metrics in Imbalanced Datasets
• Precision (Positive Predictive Value)
• Precision=
•Measures how many predicted positives are actually positive.
•Useful when false positives are costly (e.g., cancer detection).
Recall
• Recall (Sensitivity, True Positive Rate)
• Recall=
•Measures how many actual positives were detected.
•Important when false negatives are costly (e.g., missing a diabetes
case).
• `
•
F1-Score (Harmonic Mean of Precision &
Recall)
• F1-Score (Harmonic Mean of Precision & Recall)
• F1=2*
•A balance between precision and recall.
•ROC-AUC (Receiver Operating Characteristic - Area Under
Curve)
•Measures the ability of the model to distinguish between classes.

More Related Content

PDF
Machine learning for IoT - unpacking the blackbox
PDF
evaluationmeasures-ml.pdf evaluation measures
PPTX
Lecture-12Evaluation Measures-ML.pptx
PPTX
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
PPTX
IME 672 - Classifier Evaluation I.pptx
PDF
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
PPTX
lecture-12evaluationmeasures-ml-221219130248-3522ee79.pptx eval
PPTX
04 performance metrics v2
Machine learning for IoT - unpacking the blackbox
evaluationmeasures-ml.pdf evaluation measures
Lecture-12Evaluation Measures-ML.pptx
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
IME 672 - Classifier Evaluation I.pptx
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
lecture-12evaluationmeasures-ml-221219130248-3522ee79.pptx eval
04 performance metrics v2

Similar to Dependency modelling in data mining.pptx (20)

PPTX
ML-ChapterFour-ModelEvaluation.pptx
PDF
Machine Learning From Raw Data To The Predictions
PPTX
MACHINE LEARNING PPT K MEANS CLUSTERING.
PDF
Outlier Detection Using Unsupervised Learning on High Dimensional Data
PPT
Data cleaning-outlier-detection
PPTX
PR-190: A Baseline For Detecting Misclassified and Out-of-Distribution Examp...
PPTX
QA Fest 2019. Никита Кричко. Тестирование приложений, использующих ИИ
PDF
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
PDF
Azure Machine Learning and ML on Premises
PPT
12Outlier.for software introductionalism
PPTX
Outlier-Detection-in-Higher-Dimensions in data mining
PDF
Evaluation measures for models assessment over imbalanced data sets
PPTX
22PCOAM21 Data Quality Session 3 Data Quality.pptx
PPTX
Classification in the database system.pptx
PPTX
EvaluationMetrics.pptx
PDF
LR2. Summary Day 2
PDF
Multiple Linear Regression Models in Outlier Detection
PDF
Machine Learning Foundations
PPTX
Data Science- Data Preprocessing, Data Cleaning.
ML-ChapterFour-ModelEvaluation.pptx
Machine Learning From Raw Data To The Predictions
MACHINE LEARNING PPT K MEANS CLUSTERING.
Outlier Detection Using Unsupervised Learning on High Dimensional Data
Data cleaning-outlier-detection
PR-190: A Baseline For Detecting Misclassified and Out-of-Distribution Examp...
QA Fest 2019. Никита Кричко. Тестирование приложений, использующих ИИ
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Azure Machine Learning and ML on Premises
12Outlier.for software introductionalism
Outlier-Detection-in-Higher-Dimensions in data mining
Evaluation measures for models assessment over imbalanced data sets
22PCOAM21 Data Quality Session 3 Data Quality.pptx
Classification in the database system.pptx
EvaluationMetrics.pptx
LR2. Summary Day 2
Multiple Linear Regression Models in Outlier Detection
Machine Learning Foundations
Data Science- Data Preprocessing, Data Cleaning.
Ad

More from urvashipundir04 (20)

PPTX
introduction to python in detail including .pptx
PPTX
kewords in python using 35 keywords.pptx
PPTX
stack in python using different datatypes.pptx
PPTX
Game Playing in Artificial intelligence.pptx
PPTX
extended modelling in dbms using different.pptx
PPTX
PRODUCTION SYSTEM in data science .pptx
PPTX
Presentation1 in datamining using techn.pptx
PPTX
INTRODUCTION to datawarehouse IN DATA.pptx
PPTX
SOCIAL NETWORK ANALYISI in engeenireg.pptx
PPTX
datamining in engerring using different techniques.pptx
PPTX
datamining IN Artificial intelligence.pptx
PPTX
Underfitting and Overfitting in Machine Learning.pptx
PPTX
introduction values and best practices in
PPTX
ppt on different topics of circular.pptx
PPTX
list in python and traversal of list.pptx
PPT
ermodelN in database management system.ppt
PPTX
libraries in python using different .pptx
PPTX
tuple in python is an impotant topic.pptx
PPTX
ANIMATION in computer graphics using 3 D.pptx
PPTX
dispaly subroutines in computer graphics .pptx
introduction to python in detail including .pptx
kewords in python using 35 keywords.pptx
stack in python using different datatypes.pptx
Game Playing in Artificial intelligence.pptx
extended modelling in dbms using different.pptx
PRODUCTION SYSTEM in data science .pptx
Presentation1 in datamining using techn.pptx
INTRODUCTION to datawarehouse IN DATA.pptx
SOCIAL NETWORK ANALYISI in engeenireg.pptx
datamining in engerring using different techniques.pptx
datamining IN Artificial intelligence.pptx
Underfitting and Overfitting in Machine Learning.pptx
introduction values and best practices in
ppt on different topics of circular.pptx
list in python and traversal of list.pptx
ermodelN in database management system.ppt
libraries in python using different .pptx
tuple in python is an impotant topic.pptx
ANIMATION in computer graphics using 3 D.pptx
dispaly subroutines in computer graphics .pptx
Ad

Recently uploaded (20)

PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Construction Project Organization Group 2.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
Project quality management in manufacturing
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
OOP with Java - Java Introduction (Basics)
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PPT on Performance Review to get promotions
PPTX
Sustainable Sites - Green Building Construction
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Mechanical Engineering MATERIALS Selection
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
DOCX
573137875-Attendance-Management-System-original
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Construction Project Organization Group 2.pptx
Digital Logic Computer Design lecture notes
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Project quality management in manufacturing
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
OOP with Java - Java Introduction (Basics)
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT on Performance Review to get promotions
Sustainable Sites - Green Building Construction
Model Code of Practice - Construction Work - 21102022 .pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mechanical Engineering MATERIALS Selection
Structs to JSON How Go Powers REST APIs.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
573137875-Attendance-Management-System-original

Dependency modelling in data mining.pptx

  • 2. Outliers • In machine learning, outliers are data points that significantly differ from the rest of the dataset. They can be unusually high or low values compared to the majority of data points and may result from errors, variability in measurements, or rare occurrences.
  • 3. Types of Outliers • Global Outliers (Point Anomalies) • A data point that deviates significantly from the entire dataset. • Example: In a dataset of human weights (mostly between 40-100 kg), a value of 500 kg would be a global outlier.
  • 4. • Contextual Outliers (Conditional Anomalies): • A value that is normal in one context but an outlier in another. •Example: A temperature of 30°C is normal in summer but an outlier in winter.
  • 5. • Collective Outliers: •A group of data points that, when considered together, behave differently from the rest. •Example: A sudden spike in website traffic at midnight for an e- commerce site may indicate a cyber attack.
  • 6. Causes of Outliers Measurement errors (faulty sensors, human input mistakes) •Data entry errors (typos, incorrect units) •Experimental errors •Natural variations (legitimate extreme values) •Fraudulent activities (e.g., fraud detection in banking
  • 7. Effects of Outliers • Skew statistical results (e.g., mean, variance) • Affect model performance (e.g., linear regression, KNN) • Mislead training in machine learning models
  • 8. How to Handle Outliers • Detection Methods: • Box Plot (IQR Method): Identifies outliers based on interquartile range (IQR). • Z-Score: Values with Z-score > 3 or < -3 are considered outliers. • DBSCAN Clustering: Detects density-based outliers. • Isolation Forests & LOF (Local Outlier Factor): Machine learning methods to detect anomalies.
  • 9. Handling Techniques • Remove outliers if they are due to errors. • Transform data (e.g., log transformation) to reduce impact. • Cap the values (winsorization) to limit extreme values. • Use robust models (e.g., tree-based models, median-based methods) that are less sensitive to outliers.
  • 10. Evaluation metrics in machine learning • Accuracy is one of the most commonly used evaluation metrics in machine learning, especially for classification problems. It measures how often the model correctly predicts the target class. • Formula for Accuracy • Accuracy= *100
  • 11. • Or in terms of a confusion matrix:. • Accuracy= • Where: • TP (True Positive): Correctly predicted positive cases • TN (True Negative): Correctly predicted negative cases • FP (False Positive): Incorrectly predicted as positive when it was negative • FN (False Negative): Incorrectly predicted as negative when it was positive • ).
  • 12. • When is Accuracy Useful? • Accuracy is a good metric when: ✅ The dataset is balanced (equal number of classes). ✅ False positives and false negatives have similar costs (e.g., spam detection • When Accuracy is Misleading? • Accuracy can be misleading in imbalanced datasets where one class dominates. • Example: • Imagine a diabetes prediction system where: • 95% of people are non-diabetic (negative class) • 5% of people are diabetic (positive class) • If a model predicts "non-diabetic" for everyone, the accuracy would be 95%, but it completely fails to detect diabetes
  • 13. Better Metrics in Imbalanced Datasets • Precision (Positive Predictive Value) • Precision= •Measures how many predicted positives are actually positive. •Useful when false positives are costly (e.g., cancer detection).
  • 14. Recall • Recall (Sensitivity, True Positive Rate) • Recall= •Measures how many actual positives were detected. •Important when false negatives are costly (e.g., missing a diabetes case). • ` •
  • 15. F1-Score (Harmonic Mean of Precision & Recall) • F1-Score (Harmonic Mean of Precision & Recall) • F1=2* •A balance between precision and recall. •ROC-AUC (Receiver Operating Characteristic - Area Under Curve) •Measures the ability of the model to distinguish between classes.