SlideShare a Scribd company logo
Libyan Academy for Postgraduate Studies
Subject name: Knowledge Management & Data Mining
Presented by: Hani Ahmed Jolgham
Semester: Spring 2023
Outline
 Problem statement.
 Objectives.
 Significance.
 Introduction.
 Related Work.
 Methodology.
 Results and Discussion.
 Conclusion
 Future Work.
Problem statement
 With high loan default incidences leading to
low efficiency in loan collection in the
Philippines .
 financial institutions believed that loan
defaults could be predicted using
recommender systems. Such innovations
could be driven by machine learning
approaches, .
Objectives
 This study proposes solutions that aim of helping
loan-extending institutions.This is could be through
applying supervised and unsupervised data mining
approaches to derive the best classifier of loan default.
 Four algorithms was implemented to identify the best
classifier and those algorithms were J48, k-nearest
neighbors (k-NN), naïve Bayes and logistic
Significance
 This classifier( recommender system) will
assist Credit risk management to take
decision about giving loan approval .
Since, taking the right decision is key
factor for bank institutions’ success since
many losses result from wrong decisions
and wrong credit loan approval.
Introduction
 Technology is rapidly changing, and many
organizations are adapting to such changes,
including bank institutions.
 data mining allows extracting information from
the available data and predict the results of
different scenarios that help top-level
management to provide business decisions and
increase customer familiarity and satisfaction.
 Financial sectors use data mining for profitability,
customer segmentation, tracing fraudulent
transactions, checking high-risk loan applications.
Related work
 Data mining is one of the important techniques
banks used to discover knowledge from
databases.
 Hamid and Ahmed [6] presented a new model
for classifying the risks of loan in the banking
sector employing data mining.The model aims to
predict the standing of loans from the banking
sector.The proposed model made use of the J48,
Bayes Net, and naive Bayes algorithms.The study
found out that J48 algorithm has the highest
accuracy among all three algorithms.
Contd.
 The study by Lahsasna et al. [17] about
predicting loan default introduced a loan
default prediction model based on the
random forest algorithm.The study’s
experimental result shows that the random
forest algorithm has a higher prediction
with 98% accuracy than the decision tree,
support vector machine (SVM), and logistic
regression algorithms, which only gained
95%, 75% 73% accuracy, respectively
Methodology
Dataset: data on loan default was provided
by a loan-extending agency located in Davao City,
Philippines.
1) dataset contained 29 attributes.
2) included 27 explanatory attributes, 1 class
attribute, and 1 attribute for ID.
3) It has1,000 instances.
4) 900 were used for training and cross-validation.
5) 100 were used for prediction as a test set.
data mining applications in assigment for PHd
Data Preparation
1. unsupervised instance filter replaces missing
instances with mean for numeric attributes and
mode “most frequent value ” for nominal
attributes.
2. An attribute with a lot of missing values, or those
attributes with only one distinct value, can be
considered irrelevant, as they provide no
variations towards the target attribute (i.e., class).
Ex: (coded A12 and A13).A13 has 1,000 instances with
one distinct response (F) while A12 has 999 instances
with one distinct response (T). They must be removed .
Data normality
 Standard deviation (SD=2,822.7) is high so this may lead
to less reliable prediction performance.
 Therefore , we need to rescale the numeric attributes
to values between 0 and 1.
Feature selection
 To ensure that relevant attributes are included prior to
the classification procedure.
 To select most-correlated attributes to the class
attribute.
 prominent feature selection algorithms inWeka
1. correlation-based feature selection
2. information gain-based or entropy-based feature
selection
3. learner-based feature selection
Data imbalance & Cross validation
 training set appears to have an imbalance of class attributes
 250 instances of class label 0 while there are 650 instances of class
label 1.
 algorithms tend to become biased by predicting the overall accuracy
towards the class with bigger observations
 To solve this…. filter called synthetic minority oversampling
technique (SMOTE).
 the need to increase the 250 zero-labeled class to 650 requires the
addition of 400 instances to be at par with the one-labeled class.
 They used 13 folds for cross validation process, 100 instance each
to develop prediction model for training set.
 To achieve a better classification performance….. Each fold should
have 50 instance of 0 labeled class and 50 instance from the one-
labeled class
Results and discussion
 Classification accuracy
 There are 11 cross-validations conducted using the four
classifiers.
 Confidence factor in J48 set to(0.25, 0.5, 0.75,1.0) )any
branch with a confidence level below the threshold will
be pruned from the tree to reduce the complexity and
to overcome over fitting.
Contd.
Classifier comparison
 Three factors were considered in assessment:
1. Average F-measure
2. Correctly classified instances
3. Kappa statistcs
Prediction result
 the algorithms were used to assess the model in
the test set, which was the last 100 instances of
the original supplied *.csv file with unlabeled
classes.
 best classifier will be chosen is that the number of
predicted classes should be close to 50 zero-
labeled and 50 one-labeled classes.
 k-NN was able to predict 48 instances with 0 as
class label and 52 with 1 as class label.
 logistic was able to predict 44 instances with 0 as
class label and 56 with 1 as class label.
Conclusion
 different supervised and unsupervised data
mining algorithms were implemented to
identify the best classifier of a given loan
default dataset.
 J48 with 0.50 confidence factor has the best
classification accuracy among its variants.
 the classifier with the best classification
accuracy is k nearest neighbor of 3.
Future work
 it is recommended that the implemented
classifiers will be applied to bigger datasets
to further validate their accuracy.

More Related Content

PDF
Paper-Allstate-Claim-Severity
PDF
Neural networks, naïve bayes and decision tree machine learning
PDF
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
PPT
final report (ppt)
PDF
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
PDF
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
PDF
Review of Algorithms for Crime Analysis & Prediction
PDF
Proficiency comparison ofladtree
Paper-Allstate-Claim-Severity
Neural networks, naïve bayes and decision tree machine learning
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
final report (ppt)
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
Review of Algorithms for Crime Analysis & Prediction
Proficiency comparison ofladtree

Similar to data mining applications in assigment for PHd (20)

PPTX
Short story ppt
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PPTX
3 classification
PDF
Unfolding the Credit Card Fraud Detection Technique by Implementing SVM Algor...
PPTX
Short story ppt
PDF
Introduction to Data Mining
PDF
A predictive system for detection of bankruptcy using machine learning techni...
PDF
Credit iconip
PDF
Probability density estimation using Product of Conditional Experts
PDF
International Journal of Advance Robotics & Expert Systems (JARES)
PDF
An application of artificial intelligent neural network and discriminant anal...
PPT
CREDIT_CARD.ppt
PDF
DataMining_CA2-4
PDF
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
PDF
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
PPT
Machine-Learning-Algorithms- A Overview.ppt
PPT
Machine-Learning-Algorithms- A Overview.ppt
PDF
RESULT MINING: ANALYSIS OF DATA MINING TECHNIQUES IN EDUCATION
PPTX
dataminingclassificationprediction123 .pptx
Short story ppt
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
3 classification
Unfolding the Credit Card Fraud Detection Technique by Implementing SVM Algor...
Short story ppt
Introduction to Data Mining
A predictive system for detection of bankruptcy using machine learning techni...
Credit iconip
Probability density estimation using Product of Conditional Experts
International Journal of Advance Robotics & Expert Systems (JARES)
An application of artificial intelligent neural network and discriminant anal...
CREDIT_CARD.ppt
DataMining_CA2-4
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
Machine-Learning-Algorithms- A Overview.ppt
Machine-Learning-Algorithms- A Overview.ppt
RESULT MINING: ANALYSIS OF DATA MINING TECHNIQUES IN EDUCATION
dataminingclassificationprediction123 .pptx
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Mega Projects Data Mega Projects Data
PDF
Lecture1 pattern recognition............
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Foundation of Data Science unit number two notes
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
Data_Analytics_and_PowerBI_Presentation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
ISS -ESG Data flows What is ESG and HowHow
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Mega Projects Data Mega Projects Data
Lecture1 pattern recognition............
Clinical guidelines as a resource for EBP(1).pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
Foundation of Data Science unit number two notes
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Reliability_Chapter_ presentation 1221.5784
Ad

data mining applications in assigment for PHd

  • 1. Libyan Academy for Postgraduate Studies Subject name: Knowledge Management & Data Mining Presented by: Hani Ahmed Jolgham Semester: Spring 2023
  • 2. Outline  Problem statement.  Objectives.  Significance.  Introduction.  Related Work.  Methodology.  Results and Discussion.  Conclusion  Future Work.
  • 3. Problem statement  With high loan default incidences leading to low efficiency in loan collection in the Philippines .  financial institutions believed that loan defaults could be predicted using recommender systems. Such innovations could be driven by machine learning approaches, .
  • 4. Objectives  This study proposes solutions that aim of helping loan-extending institutions.This is could be through applying supervised and unsupervised data mining approaches to derive the best classifier of loan default.  Four algorithms was implemented to identify the best classifier and those algorithms were J48, k-nearest neighbors (k-NN), naïve Bayes and logistic
  • 5. Significance  This classifier( recommender system) will assist Credit risk management to take decision about giving loan approval . Since, taking the right decision is key factor for bank institutions’ success since many losses result from wrong decisions and wrong credit loan approval.
  • 6. Introduction  Technology is rapidly changing, and many organizations are adapting to such changes, including bank institutions.  data mining allows extracting information from the available data and predict the results of different scenarios that help top-level management to provide business decisions and increase customer familiarity and satisfaction.  Financial sectors use data mining for profitability, customer segmentation, tracing fraudulent transactions, checking high-risk loan applications.
  • 7. Related work  Data mining is one of the important techniques banks used to discover knowledge from databases.  Hamid and Ahmed [6] presented a new model for classifying the risks of loan in the banking sector employing data mining.The model aims to predict the standing of loans from the banking sector.The proposed model made use of the J48, Bayes Net, and naive Bayes algorithms.The study found out that J48 algorithm has the highest accuracy among all three algorithms.
  • 8. Contd.  The study by Lahsasna et al. [17] about predicting loan default introduced a loan default prediction model based on the random forest algorithm.The study’s experimental result shows that the random forest algorithm has a higher prediction with 98% accuracy than the decision tree, support vector machine (SVM), and logistic regression algorithms, which only gained 95%, 75% 73% accuracy, respectively
  • 9. Methodology Dataset: data on loan default was provided by a loan-extending agency located in Davao City, Philippines. 1) dataset contained 29 attributes. 2) included 27 explanatory attributes, 1 class attribute, and 1 attribute for ID. 3) It has1,000 instances. 4) 900 were used for training and cross-validation. 5) 100 were used for prediction as a test set.
  • 11. Data Preparation 1. unsupervised instance filter replaces missing instances with mean for numeric attributes and mode “most frequent value ” for nominal attributes. 2. An attribute with a lot of missing values, or those attributes with only one distinct value, can be considered irrelevant, as they provide no variations towards the target attribute (i.e., class). Ex: (coded A12 and A13).A13 has 1,000 instances with one distinct response (F) while A12 has 999 instances with one distinct response (T). They must be removed .
  • 12. Data normality  Standard deviation (SD=2,822.7) is high so this may lead to less reliable prediction performance.  Therefore , we need to rescale the numeric attributes to values between 0 and 1.
  • 13. Feature selection  To ensure that relevant attributes are included prior to the classification procedure.  To select most-correlated attributes to the class attribute.  prominent feature selection algorithms inWeka 1. correlation-based feature selection 2. information gain-based or entropy-based feature selection 3. learner-based feature selection
  • 14. Data imbalance & Cross validation  training set appears to have an imbalance of class attributes  250 instances of class label 0 while there are 650 instances of class label 1.  algorithms tend to become biased by predicting the overall accuracy towards the class with bigger observations  To solve this…. filter called synthetic minority oversampling technique (SMOTE).  the need to increase the 250 zero-labeled class to 650 requires the addition of 400 instances to be at par with the one-labeled class.  They used 13 folds for cross validation process, 100 instance each to develop prediction model for training set.  To achieve a better classification performance….. Each fold should have 50 instance of 0 labeled class and 50 instance from the one- labeled class
  • 15. Results and discussion  Classification accuracy  There are 11 cross-validations conducted using the four classifiers.  Confidence factor in J48 set to(0.25, 0.5, 0.75,1.0) )any branch with a confidence level below the threshold will be pruned from the tree to reduce the complexity and to overcome over fitting.
  • 17. Classifier comparison  Three factors were considered in assessment: 1. Average F-measure 2. Correctly classified instances 3. Kappa statistcs
  • 18. Prediction result  the algorithms were used to assess the model in the test set, which was the last 100 instances of the original supplied *.csv file with unlabeled classes.  best classifier will be chosen is that the number of predicted classes should be close to 50 zero- labeled and 50 one-labeled classes.  k-NN was able to predict 48 instances with 0 as class label and 52 with 1 as class label.  logistic was able to predict 44 instances with 0 as class label and 56 with 1 as class label.
  • 19. Conclusion  different supervised and unsupervised data mining algorithms were implemented to identify the best classifier of a given loan default dataset.  J48 with 0.50 confidence factor has the best classification accuracy among its variants.  the classifier with the best classification accuracy is k nearest neighbor of 3.
  • 20. Future work  it is recommended that the implemented classifiers will be applied to bigger datasets to further validate their accuracy.