1. Libyan Academy for Postgraduate Studies
Subject name: Knowledge Management & Data Mining
Presented by: Hani Ahmed Jolgham
Semester: Spring 2023
2. Outline
Problem statement.
Objectives.
Significance.
Introduction.
Related Work.
Methodology.
Results and Discussion.
Conclusion
Future Work.
3. Problem statement
With high loan default incidences leading to
low efficiency in loan collection in the
Philippines .
financial institutions believed that loan
defaults could be predicted using
recommender systems. Such innovations
could be driven by machine learning
approaches, .
4. Objectives
This study proposes solutions that aim of helping
loan-extending institutions.This is could be through
applying supervised and unsupervised data mining
approaches to derive the best classifier of loan default.
Four algorithms was implemented to identify the best
classifier and those algorithms were J48, k-nearest
neighbors (k-NN), naïve Bayes and logistic
5. Significance
This classifier( recommender system) will
assist Credit risk management to take
decision about giving loan approval .
Since, taking the right decision is key
factor for bank institutions’ success since
many losses result from wrong decisions
and wrong credit loan approval.
6. Introduction
Technology is rapidly changing, and many
organizations are adapting to such changes,
including bank institutions.
data mining allows extracting information from
the available data and predict the results of
different scenarios that help top-level
management to provide business decisions and
increase customer familiarity and satisfaction.
Financial sectors use data mining for profitability,
customer segmentation, tracing fraudulent
transactions, checking high-risk loan applications.
7. Related work
Data mining is one of the important techniques
banks used to discover knowledge from
databases.
Hamid and Ahmed [6] presented a new model
for classifying the risks of loan in the banking
sector employing data mining.The model aims to
predict the standing of loans from the banking
sector.The proposed model made use of the J48,
Bayes Net, and naive Bayes algorithms.The study
found out that J48 algorithm has the highest
accuracy among all three algorithms.
8. Contd.
The study by Lahsasna et al. [17] about
predicting loan default introduced a loan
default prediction model based on the
random forest algorithm.The study’s
experimental result shows that the random
forest algorithm has a higher prediction
with 98% accuracy than the decision tree,
support vector machine (SVM), and logistic
regression algorithms, which only gained
95%, 75% 73% accuracy, respectively
9. Methodology
Dataset: data on loan default was provided
by a loan-extending agency located in Davao City,
Philippines.
1) dataset contained 29 attributes.
2) included 27 explanatory attributes, 1 class
attribute, and 1 attribute for ID.
3) It has1,000 instances.
4) 900 were used for training and cross-validation.
5) 100 were used for prediction as a test set.
11. Data Preparation
1. unsupervised instance filter replaces missing
instances with mean for numeric attributes and
mode “most frequent value ” for nominal
attributes.
2. An attribute with a lot of missing values, or those
attributes with only one distinct value, can be
considered irrelevant, as they provide no
variations towards the target attribute (i.e., class).
Ex: (coded A12 and A13).A13 has 1,000 instances with
one distinct response (F) while A12 has 999 instances
with one distinct response (T). They must be removed .
12. Data normality
Standard deviation (SD=2,822.7) is high so this may lead
to less reliable prediction performance.
Therefore , we need to rescale the numeric attributes
to values between 0 and 1.
13. Feature selection
To ensure that relevant attributes are included prior to
the classification procedure.
To select most-correlated attributes to the class
attribute.
prominent feature selection algorithms inWeka
1. correlation-based feature selection
2. information gain-based or entropy-based feature
selection
3. learner-based feature selection
14. Data imbalance & Cross validation
training set appears to have an imbalance of class attributes
250 instances of class label 0 while there are 650 instances of class
label 1.
algorithms tend to become biased by predicting the overall accuracy
towards the class with bigger observations
To solve this…. filter called synthetic minority oversampling
technique (SMOTE).
the need to increase the 250 zero-labeled class to 650 requires the
addition of 400 instances to be at par with the one-labeled class.
They used 13 folds for cross validation process, 100 instance each
to develop prediction model for training set.
To achieve a better classification performance….. Each fold should
have 50 instance of 0 labeled class and 50 instance from the one-
labeled class
15. Results and discussion
Classification accuracy
There are 11 cross-validations conducted using the four
classifiers.
Confidence factor in J48 set to(0.25, 0.5, 0.75,1.0) )any
branch with a confidence level below the threshold will
be pruned from the tree to reduce the complexity and
to overcome over fitting.
17. Classifier comparison
Three factors were considered in assessment:
1. Average F-measure
2. Correctly classified instances
3. Kappa statistcs
18. Prediction result
the algorithms were used to assess the model in
the test set, which was the last 100 instances of
the original supplied *.csv file with unlabeled
classes.
best classifier will be chosen is that the number of
predicted classes should be close to 50 zero-
labeled and 50 one-labeled classes.
k-NN was able to predict 48 instances with 0 as
class label and 52 with 1 as class label.
logistic was able to predict 44 instances with 0 as
class label and 56 with 1 as class label.
19. Conclusion
different supervised and unsupervised data
mining algorithms were implemented to
identify the best classifier of a given loan
default dataset.
J48 with 0.50 confidence factor has the best
classification accuracy among its variants.
the classifier with the best classification
accuracy is k nearest neighbor of 3.
20. Future work
it is recommended that the implemented
classifiers will be applied to bigger datasets
to further validate their accuracy.