SlideShare a Scribd company logo
1
LOAN PREDICTION SYSTEM USING
MACHINE LEARNING
A Report for the Evaluation of Project
Submitted by
SOUMA MAITI (27500120016)
TRIASHA SAMANTA (27500120005)
In partial fulfillment for the award of the degree
Of
BACHELOR OF TECHNOLOGY (B. TECH) IN
COMPUTER SCIENCE AND ENGINEERING
MAULANA ABUL KALAM AZAD UNIVERSITY OF
TECHNOLOGY
Under the Supervision of Dr. Dhrubajyoti Ghosh
DECEMBER-2023
2
OMDAYAL GROUP OF INSTITUTION
SCHOOL OF COMPUTING AND SCIENCE AND
ENGINEERING
BONAFIDE CERTIFICATE
Certified that this project report “LOAN
PREDICTION SYSTEM” is the bonafide work of
“SOUMA MAITI (27500120016)” & “TRIASHA
SAMANTA(27500120005)” who carried out the
project work under my supervision.
Dipankar Hazra
Teacher in charge
Computing Science &
Engineering Department
OMDAYAL GROUP OF
INSTITUTION.
Dr. Dhrubajyoti Ghosh
Assistant Professor
Computing Science &
Engineering Department.
OMDAYAL GROUP OF
INSTITUTION
3
ACKNOWLEDGEMENT
I am pleased to acknowledge my sincere thanks to Board
of Management of OMDAYAL GROUP OF INSTITUTION for
their kind encouragement in doing this project and for
completing it successfully. I am grateful to them.
I convey thanks to Dipankar Hazra, Head of the
Department, Department of Computer Science
Engineering for providing us the necessary support and
details at the right time during the progressive reviews.
I would like to express my sincere and deep sense of
gratitude to my Project Guide Dr Dhrubajyoti Ghosh,
Assistant Professor for her valuable guidance,
suggestions, and constant encouragement paved way for
the successful completion of my project.
I wish to express our thanks to all Teaching and Non-
teaching staff members of the Department of COMPUTER
SCIENCE AND ENGINEERING who were helpful in many
ways for the completion of the project.
4
LOAN PREDICTION
SYSTEM USING
MACHINE
LEARNING
5
CHAPTER
NO
TITLE PAGE
NO
1. Abstract of the Project 6
2. Literature Survey 7-10
3. Introduction: Machine Learning 11-12
3.1 How Machine Learning Works 12-13
3.2 Terminologies of Machine Learning 13
3.3 Machine Learning Types 14-16
4. Various Machine Learning Algorithm 17
4.1 Logistic Regression 16-18
4.2 Support Vector Classifier 18-20
4.3 Random Forest Algorithm 20
4.4 Naive Bayes 21-22
4.5 Decision Tree Classification 22-23
4.6 Gradient Boosting Algorithm 24-25
5. Implementation Of Model 26
5.1 Implementation Of Model: Existing System 26
5.2 Implementation Of Model: Proposed System 26-28
6. Requirement: Hardware & Software 29
6.1 Various Python Libraries Used 30
7. Architecture Design 31
7.1 Sequence Diagram & Use Case Diagram 32-33
7.2 Activity Diagram & Collaboration Diagram 34
8. Methodology 35-48
9. Source Code 49-56
10. Summary & Conclusion 57
11. References 58
TABLE OF CONTENTS
6
1) ABSTRACT OF THE PROJECT
Technology has boosted the existence of human kind the quality
of life they live. Every day we are planning to create something
new and different. We have a solution for every other problem
we have machines to support our lives and make us somewhat
complete in the banking sector candidate gets proofs/ backup
before approval of the loan amount. The application approved or
not approved depends upon the historical data of the candidate
by the system. Every day lots of people applying for the loan in
the banking sector but Bank would have limited funds. In this
case, the right prediction would be very beneficial using some
classes-function algorithm. An example the logistic regression,
random forest classifier, support vector machine classifier, etc.
A Bank's profit and loss depend on the amount of the loans that
is whether the Client or customer is paying back the loan.
Recovery of loans is the most important for the banking sector.
The improvement process plays an important role in the banking
sector. The historical data of candidates was used to build a
machine learning model using different classification algorithms.
The main objective of this paper is to predict whether a new
applicant granted the loan or not using machine learning models
trained on the historical data.
7
2) LITERATURE SURVEY
A literature review is a body of text that aims to review the critical
points of current knowledge on and/or methodological approaches to
a particular topic. It is secondary sources and discuss published
information in a particular subject area and sometimes information
in a particular subject area within a certain time period. Its ultimate
goal is to bring the reader up to date with current literature on a topic
and forms the basis for another goal, such as future research that
may be needed in the area and precedes a research proposal and may
be just a simple summary of sources. Usually, it has an
organizational pattern and combines both summary and synthesis. A
summary is a recap of important information about the source, but a
synthesis is a reorganization, reshuffling of information. It might
give a new interpretation of old material or combine new with old
interpretations or it might trace the intellectual progression of the
field, including major debates. Depending on the situation, the
literature review may evaluate the sources and advise the reader on
the most pertinent or relevant of them.
Review of Literature Survey:
1) Title: A benchmark of machine learning approaches for
credit score prediction.
Author: Vincenzo Moscato, Antonio Picariello, Giancarl Sperlí
Year : 2021
Credit risk assessment plays a key role for correctly supporting
financial institutes in defining their bank policies and commercial
strategies. Over the last decade, the emerging of social lending
platforms has disrupted traditional services for credit risk assessment.
Through these platforms, lenders and borrowers can easily interact
among them without any involvement of financial institutes. In
particular, they support borrowers in the fundraising process,
enabling the participation of any number and size of lenders.
However, the lack of lenders’ experience and missing or uncertain
information about 4 borrower’s credit history can increase risks in
8
social lending platforms, requiring an accurate credit risk scoring. To
overcome such issues, the credit risk assessment problem of
financial operations is usually modeled as a binary problem on the
basis of debt’s repayment and proper machine learning techniques
can be consequently exploited. In this paper, we propose a bench
marking study of some of the most used credit risk scoring models to
predict if a loan will be repaid in a P2P platform. We deal with a
class imbalance problem and leverage several classifiers among the
most used in the literature, which are based on different sampling
techniques. A real social lending platform (Lending Club) data-set,
composed by 877,956 samples, has been used to perform the
experimental analysis considering different evaluation metrics (i.e.
AUC, Sensitivity, Specificity), also comparing the obtained
outcomes with respect to the state-of-the-art approaches. Finally, the
three best approaches have also been evaluated in terms of their
explain-ability by means of different explainable Artificial
Intelligence (XAI) tools.
2) Title : An Approach for Prediction of Loan approval using
Machine Learning Algorithm.
Author: Mohammad Ahmad Sheikh, Amit Kumar Goel, Tapas
Kumar
Year : 2020
In our banking system, banks have many products to sell but main
source of income of any banks is on its credit line. So they can earn
from interest of those loans which they credits.A bank’s profit or a
loss depends to a large extent on loans i.e. whether the customers are
paying back the loan or defaulting. By predicting the loan defaulters,
the bank can reduce its Non Performing Assets. This makes the
study of this phenomenon very important. Previous research in this
era has shown that there are so many methods to study the problem
of controlling loan default. But as the right predictions are very
important for the maximization of profits, it is essential to study the
nature of the different methods and their comparison. A very
important approach in predictive analytic is used to study the
problem of predicting loan defaulters: The Logistic regression model.
The data is collected from the Kaggle for studying and prediction.
9
Logistic Regression models have been performed and the different
measures of performances are computed. The models are compared
on the basis of the performance measures such as sensitivity and 5
specificity. The final results have shown that the model produce
different results.Model is marginally better because it includes
variables (personal attributes of customer like age, purpose, credit
history, credit amount, credit duration, etc.) other than checking
account information (which shows wealth of a customer) that should
be taken into account to calculate the probability of default on loan
correctly. Therefore, by using a logistic regression approach, the
right customers to be targeted for granting loan can be easily
detected by evaluating their likelihood of default on loan. The model
concludes that a bank should not only target the rich customers for
granting loan but it should assess the other attributes of a customer
as well which play a very important part in credit granting decisions
and predicting the loan defaulters.
3) Title : Predict Loan Approval in Banking System Machine
Learning Approach for Cooperative Banks Loan Approval.
Author: Amruta S. Aphale, Dr. Sandeep R. Shinde.
Year : 2020
In today’s world, taking loans from financial institutions has become
a very common phenomenon. Everyday a large number of people
make application for loans, for a variety of purposes. But all these
applicants are not reliable and everyone cannot be approved. Every
year, we read about a number of cases where people do not repay
bulk of the loan amount to the banks due to which they suffers huge
losses. The risk associated with making a decision on loan approval
is immense. So the idea of this project is to gather loan data from
multiple data sources and use various machine learning algorithms
on this data to extract important information. This model can be used
by the organizations in making the right decision to approve or reject
the loan request of the customers. In this paper, we examine a real
bank credit data and conduct several machine learning algorithms on
the data for that determine credit worthiness of customers in order to
formulate bank risk automated system.
10
4) Title : Loan Approval Prediction Using Machine Learning
Author: Yash Divate, Prashant Rana, Pratik Chavan
Year : 2021
With the upgrade in the financial area loads of individuals are
applying for bank advances however the bank has its restricted
resources which it needs to allow to restricted individuals just, so
discovering to whom the credit can be conceded which will be a
more secure choice for the bank is a commonplace interaction. So in
this task we attempt to decrease this danger factor behind choosing
the protected individual in order to save bunches of bank endeavors
and resources. This is finished by mining the Data of the past
records of individuals to whom the advance was conceded
previously and based on these records/encounters the machine was
prepared utilizing the AI model which give the most precise
outcome. The principle objective of this paper is to anticipate
whether relegating the advance to specific individual will be
protected or not. This paper is separated into four areas (i)Data
Collection (ii) Comparison of AI models on gathered information (iii)
Training of framework on most encouraging model (iv) Testing.
11
3) INTRODUCTION
The immense increase in capitalism, the fast-paced development and instantaneous
changes in the lifestyle has us in awe. EMI, loans at nominal rate, housing loans, vehicle
loans, these are some of the few words which have skyrocketed from the past few years.
The needs, wants and demands have never been increased this before. People gets loan
from banks; however, it may be baffling for the bankers to judge who can pay back the
loan nevertheless the bank shouldn’t be in loss. Banks earn most of their profits through
the loan sanctioning. Generally, banks pass loan after completing the numerous
verification processes despite all these, it is still not confirmed that the borrower will pay
back the loan or not. To get over the dilemma, I have built up a prediction model which
says if the loan has been assigned in the safe hands or not. Government agencies like keep
under surveillance why one person got a loan and the other person could not. In Machine
Learning techniques which include classification and prediction can be applied to conquer
this to a brilliant extent. Machine learning has eased today’s world by developing these
prediction models. Here we will be using the fine techniques of machine learning –
Decision tree algorithm to build this prediction model for loan assessment. It is as so
because decision tree gives accuracy in the prediction and is often used in the industry for
these models.
Machine Learning :
Machine learning (ML) is a type of artificial intelligence (AI) focused on building
computer systems that learn from data. The broad range of techniques ML encompasses
enables software applications to improve their performance over time.
Machine learning algorithm are trained to find relationships and patterns in data. They use
historical data as input to make predictions, classify information, cluster data points,
reduce dimensionality and even help generate new content, as demonstrated by new ML-
fueled applications such as Chat GPT, Dall-E 2 and GitHub Copilot.
Machine learning is widely applicable across many industries. Recommendation System ,
for example, are used by e-commerce, social media and news organizations to suggest
content based on a customer's past behavior. Machine learning algorithms and machine
vision are a critical component of self-driving cars, helping them navigate the roads safely.
In healthcare, machine learning is used to diagnose and suggest treatment plans. Other
12
common ML use cases include fraud detection, spam filtering, malware threat detection,
predictive maintenance and business process automation.
While machine learning is a powerful tool for solving problems, improving business
operations and automating tasks, it's also a complex and challenging technology, requiring
deep expertise and significant resources. Choosing the right algorithm for a task calls for a
strong grasp of mathematics and statistics. Training machine learning algorithms often
involves large amounts of good quality data to produce accurate results. The results
themselves can be difficult to understand -- particularly the outcomes produced by
complex algorithms, such as the deep learning neural network patterned after the human
brain. And ML Models can be costly to run and tune.
Still, most organizations either directly or indirectly through ML-infused products are
embracing machine learning. According to the "2023 AI and Machine Learning Research
Report" from Rackspace Technology, 72% of companies surveyed said that AI and
machine learning are part of their IT and business strategies, and 69% described AI/ML as
the most important technology. Companies that have adopted it reported using it to
improve existing processes (67%), predict business performance and industry trends (60%)
and reduce risk (53%).
Tech Target's guide to machine learning is a primer on this important field of computer
science, further explaining what machine learning is, how to do it and how it is applied in
business. You'll find information on the various types of machine learning algorithms, the
challenges and best practices associated with developing and destroying ML Models, and
what the future holds for machine learning. Throughout the guide, there are hyperlinks to
related articles that cover the topics in greater depth.
3.1) How Machine Learning works:
Machine learning uses two types of techniques: supervised learning, which trains a model
on known input and output data so that it can predict future outputs, and unsupervised
learning, which finds hidden patterns or intrinsic structures in input data.
The Machine Learning process starts with inputting training data into the selected
algorithm. Training data being known or unknown data to develop the final Machine
Learning algorithm. The type of training data input does impact the algorithm, and that
concept will be covered further momentarily.
13
3.2) Terminologies of Machine Learning :
 Model : A model is a specific representation learned from data by applying some
machine learning algorithm. A model is also called hypothesis.
 Feature : A feature is an individual measurable property of our data. A set of
numeric features can be conveniently described by a feature vector. Feature
vectors are fed as input to the model. For example, in order to predict a fruit,
there may be features like color, smell, taste, etc.
 Target(Label): A target variable or label is the value to be predicted by our
model. For the fruit example discussed in the features section, the label with each
set of input would be the name of the fruit like apple, orange, banana, etc.
 Training: The idea is to give a set of inputs(features) and it’s expected
outputs(labels), so after training, we will have a model (hypothesis) that will then
map new data to one of the categories trained on.
 Prediction: Once our model is ready, it can be fed a set of inputs to which it will
provide a predicted output(label).
Fig 1- How machine learning works
14
3.3) Machine Learning Types:
Learning is, of course, a very wide domain. Consequently, the field of machine learning
has branched into several sub-fields dealing with
different types of learning tasks. We give a rough taxonomy of learning paradigms, aiming
to provide some perspective of where the content sits within the wide field of machine
learning.
Terms frequently used are:
 Labeled data : Data consisting of a set of training examples, where each example is a
pair consisting of an input and a desired output value (also called the supervisory signal,
labels, etc)
 Classification : The goal is to predict discrete values, e.g. {1,0}, {True, False},
{spam, not spam}.
 Regression : The goal is to predict continuous values, e.g. home prices.
There some variations of how to define the types of Machine Learning Algorithms but
commonly they can be divided into categories according to their purpose and the main
categories are the following:
 Supervised learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning
3.3.1) Supervised learning:
Learned to perform that task. Supervised learning algorithms include classification and
regression. Classification algorithms are used when the outputs are restricted to a limited
set of values, and regression algorithms are used when the outputs may have any numerical
value within a range. Similarity learning is an area of supervised machine learning closely
related to regression and classification, but the goal is to learn from examples using a
similarity function that measures how similar or related two objects are. It has applications
in ranking, recommendation systems, visual identity tracking, face verification, and
speaker verification. 16 In the case of semi-supervised learning algorithms, some of the
training examples are missing training labels, but they can nevertheless be used to improve
15
the quality of a model. In weakly supervised learning, the training labels are noisy, limited,
or imprecise; however, these labels are often cheaper to obtain, resulting in larger effective
training sets.
List of Common Algorithms:
• Nearest Neighbour
• Naive Bayes
• Decision Trees
• Linear Regression
• Support Vector Machines (SVM)
• Neural Networks
3.3.2) Unsupervised learning:
Unsupervised learning algorithms take a set of data that contains only inputs, and find
structure in the data, like grouping or clustering of data points. The algorithms, therefore,
learn from test data that has not been labeled, classified or categorized. Instead of
responding to feedback, unsupervised learning algorithms identify commonalities in the
data and react based on the presence or absence of such commonalities in each new piece
of data. A central application of unsupervised learning is in the field of density estimation
in statistics, though unsupervised learning encompasses other domains involving
summarizing and explaining data features. Cluster analysis is the assignment of a set of
observations into subsets (called clusters) so that observations within the same cluster are
similar according to one or more pre designated criteria, while observations drawn from
different clusters are dissimilar. Different clustering techniques make different
assumptions on the structure of the data, often defined by some similarity metric and
evaluated, for example, by internal compactness, or the similarity between members of the
same cluster, and separation, the difference between clusters. Other methods are based on
estimated density and graph connectivity.
List of Common Algorithms: • K-means clustering ,Association Rules.
16
3.3.3) Semi-supervised learning:
Semi-supervised learning falls between unsupervised learning (without any labeled
training data) and supervised learning (with completely labeled training data). Many
machine-learning researchers have found that unlabeled data, when used in conjunction
with a small amount of labeled data, can produce a considerable improvement in learning
accuracy.
3.3.4) Reinforcement Learning:
Reinforcement Learning is a type of Machine Learning, and thereby also a branch of
Artificial Intelligence. It allows machines and software agents to automatically determine
the ideal behaviour within a specific context, in order to maximize its performance. Simple
reward feedback is required for the agent to learn its behaviour; this is known as the
reinforcement signal. There are many different algorithms that tackle this issue. As a
matter of fact, Reinforcement Learning is defined by a specific type of problem, and all its
solutions are classed as Reinforcement Learning algorithms. In the problem, an agent is
supposed decide the best action to select based on his current state. When this step is
repeated, the problem is known as a Markov Decision Process.
List of Common Algorithms:
• Q-Learning • Temporal Difference (TD) • Deep Adversarial Networks
Use cases: Some applications of the reinforcement learning algorithms are computer
played board games (Chess, Go), robotic hands, and self-driving cars.
17
4) Various Machine Learning Algorithms Widely Used :
4.1 ) Logistic regression:
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables. Logistic regression predicts
the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. But
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1. Logistic regression is used for solving the classification problems
It uses a logistic function called a sigmoid function to map predictions and their
probabilities. The sigmoid function refers to an S- shaped curve that converts any real
value to a range between 0 and 1.
The logit function is mathematically represented as
Fig 2- logistic Regression
18
4.2) Support Vector Classifier:
A support vector machine (SVM) is a type of supervised machine learning algorithm used
in machine learning to solve classification and regression tasks; SVMs are particularly
good at solving binary classification problems, which require classifying the elements of a
data set into two groups.
The aim of a support vector machine algorithm is to find the best possible line, or decision
boundary, that separates the data points of different data classes. This boundary is called
a hyperplane when working in high-dimensional feature spaces. The idea is to maximize
the margin, which is the distance between the hyperplane and the closest data points of
each category, thus making it easy to distinguish data classes.
SVMs are useful for analyzing complex data that can't be separated by a simple straight
line. Called nonlinear SMVs, they do this by using a mathematical trick that transforms
data into higher-dimensional space, where it is easier to find a boundary.
How do support vector machines work?
The key idea behind SVMs is to transform the input data into a higher-dimensional feature
space.
This transformation makes it easier to find a linear separation or to more effectively
classify the data set.
To do this, SVMs use a kernel function. Instead of explicitly calculating the coordinates of
the transformed space, the kernel function enables the SVM to implicitly compute the dot
products between the transformed feature vectors and avoid handling expensive,
unnecessary computations for extreme cases.
SVMs can handle both linearly separable and non-linearly separable data. They do this by
using different types of kernel functions, such as the linear kernel, polynomial kernel or
radial basis function (RBF) kernel. These kernels enable SVMs to effectively capture
complex relationships and patterns in the data.
During the training phase, SVMs use a mathematical formulation to find the optimal
hyperplane in a higher-dimensional space, often called the kernel space. This hyperplane is
crucial because it maximizes the margin between data points of different classes, while
minimizing the classification errors.
19
The kernel function plays a critical role in SVMs, as it makes it possible to map the data
from the original feature space to the kernel space. The choice of kernel function can have
a significant impact on the performance of the SVM algorithm; choosing the best kernel
function for a particular problem depends on the characteristics of the data.
Some of the most popular kernel functions for SVMs are the following:
 Linear Kernel : This is the simplest kernel function, and it maps the data to a higher-
dimensional space, where the data is linearly separable.
 Polynomial Kernel: This kernel function is more powerful than the linear kernel, and
it can be used to map the data to a higher-dimensional space, where the data is non-
linearly separable.
 RBF Kernel: This is the most popular kernel function for SVMs, and it is effective for
a wide range of classification problems.
 Sigmoid Kernel: This kernel function is similar to the RBF kernel, but it has a
different shape that can be useful for some classification problems.
The choice of kernel function for an SVM algorithm is a trade-off between accuracy and
complexity. The more powerful kernel functions, such as the RBF kernel, can achieve
higher accuracy than the simpler kernel functions, but they also require more data and
Fig 3- Support Vector Classifier
20
computation time to train the SVM algorithm. But this is becoming less of an issue due to
technological advances.
Once trained, SVMs can classify new, unseen data points by determining which side of the
decision boundary they fall on. The output of the SVM is the class label associated with
the side of the decision boundary.
4.3) Random Forest Algorithm:
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
of over fitting.
Fig 4 - Random Forest Algorithm
21
4.4) Naive Bayes Algorithm:
It is a classification technique based on Bayes’ Theorem with an independence assumption
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.
The Naive Bayes classifier is a popular supervised machine learning algorithm used for
classification tasks such as text classification. It belongs to the family of generative
learning algorithms, which means that it models the distribution of inputs for a given class
or category. This approach is based on the assumption that the features of the input data are
conditionally independent given the class, allowing the algorithm to make predictions
quickly and accurately. In statistics, naive Bayes classifiers are considered as simple
probabilistic classifiers that apply Bayes’ theorem. This theorem is based on the
probability of a hypothesis, given the data and some prior knowledge. The naive Bayes
classifier assumes that all features in the input data are independent of each other,
which is often not true in real-world scenarios. However, despite this simplifying
assumption, the naive Bayes classifier is widely used because of its efficiency and good
performance in many real-world applications.
Moreover, it is worth noting that naive Bayes classifiers are among the simplest Bayesian
network models, yet they can achieve high accuracy levels when coupled with kernel
density estimation. This technique involves using a kernel function to estimate the
probability density function of the input data, allowing the classifier to improve its
performance in complex scenarios where the data distribution is not well-defined. As a
result, the naive Bayes classifier is a powerful tool in machine learning, particularly in text
classification, spam filtering, and sentiment analysis, among others.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches
in diameter. Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that this fruit is
an apple and that is why it is known as ‘Naive’.
An NB model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of computing posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
22
Above,
 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of the predictor given class.
 P(x) is the prior probability of the predictor
4.5) Decision Tree Classification Algorithm:
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a datasets, branches
represent the decision rules and each leaf node represents the outcome. In a Decision tree,
there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to
make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branch.
Fig 5- Equation of Naive Bayes
23
Decision Tree Terminologies:
 Root Node: Root node is from where the decision tree starts. It represents the entire
datasets, which further gets divided into two or more homogeneous sets
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Fig 6- Decision Tree
24
4.6) Gradient Boosting Algorithm:
Gradient Boosting is a powerful boosting algorithm that combines several weak learners
into strong learners, in which each new model is trained to minimize the loss function such
as mean squared error or cross-entropy of the previous model using gradient descent. In
each iteration, the algorithm computes the gradient of the loss function with respect to the
predictions of the current ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the ensemble, and the process
is repeated until a stopping criterion is met.
In contrast to Ada Boost, the weights of the training instances are not tweaked, instead,
each predictor is trained using the residual errors of the predecessor as labels. There is a
technique called the Gradient Boosted Trees whose base learner is CART (Classification
and Regression Trees). The below diagram explains how gradient-boosted trees are trained
for regression problems.
Fig 7- Gradient Boosting Classifier
25
The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the
labels y. The predictions labeled y1(hat) are used to determine the training set residual
errors r1. Tree2 is then trained using the feature matrix X and the residual errors r1 of
Tree1 as labels. The predicted results r1(hat) are then used to determine the residual r2.
The process is repeated until all the M trees forming the ensemble are trained. There is an
important parameter used in this technique known as Shrinkage. Shrinkage refers to the
fact that the prediction of each tree in the ensemble is shrunk after it is multiplied by the
learning rate (eta) which ranges between 0 to 1. There is a trade-off between eta and the
number of estimators, decreasing learning rate needs to be compensated with increasing
estimators in order to reach certain model performance. Since all trees are trained now,
predictions can be made. Each tree predicts a label and the final prediction is given by the
formula
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)
26
5) IMPLEMENTATION OF MODEL
5.1) EXISTING SYSTEM :
Banks need to analyze for the person who applies for the loan will repay the loan or not.
Sometime it happens that customer has provided partial data to the bank, in this case
person may get the loan without proper verification and bank may end up with loss.
Bankers cannot analyze the huge amounts of data manually, it may become a big headache
to check whether a person will repay its loan or not. It is very much necessary to know the
person getting loan is going in safe hand or not. So, it is pretty much important to have a
automated model which should predict the customer getting the loan will repay the loan or
not.
Disadvantage : To apply the loan we need to go to bank to apply it
The model will be able to predict whether a loan applicant will default on a given
loan.The system architecture is as below.
5.2) PROPOSED SYSTEM
The proposed model focuses on predicting the credibility of customers for loan repayment
by analyzing their details. The input to the model is the customer details collected. On the
output from the classifier, decision on whether to approve or reject the customer request
can be made. Using different data analytic tools loan prediction and there severity can be
forecast ed. In this process it is required to train the data using different algorithms and
then compare user data with trained data to predict the nature of loan. The training data set
is now supplied to machine learning model; on the basis of this data set the model is
trained. Every new applicant details filled at the time of application form acts as a test data
set. After the operation of testing, 8 model predict whether the new applicant is a fit case
for approval of the loan or not based upon the inference it conclude on the basis of the
training data sets. By providing real time input on the web app. In our project, Logistic
Regression gives high accuracy level compared with other algorithms. Finally, we are
predicting the result via data visualization and display the predicted output using web app
using flask.
27
The proposed model focuses on predicting the credibility of customers for loan repayment
by analyzing their details. The input to the model is the customer details collected. On the
output from the classifier, decision on whether to approve or reject the customer request can
be made. Using different data analytic tools loan prediction and there severity can be
forecast ed. In this process it is required to train the data using different algorithms and then
compare user data with trained data to predict the nature of loan. The training data set is
now supplied to machine learning model; on the basis of this data set the model is trained.
Every new applicant details filled at the time of application form acts as a test data set. After
the operation of testing, 8 model predict whether the new applicant is a fit case for approval
of the loan or not based upon the inference it conclude on the basis of the training data sets.
By providing real time input on the web app. In our project, Logistic Regression gives high
accuracy level compared with other algorithms. Finally, we are predicting the result via data
visualization and display the predicted output using web app using flask.
Advantage: No need to go to bank We can do the transaction from house, we can
consume the time doing from home.
Random
Forest
Loan
Train model Naive Bayes
Predict
best model Default 1
Logistic
Regression
0
Fig 8- Proposed Model
28
 Step 1: The Loan application goes through the trained model where the three
classification algorithms are applied.
 Step 2: The machine learning with the best performance in accuracy is selected.
 Step 3 : The machine learning algorithm is applied to the loan application.
 Step 4: The machine learning algorithm determines the probability of default. 1, being
true and 0 being false
29
6) REQUIREMENT SPECIFICATIONS
Prediction of modernized loan approval system based on machine learning approach is a
loan approval system from where we can know whether the loan will pass or not. In this
system, we take some data from the user like his monthly income, marriage status, loan
amount, loan duration, etc. Then the bank will decide according to its parameters whether
the client will get the loan or not. So there is a classification system, in this system, a
training set is employed to make the model and the classifier may classify the data items
into their appropriate class. A test datasets is created that trains the data and gives the
appropriate result that, is the client potential and can repay the loan. Prediction of a
modernized loan approval system is incredibly helpful for banks and also the clients. This
system checks the candidate on his priority basis. Customer can submit his application
directly to the bank so the bank will do the whole process, no third party or stockholder will
interfere in it. And finally, the bank will decide that the candidate is deserving or not on its
priority basis. The only object of this research paper is that the deserving candidate gets
straight forward and quick results.
HARDWARE AND SOFTWARE SPECIFICATION
HARDWARE REQUIREMENTS
● Hard disk : 500 GB and above. ● Processor : i3 and above.
● Ram : 4GB and above.
SOFTWARE REQUIREMENTS
● Operating System : Windows 10/11 ● Tools :Jupiter Note Book IDE
●Programming Language: Python 3 ● Streamlit App
● Visual Studio Code Editor
30
6.1) Python Libraries Used:
The machine learning models are implemented using python version 3.7 on
a Jupyter notebook with the listed libraries: numpy, pandas , matplotlib,
seaborn , and sklearn.
 Jupyter notebooks are a web-based interface in which you can
write, visualize, and execute python code in cells. It is good for
exploratory analysis and enable to run individual code cells.
 Numpy is a Python library that may be used to work with multi-
dimensional arrays, linear algebra, the Fourier transform, and matrices.
 Pandas is a data manipulation and analysis package written in
Python.
 Matplotlib is a Python package that allows you to create
static,animated, and interactive visualizations.
 Seaborn is a matplotlib-based python data visualization package. It
has a high-level interface for creating visually appealing and instructive
statistics visuals.
 Sklearn is a Python toolkit that allows you to create machine
learning and statistical models including clustering, classification, and
regression.
31
7) ARCHITECTURE DESIGN
7.1) Architecture Diagram:
Datasets Prepossessing
User Input
Web App
32
7.2) Sequence Diagram:
A Sequence diagram is a kind of interaction diagram that shows how
processes operate with one another and in what order. It is a construct of
Message Sequence diagrams are sometimes called event diagrams, event
sceneries and timing diagram..
Fig 9- Sequence Diagram
33
7.3) Use Case Diagram:
Unified Modeling Language (UML) is a standardized general-purpose
modeling language in the field of software engineering. The standard is
managed and was created by the Object Management Group. UML
includes a set of graphic notation techniques to create visual models of
software intensive systems. This language is used to specify, visualize,
modify, construct and document the artifacts of an object oriented
software intensive system under development.
Fig 10 - Use Case Diagram
34
7.4) Activity Diagram:
Activity diagram is a graphical representation of workflows of
stepwise activities and actions with support for choice, iteration and
concurrency. An activity diagram shows the overallflow of control.
7.5) Collaboration Diagram:
DATA
COLLECTION
DATA
PREPROCESSING
MACHINE
LEARNING
ALGORITHM
LOAN
PREDICTION
WEB
APPLICATION
●OUTPUT
DATA DATA DATA
DATA
DATA
Fig 11 - Activity Diagram
Fig 12 - Collaboration Diagram
35
8) METHEDOLOGY
DATA PREPROCESSING:
Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model.
Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.
It involves below steps:
 Getting the dataset
 Importing libraries
 Importing datasets
 Finding Missing Data
 Encoding Categorical Data
 Splitting dataset into training and test set
 Feature scaling
Data-set:
Datasets is provided to Machine Learning models on the basis of the
facts this version is trained.
We have collected the dataset from a website called Kaggle.
There are a total of 614 rows and 13 columns in the dataset.
There are columns for Loan_Id, Gender, Married or Not, No. Of
Dependents, Education Background of the loan seeker, Employment
status of the loan seeker, Income of Applicant, Income of Co-
applicant,Loan Amount, Credit History of the applicant and the loan
status of the applicant.
36
Importing Python libraries:
In order to perform data prepossessing using Python, we need to import
some predefined Python libraries. These libraries are used to perform
some specific jobs. There are three specific libraries that we will use for
data prepossessing.
We have imported python libraries like Pandas, Numpy, Seaborn, Sci-kit
Learn, matplotlib for our work.
Importing the Data set:
Now We have imported the dataset which we will use as historical data to
train the model.
Fig-13 - Data Set
37
Understanding the Data:
First of all we use the data.describe() method to shows the important
information from the data-set.
It provides the count, mean, standard deviation
(std), min, quartile and max in its output.
Another method is info () , This method show us the information about
the data set
38
As we can see in the output.
 There are 614 entries
 There are total 13 features (0 to 12)
 There are three types of datatype dtypes: float64(4), int64(1), object(8)
 It's Memory usage that is, memory usage: 62.5+ KB
 Also, We can check how many missing values available in the Non-
Null Count column
Exploratory Data Analysis:
In this section, We learn about extra information about data and it's
characteristics.
39
Data Cleaning:
In this step of data cleaning by checking we have eliminated all the
missing values because they affect the accuracy of the model. We have
achieved this by either filling the the missing values with a mean or mode
function or by dropping all missing values.
First we have checked the number of null values in each column of the
dataset.
Now we have checked the percentage of of missing values in each
column of the dataset.
40
Now we will handle the missing data entries in the data set. We can see
the number of missing values in four columns in the data set-
Gender,Dependents,Loan Amount, Loan Term are less than 5% so we
will drop the rows with the missing values.
Rest two columns- Self_Employed and Credit History which have greater
than 5% missing values we will use mode function to fill up the null
values.
41
Handling the categorical columns:
Loan_Status feature numeric values, we will replace the columns with
numeric values.
There are some values in the dependent column as 3+. we will replace it
by numeric value 4.
Feature Scaling:
Feature Selection is the method of reducing the input variable to your
model by using only relevant data and getting rid of noise in data.
It is the process of automatically choosing relevant features for your
machine learning model based on the type of problem you are trying to
solve. We do this by including or excluding important features without
changing them. It helps in cutting down the noise in our data and
reducing the size of our input data.
42
Splitting The Datasets Into The Training Set And Test Set
& Applying K-Fold Cross Validation:
Now we have split the datasets into two sets for training and testing. We
will apply cross validation and will check the accuracy of the various
models we have used in this work.
Implementing various machine learning models:
We will implement all the five machine learning algorithm- Logistic
Regression, Support Vector Classifier, Random Forest Classifier,
Decision Tree Classifier and Gradient Boosting Classifier and check the
accuracy of all the algorithms with average cross validation score
43
Logistic Regression:
So the accuracy of this model is 0.8018018018018
Support Vector Classifier:
So the accuracy of this model is 0.79279279279
Decision Tree Classifier:
44
So the accuracy of this model is 0.7117117117117117
Random Forest Classifier:
So the accuracy of this model is 0.765765765765
Gradient Boosting Classifier:
So the accuracy of this model is 0.79279279
45
HYPERPARAMETER TUNING:
Hyper parameters tuning is the process of determining the right
combination of hyper parameters that maximizes the model performance.
It works by running multiple trials in a single training process. Each trial
is a complete execution of your training application with values for your
chosen hyper parameters, set within the limits you specify. This process
once finished will give you the set of hyper parameter values that are best
suited for the model to give optimal results.
We have used Random Search CV for the tuning.
46
After hyper parameter tuning the accuracy of the best 3 models are-
So we get the Random Forest Classifier model gives the best accuracy of
80.67 so we have choose this model for this work.
47
Model Deployment:
Finally, we are done so far. The last step is to deploy our model in
production map.
So we need to export our model and bind with web application API.
Using pickle we can export our model and store in to rf_model.pkl file, so
we can early access this file and calculate customize prediction using
Web App API.
User Interface:
The user interface of the app is made on Streamlit App. Streamlit is a free
and open-source framework to rapidly build and share beautiful machine
learning and data science web apps.
It is a Python-based library specifically designed for machine learning
engineers.
We have loaded the rf_model.pkl file in the streamlit app. code
48
Predicting Results:
We wil give some input in the app and app will give us the output
whether the loan is approved or not.
49
9) Source Code:
import pandas as pd
data = pd.read_csv('loan_prediction.csv')
# Loan_ID : Unique Loan ID
# Gender : Male/ Female
# Married : Applicant married (Y/N)
# Dependents : Number of dependents
# Education : Applicant Education (Graduate/ Under Graduate)
# Self_Employed : Self employed (Y/N)
# ApplicantIncome : Applicant income
# CoapplicantIncome : Coapplicant income
# LoanAmount : Loan amount in thousands of dollars
# Loan_Amount_Term : Term of loan in months
# Credit_History : Credit history meets guidelines yes or no
# Property_Area : Urban/ Semi Urban/ Rural
# Loan_Status : Loan approved (Y/N) this is the target variable
1. Display Top 5 Rows of The Dataset
data.head()
2. Check Last 5 Rows of The Dataset
data.tail()
3. Find Shape of Our Dataset (Number of Rows And Number of
Columns)
data.shape
print("Number of Rows",data.shape[0])print("Number of
Columns",data.shape[1])
4. Get Information About Our Dataset Like Total Number Rows,
Total Number of Columns, Datatypes of Each Column And Memory
Requirement
data.info()
5. Check Null Values In The Dataset
data.isnull().sum()
data.isnull().sum()*100 / len(data)
6. Handling The missing Values
data = data.drop('Loan_ID',axis=1)
50
data.head(1)
columns = ['Gender','Dependents','LoanAmount','Loan_Amount_Term']
data = data.dropna(subset=columns)
data.isnull().sum()*100 / len(data)
data['Self_Employed'].mode()[0]
data['Self_Employed']=data['Self_Employed'].fillna(data['Self_Employe
d'].mode()[0])
data.isnull().sum()*100 / len(data)
data['Gender'].unique()
data['Self_Employed'].unique()
data['Credit_History'].mode()[0]
ata['Credit_History']=data['Credit_History'].fillna(data['Credit_History']
mode()[0])
data.isnull().sum()*100 / len(data)
7. Handling Categorical Columns
data.sample(5)
data['Dependents']=data['Dependents'].replace(to_replace="3+",value=
4')
data['Dependents'].unique()
data['Loan_Status'].unique()
data['Gender'] =
data['Gender'].map({'Male':1,'Female':0}).astype('int')
data['Married'] =
data['Married'].map({'Yes':1,'No':0}).astype('int')data['Education'] =
data['Education'].map({'Graduate':1,'Not
Graduate':0}).astype('int')data['Self_Employed'] =
data['Self_Employed'].map({'Yes':1,'No':0}).astype('int')data['Property_
Area'] =
data['Property_Area'].map({'Rural':0,'Semiurban':2,'Urban':1}).astype('i
nt')
data['Loan_Status'] = data['Loan_Status'].map({'Y':1,'N':0}).astype('int')
data.head()
8. Store Feature Matrix In X And Response (Target) In Vector y
X = data.drop('Loan_Status',axis=1)
y = data['Loan_Status']
9. Feature Scaling
51
data.head()
cols =
['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_
Term']
from sklearn.preprocessing import StandardScalerst =
StandardScaler()X[cols]=st.fit_transform(X[cols])X
10. Splitting The Dataset Into The Training Set And Test Set &
Applying K-Fold Cross Validation
from sklearn.model_selection import train_test_splitfrom
sklearn.model_selection import cross_val_scorefrom sklearn.metrics
import accuracy_scoreimport numpy as np
model_df={}def model_val(model,X,y):
X_train,X_test,y_train,y_test=train_test_split(X,y,
test_size=0.20,
random_state=42)
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(f"{model} accuracy is {accuracy_score(y_test,y_pred)}")
score = cross_val_score(model,X,y,cv=5)
print(f"{model} Avg cross val score is {np.mean(score)}")
model_df[model]=round(np.mean(score)*100,2)
model_df
11. Logistic Regression
from sklearn.linear_model import LogisticRegressionmodel =
LogisticRegression()model_val(model,X,y)
LogisticRegression() accuracy is 0.8018018018018018
LogisticRegression() Avg cross val score is 0.8047829647829647
12. SVC
from sklearn import svmmodel = svm.SVC()model_val(model,X,y)
SVC() accuracy is 0.7927927927927928
SVC() Avg cross val score is 0.7938902538902539
13. Decision Tree Classifier
From sklearn.tree import DecisionTreeClassifiermodel =
DecisionTreeClassifier()model_val(model,X,y)
DecisionTreeClassifier() accuracy is 0.7117117117117117
52
DecisionTreeClassifier() Avg cross val score is 0.7089434889434889
14. Random Forest Classifier
from sklearn.ensemble import RandomForestClassifiermodel
=RandomForestClassifier()model_val(model,X,y)
RandomForestClassifier() accuracy is 0.7567567567567568
RandomForestClassifier() Avg cross val score is 0.7776412776412777
15. Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifiermodel
=GradientBoostingClassifier()model_val(model,X,y)
GradientBoostingClassifier() accuracy is 0.7927927927927928
GradientBoostingClassifier() Avg cross val score is 0.774004914004914
16. Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
Logistic Regression
log_reg_grid={"C":np.logspace(-4,4,20),
"solver":['liblinear']}
rs_log_reg=RandomizedSearchCV(LogisticRegression(),
param_distributions=log_reg_grid,
n_iter=20,cv=5,verbose=True)
rs_log_reg.fit(X,y)
rs_log_reg.best_score_
rs_log_reg.best_params_
SVC
svc_grid = {'C':[0.25,0.50,0.75,1],"kernel":["linear"]}
rs_svc=RandomizedSearchCV(svm.SVC(),
param_distributions=svc_grid,
cv=5,
n_iter=20,
verbose=True)
rs_svc.fit(X,y)
rs_svc.best_score_
rs_svc.best_params_
53
Random Forest Classifier
RandomForestClassifier()
rf_grid={'n_estimators':np.arange(10,1000,10),
'max_features':['auto','sqrt'],
'max_depth':[None,3,5,10,20,30],
'min_samples_split':[2,5,20,50,100],
'min_samples_leaf':[1,2,5,10]
}
rs_rf=RandomizedSearchCV(RandomForestClassifier(),
param_distributions=rf_grid,
cv=5,
n_iter=20,
verbose=True)
rs_rf.fit(X,y)
rs_rf.best_score_
17. Save The Model
X = data.drop('Loan_Status',axis=1)y = data['Loan_Status']
rf = RandomForestClassifier(n_estimators=270,
min_samples_split=5,
min_samples_leaf=5,
max_features='sqrt',
max_depth=5)
rf.fit(X,y)
RandomForestClassifier(max_depth=5, max_features='sqrt',
min_samples_leaf=5,
min_samples_split=5, n_estimators=270)
:
import joblib
joblib.dump(rf,'loan_status_predict')
['loan_status_predict']
model = joblib.load('loan_status_predict')
import pandas as pddf = pd.DataFrame({
'Gender':1,
54
'Married':1,
'Dependents':2,
'Education':0,
'Self_Employed':0,
'ApplicantIncome':2889,
'CoapplicantIncome':0.0,
'LoanAmount':45,
'Loan_Amount_Term':180,
'Credit_History':0,
'Property_Area':1},index=[0])
df
result = model.predict(df)
if result==1:
print("Loan Approved")else:
print("Loan Not Approved")
Graphical User Interface(GUI)
import numpy as np
import streamlit as st
import joblib
import pandas as pd
#Loading the model
model=joblib.load('C:/Users/Souma
Maiti/OneDrive/Desktop/Project/rf_model.pkl')
def loan_prediction(inputs):
input_as_np_array=np.array(inputs).reshape(1,-1)
prediction=model.predict(input_as_np_array)
print(prediction)
if (prediction[0]==0):
return 'THE LOAN IS NOT APPROVED'
else:
return 'THE LOAN IS APPROVED FOR YOU'
def main():
55
#give a title
st.title('Loan Status Prediction App')
#Getting the input from user
#GENDER
Gender = st.selectbox(
'Gender',('Male', 'Female'))
st.write('You Selected:', Gender)
if Gender=='Male':
Gender=1
else:
Gender=0
#MARRIED
Married = st.selectbox(
'Married',('Yes', 'No'))
st.write('You Selected:',Married)
if Married=='Yes':
Married=1
else:
Married=0
#DEPENDENTS
Dependents=st.slider('Dependents',0,10,1)
#EDUCATION
Education = st.selectbox(
'Education',('Graduate', 'Not Graduate'))
st.write('You Selected:',Education)
if Education=='Graduate':
Education=1
else:
Education=0
#SELF_EMPLOYED
Self_Employed = st.selectbox(
'Self_Employed',('Yes', 'No'))
st.write('You Selected:',Self_Employed)
if Self_Employed=='Yes':
56
Self_Employed=1
else:
Self_Employed=0
ApplicantIncome =st.text_input('Applicant Income')
CoapplicantIncome=st.text_input('Co-Applicant Income')
LoanAmount=st.text_input('Loan Amount')
Loan_Amount_Term=st.text_input('Loan Amount Terms')
#CREDIT HISTORY
Credit_History= st.selectbox(
'Credit History',('Yes', 'No'))
st.write('You Selected:',Credit_History)
if Credit_History=='Yes':
Credit_History=1
else:
Credit_History=0
#PROPERTY AREA
Property_Area = st.selectbox(
'Property Area',('Rural', 'Semi Urban','Urban'))
st.write('You Selected:',Property_Area)
if Property_Area=='Rural':
Property_Area=0
if Property_Area=='Semi Urban':
Property_Area=1
else:
Property_Area=2
#Code for prediction
pred = ''
if st.button('Predict'):
pred=loan_prediction([Gender,Married,Dependents,Education,Self_Empl
oyed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_
Term,Credit_History,Property_Area])
st.success(pred)
if __name__== '__main__':
main()
57
10) SUMMARY AND CONCLUSION
SUMMARY:
This project objective is to predict the Loan Approval of the
user. So this online banking loan approval system will reduce
the paper work and reduce the wastage of bank asserts and
efforts and also saves the valuable time of the customer.
In our work, a total of five machine learning algorithms which
includes Logistic Regression, Decision Tree, Random forest
classification,Support Vectior Classifier and Gradient Boosting
Classifier are applied to predict the loan approval of customers.
The experimental results conclude that the accuracy of Random
Forest Classification machine learning algorithm is better
compared to others machine learning approach.
CONCLUSION:
The analytical process started from data cleaning and processing,
missing value, exploratory analysis and finally model building
and evaluation. The best accuracy on public test set is higher
accuracy score has been found out. This application can help to
find the Prediction of Bank Loan Approval.
FUTURE WORK:
• Bank Loan Approval prediction to connect with cloud.
• To optimize the work to implement in Artificial Intelligence
environment.
58
11) REFERENCES:
[1] Amruta S. Aphale , Dr. Sandeep R. Shinde, 2020, Predict Loan
Approval in Banking System Machine Learning Approach for
Cooperative Banks Loan Approval, International Journal Of Engineering
Research & Technology (IJERT) Volume 09, Issue 08 (August 2020)
[2] Ashwini S. Kadam, Shraddha R. Nikam, Ankita A. Aher, Gayatri V.
Shelke, Amar S.Chandgude, “Prediction for Loan Approval using
Machine Learning Algorithm” (IRJET) Volume: 08 Issue: 04 | Apr 2021.
[3] M. A. Sheikh, A. K. Goel and T. Kumar, "An Approach for Prediction
of Loan Approval using Machine Learning Algorithm," 2020
International Conference on Electronics and Sustainable Communication
Systems (ICESC), 2020, pp. 490- 494, doi:
10.1109/ICESC48915.2020.9155614.
[4] Rath, Golak & Das, Debasish & Acharya, Biswaranjan. (2021).
Modern Approach for Loan Sanctioning in Banks Using Machine
Learning. Pages={179-188} 10.1007/978-981-15-5243-4_15.
[5] Vincenzo Moscato, Antonio Picariello, Giancarlo Sperlí, A
benchmark of machine learning approaches for credit score prediction,
Expert Systems with Applications, Volume 165, 2021, 113986, ISSN
0957-4174.
[6]Yash Divate, Prashant Rana, Pratik Chavan, “Loan Approval
Prediction Using Machine Learning” International Research Journal of
Engineering and Technology (IRJET) Volume: 08 Issue: 05 | May 2021
[7] WWW.JAVAPOINT.COM

More Related Content

PPTX
F.I.T.T. Principles
PPTX
Fitt principle
PPTX
Loan Prediction System Using Machine Learning.pptx
PDF
Rise of Cloud AI in India 2024 - Bessemer Venture Partners
PPTX
Cyber security business plan
PDF
Variable expenses and contribution profit across fintech business models - Be...
PPTX
Diabetes Mellitus
PPTX
Hypertension
F.I.T.T. Principles
Fitt principle
Loan Prediction System Using Machine Learning.pptx
Rise of Cloud AI in India 2024 - Bessemer Venture Partners
Cyber security business plan
Variable expenses and contribution profit across fintech business models - Be...
Diabetes Mellitus
Hypertension

What's hot (20)

PDF
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
PPTX
Introduction to computer graphics
PDF
Loan approval prediction based on machine learning approach
PPTX
raster and random scan
PPT
computer graphics
PPTX
Computer graphics ppt
PPTX
Machine Learning for Disease Prediction
PPTX
Multimedia System Architecture details.pptx
PPT
Putnam Resource allocation model.ppt
PPTX
Vision of cloud computing
PPTX
Vm migration techniques
PPTX
Color Models
PDF
Chapter 2. Digital Image Fundamentals.pdf
PPTX
Risk Mitigation, Monitoring and Management Plan (RMMM)
PPTX
Machine learning seminar ppt
PPTX
Fundamentals and image compression models
PPT
Software Engineering (Project Planning & Estimation)
PPT
Computer graphics1
PPT
Digital Image Processing
PPTX
Prediction of heart disease using machine learning.pptx
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
Introduction to computer graphics
Loan approval prediction based on machine learning approach
raster and random scan
computer graphics
Computer graphics ppt
Machine Learning for Disease Prediction
Multimedia System Architecture details.pptx
Putnam Resource allocation model.ppt
Vision of cloud computing
Vm migration techniques
Color Models
Chapter 2. Digital Image Fundamentals.pdf
Risk Mitigation, Monitoring and Management Plan (RMMM)
Machine learning seminar ppt
Fundamentals and image compression models
Software Engineering (Project Planning & Estimation)
Computer graphics1
Digital Image Processing
Prediction of heart disease using machine learning.pptx
Ad

Similar to Loan Prediction System Using Machine Learning Algorithms Project Report (20)

PPTX
RST_REVIEW.ppt is used for the loan prediction
PDF
Sandip Finwmwmmwmwmmmenenneal Project.pdf
PDF
Loan Default Prediction Using Machine Learning Techniques
PDF
Corporate bankruptcy prediction using Deep learning techniques
PDF
Improving the credit scoring model of microfinance
PDF
Applying Convolutional-GRU for Term Deposit Likelihood Prediction
PDF
Supervised and unsupervised data mining approaches in loan default prediction
PPTX
Credit Risk Ppt management analysis varisble.pptx
PDF
B510519.pdf
PDF
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
PDF
Data mining on Financial Data
DOCX
IMPACT ON BANKING SECTOR LIQUIDITY AFTER INCREASE IN USE OF DIGITAL WALLET IN...
PDF
BANK LOAN PREDICTION USING MACHINE LEARNING
PDF
Predictive Analytics in Education Context
PDF
Decision support system using decision tree and neural networks
DOCX
Selection of Entrepreneurs Group(6)
PDF
An Explanation Framework for Interpretable Credit Scoring
PDF
AN EXPLANATION FRAMEWORK FOR INTERPRETABLE CREDIT SCORING
PDF
In Banking Loan Approval Prediction Using Machine Learning
DOCX
This form must be approved for a candidate to register for t
RST_REVIEW.ppt is used for the loan prediction
Sandip Finwmwmmwmwmmmenenneal Project.pdf
Loan Default Prediction Using Machine Learning Techniques
Corporate bankruptcy prediction using Deep learning techniques
Improving the credit scoring model of microfinance
Applying Convolutional-GRU for Term Deposit Likelihood Prediction
Supervised and unsupervised data mining approaches in loan default prediction
Credit Risk Ppt management analysis varisble.pptx
B510519.pdf
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
Data mining on Financial Data
IMPACT ON BANKING SECTOR LIQUIDITY AFTER INCREASE IN USE OF DIGITAL WALLET IN...
BANK LOAN PREDICTION USING MACHINE LEARNING
Predictive Analytics in Education Context
Decision support system using decision tree and neural networks
Selection of Entrepreneurs Group(6)
An Explanation Framework for Interpretable Credit Scoring
AN EXPLANATION FRAMEWORK FOR INTERPRETABLE CREDIT SCORING
In Banking Loan Approval Prediction Using Machine Learning
This form must be approved for a candidate to register for t
Ad

More from Souma Maiti (20)

PDF
Mental Health prrediction system using Machine Learning Algoritms
PPTX
Mental Health Prediction System Using Machine Learning
PPTX
Types of Cyber Security Attacks- Active & Passive Attak
PPTX
E-Commerce Analysis & Strategy Presentation
PPTX
Principles of Network Security-CIAD TRIAD
PDF
Decision Tree in Machine Learning
PDF
Idea on Entreprenaurship
PDF
System Based Attacks - CYBER SECURITY
PDF
Operation Research
PDF
Loan Approval Prediction Using Machine Learning
PDF
Constitution of India
PDF
COMIPLER_DESIGN_1[1].pdf
PDF
Heuristic Search Technique- Hill Climbing
DOCX
SATELLITE INTERNET AND STARLINK
PPTX
Fundamental Steps Of Image Processing
PPTX
Join in SQL - Inner, Self, Outer Join
PPTX
K means Clustering Algorithm
PDF
Errors in Numerical Analysis
PPTX
Open Systems Interconnection (OSI) MODEL
PPTX
Internet of Things(IOT)
Mental Health prrediction system using Machine Learning Algoritms
Mental Health Prediction System Using Machine Learning
Types of Cyber Security Attacks- Active & Passive Attak
E-Commerce Analysis & Strategy Presentation
Principles of Network Security-CIAD TRIAD
Decision Tree in Machine Learning
Idea on Entreprenaurship
System Based Attacks - CYBER SECURITY
Operation Research
Loan Approval Prediction Using Machine Learning
Constitution of India
COMIPLER_DESIGN_1[1].pdf
Heuristic Search Technique- Hill Climbing
SATELLITE INTERNET AND STARLINK
Fundamental Steps Of Image Processing
Join in SQL - Inner, Self, Outer Join
K means Clustering Algorithm
Errors in Numerical Analysis
Open Systems Interconnection (OSI) MODEL
Internet of Things(IOT)

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.

Loan Prediction System Using Machine Learning Algorithms Project Report

  • 1. 1 LOAN PREDICTION SYSTEM USING MACHINE LEARNING A Report for the Evaluation of Project Submitted by SOUMA MAITI (27500120016) TRIASHA SAMANTA (27500120005) In partial fulfillment for the award of the degree Of BACHELOR OF TECHNOLOGY (B. TECH) IN COMPUTER SCIENCE AND ENGINEERING MAULANA ABUL KALAM AZAD UNIVERSITY OF TECHNOLOGY Under the Supervision of Dr. Dhrubajyoti Ghosh DECEMBER-2023
  • 2. 2 OMDAYAL GROUP OF INSTITUTION SCHOOL OF COMPUTING AND SCIENCE AND ENGINEERING BONAFIDE CERTIFICATE Certified that this project report “LOAN PREDICTION SYSTEM” is the bonafide work of “SOUMA MAITI (27500120016)” & “TRIASHA SAMANTA(27500120005)” who carried out the project work under my supervision. Dipankar Hazra Teacher in charge Computing Science & Engineering Department OMDAYAL GROUP OF INSTITUTION. Dr. Dhrubajyoti Ghosh Assistant Professor Computing Science & Engineering Department. OMDAYAL GROUP OF INSTITUTION
  • 3. 3 ACKNOWLEDGEMENT I am pleased to acknowledge my sincere thanks to Board of Management of OMDAYAL GROUP OF INSTITUTION for their kind encouragement in doing this project and for completing it successfully. I am grateful to them. I convey thanks to Dipankar Hazra, Head of the Department, Department of Computer Science Engineering for providing us the necessary support and details at the right time during the progressive reviews. I would like to express my sincere and deep sense of gratitude to my Project Guide Dr Dhrubajyoti Ghosh, Assistant Professor for her valuable guidance, suggestions, and constant encouragement paved way for the successful completion of my project. I wish to express our thanks to all Teaching and Non- teaching staff members of the Department of COMPUTER SCIENCE AND ENGINEERING who were helpful in many ways for the completion of the project.
  • 5. 5 CHAPTER NO TITLE PAGE NO 1. Abstract of the Project 6 2. Literature Survey 7-10 3. Introduction: Machine Learning 11-12 3.1 How Machine Learning Works 12-13 3.2 Terminologies of Machine Learning 13 3.3 Machine Learning Types 14-16 4. Various Machine Learning Algorithm 17 4.1 Logistic Regression 16-18 4.2 Support Vector Classifier 18-20 4.3 Random Forest Algorithm 20 4.4 Naive Bayes 21-22 4.5 Decision Tree Classification 22-23 4.6 Gradient Boosting Algorithm 24-25 5. Implementation Of Model 26 5.1 Implementation Of Model: Existing System 26 5.2 Implementation Of Model: Proposed System 26-28 6. Requirement: Hardware & Software 29 6.1 Various Python Libraries Used 30 7. Architecture Design 31 7.1 Sequence Diagram & Use Case Diagram 32-33 7.2 Activity Diagram & Collaboration Diagram 34 8. Methodology 35-48 9. Source Code 49-56 10. Summary & Conclusion 57 11. References 58 TABLE OF CONTENTS
  • 6. 6 1) ABSTRACT OF THE PROJECT Technology has boosted the existence of human kind the quality of life they live. Every day we are planning to create something new and different. We have a solution for every other problem we have machines to support our lives and make us somewhat complete in the banking sector candidate gets proofs/ backup before approval of the loan amount. The application approved or not approved depends upon the historical data of the candidate by the system. Every day lots of people applying for the loan in the banking sector but Bank would have limited funds. In this case, the right prediction would be very beneficial using some classes-function algorithm. An example the logistic regression, random forest classifier, support vector machine classifier, etc. A Bank's profit and loss depend on the amount of the loans that is whether the Client or customer is paying back the loan. Recovery of loans is the most important for the banking sector. The improvement process plays an important role in the banking sector. The historical data of candidates was used to build a machine learning model using different classification algorithms. The main objective of this paper is to predict whether a new applicant granted the loan or not using machine learning models trained on the historical data.
  • 7. 7 2) LITERATURE SURVEY A literature review is a body of text that aims to review the critical points of current knowledge on and/or methodological approaches to a particular topic. It is secondary sources and discuss published information in a particular subject area and sometimes information in a particular subject area within a certain time period. Its ultimate goal is to bring the reader up to date with current literature on a topic and forms the basis for another goal, such as future research that may be needed in the area and precedes a research proposal and may be just a simple summary of sources. Usually, it has an organizational pattern and combines both summary and synthesis. A summary is a recap of important information about the source, but a synthesis is a reorganization, reshuffling of information. It might give a new interpretation of old material or combine new with old interpretations or it might trace the intellectual progression of the field, including major debates. Depending on the situation, the literature review may evaluate the sources and advise the reader on the most pertinent or relevant of them. Review of Literature Survey: 1) Title: A benchmark of machine learning approaches for credit score prediction. Author: Vincenzo Moscato, Antonio Picariello, Giancarl Sperlí Year : 2021 Credit risk assessment plays a key role for correctly supporting financial institutes in defining their bank policies and commercial strategies. Over the last decade, the emerging of social lending platforms has disrupted traditional services for credit risk assessment. Through these platforms, lenders and borrowers can easily interact among them without any involvement of financial institutes. In particular, they support borrowers in the fundraising process, enabling the participation of any number and size of lenders. However, the lack of lenders’ experience and missing or uncertain information about 4 borrower’s credit history can increase risks in
  • 8. 8 social lending platforms, requiring an accurate credit risk scoring. To overcome such issues, the credit risk assessment problem of financial operations is usually modeled as a binary problem on the basis of debt’s repayment and proper machine learning techniques can be consequently exploited. In this paper, we propose a bench marking study of some of the most used credit risk scoring models to predict if a loan will be repaid in a P2P platform. We deal with a class imbalance problem and leverage several classifiers among the most used in the literature, which are based on different sampling techniques. A real social lending platform (Lending Club) data-set, composed by 877,956 samples, has been used to perform the experimental analysis considering different evaluation metrics (i.e. AUC, Sensitivity, Specificity), also comparing the obtained outcomes with respect to the state-of-the-art approaches. Finally, the three best approaches have also been evaluated in terms of their explain-ability by means of different explainable Artificial Intelligence (XAI) tools. 2) Title : An Approach for Prediction of Loan approval using Machine Learning Algorithm. Author: Mohammad Ahmad Sheikh, Amit Kumar Goel, Tapas Kumar Year : 2020 In our banking system, banks have many products to sell but main source of income of any banks is on its credit line. So they can earn from interest of those loans which they credits.A bank’s profit or a loss depends to a large extent on loans i.e. whether the customers are paying back the loan or defaulting. By predicting the loan defaulters, the bank can reduce its Non Performing Assets. This makes the study of this phenomenon very important. Previous research in this era has shown that there are so many methods to study the problem of controlling loan default. But as the right predictions are very important for the maximization of profits, it is essential to study the nature of the different methods and their comparison. A very important approach in predictive analytic is used to study the problem of predicting loan defaulters: The Logistic regression model. The data is collected from the Kaggle for studying and prediction.
  • 9. 9 Logistic Regression models have been performed and the different measures of performances are computed. The models are compared on the basis of the performance measures such as sensitivity and 5 specificity. The final results have shown that the model produce different results.Model is marginally better because it includes variables (personal attributes of customer like age, purpose, credit history, credit amount, credit duration, etc.) other than checking account information (which shows wealth of a customer) that should be taken into account to calculate the probability of default on loan correctly. Therefore, by using a logistic regression approach, the right customers to be targeted for granting loan can be easily detected by evaluating their likelihood of default on loan. The model concludes that a bank should not only target the rich customers for granting loan but it should assess the other attributes of a customer as well which play a very important part in credit granting decisions and predicting the loan defaulters. 3) Title : Predict Loan Approval in Banking System Machine Learning Approach for Cooperative Banks Loan Approval. Author: Amruta S. Aphale, Dr. Sandeep R. Shinde. Year : 2020 In today’s world, taking loans from financial institutions has become a very common phenomenon. Everyday a large number of people make application for loans, for a variety of purposes. But all these applicants are not reliable and everyone cannot be approved. Every year, we read about a number of cases where people do not repay bulk of the loan amount to the banks due to which they suffers huge losses. The risk associated with making a decision on loan approval is immense. So the idea of this project is to gather loan data from multiple data sources and use various machine learning algorithms on this data to extract important information. This model can be used by the organizations in making the right decision to approve or reject the loan request of the customers. In this paper, we examine a real bank credit data and conduct several machine learning algorithms on the data for that determine credit worthiness of customers in order to formulate bank risk automated system.
  • 10. 10 4) Title : Loan Approval Prediction Using Machine Learning Author: Yash Divate, Prashant Rana, Pratik Chavan Year : 2021 With the upgrade in the financial area loads of individuals are applying for bank advances however the bank has its restricted resources which it needs to allow to restricted individuals just, so discovering to whom the credit can be conceded which will be a more secure choice for the bank is a commonplace interaction. So in this task we attempt to decrease this danger factor behind choosing the protected individual in order to save bunches of bank endeavors and resources. This is finished by mining the Data of the past records of individuals to whom the advance was conceded previously and based on these records/encounters the machine was prepared utilizing the AI model which give the most precise outcome. The principle objective of this paper is to anticipate whether relegating the advance to specific individual will be protected or not. This paper is separated into four areas (i)Data Collection (ii) Comparison of AI models on gathered information (iii) Training of framework on most encouraging model (iv) Testing.
  • 11. 11 3) INTRODUCTION The immense increase in capitalism, the fast-paced development and instantaneous changes in the lifestyle has us in awe. EMI, loans at nominal rate, housing loans, vehicle loans, these are some of the few words which have skyrocketed from the past few years. The needs, wants and demands have never been increased this before. People gets loan from banks; however, it may be baffling for the bankers to judge who can pay back the loan nevertheless the bank shouldn’t be in loss. Banks earn most of their profits through the loan sanctioning. Generally, banks pass loan after completing the numerous verification processes despite all these, it is still not confirmed that the borrower will pay back the loan or not. To get over the dilemma, I have built up a prediction model which says if the loan has been assigned in the safe hands or not. Government agencies like keep under surveillance why one person got a loan and the other person could not. In Machine Learning techniques which include classification and prediction can be applied to conquer this to a brilliant extent. Machine learning has eased today’s world by developing these prediction models. Here we will be using the fine techniques of machine learning – Decision tree algorithm to build this prediction model for loan assessment. It is as so because decision tree gives accuracy in the prediction and is often used in the industry for these models. Machine Learning : Machine learning (ML) is a type of artificial intelligence (AI) focused on building computer systems that learn from data. The broad range of techniques ML encompasses enables software applications to improve their performance over time. Machine learning algorithm are trained to find relationships and patterns in data. They use historical data as input to make predictions, classify information, cluster data points, reduce dimensionality and even help generate new content, as demonstrated by new ML- fueled applications such as Chat GPT, Dall-E 2 and GitHub Copilot. Machine learning is widely applicable across many industries. Recommendation System , for example, are used by e-commerce, social media and news organizations to suggest content based on a customer's past behavior. Machine learning algorithms and machine vision are a critical component of self-driving cars, helping them navigate the roads safely. In healthcare, machine learning is used to diagnose and suggest treatment plans. Other
  • 12. 12 common ML use cases include fraud detection, spam filtering, malware threat detection, predictive maintenance and business process automation. While machine learning is a powerful tool for solving problems, improving business operations and automating tasks, it's also a complex and challenging technology, requiring deep expertise and significant resources. Choosing the right algorithm for a task calls for a strong grasp of mathematics and statistics. Training machine learning algorithms often involves large amounts of good quality data to produce accurate results. The results themselves can be difficult to understand -- particularly the outcomes produced by complex algorithms, such as the deep learning neural network patterned after the human brain. And ML Models can be costly to run and tune. Still, most organizations either directly or indirectly through ML-infused products are embracing machine learning. According to the "2023 AI and Machine Learning Research Report" from Rackspace Technology, 72% of companies surveyed said that AI and machine learning are part of their IT and business strategies, and 69% described AI/ML as the most important technology. Companies that have adopted it reported using it to improve existing processes (67%), predict business performance and industry trends (60%) and reduce risk (53%). Tech Target's guide to machine learning is a primer on this important field of computer science, further explaining what machine learning is, how to do it and how it is applied in business. You'll find information on the various types of machine learning algorithms, the challenges and best practices associated with developing and destroying ML Models, and what the future holds for machine learning. Throughout the guide, there are hyperlinks to related articles that cover the topics in greater depth. 3.1) How Machine Learning works: Machine learning uses two types of techniques: supervised learning, which trains a model on known input and output data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns or intrinsic structures in input data. The Machine Learning process starts with inputting training data into the selected algorithm. Training data being known or unknown data to develop the final Machine Learning algorithm. The type of training data input does impact the algorithm, and that concept will be covered further momentarily.
  • 13. 13 3.2) Terminologies of Machine Learning :  Model : A model is a specific representation learned from data by applying some machine learning algorithm. A model is also called hypothesis.  Feature : A feature is an individual measurable property of our data. A set of numeric features can be conveniently described by a feature vector. Feature vectors are fed as input to the model. For example, in order to predict a fruit, there may be features like color, smell, taste, etc.  Target(Label): A target variable or label is the value to be predicted by our model. For the fruit example discussed in the features section, the label with each set of input would be the name of the fruit like apple, orange, banana, etc.  Training: The idea is to give a set of inputs(features) and it’s expected outputs(labels), so after training, we will have a model (hypothesis) that will then map new data to one of the categories trained on.  Prediction: Once our model is ready, it can be fed a set of inputs to which it will provide a predicted output(label). Fig 1- How machine learning works
  • 14. 14 3.3) Machine Learning Types: Learning is, of course, a very wide domain. Consequently, the field of machine learning has branched into several sub-fields dealing with different types of learning tasks. We give a rough taxonomy of learning paradigms, aiming to provide some perspective of where the content sits within the wide field of machine learning. Terms frequently used are:  Labeled data : Data consisting of a set of training examples, where each example is a pair consisting of an input and a desired output value (also called the supervisory signal, labels, etc)  Classification : The goal is to predict discrete values, e.g. {1,0}, {True, False}, {spam, not spam}.  Regression : The goal is to predict continuous values, e.g. home prices. There some variations of how to define the types of Machine Learning Algorithms but commonly they can be divided into categories according to their purpose and the main categories are the following:  Supervised learning  Unsupervised Learning  Semi-supervised Learning  Reinforcement Learning 3.3.1) Supervised learning: Learned to perform that task. Supervised learning algorithms include classification and regression. Classification algorithms are used when the outputs are restricted to a limited set of values, and regression algorithms are used when the outputs may have any numerical value within a range. Similarity learning is an area of supervised machine learning closely related to regression and classification, but the goal is to learn from examples using a similarity function that measures how similar or related two objects are. It has applications in ranking, recommendation systems, visual identity tracking, face verification, and speaker verification. 16 In the case of semi-supervised learning algorithms, some of the training examples are missing training labels, but they can nevertheless be used to improve
  • 15. 15 the quality of a model. In weakly supervised learning, the training labels are noisy, limited, or imprecise; however, these labels are often cheaper to obtain, resulting in larger effective training sets. List of Common Algorithms: • Nearest Neighbour • Naive Bayes • Decision Trees • Linear Regression • Support Vector Machines (SVM) • Neural Networks 3.3.2) Unsupervised learning: Unsupervised learning algorithms take a set of data that contains only inputs, and find structure in the data, like grouping or clustering of data points. The algorithms, therefore, learn from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning algorithms identify commonalities in the data and react based on the presence or absence of such commonalities in each new piece of data. A central application of unsupervised learning is in the field of density estimation in statistics, though unsupervised learning encompasses other domains involving summarizing and explaining data features. Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to one or more pre designated criteria, while observations drawn from different clusters are dissimilar. Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated, for example, by internal compactness, or the similarity between members of the same cluster, and separation, the difference between clusters. Other methods are based on estimated density and graph connectivity. List of Common Algorithms: • K-means clustering ,Association Rules.
  • 16. 16 3.3.3) Semi-supervised learning: Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce a considerable improvement in learning accuracy. 3.3.4) Reinforcement Learning: Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behaviour; this is known as the reinforcement signal. There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning algorithms. In the problem, an agent is supposed decide the best action to select based on his current state. When this step is repeated, the problem is known as a Markov Decision Process. List of Common Algorithms: • Q-Learning • Temporal Difference (TD) • Deep Adversarial Networks Use cases: Some applications of the reinforcement learning algorithms are computer played board games (Chess, Go), robotic hands, and self-driving cars.
  • 17. 17 4) Various Machine Learning Algorithms Widely Used : 4.1 ) Logistic regression: Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. But instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. Logistic regression is used for solving the classification problems It uses a logistic function called a sigmoid function to map predictions and their probabilities. The sigmoid function refers to an S- shaped curve that converts any real value to a range between 0 and 1. The logit function is mathematically represented as Fig 2- logistic Regression
  • 18. 18 4.2) Support Vector Classifier: A support vector machine (SVM) is a type of supervised machine learning algorithm used in machine learning to solve classification and regression tasks; SVMs are particularly good at solving binary classification problems, which require classifying the elements of a data set into two groups. The aim of a support vector machine algorithm is to find the best possible line, or decision boundary, that separates the data points of different data classes. This boundary is called a hyperplane when working in high-dimensional feature spaces. The idea is to maximize the margin, which is the distance between the hyperplane and the closest data points of each category, thus making it easy to distinguish data classes. SVMs are useful for analyzing complex data that can't be separated by a simple straight line. Called nonlinear SMVs, they do this by using a mathematical trick that transforms data into higher-dimensional space, where it is easier to find a boundary. How do support vector machines work? The key idea behind SVMs is to transform the input data into a higher-dimensional feature space. This transformation makes it easier to find a linear separation or to more effectively classify the data set. To do this, SVMs use a kernel function. Instead of explicitly calculating the coordinates of the transformed space, the kernel function enables the SVM to implicitly compute the dot products between the transformed feature vectors and avoid handling expensive, unnecessary computations for extreme cases. SVMs can handle both linearly separable and non-linearly separable data. They do this by using different types of kernel functions, such as the linear kernel, polynomial kernel or radial basis function (RBF) kernel. These kernels enable SVMs to effectively capture complex relationships and patterns in the data. During the training phase, SVMs use a mathematical formulation to find the optimal hyperplane in a higher-dimensional space, often called the kernel space. This hyperplane is crucial because it maximizes the margin between data points of different classes, while minimizing the classification errors.
  • 19. 19 The kernel function plays a critical role in SVMs, as it makes it possible to map the data from the original feature space to the kernel space. The choice of kernel function can have a significant impact on the performance of the SVM algorithm; choosing the best kernel function for a particular problem depends on the characteristics of the data. Some of the most popular kernel functions for SVMs are the following:  Linear Kernel : This is the simplest kernel function, and it maps the data to a higher- dimensional space, where the data is linearly separable.  Polynomial Kernel: This kernel function is more powerful than the linear kernel, and it can be used to map the data to a higher-dimensional space, where the data is non- linearly separable.  RBF Kernel: This is the most popular kernel function for SVMs, and it is effective for a wide range of classification problems.  Sigmoid Kernel: This kernel function is similar to the RBF kernel, but it has a different shape that can be useful for some classification problems. The choice of kernel function for an SVM algorithm is a trade-off between accuracy and complexity. The more powerful kernel functions, such as the RBF kernel, can achieve higher accuracy than the simpler kernel functions, but they also require more data and Fig 3- Support Vector Classifier
  • 20. 20 computation time to train the SVM algorithm. But this is becoming less of an issue due to technological advances. Once trained, SVMs can classify new, unseen data points by determining which side of the decision boundary they fall on. The output of the SVM is the class label associated with the side of the decision boundary. 4.3) Random Forest Algorithm: Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output. The greater number of trees in the forest leads to higher accuracy and prevents the problem of over fitting. Fig 4 - Random Forest Algorithm
  • 21. 21 4.4) Naive Bayes Algorithm: It is a classification technique based on Bayes’ Theorem with an independence assumption among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. The Naive Bayes classifier is a popular supervised machine learning algorithm used for classification tasks such as text classification. It belongs to the family of generative learning algorithms, which means that it models the distribution of inputs for a given class or category. This approach is based on the assumption that the features of the input data are conditionally independent given the class, allowing the algorithm to make predictions quickly and accurately. In statistics, naive Bayes classifiers are considered as simple probabilistic classifiers that apply Bayes’ theorem. This theorem is based on the probability of a hypothesis, given the data and some prior knowledge. The naive Bayes classifier assumes that all features in the input data are independent of each other, which is often not true in real-world scenarios. However, despite this simplifying assumption, the naive Bayes classifier is widely used because of its efficiency and good performance in many real-world applications. Moreover, it is worth noting that naive Bayes classifiers are among the simplest Bayesian network models, yet they can achieve high accuracy levels when coupled with kernel density estimation. This technique involves using a kernel function to estimate the probability density function of the input data, allowing the classifier to improve its performance in complex scenarios where the data distribution is not well-defined. As a result, the naive Bayes classifier is a powerful tool in machine learning, particularly in text classification, spam filtering, and sentiment analysis, among others. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. An NB model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem provides a way of computing posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
  • 22. 22 Above,  P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).  P(c) is the prior probability of class.  P(x|c) is the likelihood which is the probability of the predictor given class.  P(x) is the prior probability of the predictor 4.5) Decision Tree Classification Algorithm: Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a datasets, branches represent the decision rules and each leaf node represents the outcome. In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branch. Fig 5- Equation of Naive Bayes
  • 23. 23 Decision Tree Terminologies:  Root Node: Root node is from where the decision tree starts. It represents the entire datasets, which further gets divided into two or more homogeneous sets  Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.  Splitting: Splitting is the process of dividing the decision node/root node into sub- nodes according to the given conditions  Branch/Sub Tree: A tree formed by splitting the tree.  Pruning: Pruning is the process of removing the unwanted branches from the tree.  Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes. Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram: Fig 6- Decision Tree
  • 24. 24 4.6) Gradient Boosting Algorithm: Gradient Boosting is a powerful boosting algorithm that combines several weak learners into strong learners, in which each new model is trained to minimize the loss function such as mean squared error or cross-entropy of the previous model using gradient descent. In each iteration, the algorithm computes the gradient of the loss function with respect to the predictions of the current ensemble and then trains a new weak model to minimize this gradient. The predictions of the new model are then added to the ensemble, and the process is repeated until a stopping criterion is met. In contrast to Ada Boost, the weights of the training instances are not tweaked, instead, each predictor is trained using the residual errors of the predecessor as labels. There is a technique called the Gradient Boosted Trees whose base learner is CART (Classification and Regression Trees). The below diagram explains how gradient-boosted trees are trained for regression problems. Fig 7- Gradient Boosting Classifier
  • 25. 25 The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y. The predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2 is then trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The predicted results r1(hat) are then used to determine the residual r2. The process is repeated until all the M trees forming the ensemble are trained. There is an important parameter used in this technique known as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in the ensemble is shrunk after it is multiplied by the learning rate (eta) which ranges between 0 to 1. There is a trade-off between eta and the number of estimators, decreasing learning rate needs to be compensated with increasing estimators in order to reach certain model performance. Since all trees are trained now, predictions can be made. Each tree predicts a label and the final prediction is given by the formula y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)
  • 26. 26 5) IMPLEMENTATION OF MODEL 5.1) EXISTING SYSTEM : Banks need to analyze for the person who applies for the loan will repay the loan or not. Sometime it happens that customer has provided partial data to the bank, in this case person may get the loan without proper verification and bank may end up with loss. Bankers cannot analyze the huge amounts of data manually, it may become a big headache to check whether a person will repay its loan or not. It is very much necessary to know the person getting loan is going in safe hand or not. So, it is pretty much important to have a automated model which should predict the customer getting the loan will repay the loan or not. Disadvantage : To apply the loan we need to go to bank to apply it The model will be able to predict whether a loan applicant will default on a given loan.The system architecture is as below. 5.2) PROPOSED SYSTEM The proposed model focuses on predicting the credibility of customers for loan repayment by analyzing their details. The input to the model is the customer details collected. On the output from the classifier, decision on whether to approve or reject the customer request can be made. Using different data analytic tools loan prediction and there severity can be forecast ed. In this process it is required to train the data using different algorithms and then compare user data with trained data to predict the nature of loan. The training data set is now supplied to machine learning model; on the basis of this data set the model is trained. Every new applicant details filled at the time of application form acts as a test data set. After the operation of testing, 8 model predict whether the new applicant is a fit case for approval of the loan or not based upon the inference it conclude on the basis of the training data sets. By providing real time input on the web app. In our project, Logistic Regression gives high accuracy level compared with other algorithms. Finally, we are predicting the result via data visualization and display the predicted output using web app using flask.
  • 27. 27 The proposed model focuses on predicting the credibility of customers for loan repayment by analyzing their details. The input to the model is the customer details collected. On the output from the classifier, decision on whether to approve or reject the customer request can be made. Using different data analytic tools loan prediction and there severity can be forecast ed. In this process it is required to train the data using different algorithms and then compare user data with trained data to predict the nature of loan. The training data set is now supplied to machine learning model; on the basis of this data set the model is trained. Every new applicant details filled at the time of application form acts as a test data set. After the operation of testing, 8 model predict whether the new applicant is a fit case for approval of the loan or not based upon the inference it conclude on the basis of the training data sets. By providing real time input on the web app. In our project, Logistic Regression gives high accuracy level compared with other algorithms. Finally, we are predicting the result via data visualization and display the predicted output using web app using flask. Advantage: No need to go to bank We can do the transaction from house, we can consume the time doing from home. Random Forest Loan Train model Naive Bayes Predict best model Default 1 Logistic Regression 0 Fig 8- Proposed Model
  • 28. 28  Step 1: The Loan application goes through the trained model where the three classification algorithms are applied.  Step 2: The machine learning with the best performance in accuracy is selected.  Step 3 : The machine learning algorithm is applied to the loan application.  Step 4: The machine learning algorithm determines the probability of default. 1, being true and 0 being false
  • 29. 29 6) REQUIREMENT SPECIFICATIONS Prediction of modernized loan approval system based on machine learning approach is a loan approval system from where we can know whether the loan will pass or not. In this system, we take some data from the user like his monthly income, marriage status, loan amount, loan duration, etc. Then the bank will decide according to its parameters whether the client will get the loan or not. So there is a classification system, in this system, a training set is employed to make the model and the classifier may classify the data items into their appropriate class. A test datasets is created that trains the data and gives the appropriate result that, is the client potential and can repay the loan. Prediction of a modernized loan approval system is incredibly helpful for banks and also the clients. This system checks the candidate on his priority basis. Customer can submit his application directly to the bank so the bank will do the whole process, no third party or stockholder will interfere in it. And finally, the bank will decide that the candidate is deserving or not on its priority basis. The only object of this research paper is that the deserving candidate gets straight forward and quick results. HARDWARE AND SOFTWARE SPECIFICATION HARDWARE REQUIREMENTS ● Hard disk : 500 GB and above. ● Processor : i3 and above. ● Ram : 4GB and above. SOFTWARE REQUIREMENTS ● Operating System : Windows 10/11 ● Tools :Jupiter Note Book IDE ●Programming Language: Python 3 ● Streamlit App ● Visual Studio Code Editor
  • 30. 30 6.1) Python Libraries Used: The machine learning models are implemented using python version 3.7 on a Jupyter notebook with the listed libraries: numpy, pandas , matplotlib, seaborn , and sklearn.  Jupyter notebooks are a web-based interface in which you can write, visualize, and execute python code in cells. It is good for exploratory analysis and enable to run individual code cells.  Numpy is a Python library that may be used to work with multi- dimensional arrays, linear algebra, the Fourier transform, and matrices.  Pandas is a data manipulation and analysis package written in Python.  Matplotlib is a Python package that allows you to create static,animated, and interactive visualizations.  Seaborn is a matplotlib-based python data visualization package. It has a high-level interface for creating visually appealing and instructive statistics visuals.  Sklearn is a Python toolkit that allows you to create machine learning and statistical models including clustering, classification, and regression.
  • 31. 31 7) ARCHITECTURE DESIGN 7.1) Architecture Diagram: Datasets Prepossessing User Input Web App
  • 32. 32 7.2) Sequence Diagram: A Sequence diagram is a kind of interaction diagram that shows how processes operate with one another and in what order. It is a construct of Message Sequence diagrams are sometimes called event diagrams, event sceneries and timing diagram.. Fig 9- Sequence Diagram
  • 33. 33 7.3) Use Case Diagram: Unified Modeling Language (UML) is a standardized general-purpose modeling language in the field of software engineering. The standard is managed and was created by the Object Management Group. UML includes a set of graphic notation techniques to create visual models of software intensive systems. This language is used to specify, visualize, modify, construct and document the artifacts of an object oriented software intensive system under development. Fig 10 - Use Case Diagram
  • 34. 34 7.4) Activity Diagram: Activity diagram is a graphical representation of workflows of stepwise activities and actions with support for choice, iteration and concurrency. An activity diagram shows the overallflow of control. 7.5) Collaboration Diagram: DATA COLLECTION DATA PREPROCESSING MACHINE LEARNING ALGORITHM LOAN PREDICTION WEB APPLICATION ●OUTPUT DATA DATA DATA DATA DATA Fig 11 - Activity Diagram Fig 12 - Collaboration Diagram
  • 35. 35 8) METHEDOLOGY DATA PREPROCESSING: Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model. It involves below steps:  Getting the dataset  Importing libraries  Importing datasets  Finding Missing Data  Encoding Categorical Data  Splitting dataset into training and test set  Feature scaling Data-set: Datasets is provided to Machine Learning models on the basis of the facts this version is trained. We have collected the dataset from a website called Kaggle. There are a total of 614 rows and 13 columns in the dataset. There are columns for Loan_Id, Gender, Married or Not, No. Of Dependents, Education Background of the loan seeker, Employment status of the loan seeker, Income of Applicant, Income of Co- applicant,Loan Amount, Credit History of the applicant and the loan status of the applicant.
  • 36. 36 Importing Python libraries: In order to perform data prepossessing using Python, we need to import some predefined Python libraries. These libraries are used to perform some specific jobs. There are three specific libraries that we will use for data prepossessing. We have imported python libraries like Pandas, Numpy, Seaborn, Sci-kit Learn, matplotlib for our work. Importing the Data set: Now We have imported the dataset which we will use as historical data to train the model. Fig-13 - Data Set
  • 37. 37 Understanding the Data: First of all we use the data.describe() method to shows the important information from the data-set. It provides the count, mean, standard deviation (std), min, quartile and max in its output. Another method is info () , This method show us the information about the data set
  • 38. 38 As we can see in the output.  There are 614 entries  There are total 13 features (0 to 12)  There are three types of datatype dtypes: float64(4), int64(1), object(8)  It's Memory usage that is, memory usage: 62.5+ KB  Also, We can check how many missing values available in the Non- Null Count column Exploratory Data Analysis: In this section, We learn about extra information about data and it's characteristics.
  • 39. 39 Data Cleaning: In this step of data cleaning by checking we have eliminated all the missing values because they affect the accuracy of the model. We have achieved this by either filling the the missing values with a mean or mode function or by dropping all missing values. First we have checked the number of null values in each column of the dataset. Now we have checked the percentage of of missing values in each column of the dataset.
  • 40. 40 Now we will handle the missing data entries in the data set. We can see the number of missing values in four columns in the data set- Gender,Dependents,Loan Amount, Loan Term are less than 5% so we will drop the rows with the missing values. Rest two columns- Self_Employed and Credit History which have greater than 5% missing values we will use mode function to fill up the null values.
  • 41. 41 Handling the categorical columns: Loan_Status feature numeric values, we will replace the columns with numeric values. There are some values in the dependent column as 3+. we will replace it by numeric value 4. Feature Scaling: Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data. It is the process of automatically choosing relevant features for your machine learning model based on the type of problem you are trying to solve. We do this by including or excluding important features without changing them. It helps in cutting down the noise in our data and reducing the size of our input data.
  • 42. 42 Splitting The Datasets Into The Training Set And Test Set & Applying K-Fold Cross Validation: Now we have split the datasets into two sets for training and testing. We will apply cross validation and will check the accuracy of the various models we have used in this work. Implementing various machine learning models: We will implement all the five machine learning algorithm- Logistic Regression, Support Vector Classifier, Random Forest Classifier, Decision Tree Classifier and Gradient Boosting Classifier and check the accuracy of all the algorithms with average cross validation score
  • 43. 43 Logistic Regression: So the accuracy of this model is 0.8018018018018 Support Vector Classifier: So the accuracy of this model is 0.79279279279 Decision Tree Classifier:
  • 44. 44 So the accuracy of this model is 0.7117117117117117 Random Forest Classifier: So the accuracy of this model is 0.765765765765 Gradient Boosting Classifier: So the accuracy of this model is 0.79279279
  • 45. 45 HYPERPARAMETER TUNING: Hyper parameters tuning is the process of determining the right combination of hyper parameters that maximizes the model performance. It works by running multiple trials in a single training process. Each trial is a complete execution of your training application with values for your chosen hyper parameters, set within the limits you specify. This process once finished will give you the set of hyper parameter values that are best suited for the model to give optimal results. We have used Random Search CV for the tuning.
  • 46. 46 After hyper parameter tuning the accuracy of the best 3 models are- So we get the Random Forest Classifier model gives the best accuracy of 80.67 so we have choose this model for this work.
  • 47. 47 Model Deployment: Finally, we are done so far. The last step is to deploy our model in production map. So we need to export our model and bind with web application API. Using pickle we can export our model and store in to rf_model.pkl file, so we can early access this file and calculate customize prediction using Web App API. User Interface: The user interface of the app is made on Streamlit App. Streamlit is a free and open-source framework to rapidly build and share beautiful machine learning and data science web apps. It is a Python-based library specifically designed for machine learning engineers. We have loaded the rf_model.pkl file in the streamlit app. code
  • 48. 48 Predicting Results: We wil give some input in the app and app will give us the output whether the loan is approved or not.
  • 49. 49 9) Source Code: import pandas as pd data = pd.read_csv('loan_prediction.csv') # Loan_ID : Unique Loan ID # Gender : Male/ Female # Married : Applicant married (Y/N) # Dependents : Number of dependents # Education : Applicant Education (Graduate/ Under Graduate) # Self_Employed : Self employed (Y/N) # ApplicantIncome : Applicant income # CoapplicantIncome : Coapplicant income # LoanAmount : Loan amount in thousands of dollars # Loan_Amount_Term : Term of loan in months # Credit_History : Credit history meets guidelines yes or no # Property_Area : Urban/ Semi Urban/ Rural # Loan_Status : Loan approved (Y/N) this is the target variable 1. Display Top 5 Rows of The Dataset data.head() 2. Check Last 5 Rows of The Dataset data.tail() 3. Find Shape of Our Dataset (Number of Rows And Number of Columns) data.shape print("Number of Rows",data.shape[0])print("Number of Columns",data.shape[1]) 4. Get Information About Our Dataset Like Total Number Rows, Total Number of Columns, Datatypes of Each Column And Memory Requirement data.info() 5. Check Null Values In The Dataset data.isnull().sum() data.isnull().sum()*100 / len(data) 6. Handling The missing Values data = data.drop('Loan_ID',axis=1)
  • 50. 50 data.head(1) columns = ['Gender','Dependents','LoanAmount','Loan_Amount_Term'] data = data.dropna(subset=columns) data.isnull().sum()*100 / len(data) data['Self_Employed'].mode()[0] data['Self_Employed']=data['Self_Employed'].fillna(data['Self_Employe d'].mode()[0]) data.isnull().sum()*100 / len(data) data['Gender'].unique() data['Self_Employed'].unique() data['Credit_History'].mode()[0] ata['Credit_History']=data['Credit_History'].fillna(data['Credit_History'] mode()[0]) data.isnull().sum()*100 / len(data) 7. Handling Categorical Columns data.sample(5) data['Dependents']=data['Dependents'].replace(to_replace="3+",value= 4') data['Dependents'].unique() data['Loan_Status'].unique() data['Gender'] = data['Gender'].map({'Male':1,'Female':0}).astype('int') data['Married'] = data['Married'].map({'Yes':1,'No':0}).astype('int')data['Education'] = data['Education'].map({'Graduate':1,'Not Graduate':0}).astype('int')data['Self_Employed'] = data['Self_Employed'].map({'Yes':1,'No':0}).astype('int')data['Property_ Area'] = data['Property_Area'].map({'Rural':0,'Semiurban':2,'Urban':1}).astype('i nt') data['Loan_Status'] = data['Loan_Status'].map({'Y':1,'N':0}).astype('int') data.head() 8. Store Feature Matrix In X And Response (Target) In Vector y X = data.drop('Loan_Status',axis=1) y = data['Loan_Status'] 9. Feature Scaling
  • 51. 51 data.head() cols = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_ Term'] from sklearn.preprocessing import StandardScalerst = StandardScaler()X[cols]=st.fit_transform(X[cols])X 10. Splitting The Dataset Into The Training Set And Test Set & Applying K-Fold Cross Validation from sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import accuracy_scoreimport numpy as np model_df={}def model_val(model,X,y): X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.20, random_state=42) model.fit(X_train,y_train) y_pred=model.predict(X_test) print(f"{model} accuracy is {accuracy_score(y_test,y_pred)}") score = cross_val_score(model,X,y,cv=5) print(f"{model} Avg cross val score is {np.mean(score)}") model_df[model]=round(np.mean(score)*100,2) model_df 11. Logistic Regression from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model_val(model,X,y) LogisticRegression() accuracy is 0.8018018018018018 LogisticRegression() Avg cross val score is 0.8047829647829647 12. SVC from sklearn import svmmodel = svm.SVC()model_val(model,X,y) SVC() accuracy is 0.7927927927927928 SVC() Avg cross val score is 0.7938902538902539 13. Decision Tree Classifier From sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()model_val(model,X,y) DecisionTreeClassifier() accuracy is 0.7117117117117117
  • 52. 52 DecisionTreeClassifier() Avg cross val score is 0.7089434889434889 14. Random Forest Classifier from sklearn.ensemble import RandomForestClassifiermodel =RandomForestClassifier()model_val(model,X,y) RandomForestClassifier() accuracy is 0.7567567567567568 RandomForestClassifier() Avg cross val score is 0.7776412776412777 15. Gradient Boosting Classifier from sklearn.ensemble import GradientBoostingClassifiermodel =GradientBoostingClassifier()model_val(model,X,y) GradientBoostingClassifier() accuracy is 0.7927927927927928 GradientBoostingClassifier() Avg cross val score is 0.774004914004914 16. Hyperparameter Tuning from sklearn.model_selection import RandomizedSearchCV Logistic Regression log_reg_grid={"C":np.logspace(-4,4,20), "solver":['liblinear']} rs_log_reg=RandomizedSearchCV(LogisticRegression(), param_distributions=log_reg_grid, n_iter=20,cv=5,verbose=True) rs_log_reg.fit(X,y) rs_log_reg.best_score_ rs_log_reg.best_params_ SVC svc_grid = {'C':[0.25,0.50,0.75,1],"kernel":["linear"]} rs_svc=RandomizedSearchCV(svm.SVC(), param_distributions=svc_grid, cv=5, n_iter=20, verbose=True) rs_svc.fit(X,y) rs_svc.best_score_ rs_svc.best_params_
  • 53. 53 Random Forest Classifier RandomForestClassifier() rf_grid={'n_estimators':np.arange(10,1000,10), 'max_features':['auto','sqrt'], 'max_depth':[None,3,5,10,20,30], 'min_samples_split':[2,5,20,50,100], 'min_samples_leaf':[1,2,5,10] } rs_rf=RandomizedSearchCV(RandomForestClassifier(), param_distributions=rf_grid, cv=5, n_iter=20, verbose=True) rs_rf.fit(X,y) rs_rf.best_score_ 17. Save The Model X = data.drop('Loan_Status',axis=1)y = data['Loan_Status'] rf = RandomForestClassifier(n_estimators=270, min_samples_split=5, min_samples_leaf=5, max_features='sqrt', max_depth=5) rf.fit(X,y) RandomForestClassifier(max_depth=5, max_features='sqrt', min_samples_leaf=5, min_samples_split=5, n_estimators=270) : import joblib joblib.dump(rf,'loan_status_predict') ['loan_status_predict'] model = joblib.load('loan_status_predict') import pandas as pddf = pd.DataFrame({ 'Gender':1,
  • 54. 54 'Married':1, 'Dependents':2, 'Education':0, 'Self_Employed':0, 'ApplicantIncome':2889, 'CoapplicantIncome':0.0, 'LoanAmount':45, 'Loan_Amount_Term':180, 'Credit_History':0, 'Property_Area':1},index=[0]) df result = model.predict(df) if result==1: print("Loan Approved")else: print("Loan Not Approved") Graphical User Interface(GUI) import numpy as np import streamlit as st import joblib import pandas as pd #Loading the model model=joblib.load('C:/Users/Souma Maiti/OneDrive/Desktop/Project/rf_model.pkl') def loan_prediction(inputs): input_as_np_array=np.array(inputs).reshape(1,-1) prediction=model.predict(input_as_np_array) print(prediction) if (prediction[0]==0): return 'THE LOAN IS NOT APPROVED' else: return 'THE LOAN IS APPROVED FOR YOU' def main():
  • 55. 55 #give a title st.title('Loan Status Prediction App') #Getting the input from user #GENDER Gender = st.selectbox( 'Gender',('Male', 'Female')) st.write('You Selected:', Gender) if Gender=='Male': Gender=1 else: Gender=0 #MARRIED Married = st.selectbox( 'Married',('Yes', 'No')) st.write('You Selected:',Married) if Married=='Yes': Married=1 else: Married=0 #DEPENDENTS Dependents=st.slider('Dependents',0,10,1) #EDUCATION Education = st.selectbox( 'Education',('Graduate', 'Not Graduate')) st.write('You Selected:',Education) if Education=='Graduate': Education=1 else: Education=0 #SELF_EMPLOYED Self_Employed = st.selectbox( 'Self_Employed',('Yes', 'No')) st.write('You Selected:',Self_Employed) if Self_Employed=='Yes':
  • 56. 56 Self_Employed=1 else: Self_Employed=0 ApplicantIncome =st.text_input('Applicant Income') CoapplicantIncome=st.text_input('Co-Applicant Income') LoanAmount=st.text_input('Loan Amount') Loan_Amount_Term=st.text_input('Loan Amount Terms') #CREDIT HISTORY Credit_History= st.selectbox( 'Credit History',('Yes', 'No')) st.write('You Selected:',Credit_History) if Credit_History=='Yes': Credit_History=1 else: Credit_History=0 #PROPERTY AREA Property_Area = st.selectbox( 'Property Area',('Rural', 'Semi Urban','Urban')) st.write('You Selected:',Property_Area) if Property_Area=='Rural': Property_Area=0 if Property_Area=='Semi Urban': Property_Area=1 else: Property_Area=2 #Code for prediction pred = '' if st.button('Predict'): pred=loan_prediction([Gender,Married,Dependents,Education,Self_Empl oyed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_ Term,Credit_History,Property_Area]) st.success(pred) if __name__== '__main__': main()
  • 57. 57 10) SUMMARY AND CONCLUSION SUMMARY: This project objective is to predict the Loan Approval of the user. So this online banking loan approval system will reduce the paper work and reduce the wastage of bank asserts and efforts and also saves the valuable time of the customer. In our work, a total of five machine learning algorithms which includes Logistic Regression, Decision Tree, Random forest classification,Support Vectior Classifier and Gradient Boosting Classifier are applied to predict the loan approval of customers. The experimental results conclude that the accuracy of Random Forest Classification machine learning algorithm is better compared to others machine learning approach. CONCLUSION: The analytical process started from data cleaning and processing, missing value, exploratory analysis and finally model building and evaluation. The best accuracy on public test set is higher accuracy score has been found out. This application can help to find the Prediction of Bank Loan Approval. FUTURE WORK: • Bank Loan Approval prediction to connect with cloud. • To optimize the work to implement in Artificial Intelligence environment.
  • 58. 58 11) REFERENCES: [1] Amruta S. Aphale , Dr. Sandeep R. Shinde, 2020, Predict Loan Approval in Banking System Machine Learning Approach for Cooperative Banks Loan Approval, International Journal Of Engineering Research & Technology (IJERT) Volume 09, Issue 08 (August 2020) [2] Ashwini S. Kadam, Shraddha R. Nikam, Ankita A. Aher, Gayatri V. Shelke, Amar S.Chandgude, “Prediction for Loan Approval using Machine Learning Algorithm” (IRJET) Volume: 08 Issue: 04 | Apr 2021. [3] M. A. Sheikh, A. K. Goel and T. Kumar, "An Approach for Prediction of Loan Approval using Machine Learning Algorithm," 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), 2020, pp. 490- 494, doi: 10.1109/ICESC48915.2020.9155614. [4] Rath, Golak & Das, Debasish & Acharya, Biswaranjan. (2021). Modern Approach for Loan Sanctioning in Banks Using Machine Learning. Pages={179-188} 10.1007/978-981-15-5243-4_15. [5] Vincenzo Moscato, Antonio Picariello, Giancarlo Sperlí, A benchmark of machine learning approaches for credit score prediction, Expert Systems with Applications, Volume 165, 2021, 113986, ISSN 0957-4174. [6]Yash Divate, Prashant Rana, Pratik Chavan, “Loan Approval Prediction Using Machine Learning” International Research Journal of Engineering and Technology (IRJET) Volume: 08 Issue: 05 | May 2021 [7] WWW.JAVAPOINT.COM