SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 273
AN EFFICIENT DATA PRE PROCESSING FRAME WORK FOR LOAN
CREDIBILITY PREDICTION SYSTEM
Soni P M1
, Varghese Paul2
, M.Sudheep Elayidom3
1
Assistant Professor, Dept.of MCA, SNGIST, Manjali, Kerala, India
2
Associate professor, Computer Engineering, CUSAT, Kochi, Kerala, India
3
Associate Professor, Information Technology, CUSAT, Kochi, Kerala, India
Abstract
In today's world data mining have increasingly become very interesting and popular in terms of all applications especially in the
banking industry. We have too much data and too much technology but don't have useful information. This is why we need data
mining process. The importance of data mining is increasing and studies have been done in many domains to solve tons of
problems using various data mining techniques. The art of preparing data for data mining is the most important and time
consuming phase. In developing countries like India, bankers should vigilant to fraudsters because they will create more problems
to the banking organization. Applying data mining techniques, it is very effective to build a successful predictive model that helps
the bankers to take the proper decision. This paper covers the set of techniques under the umbrella of data preprocessing based
on a case study of bank loan transaction data. The proposed model will help to distinguish borrowers who repay loans promptly
from those who do not. The frame work helps the organizations to implement better CRM by applying better prediction ability.
Keywords: Data preprocessing, Customer behavior, Input columns, Outlier columns, Target column, Dataset, CRM
--------------------------------------------------------------------***------------------------------------------------------------------
1. INTRODUCTION
The areas in which Data mining Tools can be used in the
banking industry are customer segmentation, Banking
profitability, credit scoring and approval, Predicting
payment from Customers, Marketing, detecting fraud
transactions, Cash management and forecasting operations,
optimising stock portfolios, and ranking investments [6] .
Figure 1 depicts the stages of data mining process in a
business application. The first phase called Business
understanding understands about the domain for which the
data mining has to be performed. Here the domain we
considered is Predicting payment from customers in banking
domain. Now days, there will be marvellous changes in the
way the banking transactions are performed. It is very
important to consider the customer relationship management
of the enterprise to satisfy the customers as well as the entire
business in the organization. The banking industry is widely
recognizing the importance of the information it has about
its customers [1] . The department wants data mining to find
patterns that distinguish borrowers who repay promptly
from those who don’t. [7]. Data mining provides the
technology to analyze huge volume of data and detect
hidden patterns in data to convert raw data into valuable
information. Data mining, in fact, helps to identify patterns
and relationships in the data [1]. Data preprocessing is an
often ignored but major step in the data mining process.
Banking systems collect huge amounts of data on day to day
basis, be it customer information, transaction details, risk
profiles, credit card details, limit and collateral details,
compliance and Antimony Laundering (AML) related
information, trade finance data, SWIFT and telex
messages[2].
Fig-1: Data Mining Stages
The first part of the paper discuss about various concepts
and data pre-processing methods that can be applied to the
collected data for the purpose of data mining. The second
phase of this paper deals with the problem statement which
explains about the business domain understanding. Detailed
system study was required for understanding about the
business domain. Here the loan transactions are considered
under the loan banking domain. The third phase of the stage
is the data understanding which deals with the study of loan
data corresponds to each customer from a premier
recognized Cooperative Bank. The paper explained in
detailed manner, about the data collected from the customer.
The collected data may contain out of range values such as
impossible data combinations, missing values, etc. This will
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 274
lead to produce misleading results. Thus, the fourth phase of
this paper explains about how to preprocess the data based
on the banking loan domain. Also the paper points about the
technologies used for data preprocessing. The result
expected after a reliable chaining of data preprocessing tasks
is a final dataset, which can be considered correct and useful
for further data mining algorithms. Kotsiantiset presents a
well-known algorithm GALA for each step of data pre-
processing [2]. Data mining can assist critical decision
making processes in a bank [3]. This paper consists of
describing a framework for data preprocessing and to
retrieve a Boolean value that helps to decide whether or not
a loan is sanctioned or not to the customer. As banking is in
the service industry, the task of maintaining a strong and
effective CRM is a critical issue[1].
Data mining technique helps to distinguish borrowers who
repay loans promptly from those who don't. It also helps to
predict when the borrower is at default, whether providing
loan to a particular customer will result in bad loans etc [8]
Fig:2 depicts an effective CRM instrument in which
customer insight is the main focus. Customer satisfaction is
the key factor of a CRM frame work.
Fig-2: CRM instrument
The last stage of this paper is clearly explained about
conclusions and future scope of the work.
2. PROBLEM DOMAIN
There are numerous areas in which data mining can be used
in the banking industry to support customer relationships
management.. Data mining technique will help to
distinguish borrowers who repay loans promptly from those
who do not. It also helps to predict the credit worthiness of
borrower by analyzing the behavior and reliability of the
customers. With data mining techniques, banks can do a
thorough profiling and ranking of their branches with
respect to loan fraud risk. In developing countries like India,
Bankers face more problems with the fraudsters. Using data
mining technique, it is simple to build a successful
predictive model and visualize the report into meaningful
information to the user[8]. Data mining can be applied to
reduce the risk associated with lending due to fraud as well
as find an appropriate solution to the borrower’s need for
funds, with proper assessment of risk and the inclusion of
sufficient control systems to ensure repayment. Loan
officers are tasked with entering borrower’s credit data
while the system does risk computation and sends the result
to the database for approval decision by loan committee at
head office. The decision is sent to loan database which is
then assessed by the loan officer. The officer then informs
the customer about the decision. Those banks and retailers
that have realized the utility of data mining and are in the
process of building a data mining environment for their
decision making process will get immense benefit and
advantages in future. The problem statement is “Propose a
data mining methodology to analyse, design and test
efficient data mining frame work for customer loan
credibility prediction.”
3. CONCEPTS USED
The main concepts used are data representation, different
types of columns, attributes, data preprocessing methods and
a frame work based on data preprocessing.
3.1 Structure of Data
The representation of data in a data mining process is
normally as tabular form. The awareness of the structure of
data representation is the most important step in preparing
data for data mining. The action within the database is
performed by considering the row as the unit of action. So
the level of granularity within the database is a row that
often corresponds to a separate customer transaction. Each
column contains values. The range represents the set of
allowable values for a column. It is also possible to find the
minimum and maximum value of number data in a column.
Some columns are referred as Unary valued columns and
these are not used for distinguishing different rows.
3.2 Classification of Columns
The various classifications of columns are Input columns,
Target column(s), Ignored columns, Identification columns,
weight columns and Cost column. Input Column used as
input into the model. Target column(s) are used for building
predictive models. Columns that are not used are referred as
ignored columns. Identification columns uniquely identify
the data and these are ignored for data mining purposes.
Weight column specifies a “weight” to be applied to this
row. A record with a weight of three counts three times as
much as a record with a weight of one. Cost column
specifies a cost associated with a row. A customer’s value
can be considered as a cost.
3.3 Data Pre Processing
The data collected for mining process may contain missing
values, noise or inconsistency. This leads to produce
inconsistent information from the mining process. A data
mining process with high quality of data will produce an
efficient data mining results. To improve the quality of data
and consequently the mining results, the collected data is to
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 275
be pre processed so as to improve the efficiency of data
mining process. Data preprocessing is one of the critical step
in data mining process which deals with preparation and
transformation from the initial data set to the final data set.
The following categories of data pre processing are applied
to convert initial data set to final data set.
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data cleaning procedure is used to clean the data by filling
the missing values, smoothing noisy data, identifying or
removing outliers and resolving inconsistencies. If the user
is believe that the data are dirty, and then they will not trust
the results of the data mining process that has been applied
to this data [4] . Data integration is process of combining
data from different sources. The customer data may contain
certain attribute that will take larger values[4]. The solution
to this problem is normalization. Data reduction produces a
reduced representation of the data set that is much smaller in
volume and that should produce the same result [4].
Fig - 3: Data preprocessing methods
Data preprocessing is the most time consuming phase of a
data mining process. Data cleaning of loan data removed
several attributes that has no significance about the behavior
of a customer. Data integration, data reduction and data
transformation are also to be applicable for loan data. For
easy analysis, the data is reduced to some minimum amount
of records. Data pre processing techniques use several
statistical analyzing tools such as mean , mode, median,
standard deviation , range , variance etc. Graphical tools are
necessary to express and avoid outlier data. Examples of
graphical tools are histograms , quantile plot , q-q plot,
scalar plot etc. Binning method is suitable for smoothing the
data. Clustering is a data mining technique to avoid outliers.
4. DATA USED
Data was collected from a premier cooperative bank that
provides loans to individuals, business firms, etc so as to
meet the requirements of all type of customers. Data
collection was completed through procedures including on
site observation and interview with the concerned authority.
A detailed study about the loan processing and banking
transactions are also made for the same. The data available
consists of 2500 records of bank loan transaction data
including 25 data fields. Some of the fields are removed
directly by manual data preprocessing. The following are the
data fields after removing the unnecessary data fields of
manual data preprocessing.
Table-1: List of attributes
1
Loan
Number
2 Loan Date
3 Due Date
4
Loan
Amount
5 Opening
6 Payment
7 Receipt
8 Int_rcvd
9 Fine_rcvd
10 MemNo
11 Action
12 Secured
13
Loan
Balance
14
Interest
Rate
15 Category
16 Purpose
17 Gender
18 Occupation
5. PROPOSED ARCHITECTURE
The main focus of this paper is to propose a framework for
analyzing behavior of bank customers and predict the
credibility of repayment of loans. The behavior collected as
inputs to the frame work and the decision such as whether a
loan is to be sanctioned or not depends on the information
retrieved from the classifier value of the frame work . The
data collected are classified as in the table 2. The attributes
are classified based on the type of attribute. It can be Input
column, Target column and Ignored column. Only the input
column attributes are given to the model as input. From
literature review initially need to create the account for each
customers in the bank and they should enter their personal
details, income details, insurance details, loan details and the
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 276
account information of the corresponding customer in other
banks[4] . Here when a loan is sanctioned, the customer
details as well as the loan details are entered into the
database. Loan number is recognized as the identification
column that uniquely identifies the data. Generally this
column is ignored for data mining process. This is because
the unary valued columns do not contain any information
that helps to distinguish between different rows. Category
column also is removed in such a manner. The major input
columns from the collection are Loan amount, Fine
received, Action, Loan balance, Occupation, and Purpose.
These values are considered as the input to the model. The
target column is to be designed and is used when building
predictive models. Here the target column specifies whether
a loan is to be sanctioned or not depending on the input
values to the model. Some columns are considered as
ignored columns or outlier columns. These are not used in
designing the model. We don’t want the actual loan balance
to process the model.
Table- 2: Classification of attributes
Data fields or attributes
Type of
attribute
Remarks
Loan Number
Identificat
ion
column
Not used for
data mining
process.
Loan amount, Fine received
action, Loan balance,
Interest Rate, Occupation,
and purpose.
Input
columns
These are
the inputs to
the model.
Loan Status Target
Column
Decides
whether loan
is to be
sanctioned
or not.
Loan Date, Due
date,Opening,Payment ,
Receipt, Int-rcvd , MemNo,
Outlier
Columns
These are
not
necessary to
build the
model.
The model requires either an amount is in the loan balance
column or not to classify the persons who close the loan or
not. In order to do those replace the value as 1 and 0. If there
is a value in the column loan balance replace with 1,
otherwise with 0. The missing values in the columns
purpose and job are not considered because our data
collection is very huge and these are considered as do
nothing. These few missing values may not materially affect
the models. The frame work for loan transaction processing
system is described in fig 4. The framework consists of
customer data as input and target column as output. The
customer data is processed through Loan data preprocessing.
It helps to distinguish the entire input data to input columns,
identification columns and outlier columns. Outlier columns
have no importance over the attributes of data mining
process. Only the input columns are given to the model for
loan data processing. The output obtained from the model is
the Target column. This is actually a classifier data that
decides whether or not the loan is to be sanctioned or not
based on the customer behavior. In order to implement this,
some more data mining techniques such as decision tree,
SVM and neural network etc are to be designed and tested.
Also we need to design an algorithm in data mining
environment to check the truthfulness of this framework
concept
Fig 4: Framework for loan credibility prediction
5.1 Advantages
The advantage of using this frame work for banking
organization is to enter the transactions of valid customers in
to their data warehouse. The fraud customers are filtered by
applying this data mining model at the time of first
interaction with the concerned authorities. It allows solving
many problems related with fraudsters in the banking sector.
This will be an asset for developing nations like India to
stable their financial records.
6. TECHNOLOGY USED
The Weka suite contains a collection of visualization tools
and algorithms for data analysis and predictive modeling,
together with graphical user interfaces for easy access to this
functionality [5]. Weka is a powerful tool for data
preprocessing, classification, clustering and Visualization. It
Input
Columns
Identificatio
n Columns
Targ
et
Colu
mn
Outlier
Columns
Loan
Transact
ion
Proc-
essing
Al-
gorithm
Customer data as
input
Classifier
value
Input
data
to the
model
Loan Data Preprocessing
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 277
is freely available as well as platform-independent software.
As is fig 5 Weka graphical user interface chooser consists of
four applications such as Explorer, Experimenter,
Knowledge Flow and simple CLI. Notepad helps to create
the data set in .ARFF format and the file in .ARFF format is
necessary to open in Weka. Weka can apply several
algorithms and techniques in data mining and it is possible
to compare the result of a process in different techniques.
Microsoft Excel is a powerful tool to manage data in tabular
form that is the most important format of data used by data
mining algorithm. The initial data and final data can be
easily represented in tabular form of Excel.
Fig 5: GUI of Weka
7. FUTURE SCOPE
The main focus of my work is the design of data mining
models to predict the customers who repay loans promptly
from those who do not. Models for SVM, Neural Network
Analysis etc are to be designed and tested. The prediction is
done by analyzing behavior and reliability of the customers
using the prediction algorithm that take the input as the
characteristics of the customer such as nature of job, past
history, income and so many related fields. The output
expected from the system is an indicator whether the
customer is prospective or not. Also algorithms that
outperform the performance of popular data mining models
have to be developed and tested for the CRM domain.
8. CONCLUSION
In this work, a data preprocessing frame work for loan
credibility prediction is proposed. The data was collected
from a premier organization and the data were classified by
applying some preprocessing techniques such as data
cleaning, data integration and normalization etc. The
behavior of customers after preprocessing is considered as
the input to the model. The target column is a classifier that
can decide whether or not the loan to be sanctioned. The
framework is helpful to predict the loan repayment
credibility of a customer by considering the behavior of
customer .This will help the banking organization to avoid
fraudsters entering into their transactions.
REFERENCES
[1]. Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C.
J. (1992). Knowledge discovery in databases: An overview.
AI Magazine, 13(3):57.
[2]. S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Pre-
processing for Supervised Leaning", International Journal
of Computer Science, 2006, Vol 1 N. 2, pp 111–117.
[3]. Sreekumar Pulakkazhy and 2R.V.S. Balan “Data
Mining in banking and its applications – a review ” Journal
of Computer Science 9 (10): 1252-1259, 2013 ,ISSN: 1549-
3636 © 2013 Science Publications,
doi:10.3844/jcssp.2013.1252.1259
[4]. Ms. Neethu Baby1, Mrs. Priyanka L.T “Customer
Classification And Prediction Based On Data Mining
Technique”, International Journal of Emerging Technology
and Advanced Engineering, (ISSN 2250-2459, ISO
9001:2008 Certified Journal, Volume 2, Issue 12, December
2012)
[5]. Swasti Singhal, Monika Jena, “A Study on WEKA Tool
for Data Preprocessing, Classification and Clustering” ,
International Journal of Innovative Technology and
Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-
2, Issue-6, May 2013
[6]. Dileep B. Desai, Dr. R.V.Kulkarni “A Review:
Application of Data Mining Tools in CRM for Selected
Banks”, (IJCSIT) International Journal of Computer Science
and Information Technologies, Vol. 4 (2) , 2013, 199 – 201.
[7]. Rob Gerritsen, “Loan Risks: A Data Mining Case
Study“
[8]. Dr. K. Chitra1, B. Subashini , “Data Mining Techniques
and its Applications in Banking Sector “ , International
Journal of Emerging Technology and Advanced
Engineering Website: www.ijetae.com (ISSN 2250-2459,
ISO 9001:2008 Certified Journal, Volume 3, Issue 8, August
2013)

More Related Content

PDF
20 ccp using logistic
PDF
B05840510
PDF
A simulated decision trees algorithm (sdt)
PDF
Applications of machine learning
PDF
Proposed ranking for point of sales using data mining for telecom operators
PDF
An impact of knowledge mining on satisfaction of consumers in super bazaars
PDF
A data mining approach to predict
PDF
A Comparative Study of Techniques to Predict Customer Churn in Telecommunicat...
20 ccp using logistic
B05840510
A simulated decision trees algorithm (sdt)
Applications of machine learning
Proposed ranking for point of sales using data mining for telecom operators
An impact of knowledge mining on satisfaction of consumers in super bazaars
A data mining approach to predict
A Comparative Study of Techniques to Predict Customer Churn in Telecommunicat...

What's hot (19)

PDF
Discovering diamonds under coal piles: Revealing exclusive business intellige...
PDF
Using Data Mining Techniques in Customer Segmentation
PDF
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
PDF
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE CASE STUDY: AUTOMOBILE INSURAN...
PDF
ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS
PDF
Applying Convolutional-GRU for Term Deposit Likelihood Prediction
PDF
Improving the credit scoring model of microfinance
PDF
Df24693697
PDF
International Refereed Journal of Engineering and Science (IRJES)
PDF
50120140506009
PPTX
Business Analytics Unit III: Developing analytical talent
PDF
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
PDF
Full Paper: Analytics: Key to go from generating big data to deriving busines...
PPSX
Information systems and it's types
PDF
Gc3310851089
PDF
Study of Data Mining Methods and its Applications
DOCX
Metadata
PPTX
Business intelligence systems
PDF
2 b1 enterprisemasterdataarchitecture
Discovering diamonds under coal piles: Revealing exclusive business intellige...
Using Data Mining Techniques in Customer Segmentation
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE CASE STUDY: AUTOMOBILE INSURAN...
ANOMALY DETECTION AND ATTRIBUTION USING AUTO FORECAST AND DIRECTED GRAPHS
Applying Convolutional-GRU for Term Deposit Likelihood Prediction
Improving the credit scoring model of microfinance
Df24693697
International Refereed Journal of Engineering and Science (IRJES)
50120140506009
Business Analytics Unit III: Developing analytical talent
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
Full Paper: Analytics: Key to go from generating big data to deriving busines...
Information systems and it's types
Gc3310851089
Study of Data Mining Methods and its Applications
Metadata
Business intelligence systems
2 b1 enterprisemasterdataarchitecture
Ad

Viewers also liked (11)

PPT
Improving Reporting Performance
PDF
D2 10 Sygnity Mortgage Loan Processing System
PDF
What is hadoop and how it works?
PPTX
Emerging technologies /frameworks in Big Data
PPTX
introduction to big data frameworks
PDF
Online Loan Application & Its Verification System
PPTX
Online Loan Management System
PPS
Loan Management System
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
Big Data Analytics with Hadoop
Improving Reporting Performance
D2 10 Sygnity Mortgage Loan Processing System
What is hadoop and how it works?
Emerging technologies /frameworks in Big Data
introduction to big data frameworks
Online Loan Application & Its Verification System
Online Loan Management System
Loan Management System
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Big Data Analytics with Hadoop
Ad

Similar to An efficient data pre processing frame work for loan credibility prediction system (20)

PDF
Supervised and unsupervised data mining approaches in loan default prediction
PPTX
Data warehousing and mining furc
PDF
A Review On Data Mining In Banking Sector
PPTX
Week-1-Introduction to Data Mining.pptx
DOCX
Seminar Report Vaibhav
PPTX
Data mining & Decison Trees
PDF
Data mining on Financial Data
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
PPT
Dma unit 1
PPTX
Data mining
PPT
introduction to data mining applications
PDF
Data Mining – analyse Bank Marketing Data Set by WEKA.
PPT
Data mining
PPTX
1 introduction
PPTX
Data mining
PDF
ii mca juno
DOCX
notes_dmdw_chap1.docx
PDF
Overview of Data Mining
PPTX
Dw dm-banking
 
Supervised and unsupervised data mining approaches in loan default prediction
Data warehousing and mining furc
A Review On Data Mining In Banking Sector
Week-1-Introduction to Data Mining.pptx
Seminar Report Vaibhav
Data mining & Decison Trees
Data mining on Financial Data
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
Dma unit 1
Data mining
introduction to data mining applications
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data mining
1 introduction
Data mining
ii mca juno
notes_dmdw_chap1.docx
Overview of Data Mining
Dw dm-banking
 

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
PDF
Material management in construction – a case study
PDF
Managing drought short term strategies in semi arid regions a case study
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
PDF
Geographical information system (gis) for water resources management
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
PDF
Factors influencing compressive strength of geopolymer concrete
PDF
Experimental investigation on circular hollow steel columns in filled with li...
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
PDF
Evaluation of punching shear in flat slabs
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
PDF
Evaluation of operational efficiency of urban road network using travel time ...
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Mechanical properties of hybrid fiber reinforced concrete for pavements
Material management in construction – a case study
Managing drought short term strategies in semi arid regions a case study
Life cycle cost analysis of overlay for an urban road in bangalore
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of compaction energy on soil stabilized with chemical stabilizer
Geographical information system (gis) for water resources management
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Factors influencing compressive strength of geopolymer concrete
Experimental investigation on circular hollow steel columns in filled with li...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Evaluation of punching shear in flat slabs
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of operational efficiency of urban road network using travel time ...
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of morphometric parameters and runoff using rs & gis techniques
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of use of recycled materials on indirect tensile strength of asphalt c...

Recently uploaded (20)

PPT
Project quality management in manufacturing
PPTX
Construction Project Organization Group 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Project quality management in manufacturing
Construction Project Organization Group 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
additive manufacturing of ss316l using mig welding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
CH1 Production IntroductoryConcepts.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx

An efficient data pre processing frame work for loan credibility prediction system

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 273 AN EFFICIENT DATA PRE PROCESSING FRAME WORK FOR LOAN CREDIBILITY PREDICTION SYSTEM Soni P M1 , Varghese Paul2 , M.Sudheep Elayidom3 1 Assistant Professor, Dept.of MCA, SNGIST, Manjali, Kerala, India 2 Associate professor, Computer Engineering, CUSAT, Kochi, Kerala, India 3 Associate Professor, Information Technology, CUSAT, Kochi, Kerala, India Abstract In today's world data mining have increasingly become very interesting and popular in terms of all applications especially in the banking industry. We have too much data and too much technology but don't have useful information. This is why we need data mining process. The importance of data mining is increasing and studies have been done in many domains to solve tons of problems using various data mining techniques. The art of preparing data for data mining is the most important and time consuming phase. In developing countries like India, bankers should vigilant to fraudsters because they will create more problems to the banking organization. Applying data mining techniques, it is very effective to build a successful predictive model that helps the bankers to take the proper decision. This paper covers the set of techniques under the umbrella of data preprocessing based on a case study of bank loan transaction data. The proposed model will help to distinguish borrowers who repay loans promptly from those who do not. The frame work helps the organizations to implement better CRM by applying better prediction ability. Keywords: Data preprocessing, Customer behavior, Input columns, Outlier columns, Target column, Dataset, CRM --------------------------------------------------------------------***------------------------------------------------------------------ 1. INTRODUCTION The areas in which Data mining Tools can be used in the banking industry are customer segmentation, Banking profitability, credit scoring and approval, Predicting payment from Customers, Marketing, detecting fraud transactions, Cash management and forecasting operations, optimising stock portfolios, and ranking investments [6] . Figure 1 depicts the stages of data mining process in a business application. The first phase called Business understanding understands about the domain for which the data mining has to be performed. Here the domain we considered is Predicting payment from customers in banking domain. Now days, there will be marvellous changes in the way the banking transactions are performed. It is very important to consider the customer relationship management of the enterprise to satisfy the customers as well as the entire business in the organization. The banking industry is widely recognizing the importance of the information it has about its customers [1] . The department wants data mining to find patterns that distinguish borrowers who repay promptly from those who don’t. [7]. Data mining provides the technology to analyze huge volume of data and detect hidden patterns in data to convert raw data into valuable information. Data mining, in fact, helps to identify patterns and relationships in the data [1]. Data preprocessing is an often ignored but major step in the data mining process. Banking systems collect huge amounts of data on day to day basis, be it customer information, transaction details, risk profiles, credit card details, limit and collateral details, compliance and Antimony Laundering (AML) related information, trade finance data, SWIFT and telex messages[2]. Fig-1: Data Mining Stages The first part of the paper discuss about various concepts and data pre-processing methods that can be applied to the collected data for the purpose of data mining. The second phase of this paper deals with the problem statement which explains about the business domain understanding. Detailed system study was required for understanding about the business domain. Here the loan transactions are considered under the loan banking domain. The third phase of the stage is the data understanding which deals with the study of loan data corresponds to each customer from a premier recognized Cooperative Bank. The paper explained in detailed manner, about the data collected from the customer. The collected data may contain out of range values such as impossible data combinations, missing values, etc. This will
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 274 lead to produce misleading results. Thus, the fourth phase of this paper explains about how to preprocess the data based on the banking loan domain. Also the paper points about the technologies used for data preprocessing. The result expected after a reliable chaining of data preprocessing tasks is a final dataset, which can be considered correct and useful for further data mining algorithms. Kotsiantiset presents a well-known algorithm GALA for each step of data pre- processing [2]. Data mining can assist critical decision making processes in a bank [3]. This paper consists of describing a framework for data preprocessing and to retrieve a Boolean value that helps to decide whether or not a loan is sanctioned or not to the customer. As banking is in the service industry, the task of maintaining a strong and effective CRM is a critical issue[1]. Data mining technique helps to distinguish borrowers who repay loans promptly from those who don't. It also helps to predict when the borrower is at default, whether providing loan to a particular customer will result in bad loans etc [8] Fig:2 depicts an effective CRM instrument in which customer insight is the main focus. Customer satisfaction is the key factor of a CRM frame work. Fig-2: CRM instrument The last stage of this paper is clearly explained about conclusions and future scope of the work. 2. PROBLEM DOMAIN There are numerous areas in which data mining can be used in the banking industry to support customer relationships management.. Data mining technique will help to distinguish borrowers who repay loans promptly from those who do not. It also helps to predict the credit worthiness of borrower by analyzing the behavior and reliability of the customers. With data mining techniques, banks can do a thorough profiling and ranking of their branches with respect to loan fraud risk. In developing countries like India, Bankers face more problems with the fraudsters. Using data mining technique, it is simple to build a successful predictive model and visualize the report into meaningful information to the user[8]. Data mining can be applied to reduce the risk associated with lending due to fraud as well as find an appropriate solution to the borrower’s need for funds, with proper assessment of risk and the inclusion of sufficient control systems to ensure repayment. Loan officers are tasked with entering borrower’s credit data while the system does risk computation and sends the result to the database for approval decision by loan committee at head office. The decision is sent to loan database which is then assessed by the loan officer. The officer then informs the customer about the decision. Those banks and retailers that have realized the utility of data mining and are in the process of building a data mining environment for their decision making process will get immense benefit and advantages in future. The problem statement is “Propose a data mining methodology to analyse, design and test efficient data mining frame work for customer loan credibility prediction.” 3. CONCEPTS USED The main concepts used are data representation, different types of columns, attributes, data preprocessing methods and a frame work based on data preprocessing. 3.1 Structure of Data The representation of data in a data mining process is normally as tabular form. The awareness of the structure of data representation is the most important step in preparing data for data mining. The action within the database is performed by considering the row as the unit of action. So the level of granularity within the database is a row that often corresponds to a separate customer transaction. Each column contains values. The range represents the set of allowable values for a column. It is also possible to find the minimum and maximum value of number data in a column. Some columns are referred as Unary valued columns and these are not used for distinguishing different rows. 3.2 Classification of Columns The various classifications of columns are Input columns, Target column(s), Ignored columns, Identification columns, weight columns and Cost column. Input Column used as input into the model. Target column(s) are used for building predictive models. Columns that are not used are referred as ignored columns. Identification columns uniquely identify the data and these are ignored for data mining purposes. Weight column specifies a “weight” to be applied to this row. A record with a weight of three counts three times as much as a record with a weight of one. Cost column specifies a cost associated with a row. A customer’s value can be considered as a cost. 3.3 Data Pre Processing The data collected for mining process may contain missing values, noise or inconsistency. This leads to produce inconsistent information from the mining process. A data mining process with high quality of data will produce an efficient data mining results. To improve the quality of data and consequently the mining results, the collected data is to
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 275 be pre processed so as to improve the efficiency of data mining process. Data preprocessing is one of the critical step in data mining process which deals with preparation and transformation from the initial data set to the final data set. The following categories of data pre processing are applied to convert initial data set to final data set.  Data cleaning  Data integration  Data transformation  Data reduction Data cleaning procedure is used to clean the data by filling the missing values, smoothing noisy data, identifying or removing outliers and resolving inconsistencies. If the user is believe that the data are dirty, and then they will not trust the results of the data mining process that has been applied to this data [4] . Data integration is process of combining data from different sources. The customer data may contain certain attribute that will take larger values[4]. The solution to this problem is normalization. Data reduction produces a reduced representation of the data set that is much smaller in volume and that should produce the same result [4]. Fig - 3: Data preprocessing methods Data preprocessing is the most time consuming phase of a data mining process. Data cleaning of loan data removed several attributes that has no significance about the behavior of a customer. Data integration, data reduction and data transformation are also to be applicable for loan data. For easy analysis, the data is reduced to some minimum amount of records. Data pre processing techniques use several statistical analyzing tools such as mean , mode, median, standard deviation , range , variance etc. Graphical tools are necessary to express and avoid outlier data. Examples of graphical tools are histograms , quantile plot , q-q plot, scalar plot etc. Binning method is suitable for smoothing the data. Clustering is a data mining technique to avoid outliers. 4. DATA USED Data was collected from a premier cooperative bank that provides loans to individuals, business firms, etc so as to meet the requirements of all type of customers. Data collection was completed through procedures including on site observation and interview with the concerned authority. A detailed study about the loan processing and banking transactions are also made for the same. The data available consists of 2500 records of bank loan transaction data including 25 data fields. Some of the fields are removed directly by manual data preprocessing. The following are the data fields after removing the unnecessary data fields of manual data preprocessing. Table-1: List of attributes 1 Loan Number 2 Loan Date 3 Due Date 4 Loan Amount 5 Opening 6 Payment 7 Receipt 8 Int_rcvd 9 Fine_rcvd 10 MemNo 11 Action 12 Secured 13 Loan Balance 14 Interest Rate 15 Category 16 Purpose 17 Gender 18 Occupation 5. PROPOSED ARCHITECTURE The main focus of this paper is to propose a framework for analyzing behavior of bank customers and predict the credibility of repayment of loans. The behavior collected as inputs to the frame work and the decision such as whether a loan is to be sanctioned or not depends on the information retrieved from the classifier value of the frame work . The data collected are classified as in the table 2. The attributes are classified based on the type of attribute. It can be Input column, Target column and Ignored column. Only the input column attributes are given to the model as input. From literature review initially need to create the account for each customers in the bank and they should enter their personal details, income details, insurance details, loan details and the
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 276 account information of the corresponding customer in other banks[4] . Here when a loan is sanctioned, the customer details as well as the loan details are entered into the database. Loan number is recognized as the identification column that uniquely identifies the data. Generally this column is ignored for data mining process. This is because the unary valued columns do not contain any information that helps to distinguish between different rows. Category column also is removed in such a manner. The major input columns from the collection are Loan amount, Fine received, Action, Loan balance, Occupation, and Purpose. These values are considered as the input to the model. The target column is to be designed and is used when building predictive models. Here the target column specifies whether a loan is to be sanctioned or not depending on the input values to the model. Some columns are considered as ignored columns or outlier columns. These are not used in designing the model. We don’t want the actual loan balance to process the model. Table- 2: Classification of attributes Data fields or attributes Type of attribute Remarks Loan Number Identificat ion column Not used for data mining process. Loan amount, Fine received action, Loan balance, Interest Rate, Occupation, and purpose. Input columns These are the inputs to the model. Loan Status Target Column Decides whether loan is to be sanctioned or not. Loan Date, Due date,Opening,Payment , Receipt, Int-rcvd , MemNo, Outlier Columns These are not necessary to build the model. The model requires either an amount is in the loan balance column or not to classify the persons who close the loan or not. In order to do those replace the value as 1 and 0. If there is a value in the column loan balance replace with 1, otherwise with 0. The missing values in the columns purpose and job are not considered because our data collection is very huge and these are considered as do nothing. These few missing values may not materially affect the models. The frame work for loan transaction processing system is described in fig 4. The framework consists of customer data as input and target column as output. The customer data is processed through Loan data preprocessing. It helps to distinguish the entire input data to input columns, identification columns and outlier columns. Outlier columns have no importance over the attributes of data mining process. Only the input columns are given to the model for loan data processing. The output obtained from the model is the Target column. This is actually a classifier data that decides whether or not the loan is to be sanctioned or not based on the customer behavior. In order to implement this, some more data mining techniques such as decision tree, SVM and neural network etc are to be designed and tested. Also we need to design an algorithm in data mining environment to check the truthfulness of this framework concept Fig 4: Framework for loan credibility prediction 5.1 Advantages The advantage of using this frame work for banking organization is to enter the transactions of valid customers in to their data warehouse. The fraud customers are filtered by applying this data mining model at the time of first interaction with the concerned authorities. It allows solving many problems related with fraudsters in the banking sector. This will be an asset for developing nations like India to stable their financial records. 6. TECHNOLOGY USED The Weka suite contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality [5]. Weka is a powerful tool for data preprocessing, classification, clustering and Visualization. It Input Columns Identificatio n Columns Targ et Colu mn Outlier Columns Loan Transact ion Proc- essing Al- gorithm Customer data as input Classifier value Input data to the model Loan Data Preprocessing
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 05 | May-2015, Available @ http://www.ijret.org 277 is freely available as well as platform-independent software. As is fig 5 Weka graphical user interface chooser consists of four applications such as Explorer, Experimenter, Knowledge Flow and simple CLI. Notepad helps to create the data set in .ARFF format and the file in .ARFF format is necessary to open in Weka. Weka can apply several algorithms and techniques in data mining and it is possible to compare the result of a process in different techniques. Microsoft Excel is a powerful tool to manage data in tabular form that is the most important format of data used by data mining algorithm. The initial data and final data can be easily represented in tabular form of Excel. Fig 5: GUI of Weka 7. FUTURE SCOPE The main focus of my work is the design of data mining models to predict the customers who repay loans promptly from those who do not. Models for SVM, Neural Network Analysis etc are to be designed and tested. The prediction is done by analyzing behavior and reliability of the customers using the prediction algorithm that take the input as the characteristics of the customer such as nature of job, past history, income and so many related fields. The output expected from the system is an indicator whether the customer is prospective or not. Also algorithms that outperform the performance of popular data mining models have to be developed and tested for the CRM domain. 8. CONCLUSION In this work, a data preprocessing frame work for loan credibility prediction is proposed. The data was collected from a premier organization and the data were classified by applying some preprocessing techniques such as data cleaning, data integration and normalization etc. The behavior of customers after preprocessing is considered as the input to the model. The target column is a classifier that can decide whether or not the loan to be sanctioned. The framework is helpful to predict the loan repayment credibility of a customer by considering the behavior of customer .This will help the banking organization to avoid fraudsters entering into their transactions. REFERENCES [1]. Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI Magazine, 13(3):57. [2]. S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Pre- processing for Supervised Leaning", International Journal of Computer Science, 2006, Vol 1 N. 2, pp 111–117. [3]. Sreekumar Pulakkazhy and 2R.V.S. Balan “Data Mining in banking and its applications – a review ” Journal of Computer Science 9 (10): 1252-1259, 2013 ,ISSN: 1549- 3636 © 2013 Science Publications, doi:10.3844/jcssp.2013.1252.1259 [4]. Ms. Neethu Baby1, Mrs. Priyanka L.T “Customer Classification And Prediction Based On Data Mining Technique”, International Journal of Emerging Technology and Advanced Engineering, (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012) [5]. Swasti Singhal, Monika Jena, “A Study on WEKA Tool for Data Preprocessing, Classification and Clustering” , International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume- 2, Issue-6, May 2013 [6]. Dileep B. Desai, Dr. R.V.Kulkarni “A Review: Application of Data Mining Tools in CRM for Selected Banks”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (2) , 2013, 199 – 201. [7]. Rob Gerritsen, “Loan Risks: A Data Mining Case Study“ [8]. Dr. K. Chitra1, B. Subashini , “Data Mining Techniques and its Applications in Banking Sector “ , International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 8, August 2013)