You are on page 1of 28

40.

jubilarni međunarodni skup MIPRO 2017,


22.-26.05.2017., Opatija, Croatia

Selection of Variables for Credit


Risk Data Mining Models:
Preliminary Research

M. Pejić Bach*, J. Zoroja*, B. Jaković*, N. Šarlija**


* University of Zagreb, Faculty of Economics & Business, Zagreb, Croatia
** University of Osijek, Faculty of Economics, Osijek, Croatia
Selection of Variables for Credit Risk
Data Mining Models: Preliminary
Research
1. Introduction
2. Literature review
3. Methodology
4. Results
5. Discussion
6. Conclusion
Introduction
 data mining methods  finding
undiscovered valuable
information from large
databases

 “to extract knowledge in order


to make successful
management decisions”

 usage: banking, marketing,


finance, …
Data mining usage in the financial
analysis

 prediction of credit risk (prediction of


credit default)  banking industry
 several data mining techniques  most
popular - decision trees
Decision trees
 grouping variables into one or more
cathegories of the target variables

 important steps (for decision trees usage):


1. to determine the sample
2. to choose variables
3. to select an appropriate algorithm
Goal of the paper
To compare the classification of banking clients
according to credit default with the C4.5 decision
tree algorithm, using different sets of the variables:

 Entrepreneurial idea
 Growth plan
 Marketing plan
 Personal characteristics of entrepreneurs
 Characteristics of SME
 Characteristics of credit program
 Relationship between the entrepreneur and a
financial institution
Contribution of the paper

 algorithms for the selection variables 


tested on the real-world dataset of credit risk
of Croatian financial institution’s business
clients
Literature overview – Data mining
methodology
Most commonly used
 a high growth of data in methods
databases  need for the  classification
use of different  regression
methodologies
 clustering
 visualization
 data mining combines  decision trees
different approaches:  association rules
machine learning, statistics
 neural networks
and database
 support vector
management
machine
Decision trees analysis

ADVANTAGES DISADVANTAGES

 possibility of  analysts should pay


interpretation more attention to a
 simple usage and high variance
implementation  overfitting
Literature overview – Predicting credit
default with data mining approach

 a detailed analysis of data on the characteristics of


current and previous credit users  important factor
in forecasting the future credit default of new
clients
Literature overview – Variables and
techniques selection approaches
 usage of behavioral and demographic
variables of previous and current clients (e.g.
marital status, monthly income, …)  to
predict credit default

 methods in credit scoring: credit scorecard,


logistic regression, decision tree model
Methodology - Data

 sample: 200 applicants


 data collection: entrepreneurship credit dataset
 nominal, numeric variables
Table I.
VARIABLES RELATED TO THE
FUTURE PLANS FOR THE SME
Table II
VARIABLES RELATED TO
THE CHARACTERISTICS OF
ENTREPRENEUR AND SME
Table III
VARIABLES RELATED TO
THE CREDIT PROGRAM
AND THE BANK
Table IV
GOAL VARIABLE USED FOR
THE CREDIT SCORING
Methodology – Decision trees

Table V
Weka Description Of
The Used Algorithm
Methodology – Variable selection
Three approaches to the variable selection:
1. Class CfsSubsetEval algorithm
2. ChiSquaredVariableEval algorithm
3. ConsistencySubsetEval
Table VI
VARIABLES SELECTED BY
DIFFERENT ALGORITHMS
Results

Table VII
CHARACTERITICS OF THE
TREES DEVELOPED WITH
DIFFERENT VARIABLE
SELECTION APPROACHES
Figure 1 Decision tree developed using variables
selected by the Class CfsSubsetEval algorithm
Figure 2 Decision tree developed using variables
selected by the ChiSquaredVariableEval algorithm
Figure 3 Decision tree developed using variables
selected by the ConsistencySubsetEval algorithm
Discussion

Table VIII
Classification
Efficiency Measures
TABLE IX CLASSIFICATION MATRICES
Figure 4 Falsely predicted good
and bad debtors with different
variable selection approaches
Conclusion
 variables selected by the algorithm Class CfsSubsetEval have
the best results regarding the percentage of correctly
classified instances

 according to the percentage of bad debtors falsely


predicted as the good ones, the decision tree generated
using the variables selected by the ChiSquaredVariableEval is
the worse

 according to the criteria of the minimal percentage of falsely


predicted bad debtors as good, the best approach was to
use the decision tree generated using the variables selected
by the Class CfsSubsetEval or the decision tree generated
using the variables selected by the ConsistencySubsetEval

 variables related to Entrepreneurial idea, Growth plan, and


Marketing plan were more relevant than other variables
THANK YOU FOR YOUR
ATTENTION!
QUESTIONS?

You might also like