Selection of Variables For Credit Risk Data Mining Models: Preliminary Research

40.
jubilarni međunarodni skup MIPRO 2017,

22.-26.05.2017., Opatija, Croatia
Selection of Variables for Credit

Risk Data Mining Models:
Preliminary Research
M. Pejić Bach*, J. Zoroja*, B. Jaković*, N. Šarlija**

* University of Zagreb, Faculty of Economics & Business, Zagreb, Croatia
** University of Osijek, Faculty of Economics, Osijek, Croatia
Selection of Variables for Credit Risk
Data Mining Models: Preliminary
Research
1. Introduction
2. Literature review
3. Methodology
4. Results
5. Discussion
6. Conclusion
Introduction
 data mining methods  finding
undiscovered valuable
information from large
databases
 “to extract knowledge in order

to make successful
management decisions”
 usage: banking, marketing,

finance, …
Data mining usage in the financial
analysis
 prediction of credit risk (prediction of

credit default)  banking industry
 several data mining techniques  most
popular - decision trees
Decision trees
 grouping variables into one or more
cathegories of the target variables
 important steps (for decision trees usage):

1. to determine the sample
2. to choose variables
3. to select an appropriate algorithm
Goal of the paper
To compare the classification of banking clients
according to credit default with the C4.5 decision
tree algorithm, using different sets of the variables:
 Entrepreneurial idea
 Growth plan
 Marketing plan
 Personal characteristics of entrepreneurs
 Characteristics of SME
 Characteristics of credit program
 Relationship between the entrepreneur and a
financial institution
Contribution of the paper
 algorithms for the selection variables 

tested on the real-world dataset of credit risk
of Croatian financial institution’s business
clients
Literature overview – Data mining
methodology
Most commonly used
 a high growth of data in methods
databases  need for the  classification
use of different  regression
methodologies
 clustering
 visualization
 data mining combines  decision trees
different approaches:  association rules
machine learning, statistics
 neural networks
and database
 support vector
management
machine
Decision trees analysis
ADVANTAGES DISADVANTAGES
 possibility of  analysts should pay

interpretation more attention to a
 simple usage and high variance
implementation  overfitting
Literature overview – Predicting credit
default with data mining approach
 a detailed analysis of data on the characteristics of

current and previous credit users  important factor
in forecasting the future credit default of new
clients
Literature overview – Variables and
techniques selection approaches
 usage of behavioral and demographic
variables of previous and current clients (e.g.
marital status, monthly income, …)  to
predict credit default
 methods in credit scoring: credit scorecard,

logistic regression, decision tree model
Methodology - Data
 sample: 200 applicants

 data collection: entrepreneurship credit dataset
 nominal, numeric variables
Table I.
VARIABLES RELATED TO THE
FUTURE PLANS FOR THE SME
Table II
VARIABLES RELATED TO
THE CHARACTERISTICS OF
ENTREPRENEUR AND SME
Table III
VARIABLES RELATED TO
THE CREDIT PROGRAM
AND THE BANK
Table IV
GOAL VARIABLE USED FOR
THE CREDIT SCORING
Methodology – Decision trees
Table V
Weka Description Of
The Used Algorithm
Methodology – Variable selection
Three approaches to the variable selection:
1. Class CfsSubsetEval algorithm
2. ChiSquaredVariableEval algorithm
3. ConsistencySubsetEval
Table VI
VARIABLES SELECTED BY
DIFFERENT ALGORITHMS
Results
Table VII
CHARACTERITICS OF THE
TREES DEVELOPED WITH
DIFFERENT VARIABLE
SELECTION APPROACHES
Figure 1 Decision tree developed using variables
selected by the Class CfsSubsetEval algorithm
selected by the ChiSquaredVariableEval algorithm
selected by the ConsistencySubsetEval algorithm
Discussion
Table VIII
Classification
Efficiency Measures
TABLE IX CLASSIFICATION MATRICES
Figure 4 Falsely predicted good
and bad debtors with different
variable selection approaches
Conclusion
 variables selected by the algorithm Class CfsSubsetEval have
the best results regarding the percentage of correctly
classified instances
 according to the percentage of bad debtors falsely

predicted as the good ones, the decision tree generated
using the variables selected by the ChiSquaredVariableEval is
the worse
 according to the criteria of the minimal percentage of falsely

predicted bad debtors as good, the best approach was to
use the decision tree generated using the variables selected
by the Class CfsSubsetEval or the decision tree generated
using the variables selected by the ConsistencySubsetEval
 variables related to Entrepreneurial idea, Growth plan, and

Marketing plan were more relevant than other variables
THANK YOU FOR YOUR
ATTENTION!
QUESTIONS?

Selection of Variables For Credit Risk Data Mining Models: Preliminary Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Selection of Variables For Credit Risk Data Mining Models: Preliminary Research

Uploaded by

Copyright:

Available Formats

40.

jubilarni međunarodni skup MIPRO 2017,

Selection of Variables for Credit

M. Pejić Bach, J. Zoroja, B. Jaković*, N. Šarlija**

 “to extract knowledge in order

 usage: banking, marketing,

 prediction of credit risk (prediction of

 important steps (for decision trees usage):

 algorithms for the selection variables 

 possibility of  analysts should pay

 a detailed analysis of data on the characteristics of

 methods in credit scoring: credit scorecard,

 sample: 200 applicants

 according to the percentage of bad debtors falsely

 according to the criteria of the minimal percentage of falsely

 variables related to Entrepreneurial idea, Growth plan, and

You might also like