Building credit scoring models using genetic programming
Introduction
Credit scoring models have been widely used by financial institutions to determine if loan customers belong to either a good applicant group or a bad applicant group. The advantages of using credit scoring models can be described as the benefit from reducing the cost of credit analysis, enabling faster credit decision, insuring credit collections, and diminishing possible risk (Lee et al., 2002, West, 2000). Since an improvement in accuracy of a fraction of a percent might translate into significant savings (West, 2000), a more sophisticated model should be proposed to significantly improve the accuracy of the credit scoring model in this paper.
In order to obtain a satisfied credit scoring model, numerous methods have been proposed. Roughly, these methods can be classified to parametric statistical methods (e.g. discriminant analysis and logistic regression), non-parametric statistical methods (e.g. k nearest neighbor and decision trees), and soft-computing approaches (e.g. artificial neural network (ANN) and rough sets). Recently, ANNs are the most popular tool used for credit scoring and has been reported that its accuracy is superior to that of traditional statistical methods in dealing with credit scoring problems, especially in regards to non-linear patterns (Desai et al., 1996, Desai et al., 1997, Mahlhotra and Malhotra, 2003, Jensen, 1992, Piramuthu, 1999). However, on the other hand, ANN has been criticized for its poor performance when incorporating irrelevant attributes or small data sets (Castillo et al., 2003, Feraud and Cleror, 2002, Nath et al., 1997).
In order to build an effective discriminant function, two issues should be considered. First, the relationships among attributes and classes may be linear or non-linear. Second, the irrelevant attributes should be removed in order to increase the accuracy of the classification model. In this paper, GP is employed to automatically and heuristically determine the adequate discriminant functions and the valid attributes simultaneously. In addition, unlike ANNs which are only suited for large data sets, GP can perform well even in small data sets (Nath et al., 1997).
In order to efficiently obtain the discriminant function, the data set is preprocessed by discretization. Two real-world cases will be used below to compare the accuracy rate to other classification models including the logistic regression model, ANN, decision trees and rough sets. On the basis of the results, we can conclude that GP can provide better performance than other models.
The rest of this paper is organized as follows. Section 2 describes the models for credit scoring. Discretization and genetic programming are proposed in Section 3. Two real-world examples are used to demonstrate the proposed method in Section 4. Discussions are presented in Section 5 and conclusions are in Section 6.
Section snippets
Credit scoring models
In this section, we describe three popular models used in building credit scoring models. The first model is logistic regression, which is mostly used for classification problems in the area of statistics. The second model is ANN, which is known for its excellent ability of learning non-linear relationships in a system. The third model is rough sets, which is one kind of induction based algorithms, and has been widely used in classification problems since 1990s.
Genetic programming
Genetic programming was proposed by Koza (1992) to automatically extract intelligible relationships in a system and has been used in many applications such as symbolic regression (Davidson, Savic, & Walters, 2003), and classification (Stefano et al., 2002, Zhang and Bhattacharyya, 2004). The representation of GP can be viewed as a tree-based structure composed of the function set and terminal set. The function set is the operators, functions or statements such as arithmetic operators
Empirical analysis
In this section, GP is compared to MLP, classification and regression tree (CART), C4.5, Rough sets, and logistic regression (LR) using two-real world data sets. The first data set includes Australian credit scoring data with 307 examples of credit worthy customers and 383 examples for credit unworthy customers. It contains 14 attributes, where six are continuous attributes and eight are categorical attributes. The second data set, called the German Credit Data Set, was provided by Prof.
Discussions
Due to the huge growth rate of the credit industry, building an effective credit scoring model have been an important task for saving amount cost and efficient decision making. Although many novel approaches have been proposed, more issues should be considered for increasing the accuracy of the credit scoring model.
First, the irrelevant variables will destroy the structure of the data and decreases the accuracy of the discriminant function. Second, the credit scoring model should determine the
Conclusions
Building a credit scoring model involves the problems of variable selection and model identification. Although many approaches have been proposed, a flexible and accurate method is limited. In this paper, GP is employed to build the discriminant function for the credit scoring problems. On the basis of the empirical results, we can conclude that GP is more flexible and performs better accuracy in the credit scoring problems significantly.
References (36)
- et al.
The integrated methodology of rough set theory and artificial neural network for business failure prediction
Expert Systems with Applications
(2000) - et al.
Variable precision rough set theory and data discretisation an application to corporate failure prediction
OMEGA: the International Journal of Management Science
(2001) - et al.
Fuzziness in rough sets
Fuzzy Sets and Systems
(2000) - et al.
Symbolic and numerical regression: Experiments and applications
Information Sciences
(2003) - et al.
A comparison of neural networks and linear scoring models in credit union environment
European Journal of Operations Management
(1996) - et al.
Business failure prediction using rough sets
European Journal of Operational Research
(1999) - et al.
A methodology to explain neural network classification
Neural Network
(2002) - et al.
Credit scoring using the hybrid neural discriminant technique
Expert Systems with Applications
(2002) Rough set theory applied to (fuzzy) ideal theory
Fuzzy Sets and Systems
(2001)- et al.
Determining the saliency of input variables in neural network classifiers
Computers and Operations Researches
(1997)